Budgeting, Building, and Scaling Data Labeling Operations
A common mistake recognized by AI industry heavyweights is underestimating data labeling efforts. In fact, the most common challenges AI/ML companies have been facing in their data labeling journey come down to inadequate planning. This can take multiple forms, such as:
- Why did we spend $100,000 on this dataset?
- Where is the dataset we spent five months making?
- Why did our data scientist spend 40 hours this week labeling data?
Fortunately, all these challenges can be remediated with adequate planning. In this article, we will outline the key elements for good planning and forecasting. These are categorized into five:
- Establishing Goals
- Plan your project
- Estimate Time and cost
- Evaluate Partners
- Appoint a project manager
1. Establishing Goals
Before beginning the planning phase, we must first define what we want to achieve with our project. Or, in Jeff Bezos’ terms, to start with the press release. While these will be specific to your needs, some of the most common data labeling project goals are:
- Improving data labeling process for short-term and long-term projects
- Reducing the overall cost of data labeling
- Freeing up data scientist and ML engineer time
2. Plan your project
In this phase, you need to define the baseline information upon which you can build your project. We found that the most critical information in the planning phase can be categorized into three key areas:
- Data – You will first need to understand all about the data that you are working with. Consider details such as data volume, defining the amount of data that you are looking at and whether the delivery is a one-off or recurring. Establishing a representative sample will help your team assess the work ahead much better than starting work blindly on a full dataset. When looking at I/O format, the input for satellite imagery for example could be formats such as .jpeg .geotiff, and the output could be industry-standard formats such as COCO, VOC, or Yolo. Also consider how the data is shared or stored, which could be through email or shared drives.
- Guidelines – Limit the number of variables when creating a data set to remove inconsistencies. To do so, determine the high-level goal, or the success metric of the algorithm. For example, this could be ‘identify driveable areas based on satellite imagery with 95% accuracy’. Afterward, consider what you need to know specifically for your industry. For example, when looking at drivable areas from a satellite image, this can be whether the roads are delineated and pedestrian-only areas. Instructions will help annotators understand what class, attributes, and labels need to be used. Lastly, using existing examples can offer a baseline to a large team of annotators to ensure consistency through the dataset.
- Tooling – Your choice of data annotation software impacts your overall processes. When thinking about tooling, you have multiple options available such as tools built in-house, open-source software, or third-party applications. Whichever type of tooling you choose, you need to have clear documentation and guidelines such that your labeling team can learn how to use the tool. With larger data labeling teams, access and permissions also need to be considered. Typical user profiles include annotator, reviewer, manager, and admin. Tools must be configured to support classes, attributes, relationships, limitations, and multiple annotation types.
3. Estimate time and cost
Perhaps the most common mistake in data labeling projects is underestimating the time and cost required in getting high-quality labeled data. As before, we will be looking at three main components for better forecasting.
- Estimations – Within your estimations, you need to also take into consideration activities that eliminate bias. This is done by reviewing the dataset and creating a representative dataset. Time spent per one labeled entity depends on complexity. For example, a satellite image can take anywhere between one and four hours to fully annotate. Estimate the cost by using your local minimum wage as a ballpark estimate. For example: 1000 images * 2 hours spent per image * $9.35 local minimum wage = $18,700
- Expectation – There’s a substantial difference between short and long-term projects. Typically, short-term projects require very high accuracy, and in-house data labeling may be better suited in a context where labelers are already experienced in the subject. For long-term projects, outsourced data labeling will surpass the efficiency of in-house efforts after an initial adjustment period.
- Budgeting – To accurately forecast the amount you need to spend on labeling tasks, we recommend splitting your project into two phases: the evaluation phase (~10% of dataset) that will help you get an estimate of the effort required; and the production phase, which will help you plan the data volume based on evaluation. For long-term projects, set a minimum target number of new tasks to be labeled each week or month and revisit throughput each quarter at the minimum.
We recommend the following tips for optimizing your budgets:
- Filter out time-consuming edge cases
- Limit the number of images to a diverse dataset
- Reduce the number of classes and attributes
- Start with a small team and ramp up slowly to ensure data quality in a cost-effective way
4. Evaluate partners
Partnering with a vendor who specializes in data labeling can give you numerous long-term benefits, including a steady supply of labeled data, specialized annotators, and predictable costs.
When evaluating potential data labeling partners, we recommend looking at the following:
- Experience and familiarity with your area of expertise – How are the annotators trained to become specialized in your industry?
- Team size – Simply put, the larger the team, the higher the output. The vendor’s team size must guarantee they have the capacity to deliver the output you require
- Location – This impacts multiple other factors such as price, time zone availability, and resources
- Communication and Engagement – We recommend working with vendors who can provide a dedicated account manager to act as a single point of contact
- Tooling and hardware – Does the vendor have the required hardware to work with your data? For example, do they have adequate graphics cards to render high-resolution graphics?
- Data Facility and security – How does the vendor store and ensure security? This can include physical security over storage units, as well as virtual private networks and multi-factor authentication
- Bandwidth and electricity – Does the vendor have the underlying infrastructure to support very large data transfers and a steady electricity supply for power and cooling?
- Pricing and turnaround – Are the vendors offering competitive pricing and suitable service level agreements?
5. Appoint a project manager
Appointing an in-house project manager for the data labeling process can drastically streamline the project and ensure data scientists and developers are focused on their tasks.
A project manager would be responsible for the following:
- Planning the project – Working with stakeholders to understand the requirements, timelines, and costs associated with the project
- Evaluating partners – Understanding the strengths and weaknesses of each vendor and working together with them to achieve the best results
- Liaising between stakeholders – As the main point of contact, the project manager will ensure clear communication between all stakeholders involved in the project
- Reviewing data quality –Ensuring the resulting labeled data is at the needed standard
- Tracking the progress – Making sure that data is delivered on time and keeping costs within the budget
Conclusion
Based on our experience, if you implement these best practices with respect to your requirements in any upcoming labeling projects, you will be able to
- Accurately estimate time and effort
- Minimize cost and reduce risk
- Ensure output quality
- Maintain clear communications