Try for Free
PRICING cloudOn-prem deployment
Crowdsourcing Annotation with CVAT and Human Protocol: Real Data Experiment Showed Amazing Results

Introduction to Croudsourcing

As dataset sizes grow, the demand for scalable and efficient data annotation methods increases. 

Crowdsourcing can be a solution, as it offers significant advantages like scalability and reduced costs but comes with challenges in management, communication, and technical requirements. 

To address this, recently, we’ve introduced a crowdsourcing solution combining and HUMAN Protocol, now available for use. 

In the current article, we demonstrate the benefits of our approach through a real-world dataset annotation experiment, shedding light on its efficiency for potential users. This experiment also revisits key platform features and highlights the roles of crowdsourcing participants.

When we’re speaking about crowdsourcing, there are the following key participants:

  • Requesters: ML model developers, researchers, and competition organizers seeking precise data annotation.
  • Annotators: Individuals eager to earn through data annotation, ranging from those seeking extra income to full-time professionals.

If you're a Requester, CVAT and Human Protocol make the whole annotation process easy and automated for you — from setting up and managing tasks to checking the work and handling payments based on how well the job is done. To get an annotated dataset, you only need to create an annotation specification (how you want the data to be annotated), upload data, and set your quality and payment expectations. Our platform does everything else, giving you back a dataset that meets your required quality standards without any hassle.

If you're an Annotator, starting to earn money is just a few simple steps away. Sign up, pick a task, and follow the clear instructions for annotating. We keep assignments short to fit your schedule and boost efficiency. Once you complete tasks, you'll receive your earnings (tokens) in your wallet after some time.

Compensation for annotators is in cryptocurrency, necessitating a digital wallet. For Requesters there is also an option  of making payments via a bank card. The funds are earmarked at task initiation and disbursed after the task is completed and validated.

Now, we’re ready to look at our annotation experiment and investigate its outcomes.

Why we did it?

We conducted an experiment to test the efficiency of using a crowd for data annotation in real-world tasks. Our goal was to evaluate several factors:

  • What's the time investment like?
  • What quality level can we realistically achieve?
  • How cost-effective is it?

If you’re a Requester, understanding these factors is crucial to decide whether crowd-sourced annotation can be a good solution for your specific task, and whether it meets your needs for speed, cost, and quality.

The Dataset

For our experiment, we chose the Oxford Pets dataset, a publicly available collection with approximately 3.5k images featuring various types of annotations such as classification, bounding boxes, and segmentation masks. While the dataset is moderately sized, it offers real-world, manually curated annotations for each image. Originally encompassing over 30 classes, we simplified our task to focus solely on two categories: cats and dogs. Our goal was to have annotators precisely mark the heads of these animals with tight bounding boxes, critical for applications designed to distinguish between different pet species.

*In this context, a "class" refers to a category or type of object that the model is trained to identify. Each class represents a distinct group, such as 'cat' or 'dog', allowing the model to categorize images based on the characteristics defined for each class during training.

The Experiment

We recruited 10 random annotators without previous experience and closely monitored their performance. The primary goal was to reach a quality level of 80%, a benchmark that, while challenging, is crucial for the precision needed by machine learning models. This standard is a starting point that may need adjustment based on the specifics of your dataset. Achieving this level of quality is vital for ensuring the efficiency of machine learning models.

To guarantee annotation accuracy, our system employs Ground Truth (GT) annotations, also known as Honeypot. GT is a small subset of a dataset, typically 3-10% depending on its size, used for validating annotations. Usually, datasets lack annotations initially, requiring GT to be annotated as a separate task and manually reviewed and accepted. Since we had original annotations for each image, we used them for the GT.

To ensure accuracy and consistency in our study, we meticulously prepared task descriptions and selected 63 Ground Truth (GT) images (2% of the total dataset) to assess annotation quality. Annotators were assigned small batches of images for labeling. After completion, their annotations were automatically compared to the GT to evaluate accuracy. This process allowed us to systematically verify the quality of the annotations provided.

Execution and Results

So let’s go back to the questions we’ve asked in the first part of the article and answer them one by one, based on the experiment outcomes.

What's the time investment like?

Our experiment revealed that high-quality annotations can be achieved, and they can be achieved without significant delays. Initially, we estimated that an experienced team of annotators would complete the dataset in 1-3 days, including validation and assignment management. Interestingly enough, for a team with no prior knowledge, the actual time taken was 3-4 days. Here we’ve excluded some necessary adjustments on our part, but included the temporary unavailability of some annotators.

We see this as a highly positive outcome, as with such a setup, it is not necessarily obvious that the full dataset can be completed at all. In the future, learning from the mistakes and adjustments made during the first run, we are expecting to reduce the time required, bringing it closer to our original expectations. 

What quality level can we realistically achieve?

When it comes to the quality of crowd-sourced annotation, we always expect that the quality is going to be lower than one from a professional team. Meanwhile, our experiment delivered some promising insights. We set a high bar with an accuracy target of 80% (surely, it can be higher), aiming for the level of precision that machine learning models need to function reliably. We achieved this quality!

The resulting annotation quality is decent. There are certainly errors of different kinds, but overall, the results definitely can be used for model training. 

Note, that in our case the full annotation was available and we were able to confirm our statistical estimations. We can see that there is some quality drift on the full dataset compared to only the Ground Truth portion, but it is expected, as there were only 2% of the images in the Ground Truth set.

We can also see that our annotation quality surpassed that found on MTurk, where it typically ranges between 61% and 81%. According to research on Data Quality from Crowdsourcing, our results align with the highest standards for annotation quality.

This finding is crucial for anyone considering crowd-sourced annotation for their projects. It means that not only can you expect to get your visual data annotated affordably and swiftly, but you can also rely on the quality of the work to be good enough for training sophisticated deep learning models. 

How cost-effective is it?

Our examination of the cost-effectiveness of crowd-sourced annotation revealed that the expense of annotating a dataset with bounding boxes was remarkably low, costing only $0.02 per bounding box or image (a bit below the market price). This pricing strategy led to a total cost of $72 for the entire dataset, assuming most images featured just one object.

Here's a simple breakdown of the pricing we used:

Each task included up to 10 regular images that we paid for, and 2 Ground Truth (GT) images that were not paid for. Every image cost 2 cents, so each task cost 20 cents, adding up to $72 for all 3,600 images. This price setup meant we only paid for work that met our quality checks, ensuring you only pay for accurate annotations.

In the system we use HMT, a cryptocurrency, for payments, which makes the whole process fast and smooth. We don’t use regular (fiat) money at all, but if the annotators wish, they can always convert the money received into any other cryptocurrency or fiat money

This shows that using and Human Protocol for crowd-sourced annotation is not just easy on your wallet but also effective, helping you get high-quality data labeled without spending a lot.

Summary: was the approach feasible? 

Our experiment shows that crowdsourced annotation is both viable and effective, achieving desired quality with minimal deviation. We identified potential improvements, significantly reducing workforce management to just onboarding and technical support. All tasks, from recruitment to payment, were fully automated. We encourage both requesters and annotators to try our service, offering a streamlined, automated platform for high-quality data annotation tasks. If you need any help in the setting up the process, you can also drop us an email:

Happy annotating!

Not a user?  Click through and sign up here

Do not want to miss updates and news? Have any questions? Join our community:






February 23, 2024
Go Back