The CVAT Blog: Data Annotation Guides, Tutorials, and Best Practices

Using open-source datasets is crucial for developing and testing computer vision models. Here are 10 notable datasets that cover a wide range of computer vision tasks, including object detection, image classification, segmentation, and more.‍Common Objects in Context (COCO)Description: The Common Objects in Context (COCO) dataset is a large-scale dataset that includes such objects as cars, bicycles, and animals, as well as more specific categories such as umbrellas, handbags, and sports equipment. It was created to overcome the limitations of existing datasets by including more contextual details, a broader range of object categories, and more instances per category.COCO dataset is commonly used for several computer vision tasks, including but not limited to object detection, semantic segmentation, superpixel stuff segmentation, keypoint detection, and image captioning (5 captions per image). Its diverse range of images and annotations includes 330K images (>200K labeled), 1.5 million object instances, 80 object categories, and 250,000 people with keypoints. ‍Be aware that although COCO annotations are famous and widely used, their quality can vary and sometimes may be restrictive for certain use cases.‍History: The COCO dataset was first introduced in 2014 to improve the state of object recognition technologies. While the dataset itself has not been updated regularly in terms of new images being added, its annotations and capabilities are frequently enhanced and expanded through challenges and competitions held annually.‍Licensing: The COCO dataset is released under the Creative Commons Attribution 4.0 License, which allows both academic and commercial use with proper attribution. ‍Official Site: https://cocodataset.org/‍‍ImageNet‍Description: ImageNet is a collection of images structured around the WordNet classification system. WordNet groups each significant idea, which might be expressed through various words or phrases, into units known as "synonym sets" or "synsets." With over 100,000 synsets, predominantly nouns exceeding 80,000, ImageNet's goal is to furnish roughly 1000 images for every synset to accurately represent each concept. The images for each idea undergo strict quality checks and are annotated by humans for accuracy. Upon completion, ImageNet aspires to present tens of millions of meticulously labeled and organized images, covering the breadth of concepts outlined in the WordNet system.‍ImageNet played a pivotal role in the evolution of computer vision technologies, particularly through the annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC), which has been important in pushing the boundaries of image recognition capabilities and deep learning techniques. It is widely recognized for its role in advancing machine learning and computer vision, particularly in areas such as object recognition, image classification, and deep learning research. ‍History: The ImageNet project, initiated in 2009 by researchers at Stanford University, was designed to create a vast database of labeled images to enhance the field of computer vision. ImageNet significantly influenced the growth of deep learning, especially through its yearly ImageNet Large Scale Visual Recognition Challenge, which was held until 2017. Although these challenges have ended, the ImageNet dataset remains a key resource in the computer vision field, even though it is not regularly updated with new images.‍Licensing: ImageNet does not own the copyright of the images, it only compiles an accurate list of web images for each synset of WordNet. For this reason, ImageNet is available for use under terms that facilitate both academic and non-commercial research, with specific guidelines for usage and attribution.‍Official Site: http://www.image-net.org/‍‍PASCAL VOC‍Description: PASCAL VOC is a well known dataset and benchmarking initiative designed to promote progress in visual object recognition. It offers a substantial dataset and tools for research and evaluation on its dedicated platform, serving as an essential resource for the computer vision community.The PASCAL VOC dataset was developed to offer a diverse collection of images that reflect the complexity and variety of the world, which is crucial for building more effective object recognition models. This dataset has become a cornerstone in the field of computer vision, driving significant advancements in image classification technologies. The challenges associated with PASCAL VOC played an important role in pushing researchers to improve the accuracy, efficiency, and reliability of computerized image understanding and categorization. PASCAL VOC's dataset played a huge role in such fields as instance segmentation, image classification, person pose estimation, object detection, and person action classification‍History: The PASCAL VOC project, initiated in 2005, was developed to offer a standard dataset for tasks related to image recognition and object detection. It gained recognition through its yearly challenges that significantly advanced the field until they concluded in 2012. Although these annual challenges have ended, the PASCAL VOC dataset remains an important tool for researchers in computer vision, even though it is not updated with new data anymore.Licensing: PASCAL VOC is made available under conditions that support academic and research-focused projects, adhering to guidelines that encourage the ethical and responsible use of the dataset. Also, the VOC data includes images obtained from the "flickr" website, for more information, see "flickr" terms of use.‍Official Site: http://host.robots.ox.ac.uk/pascal/VOC‍‍CityscapesDescription: The Cityscapes dataset was created to help improve how we understand and analyze city scenes visually. This dataset includes a varied collection of stereo video sequences captured across street scenes in 50 distinct cities. It boasts high-quality, pixel-precise annotations for 5,000 frames and also includes an extensive selection of 20,000 frames with basic annotations. Consequently, Cityscapes significantly surpasses the scale of earlier projects in this domain, offering an unparalleled resource for researchers and developers focusing on urban environment visualization.‍Cityscapes was developed with the ambition to close the gap in the availability of an urban-focused dataset that could drive the next leap in autonomous vehicle technology and urban scene analysis. Cityscapes offers a rich collection of annotated images focused on semantic urban scene understanding. This initiative has catalyzed significant advancements in the analysis of complex urban scenes, contributing to the development of algorithms capable of more nuanced understanding and interaction with urban environments.‍History: The Cityscapes dataset was launched around 2019 to aid research aimed at understanding urban scenes at a detailed level, especially for segmentation tasks that require precise pixel and object identification. This dataset is regularly updated and remains crucial in the field, assisting developers and researchers in enhancing systems like those used in autonomous vehicles.‍Licensing: The Cityscapes dataset is provided for academic and non-commercial research purposes. ‍Official Site: https://www.cityscapes-dataset.com/‍‍KITTI‍Description: The KITTI dataset is well-known in the field of autonomous driving research, offering a comprehensive suite for several computer vision tasks related to automotive technologies. The dataset is focused on real-world scenarios and encompasses several key areas: stereo vision, optical flow, visual odometry, and 3D object detection and 3D object tracking.‍Developed to bridge the gap in automotive vision datasets, KITTI was developed to improve the domain of autonomous driving by providing a dataset that captures the complexity of real-world driving conditions with a depth and variety unseen in previous collections. ‍History: The KITTI dataset was launched in 2012 to help advance autonomous driving technologies, concentrating on specific tasks such as stereo vision, optical flow, visual odometry, 3D object detection, and tracking. It was developed through a partnership between the Karlsruhe Institute of Technology and the Toyota Technological Institute at Chicago. While the KITTI dataset is not updated regularly, it remains an essential tool for researchers and developers in the automotive technology field.‍Licensing: The KITTI dataset is made available under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 License that supports academic research and technological development, promoting its use among scholars and developers in the autonomous driving community. ‍Official Site: http://www.cvlibs.net/datasets/kitti‍‍VGGFace2 ‍Description: VGGFace2 is made of around 3.31 million images divided into 9131 classes, each representing a different person identity. It is used for a multitude of computer vision tasks such as face detection, face recognition, and landmark localization. It boasts a rich collection of images featuring a wide demographic diversity, including variations in age, pose, lighting, ethnicity, and profession, thus ensuring a robust framework for developing and testing algorithms that closely mimic human-level understanding of faces.‍The dataset comprises images of faces ranging from well-known public figures to individuals across various walks of life, enhancing the depth and applicability of face recognition technologies in real-world scenarios.‍History: VGGFace2 developed by researchers from the Visual Geometry Group at the University of Oxford was introduced in 2017 as an extension of the original VGGFace dataset. There are no regular updates to the VGGFace2 as it was released as a static collection for academic research and development purposes.‍Licensing: VGGFace2 supports both academic research and non-commercial use, detailed on its website. ‍Official Website: https://paperswithcode.com/dataset/vggface2-1‍‍CIFAR-10 & CIFAR-100‍Description: The CIFAR-10 and CIFAR-100 datasets are curated segments of the extensive 80 million tiny images collection, put together by researchers Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. These datasets were created to facilitate the analysis of real-world imagery. CIFAR-10 encompasses 60,000 color images of 32x32 pixels each, distributed across 10 categories, with each category featuring 6,000 images. This dataset is split into 50,000 images for training and 10,000 for testing, spanning a diverse array of subjects such as animals and vehicles.‍On the other hand, CIFAR-100 expands on this by offering 100 categories, each with 600 images, making for a total of the same 60,000 images but with a finer division. It allocates 500 images for training and 100 images for testing in each category. The CIFAR-100 dataset further organizes its categories into 20 supercategories, with each image tagged with both a "fine" label, identifying its specific category, and a "coarse" label, denoting its supercategory grouping.‍These datasets were created to push forward the study of image recognition by offering a detailed and varied collection of images that previous datasets lacked. They aid in developing algorithms that can distinguish and recognize a broad array of object types, bringing computer vision closer to human-like understanding.‍History: CIFAR-10 and CIFAR-100 were developed by researchers at the University of Toronto and released around 2009. They have not been regularly updated since their release, serving primarily as benchmarks in the academic community.‍Licensing: Both CIFAR-10 and CIFAR-100 are freely available for academic and educational use, under a license that supports their wide use in research and development within the field of image recognition (licensing information can be found on the official site).‍Official Site: https://www.cs.toronto.edu/~kriz/cifar.html‍‍IMDB-WIKI‍Description: To address the constraints of small to medium-sized, publicly available face image datasets, which often lack comprehensive age data and rarely contain more than a few tens of thousands of images, the IMDB-WIKI dataset was developed. Utilizing the IMDb website, the creators selected the top 100,000 actors and methodically extracted their birth dates, names, genders, and all related images.‍In a similar vein, profile images and the same metadata were collected from Wikipedia pages. Assuming images with a single face likely depict the actor, and by trusting the accuracy of the timestamps and birth dates, a real biological age was assigned to each image. Consequently, the IMDB-WIKI dataset comprises 460,723 face images from 20,284 celebrities listed on IMDb, along with an additional 62,328 images from Wikipedia, bringing the total to 523,051 images suitable for use in facial recognition training.History: The IMDB-WIKI was created by researchers at ETH Zurich in 2015. It has not received regular updates since its initial release.‍Licensing: The MDB-WIKI dataset can be used only for non-commercial and research purposes (licensing information can be found on the official site).‍Official Site: https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki/‍‍Open Images Dataset by Google‍Description: The Open Images Dataset by Google is recognized as one of the largest and most detailed public image datasets available today. It is designed to support the wide variety of requirements that come with computer vision applications. Covering a vast range of categories, from simple everyday items to intricate scenes and activities, this dataset strives to exceed the boundaries of previous collections by offering an extensive array of detailed annotations for a broad spectrum of subjects.‍Integral to a host of computer vision tasks, including image classification, object detection, visual relationship detection, and instance segmentation, the Open Images Dataset is a treasure trove for advancing machine learning models. ‍Diving into specifics, the dataset includes:‍15,851,536 bounding boxes across 600 object classes,2,785,498 instance segmentations in 350 classes,3,284,280 annotations detailing 1,466 types of relationships,675,155 localized narratives that offer rich, descriptive insights,66,391,027 point-level annotations over 5,827 classes, showcasing the dataset's depth in granularity,61,404,966 image-level labels spanning 20,638 classes, highlighting the dataset's broad scope,An extension that further enriches the collection with 478,000 crowdsourced images categorized into over 6,000 classes.escriptionHistory: The Open Images Dataset by Google was initially released in 2016. The dataset has been updated regularly, with its final version, V6, released in 2020, including enhanced annotations and expanded categories to further support the development of more accurate and diverse computer vision models.‍Licensing: The annotations are licensed by Google LLC under CC BY 4.0 license. The images are listed as having a CC BY 2.0 license. Both licenses support academic research and commercial use, promoting its application across a wide array of projects and developments in the field of computer vision.‍Official Site: https://storage.googleapis.com/openimages/web/index.html‍‍SUN Database: Scene Categorization Benchmark‍Description: The SUN dataset is a large and detailed collection created for identifying and categorizing different scenes. It is notable for its wide range of settings, from indoor spaces to outdoor areas, filling the need for more varied scene datasets as opposed to those focusing just on detection. The SUN Database aims to improve how we understand complicated scenes and their contexts by offering a wide variety of scene types and detailed annotations.‍This dataset is crucial for many computer vision tasks, such as sorting scenes, analyzing scene layouts, and object detection in various settings. It includes over 130,000 images covering more than 900 types of scenes, each with careful annotations to help accurately recognize different scenes.‍History: The SUN dataset was developed by researchers at Princeton University and Brown University and first released in 2010. Unlike some other datasets, the SUN Database has not been regularly updated since its initial release but remains a pivotal resource in the field of computer vision.‍Licensing: The SUN Database is distributed under terms that permit academic research, provided there is proper attribution to the creators and the dataset itself.‍Official site: https://vision.princeton.edu/projects/2010/SUN/‍‍ConclusionConcluding this article, we sincerely hope you found it helpful and that it enhances your research in model training and your daily computer vision tasks. If you haven't found exactly what you're looking for, please stay tuned and follow our social media channels. We plan to share our knowledge on how to create, annotate, and maintain your very own dataset tailored to your specific needs.‍Stay curious, keep annotating!‍‍Stay curious, keep annotating!Not a CVAT.ai user? Click through and sign up here‍Do not want to miss updates and news? Have any questions? Join our community:‍Facebook‍DiscordLinkedInGitterGitHub

Industry Insights

April 17, 2024

Mastering Image Annotation Crowdsourcing for Computer Vision with CVAT.ai and HUMAN Protocol

Data annotation is a critical step in the development of machine learning models. However, manual annotation can be time-consuming specifically for big datasets. What if you could automate this process and therefore make it faster? And with any model? ‍In CVAT.ai it is possible with CVAT CLI. But before diving into the technical details, let’s set up the basic understanding. ‍The CVAT CLI leverages Software Development Kit (CVAT SDK) to auto-annotate, or pre-annotate your dataset, allowing you to focus more on model development and less on data preparation. ‍The SDK enables you to incorporate functionalities from a variety of machine learning libraries. Including torchvision, but you can use others. The SDK provides you with a range of options for automated annotation, also known as Auto-Annotation (AA) functions.‍What are AA Functions?‍Auto-Annotation, or AA, functions, are Python objects designed to perform specific annotation tasks. These functions translate your raw data into annotations. ‍A typical AA function generally includes the following components: Code to load the machine learning model.A specification outlining the types of annotations that can be generated.Code to transform CVAT data into a format the machine learning model understands.Code to run the model to obtain predictions.Code to convert predicted annotations back into a CVAT-friendly format.‍The CVAT SDK is built on a layered architecture comprising several parts:‍The Interface: Defines the protocol that any AA function must implement.The Driver: Manages the execution of AA functions and performs the actual annotation on the CVAT dataset. Predefined AA Functions: Includes a set of predefined functions.‍This is just a glimpse; the following article will walk you through the steps and specifics to get you started on your automated annotation journey.‍There are two ways to auto annotate using the CVAT CLI:‍Annotating with predefined Auto-Annotation Functions in CVAT SDK.Annotating with your own Auto-Annotation Function‍Before starting the annotation‍Before starting the annotation process let’s set up a task in CVAT Cloud. In this case it is a simple dataset with animals and labels “cat” and “dog”.‍CVAT screen with image for annotation‍For both cases, first we need to create an environment where we could run the function. Let's begin by installing a few Python packages on the local machine. Please note that commands might vary for different operating systems. For the sake of this article, all commands that we use are for Windows.‍Run the following command:‍python -m venv venv‍When the virtual environment is ready, you will need to activate it:‍.\venv\Scripts\Activate.ps1‍The next step is to install the CVAT.ai CLI. Execute the command and wait for the installation to complete.‍pip install cvat-sdk[pytorch] cvat-cli‍To allow CVAT CLI access to CVAT, you'll need to store your CVAT password in the PASS environment variable. We'll utilize the Read-Host command here to prevent the password from being displayed.‍$ENV:PASS = Read-Host -MaskInput‍Enter your CVAT password and hit Enter. Now you are ready to run the automatic annotation.‍Easy Guide to Using Predefined Auto-Annotation Functions in CVAT SDK ‍You can auto-annotate with its two functions that utilize models from the torchvision library.The CVAT SDK includes two predefined AA functions. Each function is implemented as a module to allow usage through the CLI auto-annotate command. ‍After you’ve installed Python and environment is ready, run an Automatic Annotation from CLI we will use the following command:‍cvat-cli auto-annotate "<task ID>" --function-module cvat_sdk.auto_annotation.functions.torchvision_detection \ -p model_name=str:"<model name>" ...‍Let’s come back to the task that was created earlier. To run the function you will need a host, task ID, and username. For the model name check the torchvision documentation. In the example below we’ll use fcos_resnet50_fpn.‍The score_thresh=float:0.7 parameter is used to specify the threshold for object detection confidence scores. In this case, it's setting the confidence score threshold to 0.7, meaning that only object detections with a confidence score greater than or equal to 0.7 will be included in the results of the auto-annotation process. Objects with lower confidence scores will be filtered out. ‍CVAT screen showing where to get all parameters‍‍With these elements added to the command, you will get the following result:‍cvat-cli --server-host app.cvat.ai --auth mk auto-annotate 274373 --function-module cvat_sdk.autocvat_sdk.auto_annotation.functions.torchvision_detection -p model_name=str:fcos_resnet50_fpn -p score_thresh=float:0.7 --allow-unmatched-labels ‍Where app.cvat.ai is the host, 274373 is the task ID, and mk is the username.‍By default, the CLI will check that every label that the function can output exists in the task. In this case, our task only has "cat" and "dog" labels, while the function can output 80 labels in total, --allow-unmatched-labels tells the CLI to ignore all labels that don't exist in the task.‍It's a good practice to start with a clean state. So if there are any annotations that were done before, you can add –-clear-existing option the command, that will clear all existing annotations. ‍The annotation will start. Wait until it’s over, then go back to the task. You might need to refresh the page for annotations to be visible.‍CVAT annotated image‍It's time to check the quality. Go through the dataset to ensure that the annotations meet your requirements.‍How to Auto-Annotate Your Dataset with Model of Choice and the Command Line Interface‍The second method is when you use the auto-annotation feature not with predefined functions but with any model of your choice. In this guide, we'll walk through using YOLO v8 for auto-annotation via the Command Line Interface (CLI). Here is the task that will be annotated:‍CVAT with image to be annotated‍When the environment is ready, you can run a model function. Something like this:‍import PIL.Image from ultralytics import YOLO import cvat_sdk.auto_annotation as cvataa import cvat_sdk.models as models _model = YOLO("yolov8n.pt") spec = cvataa.DetectionFunctionSpec( labels=[cvataa.label_spec(name, id) for id, name in _model.names.items()], ) def _yolo_to_cvat(results): for result in results: for box, label in zip(result.boxes.xyxy, result.boxes.cls): yield cvataa.rectangle(int(label.item()), [p.item() for p in box]) def detect(context, image): return list(_yolo_to_cvat(_model.predict(source=image, verbose=False))) open_in_new MORE content_copy COPY @cvataicode at thiscodeWorks.com‍To move to the next step, you'll need to install the Ultralytics library, which houses the YOLO models. To do it, execute the following command. Wait for the installation to finish.‍pip install ultralytics‍It's a good practice to start with a clean slate. For this purpose, the –-clear-existing option is added to the command, which will clear all existing annotations. ‍Note that you’ll need to specify the path to the file implementing the function. ‍You can also exclude labels that you don't need.‍Here’s how you'd run the command in the CLI:‍cvat-cli --server-host app.cvat.ai --auth mk auto-annotate 274373 --function-file .\yolo8.py --allow-unmatched-labels –-clear-existing‍Press Enter and wait for the auto-annotation by YOLO8 to be accomplished.‍Once the auto-annotation is complete, it's time to check the quality. Go through the dataset to ensure that the annotations meet your requirements.‍CVAT annotated image‍There you have it! Now you know how to use any model, including YOLO v8, to auto-annotate your dataset via the CLI. Using auto-annotate can save you a tremendous amount of time and help you achieve consistent annotation across your datasets. If you have more questions, please see Auto Annotation documentation.‍And check the video to see the full process:‍‍Remember to like, share, and subscribe for more updates!Happy Annotating!‍‍Not a CVAT.ai user? Click through and sign up here‍Do not want to miss updates and news? Have any questions? Join our community:‍Facebook‍DiscordLinkedInGitterGitHub‍‍

Labeling Guides

October 5, 2023

An Introduction to Automated Data Annotation with CVAT CLI

Ensuring top-quality annotations in Computer Vision Annotation Tool (CVAT) is simpler than you think. Whether you're a project owner, an annotator, or a QA specialist, our platform makes the process seamless.‍Watch the tutorial and read on to discover how to navigate this critical aspect of machine learning and video annotation.‍‍‍‍Setting Up Your Project‍The first step is initializing your annotation project. After creating a Project and adding a Task, you assign Jobs to your Annotators. These jobs contain images for annotation. In our demonstration video, we've intentionally introduced errors for educational purposes—such as labeling "dogs" as "cats".‍Switching Roles for Quality Assurance (QA)‍When the annotator has completed their tasks, it's time for Quality Assurance. To show how this works, we'll switch back to the Project owner's account to initiate the QA process.Assigning a QA specialist to review the annotations is a breeze. Just invite the person to your project and assign them to the specific job. Then change the status of the Job to "Validation".‍Review and Issue Tracking‍The person assigned as QA will log in and have access to the QA interface which has been designed specifically for issues reporting and tracking. It lacks the typical annotation tools but includes an "Issue tracker" icon.‍QA will go through each annotation to identify errors. Once found, QA creates an issue and submits it. CVAT also provides predefined issues for common errors, saving time and ensuring consistency.‍Navigating and Resolving Issues‍After the QA specialist completes their review, we’ll go back to the annotator’s account and interface to see how the reported issues look. The annotator can easily navigate through the list of issues and correct the errors. After all is done, the annotator saves the work, making the annotations complete and ready for future use. And that’s it!‍Happy Annotating!Not a CVAT.ai user? Click through and sign up hereDo not want to miss updates and news? Have any questions? Join our community:‍Facebook‍DiscordLinkedInGitterGitHub‍‍

Labeling Guides

September 14, 2023

How to Ensure Quality in Image Annotation with CVAT.ai

In the fast-paced world of data annotation, striking the perfect balance between speed and accuracy is essential. With complex datasets and strict deadlines becoming the norm, annotation professionals are constantly seeking solutions to streamline their workflow without compromising the quality. ‍This is where the power of Layers in CVAT.ai comes into play.‍Understanding the Challenge‍Imagine having to annotate a dataset that features intricate objects or multiple subjects in each image. Traditional annotation methods might force you to choose between speed and accuracy – a decision that can have significant implications on the overall quality of your work.‍Introducing Layers ‍Layers in CVAT.ai improve the way you approach annotation tasks. Whether you're dealing with multi-object images, complex scenes, or projects with strict timelines.‍By allowing you to separate objects or subjects into distinct layers, CVAT.ai lets you focus on annotating individual elements without the clutter of overlapping annotations. This focused approach translates into increased efficiency as you no longer need to be worried about gaps between annotated objects and you also reduce the number of objects to be annotated overall.‍Want to know how to do it? Check out our latest video!‍‍Share your opinion and stay tuned!‍Happy annotating!‍Do not want to miss updates and news? Have any questions? Join our community:‍Facebook‍DiscordLinkedInGitterGitHub‍‍

Labeling Guides

August 17, 2023

Blog

The Ultimate Guide to Video Annotation for Computer Vision (2026)

Save Time,
Annotate Better

Subscribe to the CVAT Newsletter

Product & Services

Company

Resources

Blog

The Ultimate Guide to Video Annotation for Computer Vision (2026)

10 Best Known Open Source Datasets for Computer Vision in 2024

CVAT.ai Annotation Actions: Perform Bulk Actions on Filtered Shapes

CVAT Joins Google Summer of Code 2024!

CVAT vs. Label Studio: Which One to Choose?

Tips on how to annotate overlapping objects with CVAT

Best Artificial Intelligence and Computer Vision Conferences of 2024

Best Open-Source Image Annotation Tools in 2024

CVAT & HUMAN Protocol: A New Dawn in Visual Data Annotation

Mastering Image Annotation Crowdsourcing for Computer Vision with CVAT.ai and HUMAN Protocol

An Introduction to Automated Data Annotation with CVAT CLI

How to Ensure Quality in Image Annotation with CVAT.ai

Improve Annotation Speed and Accuracy with Layers in CVAT.ai

Save Time, Annotate Better

Subscribe to the CVAT Newsletter

Product & Services

Company

Resources

Save Time,
Annotate Better