Blog

Point cloud annotation is fundamentally more demanding and complex than 2D labeling. Annotators have to work in sparse, multi-view environments, constantly switch perspectives, and maintain spatial context while dealing with small or partially occluded objects. Our labeling services team knows better than anyone that even minor issues in navigation, zoom behavior, or camera stability can significantly slow down the workflow.That’s why, in our recent release, we’ve focused on practical updates that remove everyday friction and make 3D annotation faster, smoother, and more predictable.#1 Quick object focus with double-clickFinding and focusing on objects in 3D space used to require manual camera rotation and careful positioning. Now, a simple double-click on any object automatically centers the camera on it. This seemingly small change dramatically speeds up annotation, especially when working with smaller objects that are difficult to locate manually. The feature works consistently across both 3D and 2D annotation tasks.#2 Consistent zoom behavior across modes Previously, zoom settings were inconsistent across different annotation modes, requiring constant readjustment. We've standardized the Attribute annotation zoom margin setting to work uniformly across all annotation modes, including 3D side view projections. Even better, zoom levels now persist when switching between objects—no more losing your carefully adjusted view every time you move to the next annotation.#3 Stable and configurable control pointsControl points are essential for precise annotation, but they were behaving erratically in 3D views. They would change size, become transparent, or scale incorrectly with camera distance. We've fixed this completely: the Control point size setting now works properly in 3D, and control points maintain consistent, readable size regardless of camera position or zoom level.#4 Camera stability across projections and view changesSwitching between the main view and side views used to reset the camera position, forcing annotators to re-navigate each time they wanted to check an object from different angles. The camera now maintains its position when moving between projections, allowing smooth multi-angle verification without repetitive navigation.#5 Improved UI readabilityWe've improved the contrast and readability of contextual menus and reduced transparency in areas where it was obscuring text. Small visual improvements like these reduce eye strain and cognitive load during long annotation sessions.#6 Improved zoom limits and touchpad supportWe've increased the maximum zoom level and completely reworked the zoom algorithm to work properly with trackpads. Navigation is now smoother and more predictable, particularly on laptops where trackpad gestures are the primary input method.These updates are available across all CVAT editions: CVAT Online, Enterprise, and Community. Improvements like these come directly from our hands-on work with real customer data and have immediate impact on annotation speed and quality, especially for complex workflows like 3D and point cloud annotation. If you work with point cloud data, we hope these updates will streamline your labeling workflow. Have feedback or suggestions? We'd love to hear from you! Contact our team via HelpDesk or submit an issue on GitHub.

Product Updates

December 17, 2025

Data Annotation for Robotics AI: Unique Challenges, Key Methods, and Best Practices

Data annotation serves as the foundation of every successful machine learning project, because without accurately labeled datasets, even the most advanced AI models cannot detect objects, classify images, or interpret text with real-world precision. To label and manage these datasets for AI and computer vision applications, data scientists and engineers have begun to rely on open-source data annotation tools. Unlike proprietary platforms that often lock users into closed ecosystems or restrict data access, open-source tools offer transparency, control, and the freedom to customize workflows, making them increasingly attractive for teams prioritizing privacy and long-term scalability. These annotation tools give developers the flexibility to handle everything from bounding box labeling in computer vision to Semantic Segmentation, OCR annotation, and complex Medical Imaging workflows. With so many tools available though, choosing the right one can be a bit tricky. How to Choose the Right Open Source Annotation Tool Each open-source annotation tool is different (from features, to usability, to documentation), and choosing the right one starts with understanding your project’s scope, data types, and workflow requirements. So, before selecting a platform, evaluate how well it aligns with your technical goals and the size of your annotation team. The following factors should guide your decision: Supported Data Types: Ensure the platform supports your required formats, such as images, videos, 3D point clouds, or text documents. A tool that handles multimodal data will save you from migrating later. Quality Control Tools: Look for built-in review features, annotation comparisons, and consensus scoring. Quality assurance prevents mislabeled data that can degrade model performance. Collaboration and Workflow Management: For larger teams, choose a data labeling platform with task assignment, role-based access, and progress tracking to streamline coordination. Automation and AI Assistance: AI-assisted labeling and auto labeling reduce manual effort by pre-labeling data with AI tools and models like Mask R-CNN or Faster R-CNN. This accelerates annotation and helps scale to enterprise workloads. Dataset Compatibility and Integration: A tool that integrates with AWS S3, Microsoft Azure, or TensorFlow OD API allows seamless movement of annotation data between your storage, model training, and MLOps stack. Data scientists and machine learning teams should test a few open-source platforms to see which supports their annotation workflows most efficiently, ensuring consistent data quality and faster model training cycles. The 6 Top Open Source Data Annotation Tools Compared There are many open-source data annotation tools available today, but not all are built for the same purpose. Some focus on simplicity and speed for quick labeling tasks, while others deliver advanced automation, collaboration, and dataset management for large machine learning workflows. To help you make an informed decision, we have closely compared the top 6 open source data annotation tools below. By understanding the strengths and trade-offs of each, you can select the right platform to streamline your data labeling process and produce the high-quality datasets your AI models depend on. Tool Overview Key Features Best For Limitations CVAT (Computer Vision Annotation Tool) Advanced open-source tool built for high-precision computer vision projects; developed by Intel and maintained by CVAT.ai. - Supports bounding boxes, polygons, polylines, keypoints, and 3D cuboids. - AI-assisted labeling with Mask R-CNN, YOLO, SAM. Large-scale image, video, and LiDAR projects in autonomous driving, robotics, and medical imaging. Requires setup and server management; complex for beginners. Label Studio Multi-modal annotation platform by Heartex supporting image, text, audio, video, and time-series labeling. - Flexible interface configuration. - REST API and Python SDK. - Collaboration and review tools. Teams working on cross-domain projects combining computer vision, NLP, and audio. Complex setup for non-technical users; limited 3D support; some enterprise features paid. LabelMe MIT-developed, browser-based image annotation tool designed for simplicity and accessibility. - Polygon and bounding box tools. - Community-shared datasets. - Lightweight, quick to start. Academic, research, and educational projects. No AI-assisted labeling; limited scalability and data type support. Diffgram Enterprise-grade open-source data annotation and management platform built for large-scale, multi-modal AI workflows. - End-to-end dataset management with version control. - Supports images, videos, text, and 3D data. - AI-assisted and active learning labeling. Large AI teams needing automation, governance, and MLOps integration for scalable annotation pipelines. Requires server setup and technical management; may be overkill for small projects. Doccano Open-source text annotation tool built for NLP projects and language model training. - Sequence labeling, text classification, and NER. - Multi-user collaboration with roles. - Easy Docker-based deployment. NLP researchers and teams building datasets for sentiment analysis, chatbots, and translation models. Limited to text-based annotation; no model-assisted labeling or multi-modal support. WEBKNOSSOS Open-source 3D annotation and visualization platform originally built for connectomics and neuroscience research. - Handles terabyte-scale volumetric datasets efficiently. - 3D tracing and segmentation tools for cells and neurons. - Tile-based data streaming for large volumes. Neuroscience, biomedical imaging, and any project requiring high-resolution 3D segmentation and analysis. Interface designed for scientific use; limited support for general-purpose ML labeling formats. CVAT (Computer Vision Annotation Tool) CVAT is an open-source data annotation tool built for computer vision projects that require high precision and scalability. Developed by Intel and now maintained by CVAT.ai, it’s widely used by machine learning teams to prepare training data for object detection, image classification, and video annotation tasks. Key Features Comprehensive Annotation Support: Bounding boxes, polygons, polylines, keypoints, and 3D cuboids for LiDAR and point cloud data. AI-Assisted Labeling: Integrations with models like Mask R-CNN, YOLO, and SAM help automate labeling for faster dataset creation. Video & Object Tracking: Interpolation and object tracking simplify video annotation workflows. Dataset Management: Supports popular export formats like COCO, Pascal VOC, and YOLO. Collaboration & Storage: Multi-user projects with role-based access and direct links to AWS S3 or Azure Blob Storage. Use Cases CVAT is ideal for large-scale projects in autonomous driving, robotics, and military purposes. It supports both manual and semi-automated labeling, fitting seamlessly into MLOps and Active Learning pipelines. Pros Advanced automation and customization Supports multiple data types and formats Strong collaborative tools for teams Cons Requires setup and server maintenance What Users Say: “We have a dedicated annotation team within our company, comprising over 50 annotators. For the past four years, we have been using self-hosted CVAT, which has been functioning exceptionally well. Recently, we acquired a project that requires annotating approximately 1 million images and videos monthly. We tried various tools, such as Supervisely, Label Studio etc, especially for video annotation, but CVAT remains the best option.” Source. Label Studio Label Studio is an open-source data labeling platform developed by Heartex that supports text, image, audio, video, and time-series annotation. It stands out for its flexibility, allowing users to design custom labeling interfaces for different data types. This makes it ideal for teams working across multiple AI applications such as computer vision, NLP, and speech recognition. Key Features Multi-Modal Annotation: Supports labeling for text, images, videos, and audio within the same platform. Custom Interface Builder: Users can design annotation templates for specific workflows using a simple configuration format. Model-Assisted Labeling: Integrates with AI models to suggest pre-labels for human review, enabling active learning and faster project completion. API and SDK Integration: Offers REST API and Python SDK for automation, pipeline integration, and dataset export. Collaboration Tools: Teams can assign roles, review annotations, and track performance metrics. Use Cases Label Studio is used for text classification, sentiment analysis, Named Entity Recognition, document tagging, and multimodal research combining images and text. It is also useful in audio projects like transcription or sound event detection, supporting teams training speech and language models. Pros Works with many data types in one environment Highly customizable labeling interface Active learning and model integration capabilities Strong API for MLOps workflows Cons Configuration can be complex for new users Limited optimization for 3D or high-frame-rate video Some enterprise collaboration features are paid What Users Say: “I used label studio with a custom script to auto label data, manually corrected parts, retained the model, and repeated. It takes some work to learn the model API but it's free and works really well!” Source. LabelMe LabelMe is a long-standing open-source image annotation tool that remains one of the simplest and most accessible options for computer vision projects. Its web-based interface makes it easy for anyone to start labeling without complex setup, making it especially popular in academic and research environments focused on image classification and object detection. Key Features Web-Based Interface: No installation required, allowing immediate access and collaboration through a browser. Polygon and Bounding Box Tools: Designed for accurate segmentation and region-based labeling. Community Dataset Access: Users can contribute to and download from a large shared library of labeled images for training and benchmarking. Simple Data Export: Supports standard formats such as JSON and compatible outputs for training ML models. Lightweight Setup: Minimal system requirements and quick onboarding for teams or students. Use Cases LabelMe is widely used in education, research, and early-stage AI experiments. It is ideal for small to mid-sized datasets where efficiency and accessibility matter more than complex integrations. Common applications include image classification, semantic segmentation, and bounding box labeling for computer vision models. Pros Extremely easy to set up and use Ideal for quick projects and academic research Free and fully open-source with public dataset access Lightweight interface with minimal dependencies Cons Lacks advanced automation or AI-assisted labeling Limited support for video or 3D data Not suited for large enterprise-scale annotation projects What Users Say: “The tool is a lightweight graphical application with an intuitive user interface. It’s a fairly reliable app with a simple functionality for manual image labeling and for a wide range of computer vision tasks.” Source. Doccano Doccano is an open-source text annotation tool widely adopted in natural language processing (NLP) projects. It enables users to label data for tasks like sentiment analysis, named entity recognition (NER), and text classification through a simple, browser-based interface. Key Features Text-Centric Annotation: Supports sequence labeling, document classification, and span-based annotation. Collaborative Labeling: Multi-user support with role management for team projects. Flexible Export Formats: Outputs data in JSON, CSV, and fastText for seamless integration into NLP pipelines. Ease of Use: Simple to install and run via Docker; ideal for both developers and researchers. Language Support: Unicode-compatible, making it suitable for multilingual annotation tasks. Use Cases Doccano is best suited for NLP research teams, data scientists, and machine learning engineers labeling text datasets for chatbots, translation models, or AI-powered content moderation systems. Pros Purpose-built for NLP projects Intuitive, web-based interface Supports multiple export formats Lightweight and easy to deploy Cons Limited support for non-text data types Lacks advanced automation or model-assisted labeling What Users Say: “I used Doccano. Easy to setup with Docker compose. Kind of disliked that the only way to import data was from JSON, CSV or CoNLL format. Other than that, no issues. The UI is simple, it works fine. It's free.” Source. Diffgram Diffgram is a powerful open-source data annotation and management platform designed for production-scale machine learning workflows. It combines labeling, automation, and data governance in one unified system, making it suitable for enterprise-grade projects that require both flexibility and collaboration. Key Features End-to-End Data Pipeline: Handles dataset versioning, task management, and annotation tracking from a single dashboard. Multi-Modal Annotation: Supports image, video, text, and 3D data, with advanced tools for object detection, segmentation, and classification. AI-Assisted Labeling: Integrates with pretrained models for auto-labeling and supports active learning loops. Collaboration and Security: Offers role-based permissions, activity logs, and dataset audit trails for team-based annotation. Cloud and On-Prem Support: Works seamlessly with AWS, GCP, or self-hosted environments for secure data control. Use Cases Diffgram is ideal for AI and MLOps teams working on large, complex datasets where automation and version control are essential. It’s often used in autonomous driving, medical imaging, and industrial inspection where precise labeling and reproducibility are key. Pros Scalable for enterprise and research use Strong automation and AI integration Robust data governance and tracking Multi-user collaboration with granular controls Cons Requires setup and server infrastructure May be overkill for small or simple projects What Users Say: “Diffgram is hands down the best annotation tool I've ever worked with. I'm really impressed by the graphical output it provides, and their customer support is always quick and responsive whenever I need help” Source. WEBKNOSSOS WEBKNOSSOS is an open-source 3D annotation and visualization platform primarily developed for neuroscience research and volumetric data analysis. It allows users to explore, segment, and annotate large-scale 3D image datasets, such as brain scans or microscopy volumes, with precision and efficiency. Originally built to support connectomics projects, it has evolved into a flexible tool for any 3D labeling and reconstruction workflow. Key Features Scalable 3D Visualization: Designed to handle terabyte-scale volumetric datasets efficiently, enabling detailed navigation through dense 3D imagery. Annotation and Segmentation Tools: Provides intuitive tracing and labeling tools for neurons, cells, and other structures across 3D volumes. Tile-Based Data Management: Streams only the data needed for visualization, making it suitable for very large datasets stored remotely or locally. Cross-Platform Support: Runs on Windows, macOS, and Linux, with an interface optimized for both scientific and general 3D annotation tasks. Community and Extensibility: Open-source under GPL license with active contributions from the neuroscience and open data communities. Use Cases WEBKNOSSOS is widely used in connectomics, neuroimaging, and other fields requiring detailed 3D segmentation. Its ability to visualize dense biological structures at microscopic resolution makes it a preferred tool for labs mapping neural circuits or reconstructing biological tissue samples. Pros Handles extremely large 3D datasets efficiently Specialized for neuroscience and volumetric data Free and open-source with active community input Supports detailed tracing and cell segmentation workflows Cons Limited support for non-scientific annotation formats Interface may feel complex for general-purpose ML labeling What Users Say: “webKnossos has all the tools to immediately view and more importantly annotate (large) volume datasets already built-in. Any modification to annotations/segmentations made in webKnossos will show up in third-party tools.” Source. Emerging Trends in Open Source Data Annotation After the Scale AI data leak and subsequent investment by Meta, many organizations have begun reevaluating how they handle sensitive datasets. One way they are doing this is through open-source tools. According to Data Insight Markets, the current open-source data labeling market size is approximately $500 million in 2025, but will grow at a compound annual growth rate (CAGR) of 25% from 2025 to 2033, reaching approximately $2.7 billion by 2033. This clearly highlights how valuable these tools will be for both generative AI and agentic AI. Plus, open-source annotation software is evolving fast, and it’s not just about drawing boxes anymore. Today’s tools are smarter, more flexible, and ready to support complex AI workflows. For example, the integration of AI-assisted labeling powered by models like Segment Anything (SAM). With SAM, CVAT annotators can now generate segmentation masks or bounding boxes automatically, then refine them instead of drawing every shape manually. The pace of innovation doesn’t stop there. CVAT has also introduced an auto-annotation feature powered by Ultralytics YOLO models, expanding its toolkit for AI-assisted labeling. Through the new agentic integration, annotators can automatically detect and tag objects within images using pretrained YOLO weights, then refine the results alongside models like Segment Anything (SAM) for precise segmentation and bounding boxes. This blend of automation and human input has made labeling significantly faster, especially in complex datasets such as autonomous driving, and 3D point cloud annotation. Beyond this, there are other key trends emerging in open source data annotation. These include: Multimodal annotation support: tools handling images, video, text, audio, and 3D point clouds in the same platform Plugin ecosystems and custom modules: community-built extensions for domain needs (e.g., pathology annotation, geospatial overlays) Stronger dataset governance: versioning, audit logs, role permissions, and integration with cloud storage Active learning and pre-labeling loops: the system picks the hardest samples for human review to improve efficiency These changes make open-source annotation tools far more than “free alternatives.” They’re becoming core infrastructure for AI development, helping teams accelerate labeling, maintain quality, and scale data pipelines. Our Final Thoughts on Choosing the Right Tool for Your Needs Now that we've made it to the end of the article, it's time to share our key takeaways. Choosing the right data annotation tool comes down to knowing your goals and workflow. Each platform serves a different purpose, and the best one for you depends on how complex your datasets are, how much automation you need, and how your team collaborates. To keep things short and sweet, always ask these questions: Ease of setup: How quickly can you start annotating? Data type coverage: Does it support your images, text, audio, or 3D point clouds? Automation tools: Can AI help speed up repetitive tasks? Dataset management: How easily can you organize and export your labeled data? If you’re labeling text data, Doccano is one of the best open-source options. It’s built for tasks like text classification, sequence labeling, and sentiment analysis, making it ideal for NLP-focused projects. For image-based datasets, LabelMe offers a lightweight interface that works well for small or academic projects where setup speed and simplicity matter most. Lastly, CVAT and Label Studio are better suited for larger or multi-format projects. They support images, video, and point clouds, and include automation, AI-assisted labeling, and integrations with machine learning pipelines. These platforms are ideal for enterprise or research teams working across computer vision, medical imaging, or multimodal AI. If you want to experience how professional-grade open-source data labeling should feel, give CVAT’s Community edition a try today and see how it can simplify your next annotation project.

Industry Insights & Reviews

November 6, 2025

The 6 Best Open Source Data Annotation Tools in 2026

Welcome to the October edition of the CVAT Digest, your monthly roundup of the latest features, improvements, and fixes in Computer Vision Annotation Tool (CVAT).This month’s updates make CVAT smoother for teams working with large cloud-based datasets, 3D projects, and automation workflows. Whether you’re managing storage, creating tasks, or connecting external tools via API, you’ll notice a faster, more flexible, and more secure experience across the board.NewToken‑based Authentication Across the PlatformYou can now use API access tokens everywhere:Server API supports API access tokens.CLI accepts tokens via the CVAT_ACCESS_TOKEN environment variable.SDK can authenticate with an API token via login()/make_client().Create Tasks Without LabelsSpin up tasks first and finalize taxonomy later. Great for quick intake from cloud or bulk uploads.Cloud Dataset HandlingCVAT now reliably supports related images for both 2D and 3D tasks from cloud storage, and the Dataset Manifest tool handles 3D datasets across all supported layouts.Admin Control: Max Jobs per TaskSet a maximum number of jobs allowed per task to keep workloads tidy and predictable.Configurable Disk‑Usage Health CheckTune the threshold that triggers the disk‑usage health check to better fit your environment.UpdatesSecurity & OperationsRedis upgraded to address a reported CVE.FFmpeg 8.0 is now used for modern codec support.Helm compatibility restored with Kubernetes pre‑release versions; charts now pull images from the public repo.Clarified behavior: CVAT_ALLOW_STATIC_CACHE only affects new tasks (existing tasks keep their configured chunking).SDK & CLI Quality of LifeThe SDK now automatically retries certain transient server errors.SDK supports server URLs that include ports (e.g., https://example.com:8443).Performance ImprovementsFaster task creation from the cloud, even when you don’t have a manifest.Lower memory usage when counting objects in tracks during annotation updates and analytics.Manifest Requirements & ClarityFrame width and height are now required in dataset manifests (2D & 3D).Manifests can include an optional original_name field, and error messages at task creation are clearer.Documentation around supported layouts with related images has been improved.Cleanup and DeprecationsYou can no longer upgrade directly from releases prior to v2.0.0—plan a staged path through a 2.x release.Removed legacy, non‑functional API URL signing code.Upcoming change: overly broad filtering of files that merely contain the string related_images is deprecated. CVAT will filter only actual related‑image files according to the input layout.‍FixesExport integrity: Fixed an issue where tracks could leak between jobs on export.Cloud storage workflows: Bulk delete now removes all selected storages; project/task transfers retain the correct storage reference.Related images: Detection is more reliable across all supported layouts for both 2D and 3D media.UI stability & polish: Resolved a crash when loading annotation format metadata.Fixed a sporadic UI error related to reading points.Corrected model card clipping on small displays.Organization description updates now save correctly.Server resilience: The backend starts correctly even when analytics are disabled, and a packaging quirk that produced a misleading pkg_resources error message is handled.Have suggestions or requests for what you'd like to see next? Open an issue on GitHub.

Product Updates

October 30, 2025

CVAT Digest, October 2025: Smarter Cloud Workflows and Token-Based Automation

The success of every modern computer vision system relies on one thing: data. Specifically, it relies on computer vision datasets that are well-annotated, diverse, and representative of the real world. These datasets are the fuel that drives object detection, semantic segmentation, visual recognition, and other tasks in AI.But in 2026, computer vision is entering a new stage. The rise of generative adversarial networks, synthetic data pipelines, and 3D object detection has changed how teams think about data altogether. Systems are no longer trained on simple labeled images, they now rely on dynamic, multimodal datasets that capture texture, movement, and depth.That's why choosing the right dataset has become less about quantity and more about context, structure, and how well it mirrors the world your model is meant to understand.In this guide, we’ll break down the most influential and widely used computer vision datasets in 2026. Our goal is to compare them based on format, task coverage, relevance, and how well they support emerging use cases like autonomous driving, image captioning, scene recognition, and multimodal AI so that you can make an informed decision.Criteria for Choosing a Computer Vision DatasetChoosing the right computer vision dataset isn’t just about finding the largest collection of images. It’s about aligning the dataset with your task, architecture, and domain constraints.In our opinion, there are four core factors that determine how useful a dataset will be. Let’s walk through each one so you can make confident, well-informed decisions.Scale and StructureLarge datasets are essential for training deep learning models, but volume alone isn’t enough. A high-quality dataset should include:Well-balanced class distributionClearly defined training, validation, and test setsDetailed annotations like bounding boxes, image-level labels, or segmentation masksDatasets like COCO and Open Images V7 offer strong structure and multi-label annotations, making them effective for object detection and visual recognition tasks.Diversity and RealismDiversity improves generalization, and a model trained on narrow or biased data won’t perform well in production. That’s why we suggest you look for datasets with:Variation in environments, weather, lighting, and anglesRepresentation across different demographics, geographies, and object typesRealistic examples that match your deployment settingFor example, Cityscapes is known for capturing a wide range of urban driving scenarios, making it ideal for autonomous vehicles and pedestrian detection.Use Case FitThe dataset must support your specific application. A project focused on face verification requires different annotations than one focused on optical flow or handwriting recognition.Before committing to a dataset, check:Are the right annotations included? (e.g., segmentation masks, temporal data, point clouds)Does the format align with your tooling? (COCO JSON, Pascal VOC XML, TensorFlow TFRecords, etc.)Is the level of detail sufficient for your model type?The more aligned the dataset is with your use case, the less time you’ll spend converting formats or creating custom labels.Adoption and EcosystemA well-adopted dataset benefits from mature documentation, tooling support, and community contributions. When a dataset is widely used, it’s easier to integrate with frameworks like YOLO.Highly adopted datasets often come with:Active GitHub communitiesPrebuilt loaders and evaluation scriptsLong-term maintenance and version trackingHigh adoption also signals trust. If other teams are using the dataset for training ML models or benchmarking Vision AI systems, it’s more likely to fit into your pipeline without friction.Computer Vision Datasets ComparedEvery dataset plays a different role in how teams build, test, and refine machine learning models. Some focus on broad image classification, while others capture depth, motion, or real-world context for 3D object detection and scene understanding.The table below gives a quick overview of each dataset’s strengths and best uses.DatasetKey StrengthsBest Used ForImageNetOver 14 million labeled images across 21,000 categories. Strong benchmark for classification and transfer learning.Image classification, object recognition, pretraining ML models, face recognition.COCO (Common Objects in Context)330K+ images with detailed bounding boxes, segmentation masks, and captions. Context-rich, multi-object scenes.Object detection, instance segmentation, pedestrian detection, scene recognition, optical flow validation.Open Images Dataset (by Google)9M+ images with 600+ categories, 15M bounding boxes, and 2.8M segmentation masks. Cloud-scale and diverse.Large-scale model training, 3D object detection, object recognition, handwriting recognition, transfer learning.Pascal VOC20-class dataset with bounding boxes and segmentation masks in VOC XML format. Simple and lightweight.Model prototyping, educational projects, small-scale image segmentation and detection tests.LVISOver 1,000 fine-grained categories with long-tail coverage and 2M+ masks. COCO-compatible JSON format.Instance segmentation, fine-grained classification, long-tail recognition, rare-object detection.ADE20K25K+ images with pixel-level annotations for 150 categories, covering both “stuff” and “object” classes.Semantic segmentation, scene parsing, AR/VR model training, 3D face recognition, synthetic data validation (Unreal Engine).ImageNetImageNet is the cornerstone of modern computer vision. Introduced in 2009 by researchers at Princeton and Stanford, it provided the foundation for nearly every major breakthrough in deep learning and visual recognition over the last decade. Containing over 14 million labeled images across more than 21,000 categories, it became the standard benchmark for training and evaluating image classification models.Data FormatEach image in ImageNet is annotated with an image-level label corresponding to a WordNet hierarchy concept. The dataset also includes bounding boxes for over one million images, allowing it to support object detection and localization tasks. The files are typically organized by category folders, making them easily exportable for formats like COCO JSON, TensorFlow TFRecords, or Pascal VOC XML.Key FeaturesLarge-scale dataset covering diverse object categoriesHierarchical labeling system aligned with WordNetAvailability of both classification and detection subsetsSupported by nearly all modern frameworks (PyTorch, TensorFlow, MXNet)Used as a pretraining source for transfer learning in downstream tasksBest Use CasesPretraining for deep learning models in classification and object recognitionTransfer learning for custom datasets and domain adaptationBenchmarking model performance against established standardsFine-tuning tasks like scene recognition, face verification, or image captioningProsExtremely large and diverse datasetUniversally supported across frameworksStrong benchmark for visual recognition modelsEnables faster convergence during model trainingConsLacks domain-specific or multimodal annotationsSome images are outdated or low-resolutionLimited segmentation or 3D data supportLicensing restrictions for certain research usesCurrent RelevanceWhile newer datasets have emerged, ImageNet continues to hold immense value. Its influence is evident in how most Vision AI and generative model pipelines still begin with ImageNet pretraining. Even synthetic datasets are often validated against ImageNet accuracy benchmarks.ImageNet also continues to be cited in thousands of academic papers annually and has appeared in over 40,000 research papers and 250 patents, reflecting its ongoing importance across academia and industry.Even as models evolve toward multimodal and generative architectures, it continues to serve as the baseline reference for training, validation, and performance benchmarking in the field.COCO (Common Objects in Context)The COCO dataset, or Common Objects in Context, is one of the most widely used computer vision datasets for object detection, instance segmentation, and keypoint tracking. Released by Microsoft in 2014, it set a new benchmark for real-world image understanding by emphasizing the importance of context. Rather than focusing on isolated objects, COCO captures how multiple objects interact within complex scenes, making it far more representative of real-world environments.Data FormatCOCO contains over 330,000 images, with more than 1.5 million object instances labeled across 80 core categories. Each image is annotated using COCO JSON format, which supports detailed metadata including segmentation masks, keypoints, and bounding boxes. It also includes captions and labels for image captioning and visual relationship tasks, expanding its utility beyond detection.Key FeaturesRich annotations for object detection, keypoint estimation, and segmentationContext-driven images showing multiple overlapping objectsBuilt-in captions for image captioning and visual recognition tasksFine-grained instance segmentation masksBest Use CasesObject detection and instance segmentationImage captioning and visual question answeringKeypoint estimation and human pose detectionScene recognition and relationship modelingBenchmarking performance for Vision AI and autonomous driving modelsProsHigh-quality, richly annotated datasetComprehensive support for multiple vision tasksStrong compatibility with open-source pipelines and frameworksRemains a universal benchmark across research and industryConsLimited category set compared to datasets like LVIS or Open ImagesFocuses primarily on everyday objects, lacking domain-specific scenesComputationally demanding for model training due to annotation densityCurrent RelevanceAs of 2025, COCO remains one of the most cited and actively used computer vision datasets worldwide, appearing in over 60,000 academic papers in a single year. Its structured format, visual diversity, and consistent annotation standards make it an indispensable resource for anyone developing deep learning models in vision-related tasks.From YOLO and Faster R-CNN to newer architectures like SAM and Ultralytics YOLO11, nearly every major object detection and segmentation benchmark is measured on COCO.Open Images Dataset (by Google)The Open Images Dataset, developed by Google, is one of the largest and most comprehensive computer vision datasets available today. First released in 2016 and continually expanded through multiple versions, it was designed to bridge the gap between image-level classification and fine-grained object detection, segmentation, and visual relationship understanding. Its goal was to create a dataset that could support every stage of modern computer vision development from pretraining and model validation to object recognition and scene analysis.Data FormatThe dataset contains over 9 million images, each annotated with image-level labels and, for a subset, bounding boxes and segmentation masks. It supports a wide range of file formats, including COCO JSON and TensorFlow TFRecords, making it compatible with most ML frameworks. The Open Images V7 release added detailed object relationships, human pose annotations, and localized narratives for image captioning.Key FeaturesOver 600 object categories with bounding boxes for 15 million objects2.8 million instance segmentation masksImage-level labels for over 19,000 visual conceptsAnnotations for object relationships and human posesPublicly hosted through Google Cloud for large-scale accessBest Use CasesLarge-scale model training for image classification and object detectionInstance segmentation and visual relationship modelingBenchmarking Vision AI or multimodal model performanceData augmentation and transfer learning across multiple visual domainsProsMassive dataset with rich, multi-level annotationsCovers a wide range of visual categories and contextsExcellent interoperability with standard formats and frameworksSupported by cloud-hosted infrastructure and community toolsConsHigh storage and computational requirementsAnnotation inconsistencies in certain object categoriesLess suitable for domain-specific or specialized use casesSome subsets require Google authentication for accessCurrent RelevanceOpen Images continues to play a crucial role for teams developing large-scale AI and Vision AI pipelines. Its scale and variety make it ideal for training deep learning models that require high visual diversity and balanced label distribution. Because it integrates both instance segmentation and image-level labeling, it remains useful for general-purpose computer vision tasks and multimodal pretraining.As of 2025, it has appeared in over three thousands research papers and remains a reference point for deep learning and computer vision research across both academia and enterprise.Pascal VOCThe Pascal Visual Object Classes (VOC) dataset is one of the earliest and most influential benchmarks in computer vision. Released between 2005 and 2012 as part of the PASCAL Visual Object Challenge, it helped standardize how researchers evaluate tasks like object detection, classification, and segmentation. Although smaller in scale compared to modern datasets like COCO or Open Images, Pascal VOC remains a cornerstone for model benchmarking and algorithm development.Data FormatPascal VOC includes roughly 20 object categories across thousands of labeled images. Each file comes with annotations in the Pascal VOC XML format, which defines bounding boxes, segmentation masks, and image-level labels. It’s widely supported by frameworks such as TensorFlow, PyTorch, and Keras, and it remains a go-to dataset for educational and prototype-level projects due to its simplicity and accessibility.Key Features20 well-defined object classes for detection and segmentationClear annotation standards in XML formatIncludes both image classification and pixel-level segmentation tasksConsistent train, validation, and test splits for fair benchmarkingLightweight dataset size for fast experimentationBest Use CasesTraining and benchmarking small to medium-sized modelsEducational and academic computer vision researchModel prototyping and pretraining before large-scale deploymentObject detection and semantic segmentation experimentsProsEasy to download, interpret, and integrateLightweight, making it ideal for rapid testingCompatible with a wide range of frameworks and export formatsHistorically important for evaluating visual recognition systemsConsLimited scale and class diversityLacks the contextual depth of modern datasetsNo support for complex relationships or 3D objectsOutdated for large-scale model training and evaluationCurrent RelevanceDespite its age, Pascal VOC remains one of the most recognized names in the field. Its influence extends to nearly every major dataset released since, and its simple, structured annotations continue to teach new generations of data scientists the fundamentals of computer vision dataset design.It’s widely used in academic settings for introducing new architectures or validating lightweight models before scaling up to larger datasets like COCO. The Pascal VOC format also remains foundational, with many modern datasets, such as Cityscapes and Open Images that borrow its structure and export compatibility.While it may no longer set state-of-the-art benchmarks, its influence persists in transfer learning, model validation, and open-source frameworks. Many real-world projects still use Pascal VOC as a quick and reliable dataset for initial model training or small-scale proof-of-concept experiments.LVISThe Large Vocabulary Instance Segmentation (LVIS) dataset was introduced to address a critical limitation in earlier benchmarks like COCO: the lack of diversity and long-tail representation. Developed by researchers from Facebook AI Research, LVIS builds on the COCO dataset but dramatically expands the number of object categories and annotations, making it ideal for fine-grained object detection and instance segmentation.Data FormatLVIS includes over 1,000 object categories across approximately 160,000 images. Each image contains detailed instance segmentation masks, bounding boxes, and object-level annotations stored in JSON format. The dataset structure is COCO-compatible, allowing seamless use with the same APIs, frameworks, and annotation tools. It is also organized to capture both frequent and rare object classes, enabling balanced model training for long-tail distributions.Key FeaturesOver 1,200 object categories, from common to rare classesMore than 2 million segmentation masks with precise boundariesCompatibility with COCO APIs and annotationsInclusion of long-tail and fine-grained object categoriesDesigned to improve generalization in real-world visual recognition tasksBest Use CasesInstance segmentation and object detectionLong-tail recognition and fine-grained classificationTransfer learning and domain adaptation researchBenchmarking generalization and model robustnessProsLarge number of detailed object categoriesExcellent representation of rare and fine-grained classesCompatible with existing COCO tools and pipelinesIdeal for testing generalization and open-vocabulary modelsConsMore complex and computationally intensive than COCOImbalanced category distribution can complicate trainingLimited support for non-visual or 3D dataAnnotation errors may appear in rare classesCurrent RelevanceBy 2026, LVIS will be one of the most important datasets for training deep learning models that need to handle a wide variety of object types. It’s widely used for research in instance segmentation, open-vocabulary detection, and fine-tuning models for edge cases in autonomous vehicles, robotics, and scene understanding.LVIS’s structure also makes it particularly useful for transfer learning, as models trained on LVIS tend to perform better on datasets with rare or domain-specific objects. With the rise of open-vocabulary and multimodal AI systems, LVIS continues to be a standard dataset for evaluating how well models generalize beyond high-frequency object classes.ADE20KThe ADE20K dataset, short for the ADE (Annotated Dataset) for Scene Parsing, is one of the most comprehensive resources for semantic segmentation and scene understanding. Developed by MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), it focuses on parsing complex scenes with pixel-level precision. Unlike datasets centered on individual objects, ADE20K provides a holistic understanding of both foreground and background elements within an image.Data FormatADE20K contains over 25,000 images with detailed pixel-level annotations across 150 object and stuff categories. Every image is manually annotated to include all visible objects and regions, ensuring accurate scene segmentation. The dataset is distributed in a format compatible with COCO JSON and Pascal VOC XML, making it easy to integrate with popular frameworks for training deep learning models. It also includes pre-defined train, validation, and test splits for reproducibility and benchmarking.Key FeaturesPixel-level semantic segmentation for 150 object categoriesCovers both “object” and “stuff” classes for complete scene parsingManually annotated by trained professionals for precisionCompatible with major frameworks like TensorFlow, PyTorch, and Detectron2Frequently used for training segmentation models such as DeepLab and PSPNetBest Use CasesSemantic segmentation and scene parsingTraining and evaluation of segmentation and panoptic modelsBenchmarking transformer-based architectures for visual understandingValidation for synthetic or domain-adapted datasetsProsHigh-quality, dense annotations with strong accuracyBalanced coverage of object and environmental categoriesMaintained by a reputable academic research groupIdeal for developing and benchmarking segmentation modelsConsLimited dataset size compared to Open Images or COCOFocuses mainly on segmentation, with no instance or 3D dataComputationally heavy due to detailed pixel-level labelingCurrent RelevanceAs of 2025, ADE20K remains a top benchmark for semantic segmentation and scene recognition research. Its fine-grained annotations make it essential for developing models that must interpret complex, multi-object environments, particularly in fields like robotics, autonomous driving, and aerial image segmentation. Even as new datasets emerge, it continues to define what “high-quality segmentation data” looks like in the era of large-scale Vision AI and multimodal learning.Examples of Use Cases for Computer Vision DatasetsEvery team uses computer vision datasets differently. For some, it’s about training models that recognize products on a shelf. For others, it’s about helping a car see the road or a robot understand its surroundings. So to shed a bit more light on how they’re used, let’s look at some of the most common and emerging applications shaping the future of computer vision today.1. Image Classification and Object RecognitionThis remains the entry point for most computer vision systems. Datasets like ImageNet, COCO, and Open Images have become industry benchmarks for training models to recognize objects, people, and scenes in real-world contexts.These datasets are ideal for applications such as:Product recognition in retail and e-commerceVisual search and image taggingQuality inspection in manufacturingFace or gesture recognition in security systemsImageNet provides broad visual diversity for general classification, while COCO adds contextual depth through overlapping objects and captions. LVIS and Pascal VOC are excellent choices for refining recognition models on more detailed or long-tail object categories.2. Object Detection and Instance SegmentationFor models that need to locate and classify multiple objects in a single frame, COCO, LVIS, and Open Images remain the gold standard. These datasets feature dense annotations, segmentation masks, and bounding boxes that teach AI to interpret complex, multi-object environments.Key applications include:Autonomous retail checkout and shelf monitoringCrowd and pedestrian detectionIndustrial defect and anomaly detectionWildlife tracking and environmental monitoringLVIS extends COCO’s capabilities with over 1,000 categories, supporting fine-grained detection and rare-object recognition, while Open Images helps scale detection to millions of diverse scenes.3. Scene Understanding and Semantic SegmentationWhen the goal is to help AI understand the entire scene, datasets like ADE20K and Cityscapes are indispensable. They include pixel-level labels for every region of an image, allowing models to learn spatial relationships and contextual awareness.Use cases include:Smart city infrastructure and traffic analyticsAR/VR environment mappingInterior design and robotics navigationAerial and satellite imagery analysisADE20K covers both “stuff” (background) and “object” classes for scene parsing, while Cityscapes provides high-resolution street-level data ideal for autonomous vehicles.4. 3D Object Detection and Spatial MappingDatasets like KITTI, nuScenes, and Matterport3D power AI systems that must understand depth, motion, and geometry. These are critical for self-driving cars, drones, and robots operating in 3D space.They support tasks such as:LiDAR and sensor fusion for autonomous drivingDrone-based warehouse mappingRobotics path planning and obstacle detectionDepth estimation and 3D reconstructionKITTI and nuScenes combine LiDAR, radar, and camera data to train robust perception models, while Matterport3D provides high-quality 3D scans for indoor spatial analysis.5. Medical Imaging and Healthcare AIIn medical data annotation, computer vision datasets help accelerate diagnosis and automate complex visual analysis. Datasets such as LIDC-IDRI, BraTS, and CheXpert provide expertly labeled scans across radiology and pathology disciplines.Common applications include:Tumor segmentation and lesion detection3D organ modeling and reconstructionDisease classification and triage systemsAutomated medical image review and workflow optimizationThese datasets mirror the structure of segmentation datasets like ADE20K, but focus on medical-specific modalities such as CT and MRI.What You Need to Know to Choose the Right Dataset for Your ProjectEvery great computer vision model starts with a decision: what data should it learn from? And that choice matters more than most people realize, as the dataset shapes how your model sees the world, what it pays attention to, and how well it performs once it faces the real thing.If you’re building something practical, start with the classics. ImageNet and COCO are perfect for object recognition, pedestrian detection, or face recognition projects where variety and accuracy matter. But as models grow more specialized, many teams are moving beyond general-purpose datasets to ones built for specific challenges, like Open Images V7 for large-scale training, KITTI for 3D object detection, or ADE20K for scene understanding. And for projects where no ready-made dataset quite fits, the next step is to collect and label their own data. That’s where CVAT can really make a difference.With CVAT, teams can turn raw data into structured, ready-to-train datasets tailored to their exact use case. You can upload images or videos, organize them into datasets, and apply consistent, high-quality annotations using tools like bounding boxes, polygons, segmentation masks, and keypoints. Once complete, datasets can be exported in formats compatible with TensorFlow, PyTorch, and other ML frameworks, making it easy to move from data preparation to model training without friction.If you’re ready to start building, CVAT gives you everything you need to manage, label, and refine your datasets in one place. Use CVAT Online if you prefer a managed cloud solution with no setup required, offering access to advanced labeling and automation features.Set up CVAT Community if you want a self-hosted, open-source version that provides full control and customization.Or, choose CVAT Enterprise if your organization needs a secure, scalable, feature-rich, self-hosted solution with professional support and tailored integrations.

Industry Insights & Reviews

October 29, 2025

The Most Popular Datasets for Computer Vision Applications in 2026

Announcing the new Ultralytics YOLO support for automatic annotation via CVAT agents. Powerful computer vision libraries such as Ultralytics YOLO, Detectron2, and MMDetection have made it easier to train high-performing models for a wide variety of tasks. However, using these models for automated annotation often requires custom code, format conversions, and one-off integrations, especially when labeling workflows span multiple tasks. As a result, many teams fall back on manual labeling because they find automation too complex to adopt at scale. Ultralytics YOLO is one of the most widely used model families in the computer vision community. Until now, CVAT included a single built-in YOLO model for auto-annotation, but expanding beyond that required manual setup. That's why we're excited to announce our new integration with Ultralytics YOLO via the CVAT AI annotation agent. Introducing the new Ultralytics YOLO and CVAT integration With this new integration, you can use native Ultralytics models (YOLOv5, YOLOv8, YOLO11) and third-party YOLO models with Ultralytics compatibility (YOLOv7, YOLOv10, etc.) for automatic image or video annotation for a wide range of computer vision tasks, including: Classification Object detection Instance segmentation Oriented object detection Pose estimation Just pick a YOLO model you want to label your dataset with, connect it to CVAT via the agent, run the agent, and get fully labeled frames or even entire datasets, complete with the right shapes and attributes, and all, in a fraction of the time. Annotation possibilities unlocked This integration opens up multiple workflow optimization and automation opportunities for ML and AI teams. Here are just a few. Pre-label data using the right model for the task Connect the YOLO models that match your annotation goals and run them sequentially to pre-label your data. Each model can be triggered individually through the CVAT interface, allowing you to generate different types of labels for the same dataset without custom scripts or external tools. This works for any YOLO model, out-of-the-box or fine-tuned. Label entire tasks in bulk Working with a large dataset? You don’t have to annotate each frame manually. Apply a YOLO model to the entire task in one step. Just open the Actions menu in your task and select Automatic annotation. CVAT will send the job to the agent and automatically annotate all frames across all jobs in a task, saving you time and reducing repetitive work. Share models across teams and projects Register a model once via a native function and agent, and make it instantly available across your organization in CVAT. Team members can use it in their own tasks without any local setup. Validate model performance on real data Test your fine-tuned YOLO model directly on annotated datasets and compare its predictions side-by-side with human labels in CVAT. Spot mismatches, edge cases, or underperforming classes, all without leaving your annotation environment. How it works Here’s what a typical YOLO auto-annotation setup via agents looks like: Step 1. Write and register the function Start by implementing a native function–a Python script that loads your YOLO model (e.g., yolov8n, yolo11m-seg) and defines how predictions will be generated and returned to CVAT. Then register this function in CVAT using the CLI. Note: You can reuse the same native function both in CLI-based annotation and agent-based mode. Step 2. Start the agent Once the function is registered, launch an agent using the CLI command. This starts a local service that automatically connects to your account in CVAT Online or Enterprise, and listens for annotation requests from CVAT. The agent then runs the model (inside your function), generates predictions, and sends them back to CVAT. For more in-depth information about how to set up automated data annotation with a YOLO or any custom model using a CVAT AI agent, read this article. Step 3. Create or select a task in CVAT Log into your CVAT instance and create a new task (or select an existing one). Upload your images or video, and define the labels you want to annotate (e.g., "person", "car", "helmet"). Depending on your use case, you can define different types of labels such as bounding boxes, polygons, or skeletons to match the expected output from your model. Step 4. Choose the model in the UI Once your task and the job are created and the agent is running, go to the AI Tools panel inside your job. Select the Detector tab and the YOLO model you registered earlier. Step 5. Run AI annotation on selected frames After selecting the model, CVAT sends a request to the running agent. The agent runs the model and returns predictions in the form of shapes (e.g., boxes, polygons, or keypoints), each associated with a label ID. Get started now Ready to speed up your annotation workflow with YOLO? Sign in to your CVAT Online account and try it out yourself. For more information about Ultralytics YOLO models and the tasks they support, check the Ultralytics documentation page. For more information about CVAT AI annotation agents, visit Announcing CVAT AI Agents: A New (and Better) Way to Automate Data Annotation using Your Own Models

Product Updates

October 23, 2025

CVAT Integrates Ultralytics YOLO Models, Unlocking Scalable Auto-Annotation for ML Teams

Industry Insights & Reviews

October 6, 2025

The 10 Biggest AI & Computer Vision Conferences in 2026

Product Updates

September 30, 2025

CVAT Digest, 2025 Wrap Up: The Biggest Product Releases and Milestones of the Year

Save Time,
Annotate Better

Subscribe to the CVAT Newsletter

Product & Services

Company

Resources

Blog

CVAT Digest, 2025 Wrap Up: The Biggest Product Releases and Milestones of the Year

Point Cloud Annotation Updates: Faster Navigation, Better Stability, Cleaner UX

The Critical Role of Data Annotation for Autonomous Vehicles

How Data Annotation is Powering the Next Wave of Agriculture

CVAT Digest, November 2025: Easier K8s Installs, Safer Access, Leaner Backups

Introducing Personal Access Tokens: A More Secure Way to Work with CVAT API

Data Annotation for Robotics AI: Unique Challenges, Key Methods, and Best Practices

The 6 Best Open Source Data Annotation Tools in 2026

CVAT Digest, October 2025: Smarter Cloud Workflows and Token-Based Automation

The Most Popular Datasets for Computer Vision Applications in 2026

CVAT Integrates Ultralytics YOLO Models, Unlocking Scalable Auto-Annotation for ML Teams

The 10 Biggest AI & Computer Vision Conferences in 2026

CVAT Digest, September 2025: Bulk Actions, Task Moves for Organizations, Flexible Cloud Storage, and More

Save Time, Annotate Better

Subscribe to the CVAT Newsletter

Product & Services

Company

Resources

Save Time,
Annotate Better