Try for Free
PRICING cloudOn-prem deployment
CVAT SDK PyTorch adapter: using CVAT datasets in your ML pipeline


CVAT is a visual data annotation tool. Using it, you can take a set of images and mark them up with annotations that either classify each image as a whole, or locate specific objects on the image.

But let’s suppose you’ve already done that. What now?

Datasets are, of course, not annotated just for the fun of it. The eventual goal is to use them in machine learning, for either training an ML model, or evaluating its performance. And in order to do that, you have to get the annotations out of CVAT and into your machine learning framework.

Previously, the only way to do that was the following:

1. Export your CVAT project or task in one of the several dataset formats supported by CVAT.

2. Write code to read the annotations in the selected format and convert them into data structures suitable for your ML framework.

This approach is certainly workable, but it does have several drawbacks:

  • The third-party dataset formats supported by CVAT cannot necessarily represent all information that CVAT datasets may contain. Therefore, some information can be lost when annotations are exported in such formats. For example, CVAT supports ellipse-shaped objects, while the COCO format does not. So when a dataset is exported into the COCO format, ellipses are converted into masks, and information about the shape is lost.
  • Even when a format can store the necessary information, it may not be convenient to deal with. For example, in the COCO format, annotations are saved as JSON files. While it is easy to load a generic JSON file, data loaded in this way will not have static type information, so features like code completion and type checking will not be available.
  • Dataset exporting can be a lengthy process, because the server has to convert all annotations (and images, if requested) into the new format. If the server is busy with other tasks, you may end up waiting a long time.
  • If the dataset is updated on the server, you have to remember to re-export it. Otherwise, your ML pipeline will operate on stale data.

All of these problems stem from one fundamental source: the use of an intermediate representation. If we could somehow use data directly from the server, they would be eliminated.

So, in CVAT SDK 2.3.0, we introduced a new feature that will, for some use cases, implement exactly that. This feature is the cvat_sdk.pytorch module, also informally known as the PyTorch adapter. The functionality in this module allows you to directly use a CVAT project or task as a PyTorch-compatible dataset.

Let’s play with it and see how it works.


First, let’s create a Python environment and install CVAT SDK. To use the PyTorch adapter, we’ll install the SDK with the pytorch extra, which pulls PyTorch and torchvision as dependencies. We won’t be using GPUs, so we’ll get the CPU-only build of PyTorch to save download time.

Now we will need a dataset. Normally, you would use the PyTorch adapter with your own annotated dataset that you already have in CVAT, but for demonstration purposes we’ll use a small public dataset instead.

To follow along, you will need an account on the public CVAT instance, If you have access to a private CVAT instance, you can use that instead. Save your CVAT credentials in environment variables so CVAT SDK can authenticate itself:

The dataset we’ll be using is the Flowers Dataset available in the Harvard Dataverse Repository. This dataset is in an ad-hoc format, so we won’t be able to directly import it into CVAT. Instead, we’ll upload it using a custom script. We won’t need the entire dataset for this demonstration, so the script will also reduce it to a small fraction.

Get that script from our blog repository and run it:

The script will create tasks for the train, test and validation subsets, and print their IDs. If you open the Tasks page, you will see that the tasks have indeed been created:

And if you open any of these tasks and click the “Job #XXXX” link near the bottom, you will see that each image has a single annotation associated with it: a tag representing the type of the flower.

Interactive usage

Note: the code snippets from this section are also available as a Jupyter Notebook.

We’re now ready to try the PyTorch adapter. Let’s start Python and create a CVAT API client:

Now let’s create a dataset object corresponding to our training set. To follow along, you will need to substitute the task ID in the first line with the ID of the Flowers-train task that was printed when you ran the script.

As you can see from the log, the SDK has downloaded the data and annotations for our task from the server. All subsequent operations on train_set will not involve network access.

But what is train_set, anyway? Examining it will reveal that it is a PyTorch Dataset object. Therefore we can query the number of samples in it and index it to retrieve individual samples.

The sample format is broadly compatible with that used by torchvision datasets. Each sample is a tuple of two elements:

  1. The first element is a PIL.Image object.
  2. The second element is a cvat_sdk.pytorch.Target object representing the annotations corresponding to the image, as well as some associated data.
  3. The annotations in the Target object are instances of LabeledImage and LabeledShape classes from the CVAT SDK, which are direct representations of CVAT’s own data structures. This means that any properties you can set on annotations in CVAT — such as attributes & group IDs — are available for use in your code.
  4. In this case, though, we don’t need all this flexibility. After all, the only information contained in the original dataset is a single class label for each image. To serve such simple scenarios, CVAT SDK provides a couple of transforms that reduce the target part of the sample to a simpler data structure. For this scenario (image classification with one tag per image), the transform is called ExtractSingleLabelIndex. Let’s recreate the dataset with this transform applied:

Note that the task data was not redownloaded again, as it had already been cached. The SDK only made one query to the CVAT server, in order to see if the task had changed.Here’s what the sample targets look like with the transform configured:

Each target is now simply a 0-dimensional PyTorch tensor containing the label index. These indices are automatically assigned by the SDK. You can also use these indices without applying the transform; they are provided by the label_id_to_index field on the Target objects.

ExtractSingleLabelIndex requires each sample to have a single tag. If a sample fails this requirement, the transform will raise an exception when that sample is retrieved.

Our dataset is now almost ready to be used for model training, except that we’ll also need to transform the image, as PyTorch cannot directly accept a PIL image as input. torchvision supplies a variety of transforms to convert and postprocess images, which can be applied using the transform argument. For example:

Full model training & evaluation example

Equipped with the functionality that we just covered, we can now plug a CVAT dataset into a PyTorch training/evaluation pipeline and have it work the same way it would with any other dataset implementation.

Programming an entire training pipeline in interactive mode is a bit cumbersome, so instead we published two example scripts that showcase using a CVAT dataset as part of a simple ML pipeline. You can get these scripts from our blog repository.

The first script trains a neural network (specifically ResNet-34, provided by torchvision) on our sample dataset (or any other dataset with a single tag per image). You run it by passing the training task ID as an argument:

It saves the resulting weights in a file named weights.pth. The evaluation script will read these weights back and evaluate the network on a validation subset—which you, again, specify via a CVAT task ID:

Since training involves randomness, you may end up seeing a slightly different accuracy number.

Working with objects

Note: the code snippets from this section are also available as a Jupyter Notebook.

The PyTorch adapter also contains a transform designed to simplify working with object detection datasets. First, let’s see how raw CVAT shapes are represented in the CVAT SDK.

Open the Flowers-train task, click on the “Job #XXX” link, open frame #2, and draw rectangles around some sunflowers:

Press “Save”. Now, restart Python and reinitialize the client:

Create the dataset again:

Note that since we have changed the task on the server, the SDK has redownloaded it.

Now let’s examine the frame that we modified:

You can see the newly-added rectangles listed in the shapes field. As before, the values representing the rectangles contain all the properties that are settable via CVAT.

Still, if you’d prefer to work with a simpler representation, there’s a transform for you: ExtractBoundingBoxes.

The output of this transform is a dictionary with keys named “boxes” and “labels”, and tensor values. The same format is accepted by torchvision’s object detection models in training mode, as well as the mAP metric in torchmetrics. So if you want to use those components with CVAT, you can do so without additional conversion.

Closing remarks

The PyTorch adapter is still new, so it has some limitations. Most notably, it does not support track annotations and video-based datasets. Still, we hope that even in its early stages it can be useful to you.

Meanwhile, we are working on extending the functionality of the adapter. The development version of CVAT SDK already features the following additions:

  • A ProjectVisionDataset class that lets you combine multiple tasks in a CVAT project into a single dataset.
  • Ability to control the cache location.
  • Ability to disable network usage (provided that the dataset has already been cached).

If you have suggestions for how the adapter may be improved, you’re welcome to create a feature request on CVAT's issue tracker.

February 14, 2023
Go Back