Image Annotation For Advanced Computer Vision

August 28, 2023

The domain of image annotation is as vast and old as data science itself. Indeed, one of the first works ever done in AI space was interpreting and annotating line drawings. In recent times, however, the focus has evolved substantially, thanks to the advent of Big Data and various real-world application areas for computer vision, such as autonomous vehicles, facial recognition, augmented reality, and surveillance.

Teaching a computer how to see is no easy task. The machine learning model for computer vision needs training with various annotated images to recognize them and provide meaningful and accurate results and predictions. 

Models can provide various kinds of outputs. For instance, predicting whether or not an object is present inside an image, creating a rectangular box around an object (commonly called a bounding box), or even creating a mask on objects with pixel-perfect accuracy. Each of these different outputs requires a similar kind of prepared, annotated data to be provided to the model so that it can learn to do it independently with accuracy.

This blog explores different ways to prepare training datasets with annotated images while understanding the most common labeling tasks.

Types of Labeling Tasks

Before delving deep into image annotation, let’s explore various tasks that image processing systems can perform.


This type of task usually checks whether a given property exists in an image or a condition is satisfied. In other words, this means classifying an image within a set of predetermined categories based on the contents. Usually, classification is an answer to a question. For example, such a question may be, “Does the image contain a bird?”

Object Detection

It takes classification one step further by not only including the presence but also the object position. It finds instance(s) of the objects within an image. Detection is a way of getting indicators for an object’s coordinates in an image. Building up from the previous classification question, this asks, for example, “Where is the bird in the image?”

Image Segmentation

The machine learning model breaks the image down into smaller components in segmentation. There are two main ways a model can segment an image. First, the model assigns a label to a specific entity, such as a person, a car, or a boat, which has delineated boundaries and is countable. Second, the model labels areas that are not countable and may not have rigid boundaries, such as sky, water, land, or groups of people.

What is commonly called Instance Segmentation is the task of identifying the entities with every pixel that belongs to them, such that the segment captures their shape. Here, one may choose to separate each instance.

On the other hand, Semantic Segmentation requires each pixel label to include both entities and areas. Most importantly, it does not differentiate between different occurrences of the same object.

Fig. 1. Left, Semantic Segmentation. Right, Instance Segmentation. Source.

Types of Image Annotation

Bounding Boxes

It is the most common approach to image labeling, as it is the one that most often fulfills the requirements of models processing images. A bounding box is a rectangular area containing an object of interest. The boxes define the location of this object of interest and a constraint to its size.

Each bounding box is a set of coordinates that delineates the starting and ending positions of the object in all directions. Under the hood, there are two main ways to format such annotations: one uses two pairs of points (x, y) to represent the top-right and the bottom-left position of the rectangle. These first two points allow us to extrapolate the other two. The other format only uses one point (x, y) to represent the top right corner of the object, while another tuple (w, h) depicts the width and the height of the bounding box.

When do you want to use bounding boxes?

When the primary purpose of your model/system is to detect or localize an object of interest, the range of uses of object detection can range from tasks such as activity recognition, face detection, face recognition, video object co-segmentation, or others.

Polygonal Segmentation

The drawback of bounding boxes is that they cannot fully delineate the shape of the object, only its general position. Polygonal segmentation addresses this problem. The approach relies on drawing points around the object of interest and connecting them to form a polygon. It, although not pixel-perfect annotation performed by humans, provides adequate data regarding the shape, size, and location.

The polygons are stored in various formats, for example, as a list containing a set of points corresponding to the polygon vertices. It is presented as a list of lists or a consecutive ordering of (x, y) points.

When do you want to use polygonal segmentation?

When the system is not only to detect or localize the position of an object of interest but also its shape and size, you would use polygon segmentation. It implies that polygonal segmentation is the way for most segmentation tasks.

How can iMerit help?

iMerit Ango Hub provides an end-to-end, fully managed data labeling service for AI teams, including image annotation. With our ever-growing team of labelers and our in-house labeling platform, we provide efficient and accurate labeling for your raw data.

Ango Hub allows our annotators to label images quickly and efficiently. After labeling, our platform enables reviewers to verify and review labels to meet high-quality requirements.

Once done, we export the annotations in various formats, such as COCO or YOLO, depending on the project. To bring labeling speed to the next level, Ango Hub has AI-assisted annotation techniques to reduce the time of such tasks from minutes to a matter of seconds.

Are you looking for data annotation to advance your project? Contact us today.