Panoptic Segmentation: Everything You Need to Know

October 05, 2023

Image segmentation is one of the most widespread data labeling tasks, finding uses in hundreds of different ML applications. Panoptic segmentation is one type of image segmentation, and while one of the most time-intensive, arguably one of the most powerful. In this article, we’ll dive deep into panoptic segmentation and how you can use it.

If you need a broader overview of image annotation, check out our complete guide to image annotation services. For the medical field specific to Radiology, you may check our guide to radiology annotation product suite.

If you need to segment your data and are looking for a data annotation platform, iMerit Ango Hub has all you need to start. If you are looking to outsource segmenting your images, iMerit Service is what you are looking for. But let’s get back to panoptic segmentation.

What is Image Segmentation?

Image segmentation is the process of labeling an image such that various parts of the image are classified up to the pixel level, making segmentation one of the most information-intensive ways of image labeling.

Segmented images can train powerful ML/ Deep Learning algorithms with detailed information on what is on the image and where. Image segmentation effectively classifies and localizes objects of interest within an image, making it the labeling task of choice when we need to train highly detailed detectors and data resources are available.

Before we delve into the details of various forms of image segmentation, we need to understand the two key concepts to further image segmentation. Any image, when segmented, can contain two kinds of elements:

  1. Things (Instance): Any countable object is called a thing. If you can identify and separate the class into multiple objects, it is called a thing. To exemplify – a person, a cat, a car, a key, and a ball are called things.
  2. Stuff (Semantic): An uncountable amorphous region of identical texture is known as stuff. Stuff, in general, forms an indivisible area within an image. For instance, roads, water, and sky belong to the stuff category.

Types of Image Segmentation

Image Labeling Tasks from Detection to Panoptic Segmentation – from the COCO dataset

Knowing the two concepts mentioned above, we can delve into image segmentation. There are three main categories:

  1. Semantic segmentation refers to exhaustively identifying different classes of objects in an image. All pixels of an image belong to a specific class (we automatically consider some unlabeled pixels as belonging to the background class).

Fundamentally, this means identifying stuff within an image.

  1. Instance segmentation refers to identifying and localizing different instances of each semantic category. Fundamentally, in instance segmentation, each object gets a unique identifier and appears as an extension of semantic segmentation. 

Instance segmentation thus identifies things in an image

  1. Panoptic Segmentation combines the merits of both approaches and distinguishes different objects to identify separate instances of each kind of object in the input image. It enables having a global view of image segmentation 

Essentially, the panoptic segmentation of an image contains data related to both the overarching classes and the instances of these classes for each pixel, thus identifying both stuff and things within an image.

Image Classification, Instance Segmentation, Semantic Segmentation, and Panoptic Segmentation on iMerit Ango Hub

The Panoptic Segmentation Format

So, how exactly do we achieve the semantic and instance categories of the same image? Kirillov et al. at Facebook AI Research and Heidelberg University solved this problem intuitively. These properties exist for panoptic segmentation:

Two Labels per Pixel: Panoptic segmentation assigns two labels to each of the pixels of an image – semantic label and instance ID. The pixels having the same label belong to the same semantic class, and instance IDs differentiate its instances. 

Annotation File Per Image: As every pixel is labeled and assigned its pixel values, it is saved as a separate (by convention, PNG) file with the pixel values rather than a set of polygons or RLE encoding.

Non-Overlapping: Unlike instance segmentation, each pixel in panoptic segmentation has a unique label corresponding to the instance, which means there are no overlapping instances.

An Image and its panoptic Segmentation overlaid upon it.

Consider the image above and its resultant panoptic segmentation PNG file. The panoptic segmentation image is saved as PNG, with the exact dimensions as the input image. It means that masks are not stored as polygons or in RLE format but as pixel values in a file. 

The image above was a 600 x 400 size, and the panoptic segmentation is also 600×400. However, while the input image has pixel values in the range of 0-255 (grayscale), the output panoptic segmentation image has a very different range of values. Each pixel value in the resultant panoptic segmentation file represents the class for that pixel.

Storing Annotations in the Panoptic Segmentation Format

Let’s dive into some Python to understand how the labels are represented in the files. 

The key question we want to address is: 

What is the corresponding class for or a pixel value in the panoptic segmentation output?

First, let’s check what classes we have:

We find out we have 133 classes in total, representing various categories of objects.

Now, let’s go to the panoptic segmentation output. If we get the unique values of the pixels in the panoptic segmentation, we get the following result:

To get the instance and class IDs for each of these pixel values, here’s how we interpret them.

The instance IDs separate different instances of the same class by a unique identifier. Note that instance IDs are global, not specific for a semantic class. The instance ID is a counter for the total instances in the image. In the case above, since the highest instance ID is 5, we have five thing instances, and the rest is stuff. 

Mathematically, we need to decode these pixel values to get the indices of the classes they represent. Usually, panoptic segmentation encoding is such that pixel value % (modulus operator) offset gives us the class ID. 

Because of our mathematical operation above, 2000 % 1000 = 5000 % 1000 = 0. Thus, pixel value 2000 is the same class as pixel value 5000, and both belong to class 0. Similarly, values 1038 and 3038 belong to the same class of 38.  

By correlating our class IDs to the model classes, we get this output:  38 is for tennis_racket, and 0 is for the person class. It also answers our initial question of what pixel values correspond to which class in the panoptic segmentation label.

Image from the first paper on the Panoptic Segmentation

Frameworks for Panoptic Segmentation

Panoptic FPN

Architecture of Panoptic FPN Combining Instance and Semantic Segmentation.

Introduced by the pioneers of Panoptic segmentation, this deep learning framework aims to unify the tasks of instance and semantic segmentation at the architectural level, designing a single network for both annotations.

It uses Mask-RCNN to achieve Instance Segmentation and adds a semantic segmentation branch. Each branch uses a Feature Pyramid Network backbone for feature extraction. The FPN extracts and scales up the features such that when encountered in different proportions, the network may still detect them correctly.

Surprisingly, this simple baseline remains effective for instance segmentation and yields a lightweight, well-performing method for semantic segmentation. By combining these two tasks, the framework sets the foundation for Panoptic Segmentation architectures.


Mask2Former Architecture

Presented in 2022, the authors aim to tackle the problems of instance and semantic segmentation using a single framework. It effectively tackles panoptic segmentation and advances the state of the art for panoptic segmentation on various datasets.

The framework is called “Masked-attention Mask Transformer (Mask2Former),” and can address any image segmentation task (panoptic, instance, or semantic). Its key components include masked attention, which extracts localized features by constraining cross-attention within predicted mask regions.  

This framework also uses two main branches: a Pixel Decoder branch and A Transformer Decoder branch. The pixel decoder performs a task similar to the FPN to scale up extracted features to various proportions. The transformer decoder uses the different scales of features for the transformer output and combines pixel decoders to predict the mask and class of objects.

Panoptic Segmentation Datasets

COCO Panoptic

Annotations from the COCO panoptic dataset

The panoptic task uses all the annotated COCO images and includes the 80 thing categories from the detection task and a subset of the 91 stuff categories from the stuff task. This dataset is best for general object detection, and you’ll often see it in the panoptic literature to fine-tune networks.


Some Annotations from ADE20k Dataset

The ADE20K semantic segmentation dataset contains more than 20K scene-centric images exhaustively annotated with pixel-level objects and object parts labels. There are 150 semantic categories, including “stuff” like sky, road, grass, and discrete objects, like person, car, and bed.


Some Annotations from the Mapillary Dataset

The Mapillary Dataset is a set of 25000 high-resolution images. The images belong to 124 semantic object categories and 100 instance categories. The dataset contains images from all over the globe, covering six continents. The data is ideal for panoptic segmentation tasks in the autonomous vehicle industry.


Annotations from the Cityscapes dataset

It is a dataset containing stereo video sequences recorded in street scenes from 50 cities, with high-quality pixel-level annotations of 5,000 frames. It is present in addition to a set of 20,000 weakly annotated frames.

It contains polygonal annotations, combining semantic and instance segmentations with 30 unique classes with data collected from 50 cities.


The Panoptic Segmentation is a highly effective method of segmenting images, including semantic and instance segmentation. Although panoptic segmentation is a recent development, the research is fast-paced while shaping the future of object detection.

Panoptic segmentation is extremely detail-rich due to the pixel-level class labels and can train powerful deep-learning frameworks. However, the process of labeling data up to the very pixel level is a grueling one. 

At iMerit, we deliver high-quality and densely annotated images. Whether you’re looking to deploy a panoptic detector for an autonomous vehicle, a medical imagery task, or another problem, we ensure that our experts label each image carefully up to pixel perfection, using iMerit Ango Hub, our state-of-the-art labeling platform natively supporting panoptic labeling.

Book a demo to learn how we can help you solve your data labeling needs.