Renjith Ms
2 min readMar 24, 2021

Mask rcnn

Mask RCNN is a deep neural network for solving instance segmentation problem in computer vision. Given an image as input to Mask rcnn, it gives the object bounding boxes, classes and masks as output.

Terms which needs to be under stood for understanding Mask rcnn

  1. Feature Pyramid Networks for Object Detection

video explain the FPN paper https://arxiv.org/abs/1612.03144

FPN in Breif

FPN composes of a bottom-up and a top-down pathway. The bottom-up pathway is the usual convolutional network for feature extraction. As we go up, the spatial resolution decreases. With more high-level structures detected, the semantic value for each layer increases.

SSD makes detection from multiple feature maps. However, the bottom layers are not selected for object detection. They are in high resolution but the semantic value is not high enough to justify its use as the speed slow-down is significant. So SSD only uses upper layers for detection and therefore performs much worse for small objects.

FPN provides a top-down pathway to construct higher resolution layers from a semantic rich layer.

While the reconstructed layers are semantic strong but the locations of objects are not precise after all the downsampling and upsampling. We add lateral connections between reconstructed layers and the corresponding feature maps to help the detector to predict the location betters. It also acts as skip connections to make training easier (similar to what ResNet does).