Skip to content

ayushdabra/drone-images-semantic-segmentation

Repository files navigation

Open In Colab launch binder

Multiclass Semantic Segmentation of Aerial Drone Images Using Deep Learning

Abstract

Semantic segmentation is the task of clustering parts of an image together which belong to the same object class. It is a form of pixel-level prediction because each pixel in an image is classified according to a category. In this project, I have performed semantic segmentation on Semantic Drone Dataset by using transfer learning on a VGG-16 backbone (trained on ImageNet) based UNet CNN model. In order to artificially increase the amount of data and avoid overfitting, I preferred using data augmentation on the training set. The model performed well, and achieved ~87% dice coefficient on the validation set.

Tech Stack

The Jupyter Notebook can be accessed from here.

What is Semantic Segmentation?

Semantic segmentation is the task of classifying each and very pixel in an image into a class as shown in the image below. Here we can see that all persons are red, the road is purple, the vehicles are blue, street signs are yellow etc.

Semantic segmentation is different from instance segmentation which is that different objects of the same class will have different labels as in person1, person2 and hence different colours.

Semantic Drone Dataset

The Semantic Drone Dataset focuses on semantic understanding of urban scenes for increasing the safety of autonomous drone flight and landing procedures. The imagery depicts more than 20 houses from nadir (bird's eye) view acquired at an altitude of 5 to 30 meters above ground. A high resolution camera was used to acquire images at a size of 6000x4000px (24Mpx). The training set contains 400 publicly available images and the test set is made up of 200 private images.


Semantic Annotation

The images are labeled densely using polygons and contain the following 24 classes:

Name R G B Color
unlabeled 0 0 0

paved-area 128 64 128

dirt 130 76 0

grass 0 102 0

gravel 112 103 87

water 28 42 168

rocks 48 41 30

pool 0 50 89

vegetation 107 142 35

roof 70 70 70

wall 102 102 156

window 254 228 12

door 254 148 12

fence 190 153 153

fence-pole 153 153 153

person 255 22 0

dog 102 51 0

car 9 143 150

bicycle 119 11 32

tree 51 51 0

bald-tree 190 250 190

ar-marker 112 150 146

obstacle 2 135 115

conflicting 255 0 0

Sample Images

Technical Approach

Data Augmentation using Albumentations Library

Albumentations is a Python library for fast and flexible image augmentations. Albumentations efficiently implements a rich variety of image transform operations that are optimized for performance, and does so while providing a concise, yet powerful image augmentation interface for different computer vision tasks, including object classification, segmentation, and detection.

There are only 400 images in the dataset, out of which I have used 320 images (80%) for training set and remaining 80 images (20%) for validation set. It is a relatively small amount of data, in order to artificially increase the amount of data and avoid overfitting, I preferred using data augmentation. By doing so I have increased the training data upto 5 times. So, the total number of images in the training set is 1600, and 80 images in the validation set, after data augmentation.

Data augmentation is achieved through the following techniques:

  • Random Cropping
  • Horizontal Flipping
  • Vertical Flipping
  • Rotation
  • Random Brightness & Contrast
  • Contrast Limited Adaptive Histogram Equalization (CLAHE)
  • Grid Distortion
  • Optical Distortion

Here are some sample augmented images and masks of the dataset:




VGG-16 Encoder based UNet Model

The UNet was developed by Olaf Ronneberger et al. for Bio Medical Image Segmentation. The architecture contains two paths. First path is the contraction path (also called as the encoder) which is used to capture the context in the image. The encoder is just a traditional stack of convolutional and max pooling layers. The second path is the symmetric expanding path (also called as the decoder) which is used to enable precise localization using transposed convolutions. Thus, it is an end-to-end fully convolutional network (FCN), i.e. it only contains Convolutional layers and does not contain any Dense layer because of which it can accept image of any size.

In the original paper, the UNet is described as follows:

U-Net architecture (example for 32x32 pixels in the lowest resolution). Each blue box corresponds to a multi-channel feature map. The number of channels is denoted on top of the box. The x-y-size is provided at the lower left edge of the box. White boxes represent copied feature maps. The arrows denote the different operations.

Custom VGG16-UNet Architecture

  • VGG16 model pre-trained on the ImageNet dataset has been used as an Encoder network.

  • A Decoder network has been extended from the last layer of the pre-trained model, and it is concatenated to the consecutive convolution blocks.

VGG16 Encoder based UNet CNN Architecture

A detailed layout of the model is available here.

Hyper-Parameters

  1. Batch Size = 8
  2. Steps per Epoch = 200.0
  3. Validation Steps = 10.0
  4. Input Shape = (512, 512, 3)
  5. Initial Learning Rate = 0.0001 (with Exponential Decay LearningRateScheduler callback)
  6. Number of Epochs = 20 (with ModelCheckpoint & EarlyStopping callback)

Results

Training Results

Model Epochs Train Dice Coefficient Train Loss Val Dice Coefficient Val Loss Max. (Initial) LR Min. LR Total Training Time
VGG16-UNet 20 (best weights at 18th epoch) 0.8781 0.2599 0.8702 0.29959 1.000 × 10-4 1.122 × 10-5 23569 s (06:32:49)

The model_training_csv.log file contain epoch wise training details of the model.

Visual Results

Predictions on Validation Set Images:

All predictions on the validation set are available in the predictions directory.

Activations (Outputs) Visualization

Activations/Outputs of some layers of the model-

block1_conv1

block4_conv1

conv2d_transpose

concatenate

conv2d

conv2d_transpose_1

conv2d_3

conv2d_transpose_2

concatenate_2

conv2d_5

conv2d_7

conv2d_8

Some more activation maps are available in the activations directory.

References

  1. Semantic Drone Dataset- http://dronedataset.icg.tugraz.at/
  2. Karen Simonyan and Andrew Zisserman, "Very Deep Convolutional Networks for Large-Scale Image Recognition", arXiv:1409.1556, 2014. [PDF]
  3. Olaf Ronneberger, Philipp Fischer and Thomas Brox, "U-Net: Convolutional Networks for Biomedical Image Segmentation", arXiv:1505. 04597, 2015. [PDF]
  4. Towards Data Science- Understanding Semantic Segmentation with UNET, by Harshall Lamba
  5. Keract by Philippe Rémy (@github/philipperemy) used under the IT License Copyright (c) 2019.