Skip to content

A DeepLab V3+ Model with choice of Encoder for Binary Segmentation. Implemented with Tensorflow.

License

Notifications You must be signed in to change notification settings

mukund-ks/DeepLabV3-Segmentation

Repository files navigation


DeepLabV3-Segmentation

A DeepLab V3+ Model with a choice of Encoder to perform Binary Segmentation Tasks.

Report Bug . Request Feature

Downloads Contributors Forks Stargazers Issues License

Table Of Contents

Changelog

Ensemble Training & Evaluation - 22/11/2023

  • Added support for ensemble training using a script that sequentially trains multiple models.
  • Ensemble evaluation script implemented for combining predictions from multiple models and computing ensemble metrics.
  • Introduced a custom callback to monitor IoU during training and save examples with IoU below a specified threshold.
  • Implemented a dynamic learning rate reduction for subsequent models during ensemble training.
  • Added a mechanism to scale the IoU threshold with each subsequent model, providing better adaptability.
  • Added function for cleaning up temporary files.

More Backbones - 23/12/2023

  • Added backbone implementation for Xception and EfficientNetB5. (ef93d12, 2307d8c)
  • Removed 'cloud' branch.
  • Updated the model on 'main' to the best performing configuration.

About The Project

The goal of this research is to develop a DeepLabV3+ model with a choice of ResNet50 or ResNet101 backbone to perform binary segmentation on plant image datasets. Based on the presence or absence of a certain object or characteristic, binary segmentation entails splitting an image into discrete subgroups known as image segments which helps to simplify processing or analysis of the image by reducing the complexity of the image. Labeling pixels is another step in the segmentation process. Each pixel or piece of a picture assigned to the same category has a unique label.

Plant pictures with ground truth binary mask labels make up the training and validation dataset. The project uses Tensorflow, a well-known deep learning library, for model development, training, and evaluation.1 During the training process, the model is optimized using strategies like the Dice Loss, Adam optimizer, Reducing LR on Pleateau and Early Stopping. All the while, important metrics like Intersection over Union (IoU), Precision, Recall, Accuracy and Dice Coefficient are kept track of.

Datasets used during development of this project are described below:

  • EWS Dataset

    The Eschikon Wheat Segmentation (EWS) Dataset consists of 190 images that were cropped to 350 by 350 pixel patches and manually labeled as binary masks for soil and plants, respectively. Pixels that the annotator was certain belonged to vegetative active material from a wheat plant should be marked as such. Everything else, including dirt, rocks, and dead plants, is categorized as vegetative inactive material. Following that, the masks were exported as 8-bit, lossless PNG images.

    Between 2017 and 2020, a Canon 5D Mark II with a 35mm lens and autofocus was used to shoot these pictures. The approximate distance to the ground was 3 m. In 2017 and 2018, ISO, aperture, and shutter speed were set using the aperture priority setting; in 2019 and 2020, these settings were set using the shutter speed priority setting. The photographic collection for each year covers the whole growing season, from emergence to harvest. The photos were taken outdoors, in a setting with a wide range of sunlight and soil moisture conditions.

  • Plant Semantic Segmentation Dataset by HIL

    Humans in the Loop (HIL) Plant Semantic Segmentation Dataset was made available as an Open-Access Dataset by The Computer Vision and Biosystems Signal Processing Group at the Department of Electrical and Computer Engineering at Aarhus University.

    144 images of plant seedlings from 3 containers were collected over the course of two months at various intervals and are included in the dataset. Each container holds up to 40 single plants, and to make them easier to see, each plant has been given a bounding box. The photos are 4096 by 3000 pixels in size and manually annotated.

    The annotations are made as such:

    • ‘Background’ class as black.
    • ‘Plant’ class as green.
  • CVPPP Dataset

    The Computer Vision Problems in Plant Phenotyping (CVPPP) Leaf Counting Challenge (LCC) 2017 Dataset provides 27 images of tobacco and 783 Arabidopsis images in separate folders from A1 through A4. Using a camera with a single plant in its range of view, tobacco photos were gathered. Images of the Arabidopsis plant were taken with a camera that had a wider field of view and were later cropped. The photographs were shot over a period of days from mutants or wild types, and they came from two different experimental settings, where the field of vision was different.

    Additionally, certain plants are slightly out of focus than others due to the wider range of view. Though, the backgrounds of most photographs are straightforward and static, occasionally, moss growth or the presence of water in the growing tray complicates the scene. For the purpose of obtaining ground truth masks for every leaf/plant in the picture, each image was manually labeled.

The ultimate objective of the project is to develop a strong model that can accurately segment plant-related regions inside photographs, which can have applications in a variety of fields, such as agriculture, botany, and environmental sciences. The included code demonstrates how to prepare the data, create the model's architecture, train it on the dataset, and assess the model's effectiveness using a variety of metrics.

Working

The objective of binary segmentation, often referred to as semantic binary segmentation, is to categorize each pixel in an image into one of two groups: the foreground (object of interest), or the background. A powerful Encoder-Decoder based architecture for solving binary segmentation challenges, DeepLabV3+ with ResNet50 or ResNet101 as the backbone offers great accuracy and spatial precision.

Architecure
Architecture of this Repository's Model - DeepLabV3+

DeepLabV3+

Known for its precise pixel-by-pixel image segmentation skills, DeepLabV3+ is a powerful semantic segmentation model. It combines a robust feature extractor, such as ResNet50 or ResNet101, with an effective decoder. This architecture does a great job of capturing both local and global context information, which makes it suitable for tasks where accurate object boundaries and fine details are important. A crucial part is the Atrous Spatial Pyramid Pooling (ASPP) module, which uses several dilated convolutions to collect data on multiple scales. The decoder further improves the output by fusing high-level semantic features with precise spatial data. Highly precise segmentations across a variety of applications are made possible by this fusion of context and location awareness.

ResNet Backbone

Residual Networks, often known as ResNets, are a class of deep neural network architectures created to address the vanishing gradient problem that can arise in very deep networks. They were first presented in the 2015 publication Deep Residual Learning for Image Recognition by Kaiming He et al. ResNets have been extensively used for a number of tasks, including image classification, object recognition, and segmentation.

The main novelty in ResNets is the introduction of residual blocks, which allow for the training of extremely deep networks by providing shortcut connections (skip connections) that omit one or more layers. Through the use of these connections, gradients can pass directly through the network without disappearing or blowing up, enabling the training of far more complex structures.

ResNets are available in a range of depths, designated as ResNet-XX, where XX is the number of layers. The ResNet-18, ResNet-34, ResNet-50, ResNet-101, and ResNet-152 are popular variations. The performance of the deeper variations is better, but they also use up more processing resources.

Modules used

  • Encoder: The ResNet backbone's early layers are typically where the encoder is implemented. With growing receptive fields, it has numerous convolutional layers. These layers take the input image and extract low-level and mid-level information. The ASPP module is then given the feature maps the encoder produced.

    For pixel-wise predictions, the Encoder is essential in converting raw pixel data into abstract representations. It consists of several layers of convolutional and pooling procedures, arranged into blocks, that gradually increase the number of channels while decreasing the input's spatial dimensions. Because of its hierarchical structure, the model may capture aspects with varied levels of complexity, from simple edges and textures to complicated object semantics.

  • Atrous Spatial Pyramid Pooling (ASPP): The ASPP module executes many convolutions with various dilation rates following the encoder. This records contextual data at various scales. Concatenated and processed outputs from several atrous convolutions are then used to create context-rich features.

    By gathering data from diverse scales and viewpoints, the ASPP module improves the network's comprehension of the items in a scene. It is especially useful for overcoming the challenges presented by items with varying sizes and spatial distributions.

  • Decoder: Through skip connections, the decoder module combines low-level features from the encoder with high-level features from the ASPP module. This method aids in recovering spatial data and producing fine-grained segmentation maps.

    This Module enables the network to generate precise and contextually rich segmentation maps by including skip links and mixing data from various scales. This is crucial for tasks like semantic segmentation, where accurate delineation of object boundaries is necessary for producing high-quality results.

  • Squeeze & Excitation (SE): It is a mechanism made to increase the convolutional neural networks' representational strength by explicitly modeling channel-wise interactions. Jie Hu et al. first discussed it in their paper Squeeze-and-Excitation Networks, published in 2018. In order to enable the model to focus greater attention on crucial features, the SE Module seeks to selectively emphasize informative channels while suppressing less critical ones within the network.

    By computing the average value of each channel across all spatial dimensions, the global average pooling method is used. The end result is a channel-wise descriptor that accurately reflects the significance of each channel in relation to the overall feature map.

    The channels are then adaptively recalibrated using the squeezed information. Two fully connected layers are utilized for this. A non-linear activation function, also known as ReLU, is added after the first layer, which minimizes the dimensionality of the squeezed descriptor. A set of channel-wise excitation weights is produced after the second layer returns the dimensionality to the original number of channels. Each channel's weights indicate how much it should be boosted or muted.

    Squeeze & Excite Diagram
    Squeeze & Excitation Module

Results

Results of the developed Model on EWS, PSS and CVPPP Dataset.

EWS

On the basis of IoU, the results of this repository's best performing model are compared to Zenkl et al. (2022), Yu et al. (2017), Sadeghi-Tehran et al. (2020) and Rico-Fernández et al. (2018).

A Development Flowchart and several model version configurations for ResNet50 and ResNet101 backbone on the EWS Dataset can be found here.

ResNet50 Backbone

Benchmark IoU
Repository (Model v1.5) 0.768
Zenkl et al. (2022) 0.775
Yu et al. (2017) 0.666
Sadeghi-Tehran et al. (2020) 0.638
Rico-Fernández et al. (2018) 0.691
Model v1.5 Result
ResNet50 Model v1.5 Result
Left to Right: Input Image, Ground Truth, Predicted Mask, Segmented Output

ResNet101 Backbone

Benchmark IoU
Repository (Model v1.7) 0.763
Zenkl et al. (2022) 0.775
Yu et al. (2017) 0.666
Sadeghi-Tehran et al. (2020) 0.638
Rico-Fernández et al. (2018) 0.691
Model v1.7 Result
ResNet101 Model v1.7 Result
Left to Right: Input Image, Ground Truth, Predicted Mask, Segmented Output

CVPPP

Totaling 810 pictures of Tobacco and Arabidopsis plants, the CVPPP LCC 2017 Dataset is divided into 4 directories, A1 through A4. Arabidopsis plant photos are included in divides A1, A2, and A4, which have 128, 31, and 624 images, respectively. 27 photos of tobacco plants are included in A3.

A collection of 63 photos from the divides A1 through A4 were assembled to form an evaluation set, representing each split.

Model training was done on A1, A2, A3, and A4 separately for the outcomes of this repository's model. A separate split of 267 photos, consisting of 46 images from A1, 20 images from A2, and 201 images from A4, was also created and utilized for training.

A Development Flowchart and several model version configurations for ResNet50 and ResNet101 backbone on the CVPPP Dataset can be found here.

Results can be found below.

ResNet50 Backbone

Split IoU Dice-Loss
A1 0.454 0.387
A2 0.915 0.044
A3 0.450 0.362
A4 0.921 0.043
A1+A2+A4 (Model v1.5) 0.957 0.051
Model v1.1 Result
A2 Result - ResNet50
Left to Right: Input Image, Ground Truth, Predicted Mask, Segmented Output
Model v1.2 Result
A3 Result - ResNet50
Left to Right: Input Image, Ground Truth, Predicted Mask, Segmented Output
Model v1.3 Result
A4 Result - ResNet50
Left to Right: Input Image, Ground Truth, Predicted Mask, Segmented Output
Model v1.5 Result
A1+A2+A4 Result - ResNet50
Left to Right: Input Image, Ground Truth, Predicted Mask, Segmented Output

ResNet101 Backbone

Split IoU Dice-Loss
A1 0.447 0.398
A2 0.892 0.056
A3 0.480 0.322
A4 0.915 0.045
A1+A2+A4 (Model v1.5) 0.960 0.022
Model v1.1 Result
A2 Result - ResNet101
Left to Right: Input Image, Ground Truth, Predicted Mask, Segmented Output
Model v1.2 Result
A3 Result - ResNet101
Left to Right: Input Image, Ground Truth, Predicted Mask, Segmented Output
Model v1.3 Result
A4 Result - ResNet101
Left to Right: Input Image, Ground Truth, Predicted Mask, Segmented Output
Model v1.5 Result
A1+A2+A4 Result - ResNet101
Left to Right: Input Image, Ground Truth, Predicted Mask, Segmented Output

Note

Data Augmentations were used for all training sets except A4.

PSS

There are 144 photos in the Humans in the Loop (HIL) Plant Semantic Segmentation (PSS) Dataset. No additional splits were created because of the smaller size of the dataset. The masks from the dataset, however, were thresholded to only contain black or white color. Black is the background, whereas white is the plant.

Data Augmentations were used during training of the model.

The best model found for this dataset produced the results listed below.

A Development Flowchart and several model version configurations for ResNet50 and ResNet101 backbone on the PSS Dataset can be found here.

ResNet50 Backbone

Model IoU Dice-Loss
Best Model (v1.3) 0.550 0.306
Best Model - ResNet50 Result
Best Model Result-ResNet50
Left to Right: Input Image, Ground Truth, Predicted Mask, Segmented Output

ResNet101 Backbone

Model IoU Dice-Loss
Best Model (v1.3) 0.550 0.305
Best Model - ResNet101 Result
Best Model Result-ResNet101
Left to Right: Input Image, Ground Truth, Predicted Mask, Segmented Output

Built With

Python

Tensorflow

  • IDE Used:

VSCode

  • Operating System(s):

Windows11

WSL2

Getting Started

To get a local copy of this project up and running on your machine, follow these simple steps.

  • Clone a copy of this Repository on your machine.
git clone https://github.com/mukund-ks/DeepLabV3-Segmentation.git

Prerequisites

  • Python 3.9 or above.
python -V
Python 3.9.13
  • CUDA 11.2 or above.
nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_22:08:44_Pacific_Standard_Time_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0

Installation

  1. Move into the cloned repo.
cd DeepLabV3-Segmentation
  1. Setup a Virutal Environment
python -m venv env
  1. Activate the Virutal Environment
env/Scripts/activate
  1. Install Dependencies
pip install -r requirements.txt

Note

You can deactivate the Virtual Environment by using env/Scripts/deactivate

Usage

The Model can be trained on the data aforementioned in the About section or on your own data.

  • To train & evaluate the model, use main.py
python main.py --help
Usage: main.py [OPTIONS]

  A DeepLab V3+ Decoder based Binary Segmentation Model with choice of
  Encoders b/w ResNet101 and ResNet50.

  Please make sure that your data is structured according to the folder
  structure specified in the Github Repository.

  See: https://github.com/mukund-ks/DeepLabV3-Segmentation

Options:
  --data-dir TEXT                 Path for Data Directory.  [required]
  --eval-dir TEXT                 Path for Evaluation Directory.  [required]
  -M, --model-type [ResNet101|ResNet50]
                                  Choice of Encoder.  [required]
  -A, --augmentation BOOLEAN      Opt-in to apply augmentations to provided
                                  data. Default - True
  -S, --split-data BOOLEAN        Opt-in to split data into Training and
                                  Validation set. Default - True
  --stop-early BOOLEAN            Opt-in to stop Training early if val_loss
                                  isn't improving. Default - True
  -B, --batch-size INTEGER        Batch size of data during training. Default
                                  - 4
  -E, --epochs INTEGER            Number of epochs during training. Default -
                                  25
  --help                          Show this message and exit.
  • An Example
python main.py --data-dir data --eval-dir eval_data -M ResNet50 -A False -S True -B 16 -E 80 --stop-early False

Folder Structure

The folder structure will alter slightly depending on whether or not your training data has already been divided into a training and testing set.

  • If the data is not already seperated, it should be in a directory called data that is further subdivided into Image and Mask subdirectories.

    • main.py should be run with --split-data option as True in this case.

      Example: python main.py --data-dir data --eval-dir eval_data --model-type ResNet50 --split-data True

Note The data will be split into training and testing set with a ratio of 0.2

$ tree -L 2
.
├── data
│   ├── Image
│   └── Mask
└── eval_data
    ├── Image
    └── Mask
  • If the data has already been separated, it should be in a directory called data that is further subdivided into the subdirectories Train and Test, both of which contain the subdirectories Image and Mask.

    • main.py should be run with --split-data option as False in this case.

      Example: python main.py --data-dir data --eval-dir eval_data --model-type ResNet50 --split-data False

$ tree -L 3
.
├── data
│   ├── Test
│   │   ├── Image
│   │   └── Mask
│   └── Train
│       ├── Image
│       └── Mask
└── eval_data
    ├── Image
    └── Mask
  • The structure of eval_data remains the same in both cases, holding Image and Mask sub-directories.

Note

The directory names are case-sensitive.

Roadmap

See the open issues for a list of proposed features (and known issues).

Contributing

Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.

  • If you have suggestions for adding or removing projects, feel free to open an issue to discuss it, or directly create a pull request after you edit the README.md file with necessary changes.
  • Please make sure you check your spelling and grammar.
  • Create individual PR for each suggestion.
  • Please also read through the Code Of Conduct before posting your first idea as well.

Creating A Pull Request

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b MyBranch)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push -u origin myBranch)
  5. Open a Pull Request

License

Distributed under the Apache 2.0 License. See LICENSE for more information.

Authors

Acknowledgements

To Cite this Repository

Surehli, M. K., Aggarwal, N., & Joshi, G. (2023, August 7). GitHub - mukund-ks/DeepLabV3-Segmentation: A DeepLab V3+ Model with ResNet 50 / ResNet101 Encoder for Binary Segmentation. Implemented with Tensorflow. Retrieved from https://github.com/mukund-ks/DeepLabV3-Segmentation

Footnotes

  1. A PyTorch implementation can be found here.

About

A DeepLab V3+ Model with choice of Encoder for Binary Segmentation. Implemented with Tensorflow.

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages