My background in segmenting cell nuclei
In early 2018, I built a series of submissions for Kaggle’s Data Science Bowl in order to better learn about the tasks of segmentation using convolution neural networks. The aim of the Data Science Bowl competition is to help streamline medical research by providing methods to automatically segment out cell nuclei in images of cells.
The first stage of the competition was made up of 670 images of cells with 65 held out for scoring and testing. Within the data set there was a significant amount of variation in the style and size of input images which adds to the complexity of the problem. The images in figure 1 illustrate that variation. Each training sample had two main parts, a base image which was used as input and a collection of image masks which represented every nuclei contained in that image. Scores for the competition were generated by imputing in the test images, generating image masks using any programmatic method and identifying the location of the individual nuclei in the image, and then submitting that list to Kaggle for scoring.
This challenge can effectively be summarized as an image segmentation task and I had decided to approach it as such. Since I had not previously had the opportunity to work on many segmentation tasks, the first step that I undertook was to do a literature review. Three of the models that I looked at were U-Net, LinkNet, and Mask R-CNN.
The original paper on U-Net came out in 2015 and laid out image segmentation methods using a fully connected convolutional neural network. Some of the benefits that the authors state is improved accuracy over previous models and the ability to train on very small data sets. In fact, one of the initial challenges was training U-Net to perform segmentation tasks using a data set of only 30 512×512 images. While this number is extremely low, the authors were able to find good results using aggressive data augmentation. The structure of U-Net is a series of convolutional layers where the outputs of those layers are passed to a corresponding deconvolutional layer. What this allows if for U-Net to reconstruct the image and bring it up to match the original dimensions of the input images. This model is a fully convolutional network that outputs a classification of each pixel in the image to generate a segmentation mask.
Most of my work has been based on variations of the overall U-Net architecture.
From my experience, LinkNet is lightning fast, which is one of the main improvements the authors site in their summary. LinkNet is a relatively lightweight network with around 11.5 million parameters, networks like VGG have more than 10x that amount.
The structure of LinkNet is to use a series of encoder and decoder blocks to break down the image and build it back up before passing it through a few final convolutional layers. The structure of the network was designed to minimize the number of parameters so that segmentation could be done in real time.
I performed some tests of the LinkNet architecture, but did not spend too much time iterating to improving the models.
The model was developed by Facebook as an extension of their previous work on Fast R-CNN and Faster R-CNN for the goal of pixel level segmentation. The previous Faster R-CNN was designed for object detection where the output was a standard bounding box with a class prediction associated with it. However, Mask R-CNN aims to extend this further by adding a fully convolutional branch in parallel to the previous network to allow the new architecture to output a prediction for every pixel in a given image, assign a class, and create a bounding box. The construction of the network itself is relatively intuitive and illustrates the power of neural networks at solving a variety of complicated tasks.
The grey section in figure 4, represents the previous Faster R-CNN model developed by Facebook. Mask R-CNN makes the addition of adding fully convolutional layers in parallel to the previous network to allow for the generation of pixel level image masks along with class prediction and bounding boxes.
After getting familiar with the data for the competition I built out a series of U-Net models starting with the classic U-Net architecture which is depicted above. At this point I was still using the raw training images as input. Once that was completed I started to experiment by increasing the number of convolution and deconvolutional blocks to increase perceptive ability of the network.
For guidance on exploring extensions of the U-Net and LinkNet architectures I looked to the winners interview for the Carvana Image Masking Challenge which occurred in 2017. The winners used an ensemble of U-Net and LinkNets to generate their submissions and specify some of the trials, errors, and take-home lessons that they gained in the competition. Given that the challenge is very similar to that being performed in the Data Science Bowl it made for a good transfer of knowledge.
What I found is that essentially by making the network deeper and applying regularization in the form of L2 and to some extent batch normalization I was able to create a network which outperformed the standard U-Net architecture and generate decent output as part of the competition.
Given that this data set has a fair amount of variation I applied the techniques as discussed by Kevin Mader on Kaggle, n an attempt to standardize the input images. The main variation that I wanted to standardize is the differences in colors of the backgrounds, cell material, and nuclei. The technique produces a fluorescent final color and in cases where the cells are the darkest part of the image the colors are inverted which gives all the images a similar looking final state as opposed to the raw image inputs.
I have trained several other U-Nets and LinkNets which performed well on these images and on black and white versions of these normalized images. Turning them to black and white gives a fairly uniform input and since black and white images are only a single dimension, there is less variation that the network must account for.
For all these networks I had to perform a significant amount of data augmentation, as was done in the academic literature. The types of augmentation that I applied were vertical flips, rotations, width and height distortions, sheering, and zooming. This helped to avoid overfitting on the training set and extend the generalization of the networks. One additional aspect of this challenge was that the same augmentations had to be applied to both the input image and the target image mask. If this was not done, then it would mean a mismatch between the input and output data which would be catastrophic for the network. In Keras this was done by fixing the initial random seed of the image generators so that each input and target mask would have the same augmentations applied to it.
Another area of experimentation was testing SGD, RMSProp, and Adam optimizers. While the general approach of the deep learning community is to solve most problems with an Adam optimizer, a paper from Berkley published in 2017 caught my eye, it was titled The Marginal Value of Adaptive Gradient Methods in Machine Learning. The point of the paper is that while more advanced methods have been developed, SGD trained networks perform just as well or better in particular when it comes to generalizing to unseen test data. So I experimented with SGD optimized networks. They took a good amount of time to converge, but not extraordinarily so.
This was my first extensive project on image segmentation and I enjoyed diving into the literature and working with fully convolutional networks. It was a great opportunity to learn a new subject matter area and test several hypotheses in the areas of data augmentation, regularization, network structure, and optimization.