How does masks and images work with each other in UNET? - machine-learning

Let's say , we have a 1000 number of images with their corresponding masks .Correct me if I am wrong that if we use UNET then it will pass through a number of different convolutional layers , relu , pooling etc. . It will learn the features of images according to its corresponding masks . It will give the label to objects and then it learns the features of images we pass in its training . It will match the object of image with its corresponding mask to learn the object features only not unnecessary objects features . Like if we pass the image of cat and its background is filled with some unnecessary obstacles(bins , table , chair etc. )
According to the mask of cat , it will learn the features of cats only . Kindly elaborate your answer if I am wrong ?

Yes, you are right.
However not only UNET every segmentation algorithm works in the same way that it will learn to detect the features that are masked and ignoring unnecessary objects(as you mentioned).
By the way, people typically choose Fast RCNN, Yolo than UNET for multiclass segmentation for real world objects (like chair, table, cat, cars, etc).

so here is a short explanation (but not limited to).
1- All the segmentation network or let's say task (in a more general term), uses the actual image and ground truth (your masks) to learn a classification task.
Is it really a classification task like logistics regression or decision tree? (then why the hell such a complex name).
Ans: Cool, intrinsically YES, Your network is learning to classify. But it's a bit different than your decision tree or logistics.
So our network like UNET tries to learn, how to classify each pixel in the image. And this learning is completely supervised, as you have a ground truth (masks), which tells the network, which class a pixel in the image belongs to. Hence, when you do the training the network weights (weights of all your conv layers and blah blah...) are adjusted such that it learns to classify each pixel in the image to its corresponding classes.

Related

Understanding Faster rcnn

I'm trying to understand fast(er) RCNN and following are the questions I'm searching for:
To train, a FastRcnn model do we have to give bounding box
information in training phase.
If you have to give bonding box information then what's the role of
ROI layer.
Can we use a pre-trained model, which is only trained for classification, not
object detection and use it for Fast(er) RCNN's
Your answers:
1.- Yes.
2.- The ROI layer is used to produce a fixed-size vector from variable-sized images. This is performed by using max-pooling, but instead of using the typical n by n cells, the image is divided into n by n non-overlapping regions (which vary in size) and the maximum value in each region is output. The ROI layer also does the job of proyecting the bounding box in input space to the feature space.
3.- Faster R-CNN MUST be used with a pretrained network (typically on ImageNet), it cannot be trained end-to-end. This might be a bit hidden in the paper but the authors do mention that they use features from a pretrained network (VGG, ResNet, Inception, etc).

Can i Retrain Inception's Final Layer using depth images from Kinect.?

I like to know whether I can use data set of signs that is made using Kinect to retrain inception's final layer like mentioned in the Tensor Flow tutorial website that uses ordinary RGB images.I am new to this field. Opinions are much appreciated.
The short answer is "No. You cannot just fine tune only the last layer. But you can fine tune the whole pre-trained network.". The first layers of the pre-trained network is looking for RGB features. Your depth frames will hardly provide enough entropy to match that. Your options are:
If the recognised/tracked objects (hands) are not masked and you have actual depth data for the background, you can train from scratch on depth images with few contrast stretching and data whitening ((x-mu)/sigma). This will take very long time for the ivy league networks like Inception and ResNet. Also, keep in mind that most python based deep learning frameworks rely on PIL image loaders which by default assumes images are of 8bits channels mapped in the range [0, 1]. These image loaders cast all 16bits pixels ones.
If the recognised/tracked object (hands) are masked which means your background is set to the same value or barely have gradient in it, the network will overfit on the silhouette of the object because this is where the strongest edges are. The solution for this is to colorise the depth image using normal maps, HSA, HSV, JET colour coding to convert it into 3x8bits channeled image. This makes the training converge much faster and in my late experiments we found that you can fine tune the ivy league networks on the colorised depth.
Since you are new to this field.I would like to suggest you to read what is transfer learning all the three types mentioned.I would like to tell you to apply any of the mentioned forms of transfer learning basing on your data set.If your data set is very similar to the type of model you are using then you can pass through last layers.If you data is not similar you have to tune the existing model and use it.
As the layers of the neural networks increases the data specific feature extraction increases so you have to take care of the specific layers if your dataset is not very similar to the pre-built model dataset. The starting layers will contain more generic features.

Reduce dimensions of model's fully connected layer for image retrieval task

I'm working on a image retrieval task(not involving faces) and one of the things I am trying is to swap out the softmax layer in the CNN model and use the LMNN classifier. For this purpose I fine tuned the model and then extracted the features at fully connected layer. I have about 3000 images right now. The fully connected layer gives a 4096 dim vector. So my final vector is a 3000x4096 vector with about 700 classes(Each class has 2+ images). I believe this is an extremely large dimension size which the LMNN algorithm is going to take forever(it really did take forever).
How can I reduce the number of dimensions? I tried PCA but that didn't squeeze down the dimensions too much(got down to 3000x3000). I am thinking 256/512/1024 dim vector should be able to help. If I were to add another layer to reduce dimensions, say a new fully connected layer would I have to fine tune my network again? Inputs on how to do that would be great!
I am also currently trying to augment my data to get more images per class and increase the size of my dataset.
Thank you.
PCA should let you reduce the data further - you should be able to specify the desired dimensionality - see the wikipedia article.
As well as PCA you can try t-distributed stochastic neighbor embedding (t-SNE). I really enjoyed Wattenberg, et al.'s article - worth a read if you want to get an insight into how it works and some of the pitfalls.
In a neural net the standard way to reduce dimensionality is by adding more, smaller layers, as you suggested. As they can only learn during training, you'll need to re-run your fine-tuning. Ideally you would re-run the entire training process if you make a change to the model structure but if you have enough data it may be OK still.
To add new layers in TensorFlow, you would add a fully connected layer whose input is the output of your 3000 element layer, and output size is the desired number of elements. You may repeat this if you want to go down gradually (e.g. 3000 -> 1024 -> 512). You would then perform your training (or fine tuning) again.
Lastly, I did a quick search and found this paper that claims to support LMNN over large datasets through random sampling. You might be able to use that to save a few headaches: Fast LMNN Algorithm through Random Sampling

Altering trained images to train neural network

I am currently trying to make a program to differentiate rotten oranges and edible oranges solely based on their external appearance. To do this, I am planning on using a Convolutional Neural Network to train with rotten oranges and normal oranges. After some searching I could only find one database of approx. 150 rotten oranges and 150 normal oranges on a black background (http://www.cofilab.com/downloads/). Obviously, a machine learning model will need at least few thousand oranges to achieve an accuracy above 90 or so percent. However, can I alter these 150 oranges in some way to produce more photos of oranges? By alter, I mean adding different shades of orange on the citrus fruit to make a "different orange." Would this be an effective method of training a neural network?
It is a very good way to increase the number of date you have. What you'll do depends on your data. For example, if you are training on data obtained from a sensor, you may want to add some noise to the training data so that you can increase your dataset. After all, you can expect some noise coming from the sensor later on.
Assuming that you will train it on images, here is a very good github repository that provides means to use those techniques. This python library helps you with augmenting images for your machine learning projects. It converts a set of input images into a new, much larger set of slightly altered images.
Link: https://github.com/aleju/imgaug
Features:
Most standard augmentation techniques available.
Techniques can be applied to both images and keypoints/landmarks on
images. Define your augmentation sequence once at the start of the
experiment, then apply it many times.
Define flexible stochastic ranges for each augmentation, e.g. "rotate
each image by a value between -45 and 45 degrees" or "rotate each
image by a value sampled from the normal distribution N(0, 5.0)".
Easily convert all stochastic ranges to deterministic values to
augment different batches of images in the exactly identical way
(e.g. images and their heatmaps).
Data augmentation is what you are looking for. In you case you can do different things:
Apply filters to get slightly different image, as has been said you can use gaussian blur.
Cut the orange and put it in different backgrounds.
Scale the oranges with different scales factors.
Rotate the images.
create synthetic rotten oranges.
Mix all different combinations of the previous mentioned. With this kind of augmentation you can easily create thousand of different oranges.
I did something like that with a dataset of 12.000 images and I can create 630.000 samples
That is indeed a good way to increase your data set. You can, for example, apply Gaussian blur to the images. They will become blurred, but different from the original. You can invert the images too. Or, in last case, look for new images and apply the cited techniques.
Data augmentation is really good way to boost training set but still not enough to train a deep network end to end on its own given the possibility that it will overfit. You should look at domain adaptation where you take a pretrained model like inception which is trained on imagenet dataset and finetune it for your problem. Since you have to learn only parameters required to classify your use case, it is possible to achieve good accuracies with relatively less training data available. I have hosted a demo of classification with this technique here. Try it out with your dataset and see if it helps. The demo takes care of pretrained model as well as data augmentation for dataset that you will upload.

image augmentation algorithms for preparing deep learning training set

To prepare large amounts of data sets for training deep learning-based image classification models, we usually have to rely on image augmentation methods. I would like to know what are the usual image augmentation algorithms, are there any considerations when choosing them?
The litterature on data augmentation is very very large and very dependent on your kind of applications.
The first things that come to my mind are the galaxy competition's rotations and Jasper Snoeke's data augmentation.
But really all papers have their own tricks to get good scores on special datasets for exemples stretching the image to a specific size before cropping it or whatever and this in a very specific order.
More practically to train models on the likes of CIFAR or IMAGENET use random crops and random contrast, luminosity perturbations additionally to the obvious flips and noise addition.
Look at the CIFAR-10 tutorial on TF website it is a good start. Plus TF now has random_crop_and_resize() which is quite useful.
EDIT: The papers I am referencing here and there.
It depends on the problem you have to address, but most of the time you can do:
Rotate the images
Flip the image (X or Y symmetry)
Add noise
All the previous at the same time.

Resources