Why use sliding windows with convolutional neural nets in object detection? - machine-learning

I read that CNNs (with both convolution and max-pooling layers) are shift-invariant, but most object detection methods used a sliding window detector with non-maximum suppression. Is it necessary to use sliding windows with CNNs when doing object detection?
Basically, instead of training the network on small 50x50 patches of images containing the desired object, why not train on entire images where the object is present somewhere? All I can think of is practical/performance reasons (doing forward pass on smaller patches instead of whole images), but is there also a theoretical explanation I'm overlooking?

internally, CNN is doing a sliding window. Convolution in terms of 2d image is nothing more than a linear filter applied in the sliding window manner. This is simply nice, mathematical expression of the very same operation, which helps us do neat optimization. Max pooling on the other hand helps us to be robust in terms of small shifts/noise. So efficiently feeding image to the network is using (many!) sliding windows on it. Can we pass big images instead of small ones? Sure, but you wil get extremely big tensors (just compute how many numbers you will need, this is huge), and you will get really complex optimization problem. Nowadays we optimize in milions-dimensional space. Working with whole images might lead to bilions (or even bigger) number of dimensions. Optimization complexity grows exponentialy with the growth of the dimension, thus you will end up with extremely slow method (not in terms of computation itself - but convergence).

Related

Can we explicitly specify what feature to be extracted from an image while using CNN

Last day I learned about the convolution neural network, And went through some implementations of CNN using Tensorflow, All the implementation only specify the size, number of filters and strides for the filter. But when I learned about the filter it says that filter on each layer extracts different feature like edges, corners etc.
My question is can we explicitly specify filter which all feature we should extract, Or which portion for the image is more important etc
All the explanation says that we take a small part of an input image a slide across it with convolving. If so do we take all the parts of image and convolve across the image?
can we explicitly specify filter which all feature we should extract, Or which portion for the image is more important etc
Sure, this could be done. But the advantage of CNNs is that they learn the best features themselves (or at least very good ones; better ones than we can come up with in most cases).
One famous example is the ImageNet dataset:
In 2012 the first end-to-end learned CNN was used. End-to-end means that the network gets the raw data on one end as input and the optimization objective on the other end.
Before CNNs, the computer vision community used manually designed features for many years. After AlexNet in 2012, nobody did so (for "typical" computer vision - there are special applications where it is still worth a shot).
All the explanation says that we take a small part of an input image a slide across it with convolving. If so do we take all the parts of image and convolve across the image?
It is always the complete image which is convolved with a small filter. The convolution operation is local, meaning you can compute much of it in parallel as the result of the convolution in the upper left corner is not
dependent of the convolution in the lower left corner.
I think you may be confusing filters and channels. A filter is the weight window size in your convolution which can be used to produce channels from the convolution output. It is typically these channels that represent different features:
In this car identification example you can see some of the earlier channels picking up things like the hood, doors, and other borders of the car. It is hard to truly specify which features the network is extracting. If you already have knowledge of features that are important to the network you can feed them in as an additional mask layer or using some type of weighting matrix on them.

semantic segmentation for large images

I am working on a limited number of large size images, each of which can have 3072*3072 pixels. To train a semantic segmentation model using FCN or U-net, I construct a large sample of training sets, each training image is 128*128.
In the prediction stage, what I do is to cut a large image into small pieces, the same as trainning set of 128*128, and feed these small pieces into the trained model, get the predicted mask. Afterwards, I just stitch these small patches together to get the mask for the whole image. Is this the right mechanism to perform the semantic segmentation against the large images?
Your solution is often used for this kind of problem. However, I would argue that it depends on the data if it truly makes sense. Let me give you two examples you can still find on kaggle.
If you wanted to mask certain parts of satellite images, you would probably get away with this approach without a drop in accuracy. These images are highly repetitive and there's likely no correlation between the segmented area and where in the original image it was taken from.
If you wanted to segment a car from its background, it wouldn't be desirable to break it into patches. Over several layers the network will learn the global distribution of a car in the frame. It's very likely that the mask is positive in the middle and negative in the corners of the image.
Since you didn't give any specifics what you're trying to solve, I can only give a general recommendation: Try to keep the input images as large as your hardware allows. In many situation I would rather downsample the original images than breaking it down into patches.
Concerning the recommendation of curio1729, I can only advise against training on small patches and testing on the original images. While it's technically possible thanks to fully convolutional networks, you're changing the data to an extend, that might very likely hurt performance. CNNs are known for their extraction of local features, but there's a large amount of global information that is learned over the abstraction of multiple layers.
Input image data:
I would not advice feeding the big image (3072x3072) directly into the caffe.
Batch of small images will fit better into the memory and parallel programming will too come into play.
Data Augmentation will also be feasible.
Output for big Image:
As for the output of big Image, you better recast the input size of FCN to 3072x3072 during test phase. Because, layers of FCN can accept inputs of any size.
Then you will get 3072x3072 segmented image as output.

Data Augmentation for Object Detection using Deep Learning

I have a question regarding data augmentation for training the deep neural network for object detection.
I have quite limited data set (nearly 300 images). I augmented the data by rotating each image from 0-360 degrees with stepsize of 15 degree. Consequently I got 24 rotated images out of just one. So in total, I got around 7200 images. Then I drew bounding box around the object of interest in each augmented image.
Does it seem to be a reasonable approach to enhance the data?
Best Regards
In order to train a good model you need lots of representative data. Your augmentation is representative only for rotations, so yes, it is a good method, if you are concerned about having not enough object rotations. However, it will not help in any sense with generalization to other objects/transformations.
It seems like you are on the right track, rotation is usually a very useful transformation for augmenting the training data. I would suggest to try other transformations like shift (you most probably want to detect partially present objects), zoom (makes your model invariant to the scale), shear, flip, etc. By combining different transformations you can introduce additional diversity in your training data. Training set of 300 images is a very small number, so you would definitely need more than one transformation to augment so tiny training set.
This is a good approach as long as you don't implicitly change the labels when you do rotation. E.g. An image containing the digit 6 will become digit 9 on rotation of 180 deg. So, you've to pay some attention in such scenarios.
But, you could also do other geometric transformations like scaling, translation
Other augmentation that you can consider is using the pre-trained model such as ImageNet, if your problem domain has some resemblance to the ImageNet data. This will allow you to train deeper models even for your data scarce situation.
Even though rotation increases the representational complexity of your image, it might be not enough. Instead you probably need to add other types of augmentation as well.
Color augmentations are useful if they still represent the real distribution of your data.
Spatial augmentations work very good. Keep in mind that most modern systems use a lot of cropping, so that might help.
Actually I have a few scripts that I am trying to turn into a library that might work for you. Check them https://github.com/lozuwa/impy if you would like to.

find mosquitos' head in the image

I have images of mosquitos similar to these ones and I would like to automatically circle around the head of each mosquito in the images. They are obviously in different orientations and there are random number of them in different images. some error is fine. Any ideas of algorithms to do this?
This problem resembles a face detection problem, so you could try a naïve approach first and refine it if necessary.
First you would need to recreate your training set. For this you would like to extract small images with examples of what is a mosquito head or what is not.
Then you can use those images to train a classification algorithm, be careful to have a balanced training set, since if your data is skewed to one class it would hit the performance of the algorithm. Since images are 2D and algorithms usually just take 1D arrays as input, you will need to arrange your images to that format as well (for instance: http://en.wikipedia.org/wiki/Row-major_order).
I normally use support vector machines, but other algorithms such as logistic regression could make the trick too. If you decide to use support vector machines I strongly recommend you to check libsvm (http://www.csie.ntu.edu.tw/~cjlin/libsvm/), since it's a very mature library with bindings to several programming languages. Also they have a very easy to follow guide targeted to beginners (http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf).
If you have enough data, you should be able to avoid tolerance to orientation. If you don't have enough data, then you could create more training rows with some samples rotated, so you would have a more representative training set.
As for the prediction what you could do is given an image, cut it using a grid where each cell has the same dimension that the ones you used on your training set. Then you pass each of this image to the classifier and mark those squares where the classifier gave you a positive output. If you really need circles then take the center of the given square and the radius would be the half of the square side size (sorry for stating the obvious).
So after you do this you might have problems with sizes (some mosquitos might appear closer to the camera than others) , since we are not trained the algorithm to be tolerant to scale. Moreover, even with all mosquitos in the same scale, we still might miss some of them just because they didn't fit in our grid perfectly. To address this, we will need to repeat this procedure (grid cut and predict) rescaling the given image to different sizes. How many sizes? well here you would have to determine that through experimentation.
This approach is sensitive to the size of the "window" that you are using, that is also something I would recommend you to experiment with.
There are some research may be useful:
A Multistep Approach for Shape Similarity Search in Image Databases
Representation and Detection of Shapes in Images
From the pictures you provided this seems to be an extremely hard image recognition problem, and I doubt you will get anywhere near acceptable recognition rates.
I would recommend a simpler approach:
First, if you have any control over the images, separate the mosquitoes before taking the picture, and use a white unmarked underground, perhaps even something illuminated from below. This will make separating the mosquitoes much easier.
Then threshold the image. For example here i did a quick try taking the red channel, then substracting the blue channel*5, then applying a threshold of 80:
Use morphological dilation and erosion to get rid of the small leg structures.
Identify blobs of the right size to be moquitoes by Connected Component Labeling. If a blob is large enough to be two mosquitoes, cut it out, and apply some more dilation/erosion to it.
Once you have a single blob like this
you can find the direction of the body using Principal Component Analysis. The head should be the part of the body where the cross-section is the thickest.

Using flipped images for machine learning dataset

I'v got a binary classification problem. I'm trying to train a neural network to recognize objects from images. Currently I've about 1500 50x50 images.
The question is whether extending my current training set by the same images flipped horizontally is a good idea or not? (images are not symetric)
Thanks
I think you can do this to a much larger extent, not just flipping the images horizontally, but changing the angle of the image by 1 degree. This will result in 360 samples for every instance that you have in your training set. Depending on how fast your algorithm is, this may be a pretty good way to ensure that the algorithm isn't only trained to recognize images and their mirrors.
It's possible that it's a good idea, but then again, I don't know what's the goal or the domain of the image recognition. Let's say the images contain characters and you're asking the image recognition software to determine if an image contains a forward slash / or a back slash \ then flipping the image will make your training data useless. If your domain doesn't suffer from such issues, then I'd think it's a good idea to flip them and even rotate with varying degrees.
I have used flipped images in AdaBoost with great success in the course: http://www.csc.kth.se/utbildning/kth/kurser/DD2427/bik12/Schedule.php
from the zip "TrainingImages.tar.gz".
I know there are some information on pros/cons with using flipped images somewhere in the slides (at the homepage) but I can't find it. Also a great resource is http://www.csc.kth.se/utbildning/kth/kurser/DD2427/bik12/DownloadMaterial/FaceLab/Manual.pdf (together with the slides) going thru things like finding things in different scales and orientation.
If the images patches are not symmetric I don't think its a good idea to flip. Better idea is to do some similarity transforms to the training set with some limits. Another way to increase the dataset is to add gaussian smoothed templates to it. Make sure that the number of positive and negative samples are proportional. Too many positive and too less negative might skew the classifier and give bad performance on testing set.
It depends on what your NN is based on. If you are extracting rotation invariant features or features that do not depend on the spatial position within the the image (like histograms or whatever) and train your NN with these features, then rotating will not be a good idea.
If you are training directly on pixel values, then it might be a good idea.
Some more details might be useful.

Resources