I have a question regarding data augmentation for training the deep neural network for object detection.
I have quite limited data set (nearly 300 images). I augmented the data by rotating each image from 0-360 degrees with stepsize of 15 degree. Consequently I got 24 rotated images out of just one. So in total, I got around 7200 images. Then I drew bounding box around the object of interest in each augmented image.
Does it seem to be a reasonable approach to enhance the data?
Best Regards
In order to train a good model you need lots of representative data. Your augmentation is representative only for rotations, so yes, it is a good method, if you are concerned about having not enough object rotations. However, it will not help in any sense with generalization to other objects/transformations.
It seems like you are on the right track, rotation is usually a very useful transformation for augmenting the training data. I would suggest to try other transformations like shift (you most probably want to detect partially present objects), zoom (makes your model invariant to the scale), shear, flip, etc. By combining different transformations you can introduce additional diversity in your training data. Training set of 300 images is a very small number, so you would definitely need more than one transformation to augment so tiny training set.
This is a good approach as long as you don't implicitly change the labels when you do rotation. E.g. An image containing the digit 6 will become digit 9 on rotation of 180 deg. So, you've to pay some attention in such scenarios.
But, you could also do other geometric transformations like scaling, translation
Other augmentation that you can consider is using the pre-trained model such as ImageNet, if your problem domain has some resemblance to the ImageNet data. This will allow you to train deeper models even for your data scarce situation.
Even though rotation increases the representational complexity of your image, it might be not enough. Instead you probably need to add other types of augmentation as well.
Color augmentations are useful if they still represent the real distribution of your data.
Spatial augmentations work very good. Keep in mind that most modern systems use a lot of cropping, so that might help.
Actually I have a few scripts that I am trying to turn into a library that might work for you. Check them https://github.com/lozuwa/impy if you would like to.
Related
I can understand the power of data augmentation and the different ways of data augmentation like rotating, flipping, normalization, etc.
Is shifting the object around the image really needed? Will there be an difference in results of the convolution?
If you are sure that in the test data the images are always centered, then shifting might not be needed. But in real world, that is not the case.
For example you cant expect a cat to stay always at center of the image. In test data it might appear at any position. Your model will learn better if you consider such cases in your training data.
Note: The image above is just for easy understanding. Data has been augmented using rotation as well, not just shift. But it serves the purpose, so included it. (Image Source)
As far as difference in results is concerned, we can't be sure how significant will the change in the performance be until you try it out. But people find that shift helps improving performance, usually used along with flip, rotate, scale, etc.
Centered Objects doesnot require shifting, but in real-world test data you might have objects that are not centred so in that case it becomes preety imortant.
gen = ImageDataGenerator(
rotation_range=10,
width_shift_range=0.1,
height_shift_range=0.1,
shear_range=0.15,
zoom_range=0.1,
channel_shift_range=10.,
horizontal_flip=True)
Yes, shifting an object is quite needed.
If the object that you are trying to detect/classify is most of the times around the center of an image, then your model will probably adjust its weights, so that it focuses on searching the center of an image.
You can force your model to search all the regions of an image, by shifting the targeted object around the image. Also, You can improve the model's training by changing the object's shape as well (e.g. you can zoom in the image).
This repository is a quite good introductory of commonly used data augmentations in object detection:
https://github.com/kochlisGit/random-data-augmentations
I am trying to classify a series of images like this one, with each class of comprising images taken from similar cellular structure:
I've built a simple network in Keras to do this, structured as:
1000 - 10
The network unaltered achieves very high (>90%) accuracy on MNIST classification, but almost never higher than 5% on these types of images. Is this because they are too complex? My next approach would be to try stacked deep autoencoders.
Seriously - I don't expect any nonconvolutional model to work well on this type of data.
A nonconv net for MNIST works well because the data is well preprocessed (it is centered in the middle and resized to certain size). Your images are not.
You may notice (on your pictures) that certain motifs reoccure - like this darker dots - with different positions and sizes - if you don't use convolutional model you will not capture that efficiently (e.g. you will have to recognize a dark dot moved a little bit in the image as a completely different object).
Because of this I think that you should try convolutional MNIST model instead classic one or simply try to design your own.
First question, is if you run the training longer do you get better accuracy? You may not have trained long enough.
Also, what is the accuracy on training data and what is the accuracy on testing data? If they are both high, you can run longer or use a more complex model. If training accuracy is better than testing accuracy, you are essentially at the limits of your data. (i.e. brute force scaling of model size wont help, but clever improvements might, i.e. try convolutional nets)
Finally, complex and noisy data you may need a lot of data to make a reasonable classification. So you need many, many images.
Deep stacked autoencoders, as I understand it is an unsupervised method, which isn't directly suitable for classification.
I'm making a program to detect shapes from an r/c plane for a competition. I have no real images of the targets, but I do have computer generated examples of them on the rules.
My question is, can I train my program to detect real world objects based on computer generated shapes or should I find a different method to complete this task?
I would like to know before I foolishly generate 5k samples and find them useless in the end.
EDIT: I also don't know the exact color of the objects. If I feed the program samples of varying color, will it be a problem?
Thanks in advance!!
Edit2: Here's what groups from my school detected in previous years
As you can see, the detected images are not nearly as flawless as what would appear in real life. If you can suggest a better method, that would help.
If you think that the real images will have unique colors with simple geometric shapes then you could probably try to create a normalized Hue-histogram. Use it to train SVM classifier. The benefit of using Hue-histogram is that it will be rotational and scale invariant.
You can take the few precautions in mind:
Don't forget to remove the illumination affects.
Sometimes, White and black pixels create some problem in hue-histogram calculation so try to remove them from calculation by considering only those pixel which have S>0 and V>0 in S & V channels of HSV image.
I would rather suggest you to use the real world images because the performance is largely dependent upon training (my personal experience). And why don't you try to use SIFT/SURF descriptors for training to SVM (support vector machine) as SIFT/SURF are scale as well as rotational invariant.
I have images of mosquitos similar to these ones and I would like to automatically circle around the head of each mosquito in the images. They are obviously in different orientations and there are random number of them in different images. some error is fine. Any ideas of algorithms to do this?
This problem resembles a face detection problem, so you could try a naïve approach first and refine it if necessary.
First you would need to recreate your training set. For this you would like to extract small images with examples of what is a mosquito head or what is not.
Then you can use those images to train a classification algorithm, be careful to have a balanced training set, since if your data is skewed to one class it would hit the performance of the algorithm. Since images are 2D and algorithms usually just take 1D arrays as input, you will need to arrange your images to that format as well (for instance: http://en.wikipedia.org/wiki/Row-major_order).
I normally use support vector machines, but other algorithms such as logistic regression could make the trick too. If you decide to use support vector machines I strongly recommend you to check libsvm (http://www.csie.ntu.edu.tw/~cjlin/libsvm/), since it's a very mature library with bindings to several programming languages. Also they have a very easy to follow guide targeted to beginners (http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf).
If you have enough data, you should be able to avoid tolerance to orientation. If you don't have enough data, then you could create more training rows with some samples rotated, so you would have a more representative training set.
As for the prediction what you could do is given an image, cut it using a grid where each cell has the same dimension that the ones you used on your training set. Then you pass each of this image to the classifier and mark those squares where the classifier gave you a positive output. If you really need circles then take the center of the given square and the radius would be the half of the square side size (sorry for stating the obvious).
So after you do this you might have problems with sizes (some mosquitos might appear closer to the camera than others) , since we are not trained the algorithm to be tolerant to scale. Moreover, even with all mosquitos in the same scale, we still might miss some of them just because they didn't fit in our grid perfectly. To address this, we will need to repeat this procedure (grid cut and predict) rescaling the given image to different sizes. How many sizes? well here you would have to determine that through experimentation.
This approach is sensitive to the size of the "window" that you are using, that is also something I would recommend you to experiment with.
There are some research may be useful:
A Multistep Approach for Shape Similarity Search in Image Databases
Representation and Detection of Shapes in Images
From the pictures you provided this seems to be an extremely hard image recognition problem, and I doubt you will get anywhere near acceptable recognition rates.
I would recommend a simpler approach:
First, if you have any control over the images, separate the mosquitoes before taking the picture, and use a white unmarked underground, perhaps even something illuminated from below. This will make separating the mosquitoes much easier.
Then threshold the image. For example here i did a quick try taking the red channel, then substracting the blue channel*5, then applying a threshold of 80:
Use morphological dilation and erosion to get rid of the small leg structures.
Identify blobs of the right size to be moquitoes by Connected Component Labeling. If a blob is large enough to be two mosquitoes, cut it out, and apply some more dilation/erosion to it.
Once you have a single blob like this
you can find the direction of the body using Principal Component Analysis. The head should be the part of the body where the cross-section is the thickest.
I'v got a binary classification problem. I'm trying to train a neural network to recognize objects from images. Currently I've about 1500 50x50 images.
The question is whether extending my current training set by the same images flipped horizontally is a good idea or not? (images are not symetric)
Thanks
I think you can do this to a much larger extent, not just flipping the images horizontally, but changing the angle of the image by 1 degree. This will result in 360 samples for every instance that you have in your training set. Depending on how fast your algorithm is, this may be a pretty good way to ensure that the algorithm isn't only trained to recognize images and their mirrors.
It's possible that it's a good idea, but then again, I don't know what's the goal or the domain of the image recognition. Let's say the images contain characters and you're asking the image recognition software to determine if an image contains a forward slash / or a back slash \ then flipping the image will make your training data useless. If your domain doesn't suffer from such issues, then I'd think it's a good idea to flip them and even rotate with varying degrees.
I have used flipped images in AdaBoost with great success in the course: http://www.csc.kth.se/utbildning/kth/kurser/DD2427/bik12/Schedule.php
from the zip "TrainingImages.tar.gz".
I know there are some information on pros/cons with using flipped images somewhere in the slides (at the homepage) but I can't find it. Also a great resource is http://www.csc.kth.se/utbildning/kth/kurser/DD2427/bik12/DownloadMaterial/FaceLab/Manual.pdf (together with the slides) going thru things like finding things in different scales and orientation.
If the images patches are not symmetric I don't think its a good idea to flip. Better idea is to do some similarity transforms to the training set with some limits. Another way to increase the dataset is to add gaussian smoothed templates to it. Make sure that the number of positive and negative samples are proportional. Too many positive and too less negative might skew the classifier and give bad performance on testing set.
It depends on what your NN is based on. If you are extracting rotation invariant features or features that do not depend on the spatial position within the the image (like histograms or whatever) and train your NN with these features, then rotating will not be a good idea.
If you are training directly on pixel values, then it might be a good idea.
Some more details might be useful.