Recognize objects while falling - viewpoint variation - machine-learning

I have a problem statement to recognize 10 classes of different variations(variations in color and size) of same object (bottle cap) while falling taking into account the camera sees different viewpoint of the object. I have split this into sub-tasks
1) Trained a deep learning model to classify only the flat surface of the object and successful in this attempt.
Flat Faces of sample 2 class
2) Instead of taking fall into account, trained a model for possible perspective changes - not successful.
Perception changes of sample 2 class
What are the approaches to recognize the object even for perspective changes. I am not constrained to arrive with a single camera solution. Open to ideas in approaching towards this problem of variable perceptions.
Any help could be really appreciated, Thanks in advance!

The answer I want to give you is: CapsNets
You should definately check out the paper, where you will be introduced to some short comings of CNNs and how they tried to fix them.
That said, I find it hard to believe that your architecture cannot solve the problem successfully when the perspective changes. Is your dataset extremely small? I'd expect the neural network to learn filters for the riffled edges, which can be seen from all perspectives.
If you're not limited to one camera you could try to train a "normal" classifier, which you feed multiple images in production and average the prediction. Or you could build an architecture that takes in multiple perspectives at once. You have to try for yourself, what works best.
Also, never underestimate the power of old school image preprocessing. If you have 3 different perspectives, you could take the one that comes closest to the "flat" perspective. This is probably as easy as using the image with the largest colored area, where img.sum() is the highest.
Another idea is to figure out the color through explicit programming, which should be fairly easy and then feed the network a grayscale image. Maybe your network is confused by the strong correlation of the color and ignores the shape altogether.

Related

What is the best way to be sure that the detected object is a vehicle?

I'm working on a project in which I need to detect moving vehicles. I'm using background subtraction method to extract them and it works pretty good. After applying some morphological transforms, I am able to detect moving vehicles (you can see the output in the rightmost image).
How can I be sure that the moving object is a vehicle and not a non-vehicle moving object?
Training, pattern recognition, etc., I'm looking for the best solution with the lowest possible computational cost for a real-time system.
detected moving vehicles
I think it depends on how much effort you are able to give. I believe that a small Convolutional Neural Network could do this job amazingly. One simpler way is to take several examples of cars and use histograms or SIFT to define features most common to cars in your samples.

Can Haar Cascade be too accurate to be useful in this situation?

I'm making a program to detect shapes from an r/c plane for a competition. I have no real images of the targets, but I do have computer generated examples of them on the rules.
My question is, can I train my program to detect real world objects based on computer generated shapes or should I find a different method to complete this task?
I would like to know before I foolishly generate 5k samples and find them useless in the end.
EDIT: I also don't know the exact color of the objects. If I feed the program samples of varying color, will it be a problem?
Thanks in advance!!
Edit2: Here's what groups from my school detected in previous years
As you can see, the detected images are not nearly as flawless as what would appear in real life. If you can suggest a better method, that would help.
If you think that the real images will have unique colors with simple geometric shapes then you could probably try to create a normalized Hue-histogram. Use it to train SVM classifier. The benefit of using Hue-histogram is that it will be rotational and scale invariant.
You can take the few precautions in mind:
Don't forget to remove the illumination affects.
Sometimes, White and black pixels create some problem in hue-histogram calculation so try to remove them from calculation by considering only those pixel which have S>0 and V>0 in S & V channels of HSV image.
I would rather suggest you to use the real world images because the performance is largely dependent upon training (my personal experience). And why don't you try to use SIFT/SURF descriptors for training to SVM (support vector machine) as SIFT/SURF are scale as well as rotational invariant.

Detecting "city" background versus "desert" background in images using image processing/computer vision

I'm searching for algorithms/methods that are used to classify or differentiate between two outdoor environments. Given an image with vehicles, I need to be able to detect whether the vehicles are in a natural desert landscape, or whether they're in the city.
I've searched but can't seem to find relevant work on this. Perhaps because I'm new at computer vision, I'm using the wrong search terms.
Any ideas? Is there any work (or related) available in this direction?
I'd suggest reading Prince's Computer Vision: Models, Learning, and Inference (free PDF available). It covers image classification, as well as many other areas of CV. I was fortunate enough to take the Machine Vision course at UCL which the book was designed for and it's an excellent reference.
Addressing your problem specifically, a simple MAP or MLE model on pixel colours will probably provide a reasonable benchmark. From there you could look at more involved models and feature engineering.
Seemingly complex classifications similar to "civilization" vs "nature" might be able to be solved simply with the help of certain heuristics along with classification based on color. Like Gilevi said, city scenes are sure to contain many flat lines and right angles, while desert scenes are dominated by rolling dunes and so on.
To address this directly, you could use OpenCV's hough - lines algorithm on the images (tuned for this problem of course) and look at:
a) how many lines are fit to the image at a given threshold
b) of the lines that are fit what is the expected angle between two of them; if the angles are uniformly distributed then chances are its nature, but if the angles are clumped up around multiples of pi/2 (more right angles and straight lines) then it is more likely to be a cityscape.
Color components, textures, and degree of smoothness(variation or gradient of image) may differentiate the desert and city background. You may also try Hough transform, which is used for line detection that can be viewed as city feature (building, road, bridge, cars,,,etc).
I would recommend you this research very similar with your project. This article presents a comparison of different classification techniques to obtain the scene classifier (urban, highway, and rural) based on images.
See my answer here: How to match texture similarity in images?
You can use the same method. I already solved in the past problems like the one you described with this method.
The problem you are describing is that of scene categorization. Search for works that use the SUN database.
However, you only working with two relatively different categories, so I don't think you need to kill yourself implementing state-of-the-art algorithms. I think taking GIST features + color features and training a non-linear SVM would do the trick.
Urban environments is usually characterized with a lot of horizontal and vertical lines, GIST captures that information.

find mosquitos' head in the image

I have images of mosquitos similar to these ones and I would like to automatically circle around the head of each mosquito in the images. They are obviously in different orientations and there are random number of them in different images. some error is fine. Any ideas of algorithms to do this?
This problem resembles a face detection problem, so you could try a naïve approach first and refine it if necessary.
First you would need to recreate your training set. For this you would like to extract small images with examples of what is a mosquito head or what is not.
Then you can use those images to train a classification algorithm, be careful to have a balanced training set, since if your data is skewed to one class it would hit the performance of the algorithm. Since images are 2D and algorithms usually just take 1D arrays as input, you will need to arrange your images to that format as well (for instance: http://en.wikipedia.org/wiki/Row-major_order).
I normally use support vector machines, but other algorithms such as logistic regression could make the trick too. If you decide to use support vector machines I strongly recommend you to check libsvm (http://www.csie.ntu.edu.tw/~cjlin/libsvm/), since it's a very mature library with bindings to several programming languages. Also they have a very easy to follow guide targeted to beginners (http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf).
If you have enough data, you should be able to avoid tolerance to orientation. If you don't have enough data, then you could create more training rows with some samples rotated, so you would have a more representative training set.
As for the prediction what you could do is given an image, cut it using a grid where each cell has the same dimension that the ones you used on your training set. Then you pass each of this image to the classifier and mark those squares where the classifier gave you a positive output. If you really need circles then take the center of the given square and the radius would be the half of the square side size (sorry for stating the obvious).
So after you do this you might have problems with sizes (some mosquitos might appear closer to the camera than others) , since we are not trained the algorithm to be tolerant to scale. Moreover, even with all mosquitos in the same scale, we still might miss some of them just because they didn't fit in our grid perfectly. To address this, we will need to repeat this procedure (grid cut and predict) rescaling the given image to different sizes. How many sizes? well here you would have to determine that through experimentation.
This approach is sensitive to the size of the "window" that you are using, that is also something I would recommend you to experiment with.
There are some research may be useful:
A Multistep Approach for Shape Similarity Search in Image Databases
Representation and Detection of Shapes in Images
From the pictures you provided this seems to be an extremely hard image recognition problem, and I doubt you will get anywhere near acceptable recognition rates.
I would recommend a simpler approach:
First, if you have any control over the images, separate the mosquitoes before taking the picture, and use a white unmarked underground, perhaps even something illuminated from below. This will make separating the mosquitoes much easier.
Then threshold the image. For example here i did a quick try taking the red channel, then substracting the blue channel*5, then applying a threshold of 80:
Use morphological dilation and erosion to get rid of the small leg structures.
Identify blobs of the right size to be moquitoes by Connected Component Labeling. If a blob is large enough to be two mosquitoes, cut it out, and apply some more dilation/erosion to it.
Once you have a single blob like this
you can find the direction of the body using Principal Component Analysis. The head should be the part of the body where the cross-section is the thickest.

What is the best method for object detection in low-resolution moving video?

I'm looking for the fastest and more efficient method of detecting an object in a moving video. Things to note about this video: It is very grainy and low resolution, also both the background and foreground are moving simultaneously.
Note: I'm trying to detect a moving truck on a road in a moving video.
Methods I've tried:
Training a Haar Cascade - I've attempted training the classifiers to identify the object by taking copping multiple images of the desired object. This proved to produce either many false detects or no detects at all (the object desired was never detected). I used about 100 positive images and 4000 negatives.
SIFT and SURF Keypoints - When attempting to use either of these methods which is based on features, I discovered that the object I wanted to detect was too low in resolution, so there were not enough features to match to make an accurate detection. (Object desired was never detected)
Template Matching - This is probably the best method I've tried. It's the most accurate although the most hacky of them all. I can detect the object for one specific video using a template cropped from the video. However, there is no guaranteed accuracy because all that is known is the best match for each frame, no analysis is done on the percentage template matches the frame. Basically, it only works if the object is always in the video, otherwise it will create a false detect.
So those are the big 3 methods I've tried and all have failed. What would work best is something like template matching but with scale and rotation invariance (which led me to try SIFT/SURF), but i have no idea how to modify the template matching function.
Does anyone have any suggestions how to best accomplish this task?
Apply optical flow to the image and then segment it based on flow field. Background flow is very different from "object" flow (which mainly diverges or converges depending on whether it is moving towards or away from you, with some lateral component also).
Here's an oldish project which worked this way:
http://users.fmrib.ox.ac.uk/~steve/asset/index.html
This vehicle detection paper uses a Gabor filter bank for low level detection and then uses the response to create the features space where it trains an SVM classifier.
The technique seems to work well and is at least scale invariant. I am not sure about rotation though.
Not knowing your application, my initial impression is normalized cross-correlation, especially since I remember seeing a purely optical cross-correlator that had vehicle-tracking as the example application. (Tracking a vehicle as it passes using only optical components and an image of the side of the vehicle - I wish I could find the link.) This is similar (if not identical) to "template matching", which you say kind of works, but this won't work if the images are rotated, as you know.
However, there's a related method based on log-polar coordinates that will work regardless of rotation, scale, shear, and translation.
I imagine this would also enable tracking that the object has left the scene of the video, too, since the maximum correlation will decrease.
How low resolution are we talking? Could you also elaborate on the object? Is it a specific color? Does it have a pattern? The answers affect what you should be using.
Also, I might be reading your template matching statement wrong, but it sounds like you are overtraining it (by testing on the same video you extracted the object from??).
A Haar Cascade is going to require significant training data on your part, and will be poor for any adjustments in orientation.
Your best bet might be to combine template matching with an algorithm similar to camshift in opencv (5,7MB PDF), along with a probabilistic model (you'll have to figure this one out) of whether the truck is still in the image.

Resources