I am new to machine learning. I got a task to find the total number of vehicles from an image using machine learning concept. I am using neural network. My image of worst case is given here.
Traffic Image
I need to find the total number of cars from this image. My idea is to cut this big image into small patches of image and train the network to count the vehicles from the small patches. Each patch will be having count less than 5. Then in the processing of new image, I could make use of a sliding window to get the total count of vehicles.
I just want to know whether this idea is possible or not OR should I go for feature extraction and training neural network with those features. If possible, whether there is any conditions for the dataset and training.
What you are looking for is called object detection. A starting point can be Deep Neural Networks for Object Detection or Region-based Convolutional Networks for Accurate Object Detection and Segmentation.
A similar, but much more difficult task is instance segmentation. One of the latest papers I've seen in this area is Pixel-level Encoding and Depth Layering for Instance-level Semantic Labeling.
Instance segmentation is probably the hardest tasks in Computer Vision. When you're new to machine learning / computer vision, you might first want to do image classification. If you want to go into the direction of instance segmentation, then you should continue with semantic segmentation and then instance segmentation.
A simple sliding window approach, where you only predict "car" / "no car" will not work, because in the image the cars are not separated by any "no car".
Related
In the application I am developing, I have about 5000 product label images.(One label per product).
One functionality of my application is that user can take a picture using his camera and get a possible match(es) against the product labels registered the system.
Since initially, my system only has one sample per product, I decided to go with traditional Computer Vision techniques. I managed to implement this using Feature extraction and Descriptor matching.(using OpenCV SIFT and FLANN techniques referring this: https://github.com/kipr/opencv/blob/master/samples/cpp/matching_to_many_images.cpp)
Now I am thinking how to improve the accuracy by combining with CNN or Deep Learning techniques since when users approve matches, it gradually add more label samples for a product.
Is it possible to build a hybrid image matching system combining Computer Vision techniques and CNN/Deep Learning techniques?
Are there any similar services already available as services?
You should learn more about Distance Metrics Learning (DML). There is a lot of information on the internet, but briefly:
You must get embeddings (vector representation) for each image from your base (e.g. get feature vector from last convolutional layer of one of the modern CNN's (Inception, VGG, ResNet, DenseNet))
Then, when you get new image, you should create vector representation of the current image and find the closest vector from your base (by Euclidean distance, for example)
This topic is quite complicated, so study it carefully :)
Have a luck!
I am a beginner and have just started studying machine learning and neural networks and have just understood the very basics of this vast and interesting domain.
From my basic knowledge, I know that a model/classifier can be used to Classify an image as something. But I was curious if there is a way to detect multiple instances of the same object and count the same.
Basically I wanted to calculate the density of traffic at a red light to dynamically control the flow of traffic, so I was curious if there was a way to detect multiple cars and count the number of cars at a red light by training the ConvNet on Images of Cars (and if there is a way to implement the same using tensor-flow)
You might consider using an off the shelf object detector, e.g., the Tensorflow Object Detection API (github.com/tensorflow/models/tree/master/object_detection) to first detect cars, and then count them.
CNN is one branch of the machine learning. It can be trained to classify different cars as one class, just like many other technologies applied in machine learning.
My understanding of your question is: you want to count the number of cars at the red light and make decision of the traffic dynamically. So I would seperate your question into two part
Count the number of cars
Optimize the traffic flow
For the question 1 which you are actually interested in
I would suggest you to have a look at:
Counting the number of vehicles from an image with machine learning
I hope this can be helpful
I'm trying to implement a face recognition algorithm using Python. I want to be able to receive a directory of images, and compute pair-wise distances between them, when short distances should hopefully correspond to the images belonging to the same person. The ultimate goal is to cluster images and perform some basic face identification tasks (unsupervised learning).
Because of the unsupervised setting, my approach to the problem is to calculate a "face signature" (a vector in R^d for some int d) and then figure out a metric in which two faces belonging to the same person will indeed have a short distance between them.
I have a face detection algorithm which detects the face, crops the image and performs some basic pre-processing, so the images i'm feeding to the algorithm are gray and equalized (see below).
For the "face signature" part, I've tried two approaches which I read about in several publications:
Taking the histogram of the LBP (Local Binary Pattern) of the entire (processed) image
Calculating SIFT descriptors at 7 facial landmark points (right of mouth, left of mouth, etc.), which I identify per image using an external application. The signature is the concatenation of the square root of the descriptors (this results in a much higher dimension, but for now performance is not a problem).
For the comparison of two signatures, I'm using OpenCV's compareHist function (see here), trying out several different distance metrics (Chi Square, Euclidean, etc).
I know that face recognition is a hard task, let alone without any training, so I'm not expecting great results. But all I'm getting so far seems completely random. For example, when calculating distances from the image on the far right against the rest of the image, I'm getting she is most similar to 4 Bill Clintons (...!).
I have read in this great presentation that it's popular to carry out a "metric learning" procedure on a test set, which should significantly improve results. However it does say in the presentation and elsewhere that "regular" distance measures should also get OK results, so before I try this out I want to understand why what I'm doing gets me nothing.
In conclusion, my questions, which I'd love to get any sort of help on:
One improvement I though of would be to perform LBP only on the actual face, and not the corners and everything that might insert noise to the signature. How can I mask out the parts which are not the face before calculating LBP? I'm using OpenCV for this part too.
I'm fairly new to computer vision; How would I go about "debugging" my algorithm to figure out where things go wrong? Is this possible?
In the unsupervised setting, is there any other approach (which is not local descriptors + computing distances) that could work, for the task of clustering faces?
Is there anything else in the OpenCV module that maybe I haven't thought of that might be helpful? It seems like all the algorithms there require training and are not useful in my case - the algorithm needs to work on images which are completely new.
Thanks in advance.
What you are looking for is unsupervised feature extraction - take a bunch of unlabeled images and find the most important features describing these images.
The state-of-the-art methods for unsupervised feature extraction are all based on (convolutional) neural networks. Have look at autoencoders (http://ufldl.stanford.edu/wiki/index.php/Autoencoders_and_Sparsity) or Restricted Bolzmann Machines (RBMs).
You could also take an existing face detector such as DeepFace (https://www.cs.toronto.edu/~ranzato/publications/taigman_cvpr14.pdf), take only feature layers and use distance between these to group similar faces together.
I'm afraid that OpenCV is not well suited for this task, you might want to check Caffe, Theano, TensorFlow or Keras.
I am right now in a serious problem, I need to compare images of flowers (carnations) using a genetic algorithm, the program must determine which variety does the flower belongs to (until now I am using 15 different varieties), the thing is I am having difficulties constructing the chromosome, right know I am only analysing the HSV of each image, then a take every channel and calculate the mean for each (n=255), after that I calculate the correlation between HS, HV and SV, I expected that the mean would be enough to locate any new flower next to the clusters of flowers of the variety it belongs (by the way, I have a database of all the flowers used for training purpose) by calculating the distance between the mean of the flower and the centroids of each cluster, and probably using the correlations for adjustment, but that distance is usually way smaller to a different variety than the one it must be. Is there a way to classify this flowers using ONLY colours (I've read of applications that uses the texture, but that's way out of my league), especially using a genetic algorithm (I know Neural Networks are more appropriate to this kind of analysis but that's what the teacher asked)?. Thank you very much. By the way I am working on OpenCV, don’t know if it's relevant. PS: Excuse my English if any mistakes were done, not my native language.
I know that most common object detection involves Haar cascades and that there are many techniques for feature detection such as SIFT, SURF, STAR, ORB, etc... but if my end goal is to recognizes objects doesn't both ways end up giving me the same result? I understand using feature techniques on simple shapes and patterns but for complex objects these feature algorithms seem to work as well.
I don't need to know the difference in how they function but whether or not having one of them is enough to exclude the other. If I use Haar cascading, do I need to bother with SIFT? Why bother?
thanks
EDIT: for my purposes I want to implement object recognition on a broad class of things. Meaning that any cups that are similarly shaped as cups will be picked up as part of class cups. But I also want to specify instances, meaning a NYC cup will be picked up as an instance NYC cup.
Object detection usually consists of two steps: feature detection and classification.
In the feature detection step, the relevant features of the object to be detected are gathered.
These features are input to the second step, classification. (Even Haar cascading can be used
for feature detection, to my knowledge.) Classification involves algorithms
such as neural networks, K-nearest neighbor, and so on. The goal of classification is to find
out whether the detected features correspond to features that the object to be detected
would have. Classification generally belongs to the realm of machine learning.
Face detection, for example, is an example of object detection.
EDIT (Jul. 9, 2018):
With the advent of deep learning, neural networks with multiple hidden layers have come into wide use, making it relatively easy to see the difference between feature detection and object detection. A deep learning neural network consists of two or more hidden layers, each of which is specialized for a specific part of the task at hand. For neural networks that detect objects from an image, the earlier layers arrange low-level features into a many-dimensional space (feature detection), and the later layers classify objects according to where those features are found in that many-dimensional space (object detection). A nice introduction to neural networks of this kind is found in the Wolfram Blog article "Launching the Wolfram Neural Net Repository".
Normally objects are collections of features. A feature tends to be a very low-level primitive thing. An object implies moving the understanding of the scene to the next level up.
A feature might be something like a corner, an edge etc. whereas an object might be something like a book, a box, a desk. These objects are all composed of multiple features, some of which may be visible in any given scene.
Invariance, speed, storage; few reasons, I can think on top of my head. The other method to do would be to keep the complete image and then check whether the given image is similar to glass images you have in your database. But if you have a compressed representation of the glass, it will need lesser computation (thus faster), will need lesser storage and the features tells you the invariance across images.
Both the methods you mentioned are essentially the same with slight differences. In case of Haar, you detect the Haar features then you boost them to increase the confidence. Boosting is nothing but a meta-classifier, which smartly chooses which all Harr features to be included in your final meta-classification, so that it can give a better estimate. The other method, also more or less does this, except that you have more "sophisticated" features. The main difference is that, you don't use boosting directly. You tend to use some sort of classification or clustering, like MoG (Mixture of Gaussian) or K-Mean or some other heuristic to cluster your data. Your clustering largely depends on your features and application.
What will work in your case : that is a tough question. If I were you, I would play around with Haar and if it doesn't work, would try the other method (obs :>). Be aware that you might want to segment the image and give some sort of a boundary around for it to detect glasses.