Image Comparision using SURF - image-processing

Problem:
I have a "face" images database of multiple persons, in which for each person I have multiple images(each have something different in it in terms of facial expression like smiling, thinking, simple etc).
While testing, I am having a testing data set of "smiling face image" of persons for whom image already exist in database but images in database and test data set are not exactly same (i.e. two images of same person smiling at different time, out of which one is in database and other is in test data set).
Now, the problem is my application detects the person correctly but in facial expressions it mis-matches ex.: in place of "smiling face" sometimes it gives "simple face".
PS: Efficiency in terms of finding exact person is 100% but facial expression mis-match is a problem.
Algo I am using:
Image Normalization and enhancement
SURF Feature Detection and matching
Can anyone suggest what may have gone wrong or what can be a better algorithm/approach to solve this problem ?
Is there a better algorithm than SURF for comparing two images??

I would use other face recognition algorithms, for example: LBP + svm.
You can use face-rec.org to read about face recognition algorithms, or the results page of the "labeled face in the wild" page:
http://vis-www.cs.umass.edu/lfw/results.html
If your'e using OpenCV, you can check out OpenCV's module for face recognition
http://docs.opencv.org/trunk/modules/contrib/doc/facerec/

Related

Need advice for object detection and motion classification on real time video

I'm in research for my final project, i want to make object detection and motion classification like amazon go, i have read lot of research like object detection with SSD or YOLO and video classification using CNN+LSTM, i want to propose training algorithm like this:
Real time detection for multiple object (in my case: person) with SSD/YOLO
Get the boundary object and crop the frame
Feed cropped frame info to CNN+LSTM algo to make motion prediction (if the person's walking/taking items)
is it possible to make it in real-time environment?
or is there any better method for real-time detection and motion classification
If you want to use it in real-time application, several other things must be considered which are not appeared before implementation of algorithm in real environment.
About your 3-step proposed method, it already could be result in a good method, but the first step would be very accurate. I think it is better to combine the 3 steps in one step. Because the motion type of person is a good feature of a person. Because of that, I think all steps could be gathered in one step.
My idea is as follows:
1. a video classification dataset which just tag the movement of person or object
2. cnn-lstm based video classification method
This would solve your project properly.
This answer need to more details, if u interested in, I can answer u in more details.
Had pretty much the same problem. Motion prediction does not work that well in complex real-life situations. Here is a simple one:
(See in action)
I'm building a 4K video processing tool (some examples). Current approach looks like the following:
do rough but super fast segmentation
extract bounding box and shape
apply some "meta vision magic"
do precise segmentation within identified area
(See in action)
As of now the approach looks way more flexible comparing to motion tracking.
"Meta vision" intended to properly track shape evolution:
(See in action)
Let's compare:
Meta vision disabled
Meta vision enabled

Human recognition using Template matching

I am using Emgu-CV to identify each person in a big room.
My camera is static and in-door.
I would like to count the number of persons who visited the room, that is I want to recognize each person even if I got the images in different angles at different times in a day.
I am using Haar classifiers to detetct the face, heads and full body from the image and then I am comparing this with the already detected image portions using template matching so that I can recognize the person. But I am getting very poor results.
Is this the right approach for this problem ? can anyone suggest a better approach ?
Or is there any better libraries available which can solve this problem ?
I think Template Matching is the weak point in your system. I would suggest training a Haar cascade for each person individually that will result replace (detecting + recognition) with (detect a precise object). Sure if the number of people you want to recognize is rather small. Or you can use some other stuff like SURF but note their licence.

Can Feature Matching using SIFT/SURF be used for classification of similar objects?

I have implemented the SIFT algorithm in OpenCV for feature detection and matching using the following steps:
Background Removal using Otsu's thresholding
Feature Detection using SIFT feature detector
Descriptor Extraction using SIFT feature extractor
Matching feature vectors using BFMatcher(L2 Norm) and using the ratio test to filter
good matches
My objective is to classify images into different categories such as shoes, shirts etc. based on their similarity. For example two different heels should be more similar to each other than a heel and a sports shoe or a heel and a t-shirt.
However this algorithm is working well only when my template image is present in the search image (in any scale and orientation). If I compare two different heels, they don't match well and the matches are also random(heel of one image matches to the flat surface of the other image). There are also many false positives when I compare a heel with a sports shoe or a heel with a t-shirt or a heel with the picture of a baby!
I would like to look at a heel and identify it as a heel and return how similar the heel is to different images in my database giving maximum similarity for other heels, then followed by other shoes. It should not produce any similarity with irrelevant objects such as shirts, phones, pens..
I understand that the SIFT algorithm produces a descriptor vector for each keypoint based on the gradient values of pixels around the keypoint and images are matched purely using this attribute. Hence it is highly possible that a keypoint located near the heel of one shoe is matched to a keypoint at the surface of the other shoe. Therefore, what I gather is that this algorithm can be used only to detect exact matches and not to detect similarity between images
Could you please tell me if this algorithm can be used for my objective and if I am doing something wrong or suggest any other approach that I should use.
For classification of similar objects, I certainly would go for cascade classifiers.
Basically, cascade classifiers is a machine learning method where you train your classifier to detect an object in different images. For it to work well, you need to train your classifier with a lot of positive (where your object is) and negative (where your object is not) images. The method was invented by Viola and Jones in 2001.
There is a ready-made implementation in OpenCV for face detection, you will have a bit more explanations on the openCV documentation (sorry, can't post the link, I'm limited to 1 link for the moment ..)
Now, for the caveats :
First, you need a lot of positive and negative images. The more images you have, the better the algorithm will perform. Beware of over-learning : if your training dataset for heels contains, for instance, too many images of a given model it is possible that others will not be detected properly
Training the cascade classifier can be long and difficult. The end-result will depend on how well you choose the parameters for training the classifier. Some info on this can be found on this webpage : http://coding-robin.de/2013/07/22/train-your-own-opencv-haar-classifier.html

facial expressions classification (emotions) in real time

I am currently working on a project where I have to extract the facial expression of a user (only one user at a time from a webcam) like sad or happy.
The best possibility I found so far:
I used OpenCV for face detection.
Some user on a OpenCV board suggested looking for AAM (active apereance models) and ASM (active shape models), but all I found were papers.
-So i'm Using Active Shape Models with Stasm, which will give me access to 77 different points within the face like on the picture
now i want to know the best way to do :
the best learning method to use on cohn and kanade database to classify the emotions (happy,....) ?
the best method to classify the facial expressions on a video in real time ?
Look here for similar solution video and description of algorithm: http://www2.isr.uc.pt/~pedromartins/ in "Identity and Expression Recognition on Low Dimensional Manifolds" 2009 year.

read numbers and letters from an image using openCV

I am developing an application to read the letters and numbers from an image using opencv in c++. I first changed the given colour image and colour template to binary image, then called the method cvMatchTemplate(). This method just highlighted the areas where the template matches.. But not clear.. I just dont want to see the area.. I need to parse the characters(letters & numbers) from the image. I am new to openCV. Does anybody know any other method to get the result??
Image is taken from camera. the sample image is shown above. I need to get all the texts from the LED display(130 and Delft Tanthaf).
Friends I tried with the sample application of face detection, It detects the faces. the HaarCascade file is provided with the openCV. I just loaded that file and called the method cvHaarDetectObjects(); To detect the letters I created the xml file by using the application letter_recog.cpp provided by openCV. But when I loading this file, it shows some error(OpenCV error: UnSpecified error > in unknown function, file ........\ocv\opencv\src\cxcore\cxpersistence.cpp,line 4720). I searched in web for this error and got the information about lib files used. I did so, but the error still remains. Is the error with my xml file or calling the method to load this xml file((CvHaarClassifierCascade*)cvLoad("builded xml file name",0,0,0);)?? please HELP...
Thanks in advance
As of OpenCV 3.0 (in active dev), you can use the built-in "scene text" object detection module ~
Reference: http://docs.opencv.org/3.0-beta/modules/text/doc/erfilter.html
Example: https://github.com/Itseez/opencv_contrib/blob/master/modules/text/samples/textdetection.cpp
The text detection is built on these two papers:
[Neumann12] Neumann L., Matas J.: Real-Time Scene Text Localization
and Recognition, CVPR 2012. The paper is available online at
http://cmp.felk.cvut.cz/~neumalu1/neumann-cvpr2012.pdf
[Gomez13] Gomez L. and Karatzas D.: Multi-script Text Extraction from
Natural Scenes, ICDAR 2013. The paper is available online at
http://refbase.cvc.uab.es/files/GoK2013.pdf
Once you've found where the text in the scene is, you can run any sort of standard OCR against those slices (Tesseract OCR is common). And there's now an end-to-end sample in opencv using OpenCV's new interface to Tesseract:
https://github.com/Itseez/opencv_contrib/blob/master/modules/text/samples/end_to_end_recognition.cpp
Template matching tend not to be robust for this sort of application because of lighting inconsistencies, orientation changes, scale changes etc. The typical way of solving this problem is to bring in machine learning. What you are trying to do by training your own boosting classifier is one possible approach. However, I don't think you are doing the training correctly. You mentioned that you gave it 1 logo as a positive training image and 5 other images not containing the logo as negative examples? Generally you need training samples to be in the order of hundreds or thousands or more. You cannot possibly train with 6 training samples and expect it to work.
If you are unfamiliar with machine learning, here is roughly what you should do:
1) You need to collect many positive training samples (from hundred onwards but generally the more the merrier) of the object you are trying to detect. If you are trying to detect individual characters in the image, then get cropped images of individual characters. You can start with the MNIST database for this. Better yet, to train the classifier for your particular problem, get many cropped images of the characters on the bus from photos. If you are trying to detect the entire rectangular LED board panel, then use images of them as your positive training samples.
2) You will need to collect many negative training samples. Their number should be in the same order as the number of positive training samples you have. These could be images of the other objects that appear in the images you will run your detector on. For example, you could crop images of the front of the bus, road surfaces, trees along the road etc. and use them as negative examples. This is to help the classifier rule out these objects in the image you run your detector on. Hence, negative examples are not just any image containing objects you don't want to detect. They should be objects that could be mistaken for the object you are trying to detect in the images you run your detector on (at least for your case).
See the following link on how to train the cascade of classifier and produce the XML model file: http://note.sonots.com/SciSoftware/haartraining.html
Even though you mentioned you only want to detect the individual characters instead of the entire LED panel on the bus, I would recommend first detecting the LED panel so as to localize the region containing the characters of interest. After that, either perform template matching within this smaller region or run a classifier trained to recognize individual characters on patches of pixels in this region obtained using sliding window approach, and possibly at multiple scale. (Note: The haarcascade boosting classifier you mentioned above will detect characters but it won't tell you which character it detected unless you only train it to detect that particular character...) Detecting characters in this region in a sliding window manner will give you the order the characters appear so you can string them into words etc.
Hope this helps.
EDIT:
I happened to chance upon this old post of mine after separately discovering the scene text module in OpenCV 3 mentioned by #KaolinFire.
For those who are curious, this is the result of running that detector on the sample image given by the OP. Notice that the detector is able to localize the text region, even though it returns more than one bounding box.
Note that this method is not foolproof (at least this implementation in OpenCV with the default parameters). It tends to generate false-positives, especially when the input image contains many "distractors".
Here are more examples obtained using this OpenCV 3 text detector on the Google Street View dataset:
Notice that it has a tendency to find "text" between parallel lines (e.g., windows, walls etc). Since the OP's input image is likely going to contain outdoor scenes, this will be a problem especially if he/she does not restrict the region of interest to a smaller region around the LED signs.
It seems that if you are able to localize a "rough" region containing just the text (e.g., just the LED sign in the OP's sample image), then running this algorithm can help you get a tighter bounding box. But you will have to deal with the false-positives though (perhaps discarding small regions or picking among the overlapping bounding boxes using a heuristic based on knowledge about the way letters appear on the LED signs).
Here are more resources (discussion + code + datasets) on text detection.
Code
Extracting text OpenCV
http://libccv.org/doc/doc-swt/
Stroke Width Transform (SWT) implementation (Python)
https://github.com/subokita/Robust-Text-Detection
Datasets
You will find the google streetview and MSRA datasets here. Although the images in these datasets are not exactly the same as the ones for the LED signs on buses, they may be helpful either for picking the "best" performing algorithm from among several competing algorithms, or to train a machine learning algorithm from scratch.
http://www.iapr-tc11.org/mediawiki/index.php/Datasets_List
See my answer to How to read time from recorded surveillance camera video? You can/should use cvMatchTemplate() to do that.
If you are working with a fixed set of bus destinations, template matching will do.
However, if you want the system to be more flexible, I would imagine you would need some form of contour/shape analysis for each individual letter.
You can also look at EAST: Efficient Scene Text Detector - https://www.learnopencv.com/deep-learning-based-text-detection-using-opencv-c-python/
Under this link, you have examples with C++ and Python. I used this code to detect numbers of buses (after detecting that given object is a bus).

Resources