I am currently working on a project where I have to extract the facial expression of a user (only one user at a time from a webcam) like sad or happy.
The best possibility I found so far:
I used OpenCV for face detection.
Some user on a OpenCV board suggested looking for AAM (active apereance models) and ASM (active shape models), but all I found were papers.
-So i'm Using Active Shape Models with Stasm, which will give me access to 77 different points within the face like on the picture
now i want to know the best way to do :
the best learning method to use on cohn and kanade database to classify the emotions (happy,....) ?
the best method to classify the facial expressions on a video in real time ?
Look here for similar solution video and description of algorithm: http://www2.isr.uc.pt/~pedromartins/ in "Identity and Expression Recognition on Low Dimensional Manifolds" 2009 year.
Related
I'm in research for my final project, i want to make object detection and motion classification like amazon go, i have read lot of research like object detection with SSD or YOLO and video classification using CNN+LSTM, i want to propose training algorithm like this:
Real time detection for multiple object (in my case: person) with SSD/YOLO
Get the boundary object and crop the frame
Feed cropped frame info to CNN+LSTM algo to make motion prediction (if the person's walking/taking items)
is it possible to make it in real-time environment?
or is there any better method for real-time detection and motion classification
If you want to use it in real-time application, several other things must be considered which are not appeared before implementation of algorithm in real environment.
About your 3-step proposed method, it already could be result in a good method, but the first step would be very accurate. I think it is better to combine the 3 steps in one step. Because the motion type of person is a good feature of a person. Because of that, I think all steps could be gathered in one step.
My idea is as follows:
1. a video classification dataset which just tag the movement of person or object
2. cnn-lstm based video classification method
This would solve your project properly.
This answer need to more details, if u interested in, I can answer u in more details.
Had pretty much the same problem. Motion prediction does not work that well in complex real-life situations. Here is a simple one:
(See in action)
I'm building a 4K video processing tool (some examples). Current approach looks like the following:
do rough but super fast segmentation
extract bounding box and shape
apply some "meta vision magic"
do precise segmentation within identified area
(See in action)
As of now the approach looks way more flexible comparing to motion tracking.
"Meta vision" intended to properly track shape evolution:
(See in action)
Let's compare:
Meta vision disabled
Meta vision enabled
I am quite new to the area of facial expression recognition and currently I'm doing a research on this via Deep Learning specifically CNN. I have some questions with regard to preparing and/or preprocessing my data.
I have segmented videos of frontal facial expressions (e.g. 2-3 seconds video of a person expressing a happy emotion based on his/her annotations).
Note: expressions displayed by my participants are quite of low intensity (not exaggerated expressions/micro-expressions)
General Question: Now, how should I prepare my data for training with CNN (I am a bit leaning on using a deep learning library, TensorFlow)?
Question 1: I have read some deep learning-based facial expression recognition (FER) papers that suggest to take the peak of that expression (most probably a single image) and use that image as part of your training data. How would I know the peak of an expression? What's my basis? If I'm going to take a single image, wouldn't some important frames of the subtlety of expression displayed by my participants be lost?
Question 2: Or would it be also correct to execute the segmented video in OpenCV in order to detect (e.g. Viola-Jones), crop and save the faces per frame, and use those images as part of my training data with their appropriate labels? I'm guessing some frames of faces to be redundant. However, since we knew that the participants in our data shows low intensity of expressions (micro-expressions), some movements of the face could also be important.
I would really appreciate anyone who can answer, thanks a lot!
As #unique monkey already pointed out, this is generally a supervised learning task. If you wish to extract an independent "peak" point, I recommend that you scan the input images and find the one in each sequence whose reference points deviate most from the subject's resting state.
If you didn't get a resting state, then how are the video clips cropped? For instance, were the subjects told to make the expression and hold it? What portion of the total expression (before, express, after) does the clip cover? Take one or both endpoints of the video clip; graph the movements of the reference points from each end, and look for a frame in which the difference is greatest, but then turns toward the other endpoint.
answer 1: Commonly we always depend on human's sense to decide which expression is the peak of the expression(I think you can distinguish the difference in smile and laugh)
answer 2: if you want to get a good result, I suggest you not treat data so rude like this method
Interestingly, what technologies and algorithms the large companies use. I found only that the Microsoft uses the PhotoDNA technologie, but it is responsible only how photos are compared. It is interesting also as they automatically detecting pornographic images.
For example, are used any of methods : Skin Detection, ROI Detection, Bag-of-Visual-Words.
Seminal work in this field was done in Fleck, Margaret M., David A. Forsyth, and Chris Bregler. "Finding naked people." Computer Vision—ECCV'96. Springer Berlin Heidelberg, 1996. 593-602.. The approach detects skin-colored regions and then determines whether or not the regions match predefined human shapes. More on their skin detection algorithm here: http://www.cs.hmc.edu/~fleck/naked-skin.html .
More recent papers with summaries of current methods are available:
http://iseclab.org/people/cplatzer/papers/sfcs05-platzer.pdf
http://arxiv.org/abs/1402.5792
You may also take a look at: What is the best way to programatically detect porn images?
update 2016: use a convnet. They are far better at building high resolution filters. I wrote about it in more detail here
http://blog.clarifai.com/what-convolutional-neural-networks-see-at-when-they-see-nudity/
https://www.slideshare.net/mobile/RyanCompton1/what-convnets-look-at-when-they-look-at-nudity
Problem:
I have a "face" images database of multiple persons, in which for each person I have multiple images(each have something different in it in terms of facial expression like smiling, thinking, simple etc).
While testing, I am having a testing data set of "smiling face image" of persons for whom image already exist in database but images in database and test data set are not exactly same (i.e. two images of same person smiling at different time, out of which one is in database and other is in test data set).
Now, the problem is my application detects the person correctly but in facial expressions it mis-matches ex.: in place of "smiling face" sometimes it gives "simple face".
PS: Efficiency in terms of finding exact person is 100% but facial expression mis-match is a problem.
Algo I am using:
Image Normalization and enhancement
SURF Feature Detection and matching
Can anyone suggest what may have gone wrong or what can be a better algorithm/approach to solve this problem ?
Is there a better algorithm than SURF for comparing two images??
I would use other face recognition algorithms, for example: LBP + svm.
You can use face-rec.org to read about face recognition algorithms, or the results page of the "labeled face in the wild" page:
http://vis-www.cs.umass.edu/lfw/results.html
If your'e using OpenCV, you can check out OpenCV's module for face recognition
http://docs.opencv.org/trunk/modules/contrib/doc/facerec/
I am working with SVM-light. I would like to use SVM-light to train a classifier for object detection. I figured out the syntax to start a training:
svm_learn example2/train_induction.dat example2/model
My problem: how can I build the "train_induction.dat" from a
set of positive and negative pictures?
There are two parts to this question:
What feature representation should I use for object detection in images with SVMs?
How do I create an SVM-light data file with (whatever feature representation)?
For an intro to the first question, see Wikipedia's outline. Bag of words models based on SIFT or sometimes SURF or HOG features are fairly standard.
For the second, it depends a lot on what language / libraries you want to use. The features can be extracted from the images using something like OpenCV, vlfeat, or many others. You can then convert those features to the SVM-light format as described on the SVM-light homepage (no anchors on that page; search for "The input file").
If you update with what language and library you want to use, we can give more specific advice.