I am quite new to the area of facial expression recognition and currently I'm doing a research on this via Deep Learning specifically CNN. I have some questions with regard to preparing and/or preprocessing my data.
I have segmented videos of frontal facial expressions (e.g. 2-3 seconds video of a person expressing a happy emotion based on his/her annotations).
Note: expressions displayed by my participants are quite of low intensity (not exaggerated expressions/micro-expressions)
General Question: Now, how should I prepare my data for training with CNN (I am a bit leaning on using a deep learning library, TensorFlow)?
Question 1: I have read some deep learning-based facial expression recognition (FER) papers that suggest to take the peak of that expression (most probably a single image) and use that image as part of your training data. How would I know the peak of an expression? What's my basis? If I'm going to take a single image, wouldn't some important frames of the subtlety of expression displayed by my participants be lost?
Question 2: Or would it be also correct to execute the segmented video in OpenCV in order to detect (e.g. Viola-Jones), crop and save the faces per frame, and use those images as part of my training data with their appropriate labels? I'm guessing some frames of faces to be redundant. However, since we knew that the participants in our data shows low intensity of expressions (micro-expressions), some movements of the face could also be important.
I would really appreciate anyone who can answer, thanks a lot!
As #unique monkey already pointed out, this is generally a supervised learning task. If you wish to extract an independent "peak" point, I recommend that you scan the input images and find the one in each sequence whose reference points deviate most from the subject's resting state.
If you didn't get a resting state, then how are the video clips cropped? For instance, were the subjects told to make the expression and hold it? What portion of the total expression (before, express, after) does the clip cover? Take one or both endpoints of the video clip; graph the movements of the reference points from each end, and look for a frame in which the difference is greatest, but then turns toward the other endpoint.
answer 1: Commonly we always depend on human's sense to decide which expression is the peak of the expression(I think you can distinguish the difference in smile and laugh)
answer 2: if you want to get a good result, I suggest you not treat data so rude like this method
Related
I dont mean that a neural network can complete the work of traditional image processing algorithm.What i want to say is if it exists a kind of neural network can use the parameters of the traditional method as input and outputs more universal parameters that dont require manual adjustment.Intuitively, my ideas are less efficient than using neural networks directly,but I don't know much about the mathematics of neural networks.
If I understood correctly, what you mean is for a traditional method (let's say thresholding), you want to find the best parameters using ann. It is possible but you have to supply so many training data which needs to be created, processed and evaluated that it will take a lot of time. AFAIK many mobile phones that have AI assisted camera use this method to find the best aperture, exposure..etc.
First of all, thank you very much. I still have two things to figure out. If I wanted to get a (or a set of) relatively optimal parameters, what data set would I need to build (such as some kind of error between input and output and threshold) ? Second, as you give an example, is it more efficient or better than traversal or Otsu to select the optimal threshold through neural networks in practice?To be honest, I wonder if this is really more efficient than training input and output directly using neural networks
For your second question, Otsu only works on cases where the histogram has two distinct peaks. Thresholding is a simple function but the cut-off value is based on your objective; there is no single "best" value valid for every case. So if you want to train a model for thresholding, I think you have to come up with separate models for each case (like a model for thresholding bright objects, another for darker ones...etc.) Maybe an additional output parameter for determining the aim works but I am not sure. Will it be more efficient and better? Depends on the case (and your definition of better). Otsu, traversal or adaptive thresholding does not work all the time (actually Otsu has very specific use cases). If they work for your case, excellent. If not, then things get messy. So to answer your question, it depends on your problem at hand.
For the first question, TBF, it is quite difficult to work with images in traditional ANNs. Images have a lot of pixels, so standard ANNs struggle with inputs. Moreover, when the location/scale of an object in the image changes, the whole pixel data changes even though the content is the same (These are the reasons why CNN's are superior to ANN's for images). For these reasons it is better to use processed metrics which contain condensed and location-invariant information. E.g. for thresholding, you can give the histogram and it returns a thresholding value. Therefore you need an ann with 256 input neurons (for an intensity histogram of 8bit grayscale image), 1 output neuron, and 1-2 middle layers with some deeply connected neurons (128 maybe?). Your training data will be a bunch of histograms as input and corresponding best threshold value for each histogram. Then once training is finished, you can give the ANN a histogram it has never seen before and it will tell you the optimal thresholding value based on its training.
what I want to do is a model that can output different parameters (parameter sets) based on different input images, so I think if you choose a good enough data set it should be somewhat universal.
Most likely, but your data set should be quite inclusive of expected images (in terms of metrics and features), which means it has to be large.
Also, I don't know much about modeling -- can I use a function about the output/parameters (which might be a function about the result of the traditional method) as an error in the back-propagation by create a custom loss function?
I think so, but training the model will be more involved than using predefined loss functions because, well, you have to write them. Also you have to test they work as expected.
I'm in research for my final project, i want to make object detection and motion classification like amazon go, i have read lot of research like object detection with SSD or YOLO and video classification using CNN+LSTM, i want to propose training algorithm like this:
Real time detection for multiple object (in my case: person) with SSD/YOLO
Get the boundary object and crop the frame
Feed cropped frame info to CNN+LSTM algo to make motion prediction (if the person's walking/taking items)
is it possible to make it in real-time environment?
or is there any better method for real-time detection and motion classification
If you want to use it in real-time application, several other things must be considered which are not appeared before implementation of algorithm in real environment.
About your 3-step proposed method, it already could be result in a good method, but the first step would be very accurate. I think it is better to combine the 3 steps in one step. Because the motion type of person is a good feature of a person. Because of that, I think all steps could be gathered in one step.
My idea is as follows:
1. a video classification dataset which just tag the movement of person or object
2. cnn-lstm based video classification method
This would solve your project properly.
This answer need to more details, if u interested in, I can answer u in more details.
Had pretty much the same problem. Motion prediction does not work that well in complex real-life situations. Here is a simple one:
(See in action)
I'm building a 4K video processing tool (some examples). Current approach looks like the following:
do rough but super fast segmentation
extract bounding box and shape
apply some "meta vision magic"
do precise segmentation within identified area
(See in action)
As of now the approach looks way more flexible comparing to motion tracking.
"Meta vision" intended to properly track shape evolution:
(See in action)
Let's compare:
Meta vision disabled
Meta vision enabled
Recently, I have been creating neural networks with the plan to train them to play games that I have made (where the neural networks have access to all of the game data). I have a strong understanding of neural networks, genetic algorithms, and the NEAT implementation. The issue that I have run into, however, is normalizing the input for what the player sees. If we have an enemy object, a medkit object, and a weapon object, they need to be input and treated differently. I saw a video from SethBling here, where he gives a brief explaination to how he set that neural network up. He used only the values 1, 0, and -1. However, for more complex games, that wouldn't work. I've tried getting a small simulation to return true when the input is .25 < x < .75 and false otherwise, however it couldn't find a solution. Therefore, it seems that I cannot just toss the IDs of the objects into the neural network. Any help is much appreciated!
I was going to leave a comment, but unfortunately I don't have enough reputation points. So, I have a couple of suggestions that will hopefully give you some ideas. I am going to assume that you are using the NEAT algorithm as your game playing algorithm. Now, from what I gather your issue is that because you have a varying number of actual objects to interact with in your game it is practically impossible to give everything a label for every frame. Therefore, as you seem to have figured out you would need to provide a non-integer domain for the objects, whether that be normalizing the detected class id by the total number of classes, or some other method. I have 3 propositions for you in order to try to accomplish this (or avoid the problem):
1: Use some type of image manipulation (whether it be segmentation or thresholding), object detection along with image moments in order to create a database of objects that you are interested in, and then while the game is playing, you can re-create a simpler version of the actual game environment
2: Train a semantic segmentation neural network in order to perform something similar to 1
3: Train (or use a pre-trained) deep convolutional neural network to extract high-level features. Then use these features (and potentially some kind of location encoding method) as the input to your NEAT algorithm. Your NEAT algorithm would then select which combination of filters it would like to look at in order to make the decision.
I think I would personally try option number 3 as it requires the least amount of manual work in order to set up initially.
I hope this gives you a couple of ideas.
When working with neural networks a good aproach is to think of them as humans. What would a human need to know?
What I always do is I give a network different sensors for different types of things. My sensors are almost always just rays of a certain length.
In your situation I believe you have no better choice than using for example 3 sensors for coins, 3 sensors for medic cits etc.
You also have to decide if you want to use boolean sensors (which would only tell the network if something triggered the sensor or not, no matter how close) or double sensors (which would return a value telling the network how close the event happened) or if you want to have both combined.
I have a large image (5400x3600) that has multiple CCTVs that I need to detect.
The detection takes lot of time (4-7 minutes) with rotation. But it still fails to resolve certain CCTVs.
What is the best method to match a template like this?
I am using skImage - openCV is not an option for me, but I am open to suggestions on that too.
For example: in the images below, the template is correct matched with the second image - but the first image is not matched - I guess due to the noise created by the text "BLDG..."
Template:
Source image:
Match result:
The fastest method is probably a cascade of boosted classifiers trained with several variations of your logo and possibly a few rotations and some negative examples too (non-logos). You have to roughly scale your overall image so the test and training examples are approximately matched by scale. Unlike SIFT or SURF that spend a lot of time in searching for interest points and creating descriptors for both learning and searching, binary classifiers shift most of the burden to a training stage while your testing or search will be much faster.
In short, the cascade would run in such a way that a very first test would discard a large portion of the image. If the first test passes the others will follow and refine. They will be super fast consisting of just a few intensity comparison in average around each point. Only a few locations will pass the whole cascade and can be verified with additional tests such as your rotation-correlation routine.
Thus, the classifiers are effective not only because they quickly detect your object but because they can also quickly discard non-object areas. To read more about boosted classifiers see a following openCV section.
This problem in general is addressed by Logo Detection. See this for similar discussion.
There are many robust methods for template matching. See this or google for a very detailed discussion.
But from your example i can guess that following approach would work.
Create a feature for your search image. It essentially has a rectangle enclosing "CCTV" word. So the width, height, angle, and individual character features for matching the textual information could be a suitable choice. (Or you may also use the image having "CCTV". In that case the method will not be scale invariant.)
Now when searching first detect rectangles. Then use the angle to prune your search space and also use image transformation to align the rectangles in parallel to axis. (This should take care of the need for the rotation). Then according to the feature choosen in step 1, match the text content. If you use individual character features, then probably your template matching step is essentially a classification step. Otherwise if you use image for matching, you may use cv::matchTemplate.
Hope it helps.
Symbol spotting is more complicated than logo spotting because interest points work hardly on document images such as architectural plans. Many conferences deals with pattern recognition, each year there are many new algorithms for symbol spotting so giving you the best method is not possible. You could check IAPR conferences : ICPR, ICDAR, DAS, GREC (Workshop on Graphics Recognition), etc. This researchers focus on this topic : M Rusiñol, J Lladós, S Tabbone, J-Y Ramel, M Liwicki, etc. They work on several techniques for improving symbol spotting such as : vectorial signatures, graph based signature and so on (check google scholar for more papers).
An easy way to start a new approach is to work with simples shapes such as lines, rectangles, triangles instead of matching everything at one time.
Your example can be recognized by shape matching (contour matching), much faster than 4 minutes.
For good match , you require nice preprocess and denoise.
examples can be found http://www.halcon.com/applications/application.pl?name=shapematch
I am developing an application to read the letters and numbers from an image using opencv in c++. I first changed the given colour image and colour template to binary image, then called the method cvMatchTemplate(). This method just highlighted the areas where the template matches.. But not clear.. I just dont want to see the area.. I need to parse the characters(letters & numbers) from the image. I am new to openCV. Does anybody know any other method to get the result??
Image is taken from camera. the sample image is shown above. I need to get all the texts from the LED display(130 and Delft Tanthaf).
Friends I tried with the sample application of face detection, It detects the faces. the HaarCascade file is provided with the openCV. I just loaded that file and called the method cvHaarDetectObjects(); To detect the letters I created the xml file by using the application letter_recog.cpp provided by openCV. But when I loading this file, it shows some error(OpenCV error: UnSpecified error > in unknown function, file ........\ocv\opencv\src\cxcore\cxpersistence.cpp,line 4720). I searched in web for this error and got the information about lib files used. I did so, but the error still remains. Is the error with my xml file or calling the method to load this xml file((CvHaarClassifierCascade*)cvLoad("builded xml file name",0,0,0);)?? please HELP...
Thanks in advance
As of OpenCV 3.0 (in active dev), you can use the built-in "scene text" object detection module ~
Reference: http://docs.opencv.org/3.0-beta/modules/text/doc/erfilter.html
Example: https://github.com/Itseez/opencv_contrib/blob/master/modules/text/samples/textdetection.cpp
The text detection is built on these two papers:
[Neumann12] Neumann L., Matas J.: Real-Time Scene Text Localization
and Recognition, CVPR 2012. The paper is available online at
http://cmp.felk.cvut.cz/~neumalu1/neumann-cvpr2012.pdf
[Gomez13] Gomez L. and Karatzas D.: Multi-script Text Extraction from
Natural Scenes, ICDAR 2013. The paper is available online at
http://refbase.cvc.uab.es/files/GoK2013.pdf
Once you've found where the text in the scene is, you can run any sort of standard OCR against those slices (Tesseract OCR is common). And there's now an end-to-end sample in opencv using OpenCV's new interface to Tesseract:
https://github.com/Itseez/opencv_contrib/blob/master/modules/text/samples/end_to_end_recognition.cpp
Template matching tend not to be robust for this sort of application because of lighting inconsistencies, orientation changes, scale changes etc. The typical way of solving this problem is to bring in machine learning. What you are trying to do by training your own boosting classifier is one possible approach. However, I don't think you are doing the training correctly. You mentioned that you gave it 1 logo as a positive training image and 5 other images not containing the logo as negative examples? Generally you need training samples to be in the order of hundreds or thousands or more. You cannot possibly train with 6 training samples and expect it to work.
If you are unfamiliar with machine learning, here is roughly what you should do:
1) You need to collect many positive training samples (from hundred onwards but generally the more the merrier) of the object you are trying to detect. If you are trying to detect individual characters in the image, then get cropped images of individual characters. You can start with the MNIST database for this. Better yet, to train the classifier for your particular problem, get many cropped images of the characters on the bus from photos. If you are trying to detect the entire rectangular LED board panel, then use images of them as your positive training samples.
2) You will need to collect many negative training samples. Their number should be in the same order as the number of positive training samples you have. These could be images of the other objects that appear in the images you will run your detector on. For example, you could crop images of the front of the bus, road surfaces, trees along the road etc. and use them as negative examples. This is to help the classifier rule out these objects in the image you run your detector on. Hence, negative examples are not just any image containing objects you don't want to detect. They should be objects that could be mistaken for the object you are trying to detect in the images you run your detector on (at least for your case).
See the following link on how to train the cascade of classifier and produce the XML model file: http://note.sonots.com/SciSoftware/haartraining.html
Even though you mentioned you only want to detect the individual characters instead of the entire LED panel on the bus, I would recommend first detecting the LED panel so as to localize the region containing the characters of interest. After that, either perform template matching within this smaller region or run a classifier trained to recognize individual characters on patches of pixels in this region obtained using sliding window approach, and possibly at multiple scale. (Note: The haarcascade boosting classifier you mentioned above will detect characters but it won't tell you which character it detected unless you only train it to detect that particular character...) Detecting characters in this region in a sliding window manner will give you the order the characters appear so you can string them into words etc.
Hope this helps.
EDIT:
I happened to chance upon this old post of mine after separately discovering the scene text module in OpenCV 3 mentioned by #KaolinFire.
For those who are curious, this is the result of running that detector on the sample image given by the OP. Notice that the detector is able to localize the text region, even though it returns more than one bounding box.
Note that this method is not foolproof (at least this implementation in OpenCV with the default parameters). It tends to generate false-positives, especially when the input image contains many "distractors".
Here are more examples obtained using this OpenCV 3 text detector on the Google Street View dataset:
Notice that it has a tendency to find "text" between parallel lines (e.g., windows, walls etc). Since the OP's input image is likely going to contain outdoor scenes, this will be a problem especially if he/she does not restrict the region of interest to a smaller region around the LED signs.
It seems that if you are able to localize a "rough" region containing just the text (e.g., just the LED sign in the OP's sample image), then running this algorithm can help you get a tighter bounding box. But you will have to deal with the false-positives though (perhaps discarding small regions or picking among the overlapping bounding boxes using a heuristic based on knowledge about the way letters appear on the LED signs).
Here are more resources (discussion + code + datasets) on text detection.
Code
Extracting text OpenCV
http://libccv.org/doc/doc-swt/
Stroke Width Transform (SWT) implementation (Python)
https://github.com/subokita/Robust-Text-Detection
Datasets
You will find the google streetview and MSRA datasets here. Although the images in these datasets are not exactly the same as the ones for the LED signs on buses, they may be helpful either for picking the "best" performing algorithm from among several competing algorithms, or to train a machine learning algorithm from scratch.
http://www.iapr-tc11.org/mediawiki/index.php/Datasets_List
See my answer to How to read time from recorded surveillance camera video? You can/should use cvMatchTemplate() to do that.
If you are working with a fixed set of bus destinations, template matching will do.
However, if you want the system to be more flexible, I would imagine you would need some form of contour/shape analysis for each individual letter.
You can also look at EAST: Efficient Scene Text Detector - https://www.learnopencv.com/deep-learning-based-text-detection-using-opencv-c-python/
Under this link, you have examples with C++ and Python. I used this code to detect numbers of buses (after detecting that given object is a bus).