i know that cv2.face.LBPHFaceRecognizer_create() use it for recognize face in real time
but i want to know what its fonction?,what exist inside this instruction ? how it is work?
i want to know what itss struct for exemple it is take the image and extract caractrestic in forme lbph and its use for that .... than train image for that its use (name of trainer) compare the images for can recognise them.
any information or document can help me please pratge with me
LBP(Local Binary Patterns) are one way to extract characteristic features of an object (Could be face, coffe cup or anything that has a representation). LBP's algorithm is really straight forward and can be done manually. (pixel thresholding + pixel level arithmetic operations.)
LBP Algorithm:
There is a "training" part in OpenCV's FaceRecognizer methods. Don't make this confuse you, there is no deep learning approach here. Just simple math.
OpenCV transforms LBP images to histograms to store spatial information of faces with the representation proposed by Timo Ahonen, Abdenour Hadid, and Matti Pietikäinen. Face recognition with local binary patterns. In Computer vision-eccv 2004, pages 469–481. Springer, 2004. . Which divides the LBP image to local regions sized m and extracting histogram for each region and concatenating them.
After having informations about one person's (1 label's) face, the rest is simple. During the inference, it calculates the test face's LBP, divides the regions and creates a histogram. Then compares its euclidian distance to the trained faces' histograms. If it's less than the tolerance value, it counts as a match. (Other distance methods can be used also, chi-squared distance absolute value etc. )
Related
Note: This question is originally asked at OpenCV forum a couple of days ago.
I'm building an image processing program which extensively uses 2-dimensional dft, discrete Fourier transform. I'm trying to speed up so as to run in real-time.
In that application I only use a part of dft output specified by a rectangular ROI. My current implementation follows the steps below:
Compute dft of the input image f (typically size of 512x512) and get the entire dft result F
Crop F into a pre-specified region of interest (ROI, typically size of 32x32, located arbitrary), R
This process basically works well, but involves useless calculation since I only need partial information of F. I'm looking for a way to accelerate this calculation only by computing necessary part of dft.
I found OpenCV with Intel IPP computes dft with Intel IPP functions, which results in an order of magnitude faster than naive OpenCV implementation. I'm wondering if I can even accelerate this computation by only computing pre-specified frequency domain of dft.
Since I'm new to OpenCV, I've lost my way here, so I hope you could provide a way to do this.
Kindly note that I don't mean to do a dft to an ROI of an image, i.e. dft(ROI(f)), but I want to compute ROI(dft(f)).
Thanks in advance.
Just a partial idea.
The DFT is separable. It is always computed by first applying the FFT algorithm to rows of the image, then to the columns of the result (or the other way around, the order doesn't matter).
If you want only an ROI of the output, in the second step you only need to process the columns that fall within the ROI.
I don't think you'll find a way to compute only a subset of frequencies along each 1D row/column. That would likely entail hacking your own FFT, which will likely be more computationally expensive than using the one in IPP or FFTW.
I am new to image processing, and I want to extract image features in order to do some classification. I am having problems understanding the pipeline.
As far as I understand, I have a images and I run the SIFT algorithm on them. This gives me a set of descriptors for each images, the number varies, with fixed length of 128.
I then proceed to cluster them, since it is not possible to apply algorithms on varying number of features. For this, I stack up all the descriptors of all images and I run the k means algorithm with the desired number of clusters. What I get are k number of features of length 128.
Here is where I am confused, so I now have these new descriptors right, what do I do with them? I don't understand how I can plug them into a classifier if they represent all images? Should each images have their own separate features to be fed into a classifier?
I am sure I did not understand the concept, but can anybody please clarify what happens after I get a k*128 sized matrix? What is fed into for example an SVM classifier and how? How does this k means result suffice to train a classifier?
Thanks!
EDIT: I might have confused keypoints and descriptors, sorry new to image processing!
You should look into the image classification/image retrieval approach known as 'bag of visual words' - it is extremely relevant. A bag of visual words is a fixed-length feature vector v which summarises the occurrences of the features in an image. This makes use of what is called a codebook (also called a dictionary from historical uses in text retrieval), which in your case is built from your K-means clustering. To make v for a given image, the simplest approach is to assign v[j] the proportion of SIFT descriptors that are closest to the jth cluster centroid. This means the length of V is K, so it is independent of the number of SIFT features that are detected in the image.
Concretely, suppose you've done K means clustering with K = 100. Let's use ci to denote the ith cluster centre. For SIFT, this would be a vector of size 128. Now, for a given input image, you make this vector v, which is of size 100 and initialized with zeros. You then extract features from the image, and their corresponding descriptors. Let's say there are N descriptors, and we will call them d0, d2,...,d(N-1), where dj is the jth descriptor. For each dj you compute the vector distance between it and the cluster centres c0, c2,...c99. You then take the cluster index k with the lowest distance to dj, and increment: v[k]+=1. Note that this process can be parallelised very well particularly on GPUs. Also it can be faster to replace this process using what is known as Approximate Nearest Neighbours, using e.g. the FLANN library.
PREMISE:
I'm really new to Computer Vision/Image Processing and Machine Learning (luckily, I'm more expert on Information retrieval), so please be kind with this filthy peasant! :D
MY APPLICATION:
We have a mobile application where the user takes a photo (the query) and the system returns the most similar picture thas was previously taken by some other user (the dataset element). Time performances are crucial, followed by precision and finally by memory usage.
MY APPROACH:
First of all, it's quite obvious that this is a 1-Nearest Neighbor problem (1-NN). LSH is a popular, fast and relatively precise solution for this problem. In particular, my LSH impelementation is about using Kernalized Locality Sensitive Hashing to achieve a good precision to translate a d-dimension vector to a s-dimension binary vector (where s<<d) and then use Fast Exact Search in Hamming Space
with Multi-Index Hashing to quickly find the exact nearest neighbor between all the vectors in the dataset (transposed to hamming space).
In addition, I'm going to use SIFT since I want to use a robust keypoint detector&descriptor for my application.
WHAT DOES IT MISS IN THIS PROCESS?
Well, it seems that I already decided everything, right? Actually NO: in my linked question I face the problem about how to represent the set descriptor vectors of a single image into a vector. Why do I need it? Because a query/dataset element in LSH is vector, not a matrix (while SIFT keypoint descriptor set is a matrix). As someone suggested in the comments, the commonest (and most efficient) solution is using the Bag of Features (BoF) model, which I'm still not confident with yet.
So, I read this article, but I have still some questions (see QUESTIONS below)!
QUESTIONS:
First and most important question: do you think that this is a reasonable approach?
Is k-means used in the BoF algorithm the best choice for such an application? What are alternative clustering algorithms?
The dimension of the codeword vector obtained by the BoF is equal to the number of clusters (so k parameter in the k-means approach)?
If 2. is correct, bigger is k then more precise is the BoF vector obtained?
There is any "dynamic" k-means? Since the query image must added to the dataset after the computation is done (remember: the dataset is formed by the images of all submitted queries) the cluster can change in time.
Given a query image, is the process to obtain the codebook vector the same as the one for a dataset image, e.g. we assign each descriptor to a cluster and the i-th dimension of the resulting vector is equal to the number of descriptors assigned to the i-th cluster?
It looks like you are building codebook from a set of keypoint features generated by SIFT.
You can try "mixture of gaussians" model. K-means assumes that each dimension of a keypoint is independent while "mixture of gaussians" can model the correlation between each dimension of the keypoint feature.
I can't answer this question. But I remember that the SIFT keypoint, by default, has 128 dimensions. You probably want a smaller number of clusters like 50 clusters.
N/A
You can try Infinite Gaussian Mixture Model or look at this paper: "Revisiting k-means: New Algorithms via Bayesian Nonparametrics" by Brian Kulis and Michael Jordan!
Not sure if I understand this question.
Hope this help!
My method for classifying facial expressions using k-means is:
Use opencv to detect the face in the image
Use ASM and stasm to get the facial feature point
Calculate the distance between facial features (as show in the picture). There'll be 5 distances.
Calculate the centroid for each distance for each facial expression (exp: in the distance D1 there are 7 centroids for each expression 'happy, angry...').
Use 5 k-means each k-means for a distance and each k-means will have as a result the expression shown by the distance closest to the Centroid calculated in the first step.
Final expression will be the expression that appears in the most k-means results
However, using that method my results are wrong?
Is my method correct or is it wrong somewhere?
K-means is not a classification algorithm. Once runned, it simply finds centroids of K elements, so it splits data into K parts, but in most cases it won't have anything to do with desired classes. This algorithm (as all the clustering methods) should be used when you want to explore data and find some distinguishable objects. Distinguishable in any sense. If your task is to build a system, which recognizes some given classes, then it is a classification problem, not clustering. One of the most simple methods, which are easy to both implement and understand is KNN (K-nearest neighbours), which roughly does what you are trying to accomplish - checks which classes' objects are the closest ones to some predefined ones.
To better see the difference let us consider your case - you are trying to detect emotional state based on the face features. Running k-means on such data can split your face photos into many groups:
If you use photos of different people, it can cluster photos of particular people together (as their distances differ from others)
it can split data into for example man and woman, as there are gender specific differences in such features
it can even split your data based on the distance from the camera, as the perspective changes your features, creating "clusters".
etc.
As you can see, there are dozens possible "reasonable" (and even more completely not interpretable) splits, and K-means (and any) other clustering algorithm will simply find one of them (in most cases - the not interpretable one). Classification methods are used to overcome this issue, to "explain" the algorithm what are you expecting.
I am working on developing a object classifier by using 3 different features i.e SIFT, HISTOGRAM and EGDE.
However these 3 features have different dimensional vector e.g. SIFT = 128 dimension. HIST = 256.
Now these features cannot be concatenated into once vector due to different sizes. What I am planning to do but I am not sure if it is going to be correct way is this:
For each features i train the classifier separately and than i apply classification separately for 3 different features and than count the majority and finally declare the image with majority votes.
Do you think this is a correct way?
There are several ways to get classification results that take into account multiple features. What you have suggested is one possibility where instead of combining features you train multiple classifiers and through some protocol, arrive at a consensus between them. This is typically under the field of ensemble methods. Try googling for boosting, random forests for more details on how to combine the classifiers.
However, it is not true that your feature vectors cannot be concatenated because they have different dimensions. You can still concatenate the features together into a huge vector. E.g., joining your SIFT and HIST features together will give you a vector of 384 dimensions. Depending on the classifier you use, you will likely have to normalize the entries of the vector so that no one feature dominate simply because by construction it has larger values.
EDIT in response to your comment:
It appears that your histogram is some feature vector describing a characteristic of the entire object (e.g. color) whereas your SIFT descriptors are extracted at local interest keypoints of that object. Since the number of SIFT descriptors may vary from image to image, you cannot pass them directly to a typical classifier as they often take in one feature vector per sample you wish to classify. In such cases, you will have to build a codebook (also called visual dictionary) using the SIFT descriptors you have extracted from many images. You will then use this codebook to help you derive a SINGLE feature vector from the many SIFT descriptors you extract from each image. This is what is known as a "bag of visual words (BOW)" model. Now that you have a single vector that "summarizes" the SIFT descriptors, you can concatenate that with your histogram to form a bigger vector. This single vector now summarizes the ENTIRE image/(object in the image).
For details on how to build the bag of words codebook and how to use it to derive a single feature vector from the many SIFT descriptors extracted from each image, look at this book (free for download from author's website) http://programmingcomputervision.com/ under the chapter "Searching Images". It is actually a lot simpler than it sounds.
Roughly, just run KMeans to cluster the SIFT descriptors from many images and take their centroids (which is a vector called a "visual word") as the codebook. E.g. for K = 1000 you have a 1000 visual word codebook. Then, for each image, create a result vector the same size as K (in this case 1000). Each element of this vector corresponds to a visual word. Then, for each SIFT descriptor extracted from an image, find its closest matching vector in the codebook and increment the count in the corresponding cell in the result vector. When you are done, this result vector essentially counts how often the different visual words appear in the image. Similar images will have similar counts for the same visual words and hence this vector effectively represents your images. You will also need to "normalize" this vector to make sure that images with different number of SIFT descriptors (and hence total counts) are comparable. This can be as simple as simply dividing each entry by the total count in the vector or through a more sophisticated measure such as tf/idf as described in the book.
I believe the author also provide python code on his website to accompany the book. Take a look or experiment with them if you are unsure.
More sophisticated method for combining features include Multiple Kernel Learning (MKL). In this case, you compute different kernel matrices, each using one feature. You then find the optimal weights to combine the kernel matrices and use the combined kernel matrix to train a SVM. You can find the code for this in the Shogun Machine Learning Library.