How to accelerate DFT by specifying region of interest in frequency domain - opencv

Note: This question is originally asked at OpenCV forum a couple of days ago.
I'm building an image processing program which extensively uses 2-dimensional dft, discrete Fourier transform. I'm trying to speed up so as to run in real-time.
In that application I only use a part of dft output specified by a rectangular ROI. My current implementation follows the steps below:
Compute dft of the input image f (typically size of 512x512) and get the entire dft result F
Crop F into a pre-specified region of interest (ROI, typically size of 32x32, located arbitrary), R
This process basically works well, but involves useless calculation since I only need partial information of F. I'm looking for a way to accelerate this calculation only by computing necessary part of dft.
I found OpenCV with Intel IPP computes dft with Intel IPP functions, which results in an order of magnitude faster than naive OpenCV implementation. I'm wondering if I can even accelerate this computation by only computing pre-specified frequency domain of dft.
Since I'm new to OpenCV, I've lost my way here, so I hope you could provide a way to do this.
Kindly note that I don't mean to do a dft to an ROI of an image, i.e. dft(ROI(f)), but I want to compute ROI(dft(f)).
Thanks in advance.

Just a partial idea.
The DFT is separable. It is always computed by first applying the FFT algorithm to rows of the image, then to the columns of the result (or the other way around, the order doesn't matter).
If you want only an ROI of the output, in the second step you only need to process the columns that fall within the ROI.
I don't think you'll find a way to compute only a subset of frequencies along each 1D row/column. That would likely entail hacking your own FFT, which will likely be more computationally expensive than using the one in IPP or FFTW.

Related

what is the role of <cv2.face.LBPHFaceRecognizer_create() >

i know that cv2.face.LBPHFaceRecognizer_create() use it for recognize face in real time
but i want to know what its fonction?,what exist inside this instruction ? how it is work?
i want to know what itss struct for exemple it is take the image and extract caractrestic in forme lbph and its use for that .... than train image for that its use (name of trainer) compare the images for can recognise them.
any information or document can help me please pratge with me
LBP(Local Binary Patterns) are one way to extract characteristic features of an object (Could be face, coffe cup or anything that has a representation). LBP's algorithm is really straight forward and can be done manually. (pixel thresholding + pixel level arithmetic operations.)
LBP Algorithm:
There is a "training" part in OpenCV's FaceRecognizer methods. Don't make this confuse you, there is no deep learning approach here. Just simple math.
OpenCV transforms LBP images to histograms to store spatial information of faces with the representation proposed by Timo Ahonen, Abdenour Hadid, and Matti Pietikäinen. Face recognition with local binary patterns. In Computer vision-eccv 2004, pages 469–481. Springer, 2004. . Which divides the LBP image to local regions sized m and extracting histogram for each region and concatenating them.
After having informations about one person's (1 label's) face, the rest is simple. During the inference, it calculates the test face's LBP, divides the regions and creates a histogram. Then compares its euclidian distance to the trained faces' histograms. If it's less than the tolerance value, it counts as a match. (Other distance methods can be used also, chi-squared distance absolute value etc. )

Time Series DFT Signals Clustering

I have a number of time series data sets, which I want to transform to dft signals in order to reduce dimensionality. After transforming to dft, I want to cluster the resulting dft data sets using k-means algorithm.
Since dft signals contain an imaginary number how can one cluster them?
You could simply treat the imaginary part as another component in your vectors. In other applications, you will want to ignore it!
But you'll be facing other, more severe challenges.
Data mining, and clustering in particular, rarely is as easy as appliyng function a (dft) and function b (k-means) and then you have the result, hooray. Sorry - that is not how exploratory data mining works.
First of all, for many time series, DFT will not be helpful at all. On others, you will first have to do appropriate resampling, or segmentation, or get rid of uninteresting effects such as seasonality. Even if DFT works, it may emphasize artifacts such as the sampling frequency or some interferences.
And then you'll run into one major problem: k-means is based on the assumption that all attributes have the same importance. And DFT is based on the very opposite idea: the first components capture most of the signal, the later ones only minor deviations from it (and that is the very motivation for using this as dimensionality reduction).
So based on this intuition, you maybe never should apply k-means on DFT coefficients at all. At the same time, data-mining repeatedly has shown that appfoaches that are "statistical nonsense" can nevertheless provide useful results... so you can try, but verify your resultd with care, and avoid being too enthusiastic or optimistic.
With the help of FFT, it converts dataset into dft signals. It helps to calculates DFT for each small data set.

Bag of Features / Visual Words + Locality Sensitive Hashing

PREMISE:
I'm really new to Computer Vision/Image Processing and Machine Learning (luckily, I'm more expert on Information retrieval), so please be kind with this filthy peasant! :D
MY APPLICATION:
We have a mobile application where the user takes a photo (the query) and the system returns the most similar picture thas was previously taken by some other user (the dataset element). Time performances are crucial, followed by precision and finally by memory usage.
MY APPROACH:
First of all, it's quite obvious that this is a 1-Nearest Neighbor problem (1-NN). LSH is a popular, fast and relatively precise solution for this problem. In particular, my LSH impelementation is about using Kernalized Locality Sensitive Hashing to achieve a good precision to translate a d-dimension vector to a s-dimension binary vector (where s<<d) and then use Fast Exact Search in Hamming Space
with Multi-Index Hashing to quickly find the exact nearest neighbor between all the vectors in the dataset (transposed to hamming space).
In addition, I'm going to use SIFT since I want to use a robust keypoint detector&descriptor for my application.
WHAT DOES IT MISS IN THIS PROCESS?
Well, it seems that I already decided everything, right? Actually NO: in my linked question I face the problem about how to represent the set descriptor vectors of a single image into a vector. Why do I need it? Because a query/dataset element in LSH is vector, not a matrix (while SIFT keypoint descriptor set is a matrix). As someone suggested in the comments, the commonest (and most efficient) solution is using the Bag of Features (BoF) model, which I'm still not confident with yet.
So, I read this article, but I have still some questions (see QUESTIONS below)!
QUESTIONS:
First and most important question: do you think that this is a reasonable approach?
Is k-means used in the BoF algorithm the best choice for such an application? What are alternative clustering algorithms?
The dimension of the codeword vector obtained by the BoF is equal to the number of clusters (so k parameter in the k-means approach)?
If 2. is correct, bigger is k then more precise is the BoF vector obtained?
There is any "dynamic" k-means? Since the query image must added to the dataset after the computation is done (remember: the dataset is formed by the images of all submitted queries) the cluster can change in time.
Given a query image, is the process to obtain the codebook vector the same as the one for a dataset image, e.g. we assign each descriptor to a cluster and the i-th dimension of the resulting vector is equal to the number of descriptors assigned to the i-th cluster?
It looks like you are building codebook from a set of keypoint features generated by SIFT.
You can try "mixture of gaussians" model. K-means assumes that each dimension of a keypoint is independent while "mixture of gaussians" can model the correlation between each dimension of the keypoint feature.
I can't answer this question. But I remember that the SIFT keypoint, by default, has 128 dimensions. You probably want a smaller number of clusters like 50 clusters.
N/A
You can try Infinite Gaussian Mixture Model or look at this paper: "Revisiting k-means: New Algorithms via Bayesian Nonparametrics" by Brian Kulis and Michael Jordan!
Not sure if I understand this question.
Hope this help!

Can you still extract features from a digital signal without converting it to analog using MFCC?

I am developing a back-end speech recognition software wherein the user can import mp3 files. How can I extract the features from this digital audio file? should I convert it back to analog first?
Your question is unclear, since you are using terms analog and digital incorrectly. Analog is a real-world, continuous function, i.e. voltage, pressure, etc. Digital is a discrete (sampled) and quantized version of the analog signal. You must calculate the FFT of your audio frames when calculating the MFCC's. You can extract MFCC's only from the digital signal - it's rather impossible to do it with the analog one.
If you are asking about whether it is possible to extract the MFCC's from an mp3 file, then yes - it is possible. All you need is to perform the standard algorithm and you can get your features - obviously it is outside of spec of that question.
Calculate the FFT for frames of data.
Calculate the PSD by squaring the samples.
Apply the mel-filterbank and sum the energy across banks.
Calculate the logarithm of each of the energies.
Calculate the DCT of the logarithms of energies.
You're confusing things here, like #jojek said you can do all that WITH the digital signal. This here is a pretty spot on tutorial:
http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/
This one is more practical:
http://www.speech.cs.cmu.edu/15-492/slides/03_mfcc.pdf
From Wikipedia: [http://en.wikipedia.org/wiki/Mel-frequency_cepstrum]
MFCCs are commonly derived as follows:[1][2]
Take the Fourier transform of (a windowed excerpt of) a signal. Means short time fourier transform)
Map the powers of the spectrum obtained above onto the mel scale, using triangular overlapping windows. (Calculation described in the links above)
Take the logs of the powers at each of the mel frequencies.
Take the discrete cosine transform of the list of mel log powers, as if it were a signal.
The MFCCs are the amplitudes of the resulting spectrum.
and here's a Matlab toolbox to help you understand it better:
http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html

GPU based SIFT feature extractor for iOS?

I've been playing with the excellent GPUImage library, which implements several feature detectors: Harris, FAST, ShiTomas, Noble. However none of those implementations help with the feature extraction and matching part. They simply output a set of detected corner points.
My understanding (which is shakey) is that the next step would be to examine each of those detected corner points and extract the feature from then, which would result in descriptor - ie, a 32 or 64 bit number that could be used to index the point near to other, similar points.
From reading Chapter 4.1 of [Computer Vision Algorithms and Applications, Szeliski], I understand that using a BestBin approach would help to efficient find neighbouring feautures to match, etc. However, I don't actually know how to do this and I'm looking for some example code that does this.
I've found this project [https://github.com/Moodstocks/sift-gpu-iphone] which claims to implement as much as possible of the feature extraction in the GPU. I've also seen some discussion that indicates it might generate buggy descriptors.
And in any case, that code doesn't go on to show how the extracted features would be best matched against another image.
My use case if trying to find objects in an image.
Does anyone have any code that does this, or at least a good implementation that shows how the extracted features are matched? I'm hoping not to have to rewrite the whole set of algorithms.
thanks,
Rob.
First, you need to be careful with SIFT implementations, because the SIFT algorithm is patented and the owners of those patents require license fees for its use. I've intentionally avoided using that algorithm for anything as a result.
Finding good feature detection and extraction methods that also work well on a GPU is a little tricky. The Harris, Shi-Tomasi, and Noble corner detectors in GPUImage are all derivatives of the same base operation, and probably aren't the fastest way to identify features.
As you can tell, my FAST corner detector isn't operational yet. The idea there is to use a lookup texture based on a local binary pattern (why I built that filter first to test the concept), and to have that return whether it's a corner point or not. That should be much faster than the Harris, etc. corner detectors. I also need to finish my histogram pyramid point extractor so that feature extraction isn't done in an extremely slow loop on the GPU.
The use of a lookup texture for a FAST corner detector is inspired by this paper by Jaco Cronje on a technique they refer to as BFROST. In addition to using the quick, texture-based lookup for feature detection, the paper proposes using the binary pattern as a quick descriptor for the feature. There's a little more to it than that, but in general that's what they propose.
Feature matching is done by Hamming distance, but while there are quick CPU-side and CUDA instructions for calculating that, OpenGL ES doesn't have one. A different approach might be required there. Similarly, I don't have a good solution for finding a best match between groups of features beyond something CPU-side, but I haven't thought that far yet.
It is a primary goal of mine to have this in the framework (it's one of the reasons I built it), but I haven't had the time to work on this lately. The above are at least my thoughts on how I would approach this, but I warn you that this will not be easy to implement.
For object recognition / these days (as of a couple weeks ago) best to use tensorflow /Convolutional Neural Networks for this.
Apple has some metal sample code recently added. https://developer.apple.com/library/content/samplecode/MetalImageRecognition/Introduction/Intro.html#//apple_ref/doc/uid/TP40017385
To do feature detection within an image - I draw your attention to an out of the box - KAZE/AKAZE algorithm with opencv.
http://www.robesafe.com/personal/pablo.alcantarilla/kaze.html
For ios, I glued the Akaze class together with another stitching sample to illustrate.
detector = cv::AKAZE::create();
detector->detect(mat, keypoints); // this will find the keypoints
cv::drawKeypoints(mat, keypoints, mat);
// this is the pseudo SIFT descriptor
.. [255] = {
pt = (x = 645.707153, y = 56.4605064)
size = 4.80000019
angle = 0
response = 0.00223364262
octave = 0
class_id = 0 }
https://github.com/johndpope/OpenCVSwiftStitch
Here is a GPU accelerated SIFT feature extractor:
https://github.com/lukevanin/SIFTMetal
The code is written in Swift 5 and uses Metal compute shaders for most operations (scaling, gaussian blur, key point detection and interpolation, feature extraction). The implementation is largely based on the paper and code from the "Anatomy of the SIFT Method Article" published in the Image Processing Online Journal (IPOL) in 2014 (http://www.ipol.im/pub/art/2014/82/). Some parts are based on code by Rob Whess (https://github.com/robwhess/opensift), which I believe is now used in OpenCV.
For feature matching I tried using a kd-tree with the best-bin first (BBF) method proposed by David Lowe. While BBF does provide some benefit up to about 10 dimensions, with a higher number of dimensions such as used by SIFT, it is no better than quadratic search due to the "curse of dimensionality". That is to say, if you compare 1000 descriptors against 1000 other descriptors, it stills end up making 1,000 x 1,000 = 1,000,000 comparisons - the same as doing brute-force pairwise.
In the linked code I use a different approach optimised for performance over accuracy. I use a trie to locate the general vicinity for potential neighbours, then search a fixed number of sibling leaf nodes for the nearest neighbours. In practice this matches about 50% of the descriptors, but only makes 1000 * 20 = 20,000 comparisons - about 50x faster and scales linearly instead of quadratically.
I am still testing and refining the code. Hopefully it helps someone.

Resources