I'm a software engineer working on DSP for the first time.
I'm successfully using an FFT library that produces frequency spectrums. I also understand how the FFT works in terms of its inputs and outputs, in particular the contents of the two output arrays:
Now, my problem is that I'm reading some new research reports that suggest that I extract: "the energy, variance, and sum of FFT coefficients".
What are the 'FFT coefficients'? Are those the values of the Real and Imaginary arrays shown above, which (from my understanding) correspond to the amplitudes of the constituent cosine and sine waves?
What is the 'energy' of the FFT coefficients? Is that terminology from statistics or from DSP?
You are correct. FFT coefficients are the signal values in the frequency domain.
"Energy" is the square modulus of the coefficients. The total energy (sum of square modulus of all values) in either time or frequency domain is the same (see Parseval's Theorem).
The real and imaginary arrays, when put together, can represent a complex array. Every complex element of the complex array in the frequency domain can be considered a frequency coefficient, and has a magnitude ( sqrt(R*R + I*I) ). Parseval's theorem says that the sum of all the Frequency domain complex vector magnitudes (squared) is equal to the energy of the time domain signal (which may require a scaling factor involving the FFT length, depending on your particular DFT/FFT library implementation).
One example of a time domain signal is voltage on a wire, which when measured in Volts times Amps into Ohms represents power, or over time, energy. Probably the word "energy" in the strictly numerical case is derived from historical usage from physics or engineering, where the numbers meant something that could burn your fingers.
Related
I'm currently working with Fourier transforms and I notice that the output of an FFT usually has the dimensions (n_fft, ) where n_fft is the number of samples to consider in the FFT, though some implementations discard frequency bins above the Nyquist frequency. What I can't wrap my head around is why the frequency resolution is dependent on the number of samples considered in the FFT.
Could someone please explain the intuition behind this please?
The DFT is a change of basis from the time domain to the frequency domain. As Cris put it, "If you have N input samples, you need at least N numbers to fully describe the input."
some implementations discard frequency bins above the Nyquist frequency.
Yes, that's right. When the input signal is real-valued, its spectrum is Hermitian symmetric. The components above the Nyquist frequency are complex-conjugates of lower frequencies, so it is common practice to use real-to-complex FFTs to skip explicitly computing and storing them. In other words, if you have N real-valued samples, you need N/2 complex numbers to represent them.
I'm using the OpenCV implementation of ORB along with the BFMatcher. The OpenCV states that NORM_HAMMING should be used with ORB.
Why is this? What advantages does norm_hamming offer over other methods such as euclidean distance, norm_l1, etc.
When comparing descriptors in computer vision, the Euclidian distance is usually understood as the square root of the sum of the squared differences between the two vectors' elements.
The ORB descriptors are vectors of binary values. If applying Euclidian distance to binary vectors, the squared result of a single comparison would always be 1 or 0, which is not informative when it comes to estimating the difference between the elements. The overall Euclidian distance would be the square root of the sum of those ones and zeroes, again not a good estimator of the difference between the vectors.
That's why the Hamming distance is used. Here the distance is the number of elements that are not the same. As noted by Catree, you can calculate it by a simple boolean operation on the vectors, as shown in the figure below. Here D1 is a single 4-bit descriptor that we are comparing with 4 descriptors shown in D2. Matrix H is the hamming distances for each row.
ORB (ORB: an efficient alternative to SIFT or SURF) is a binary descriptor.
It should be more efficient (in term of computation) to use the HAMMING distance rather than the L1/L2 distance as the HAMMING distance can be implemented using a XOR followed by a bit count (see BRIEF: Binary Robust Independent Elementary Features):
Furthermore, comparing strings can be done by computing the Hamming
distance, which can be done extremely fast on modern CPUs that often
provide a specific instruction to perform a XOR or bit count
operation, as is the case in the latest SSE [10] instruction set.
Of course, with a classical descriptor like SIFT, you cannot use the HAMMING distance.
You can test yourself:
D1=01010110
D2=10011010
L2_dist(D1,D2)=sqrt(4)=2
XOR(D1,D2)=11001100 ; bit_count(11001100)=4
L1/L2 distance is used for string based descriptors and Hamming distance is used for binary descriptors (AKAZE, ORB, BRIEF etc.).
I'm reading the paper below and I have some trouble , understanding the concept of negative sampling.
http://arxiv.org/pdf/1402.3722v1.pdf
Can anyone help , please?
The idea of word2vec is to maximise the similarity (dot product) between the vectors for words which appear close together (in the context of each other) in text, and minimise the similarity of words that do not. In equation (3) of the paper you link to, ignore the exponentiation for a moment. You have
v_c . v_w
-------------------
sum_i(v_ci . v_w)
The numerator is basically the similarity between words c (the context) and w (the target) word. The denominator computes the similarity of all other contexts ci and the target word w. Maximising this ratio ensures words that appear closer together in text have more similar vectors than words that do not. However, computing this can be very slow, because there are many contexts ci. Negative sampling is one of the ways of addressing this problem- just select a couple of contexts ci at random. The end result is that if cat appears in the context of food, then the vector of food is more similar to the vector of cat (as measures by their dot product) than the vectors of several other randomly chosen words (e.g. democracy, greed, Freddy), instead of all other words in language. This makes word2vec much much faster to train.
Computing Softmax (Function to determine which words are similar to the current target word) is expensive since requires summing over all words in V (denominator), which is generally very large.
What can be done?
Different strategies have been proposed to approximate the softmax. These approaches can be grouped into softmax-based and sampling-based approaches. Softmax-based approaches are methods that keep the softmax layer intact, but modify its architecture to improve its efficiency (e.g hierarchical softmax). Sampling-based approaches on the other hand completely do away with the softmax layer and instead optimise some other loss function that approximates the softmax (They do this by approximating the normalization in the denominator of the softmax with some other loss that is cheap to compute like negative sampling).
The loss function in Word2vec is something like:
Which logarithm can decompose into:
With some mathematic and gradient formula (See more details at 6) it converted to:
As you see it converted to binary classification task (y=1 positive class, y=0 negative class). As we need labels to perform our binary classification task, we designate all context words c as true labels (y=1, positive sample), and k randomly selected from corpora as false labels (y=0, negative sample).
Look at the following paragraph. Assume our target word is "Word2vec". With window of 3, our context words are: The, widely, popular, algorithm, was, developed. These context words consider as positive labels. We also need some negative labels. We randomly pick some words from corpus (produce, software, Collobert, margin-based, probabilistic) and consider them as negative samples. This technique that we picked some randomly example from corpus is called negative sampling.
Reference :
(1) C. Dyer, "Notes on Noise Contrastive Estimation and Negative Sampling", 2014
(2) http://sebastianruder.com/word-embeddings-softmax/
I wrote an tutorial article about negative sampling here.
Why do we use negative sampling? -> to reduce computational cost
The cost function for vanilla Skip-Gram (SG) and Skip-Gram negative sampling (SGNS) looks like this:
Note that T is the number of all vocabs. It is equivalent to V. In the other words, T = V.
The probability distribution p(w_t+j|w_t) in SG is computed for all V vocabs in the corpus with:
V can easily exceed tens of thousand when training Skip-Gram model. The probability needs to be computed V times, making it computationally expensive. Furthermore, the normalization factor in the denominator requires extra V computations.
On the other hand, the probability distribution in SGNS is computed with:
c_pos is a word vector for positive word, and W_neg is word vectors for all K negative samples in the output weight matrix. With SGNS, the probability needs to be computed only K + 1 times, where K is typically between 5 ~ 20. Furthermore, no extra iterations are necessary to compute the normalization factor in the denominator.
With SGNS, only a fraction of weights are updated for each training sample, whereas SG updates all millions of weights for each training sample.
How does SGNS achieve this? -> by transforming multi-classification task into binary classification task.
With SGNS, word vectors are no longer learned by predicting context words of a center word. It learns to differentiate the actual context words (positive) from randomly drawn words (negative) from the noise distribution.
In real life, you don't usually observe regression with random words like Gangnam-Style, or pimples. The idea is that if the model can distinguish between the likely (positive) pairs vs unlikely (negative) pairs, good word vectors will be learned.
In the above figure, current positive word-context pair is (drilling, engineer). K=5 negative samples are randomly drawn from the noise distribution: minimized, primary, concerns, led, page. As the model iterates through the training samples, weights are optimized so that the probability for positive pair will output p(D=1|w,c_pos)≈1, and probability for negative pairs will output p(D=1|w,c_neg)≈0.
I am seeing a lot of literature in which they say that by using the fft one can reach a faster convolution. I know that one needs to get fft and and then ifft from the results, but I really do not understand why using the fft can make the convolution faster?
FFT speeds up convolution for large enough filters, because convolution requires N multiplications (and N-1) additions for each output sample and conversely (2)N^2 operations for a block of N samples.
Taking account, that one has to double the block size for FFT processing by adding zeroes, each block requires (2)*(2N)*log(2N) operations to perform FFT, 2N operations to multiply and again 4N*log(2N) operations to perform inverse FFT, there is a break even point, where 8Nlog2N <= 2N^2.
The fundamental reasons are:
1) a discrete time-domain signal can be represented as a sum of frequencies.
2) convolution in time domain (O(N^2)) equals multiplication of frequencies in frequency domain (O(N))
3) the transformation is invertible
4) there exists a method to convert a signal from time domain to frequency domain in less than N^2 operations (that's the first F in 'Fast Fourier Transform').
The straight forward FT is O(N^2), where each Frequency domain element F(i) = Sigma f(i) * exp(i*pi/N).
However the FFT is based on an observation that exp(i*pi/N) has certain symmetries, allowing the calculation to be split in odd/even vectors. The even vectors can be calculated at the expense of O(N) while the odd vectors require a full FT of half the size. As this can be repeated until N=2, the overall complexity will be (proportional to) Nlog(N).
Using GNU octave, I'm computing a fft over a piece of signal, then eliminating some frequencies, and finally reconstructing the signal. This give me a nice approximation of the signal ; but it doesn't give me a way to extrapolate the data.
Suppose basically that I have plotted three periods and a half of
f: x -> sin(x) + 0.5*sin(3*x) + 1.2*sin(5*x)
and then added a piece of low amplitude, zero-centered random noise. With fft/ifft, I can easily remove most of the noise ; but then how do I extrapolate 3 more periods of my signal data? (other of course that duplicating the signal).
The math way is easy : you have a decomposition of your function as an infinite sum of sines/cosines, and you just need to extract a partial sum and apply it anywhere. But I don't quite get the programmatic way...
Thanks!
The Discrete Fourier Transform relies on the assumption that your time domain data is periodic, so you can just repeat your time domain data ad nauseam - no explicit extrapolation is necessary. Of course this may not give you what you expect if your individual component periods are not exact sub-multiples of the DFT input window duration. This is one reason why we typically apply window functions such as the Hanning Window prior to the transform.