Voice Spectrogram - signal-processing

Voice Spectrogram - signal-processing

I am working on a spectrogram project and trying to plot the frequencies with the highest magnitude at each section. We have tested and recorded the do-re-mi-fa-so-la-ti-do sang by a human. After plotting the spectrogram, we have seen multiple sets of increase in magnitudes. In this image , we have encircled our ideal frequencies to be plotted.
However, there were some sections that had the frequencies with the highest magnitude located outside our ideal set of frequencies. For example, in time 6-7, the frequency plotted was around 200 instead of 400.
Do anybody have an idea why this happens?

This is normal and expected. The overtone or harmonic with the highest magnitude in speech or singing can vary with the pitch and voicing (the particular vowel sound, etc.) Change the speaker, pitch or vowel and the overtone or harmonic frequency multiplier for the highest energy peak can change. It can even change over time for a constant vowel and pitch.

Related

Sinusoids with frequencies that are random variales - What does the FFT impulse look like?

I'm currently working on a program in C++ in which I am computing the time varying FFT of a wav file. I have a question regarding plotting the results of an FFT.
Say for example I have a 70 Hz signal that is produced by some instrument with certain harmonics. Even though I say this signal is 70 Hz, it's a real signal and I assume will have some randomness in which that 70Hz signal varies. Say I sample it for 1 second at a sample rate of 20kHz. I realize the sample period probably doesn't need to be 1 second, but bear with me.
Because I now have 20000 samples, when I compute the FFT. I will have 20000 or (19999) frequency bins. Let's also assume that my sample rate in conjunction some windowing techniques minimize spectral leakage.
My question then: Will the FFT still produce a relatively ideal impulse at 70Hz? Or will there 'appear to be' spectral leakage which is caused by the randomness the original signal? In otherwords, what does the FFT look like of a sinusoid whose frequency is a random variable?

Some of the more common modulation schemes will add sidebands that carry the information in the modulation. Depending on the amount and type of modulation with respect to the length of the FFT, the sidebands can either appear separate from the FFT peak, or just "fatten" a single peak.

Your spectrum will appear broadened and this happens in the real world. Look e.g for the Voight profile, which is a Lorentizan (the result of an ideal exponential decay) convolved with a Gaussian of a certain width, the width being determined by stochastic fluctuations, e.g. Doppler effect on molecules in a gas that is being probed by a narrow-band laser.
You will not get an 'ideal' frequency peak either way. The limit for the resolution of the FFT is one frequency bin, (frequency resolution being given by the inverse of the time vector length), but even that (as #xvan pointed out) is in general broadened by the window function. If your window is nonexistent, i.e. it is in fact a square window of the length of the time vector, then you'll get spectral peaks that are convolved with a sinc function, and thus broadened.
The best way to visualize this is to make a long vector and plot a spectrogram (often shown for audio signals) with enough resolution so you can see the individual variation. The FFT of the overall signal is then the projection of the moving peaks onto the vertical axis of the spectrogram. The FFT of a given time vector does not have any time resolution, but sums up all frequencies that happen during the time you FFT. So the spectrogram (often people simply use the STFT, short time fourier transform) has at any given time the 'full' resolution, i.e. narrow lineshape that you expect. The FFT of the full time vector shows the algebraic sum of all your lineshapes and therefore appears broadened.
To sum it up there are two separate effects:
a) broadening from the window function (as the commenters 1 and 2 pointed out)
b) broadening from the effect of frequency fluctuation that you are trying to simulate and that happens in real life (e.g. you sitting on a swing while receiving a radio signal).
Finally, note the significance of #xvan's comment : phi= phi(t). If the phase angle is time dependent then it has a derivative that is not zero. dphi/dt is a frequency shift, so your instantaneous frequency becomes f0 + dphi/dt.

How to determine periodicity from FFT?

Let's say I have some data that corresponds to the average temperature in a city measured every minute for around 1 year. How can I determine if there's cyclical patterns from the data using an FFT?
I know how it works for sound... I do an FFT of a sound wave and now the magnitude is shown in the Y axis and the frequency in Hertz is shown in the X-axis because the sampling frequency is in Hertz. But in my previous example the sampling frequency would be... 1 sample every minute, right? So how should I change it to something meaningful? I would get cycles/minute instead of cycles per seconds? And what does cycles/minute would mean here?

I think your interpretation is correct - you are just scaling to different units. Once you've found the spectral peak you might find it more useful to take the reciprocal to express the value in minutes/cycle (ie the length of the periodic cycle). Effectively this is thinking in terms of wavelength rather than frequency.

How can I select an optimal window for Short Time Fourier Transform?

I want to select an optimal window for STFT for different audio signals. For a signal with frequency contents from 10 Hz to 300 Hz what will be the appropriate window size ? similarly for a signal with frequency contents 2000 Hz to 20000 Hz, what will  be the optimal window size ?
I know that if a window size is 10 ms then this will give you a frequency resolution of about 100 Hz. But if the frequency contents in the signal lies from 100 Hz to 20000 HZ then 10 ms will be appropriate window size ? or we should go for some  other window size because of 20000 Hz frequency content in a signal ?
I know the classic "uncertainty principle" of the Fourier Transform. You can either have high resolution in time or high resolution in frequency but not both at the same time. The window lengths allow you to trade off between the two.

Windowed analysis is designed for quasi-stationary signals. Quasi-stationary signals are signals which change over time but on some short period of time they might be considered stable.
One example of quasi-stationary signal is speech. Frequency components of this signal change over time when position of tongue and mouth changes, but on a short period of time approximately 0.01s they might be considered stable because tongue does not move this fast. The range of 0.01s is determined by our biology, we just can't move tongue faster than that.
Another example is music. When you touch the string you might consider it produces more or less stable sound for some short period of time. Usually 0.05 seconds. Within this period you might consider sound stable.
There might be other types of signals, for example, it might have frequency 10Ghz and be quasi-stationary of 1ms of time.
Windowed analysis allows to capture both stationary properties of signal and change of signal over time. Here it does not matter what sample rate does signal have, what frequency resolution do you need or what are the main harmonics. Are main harmonics near 100Hz or near 3000Hz. It is important on what period of time the signal is stationary and on what it can be considered as changing.
So for speech 25ms window is good just because speech is quasi-stationary on that range. For music you usually take longer windows because our fingers are moving slower than our mouth. You need to study your signal to decide optimal window length or you need to provide more information about it.

You need to specify your "optimality" criteria.
For a desired frequency resolution, you need a length or window size of roughly Fs/df (or from a fraction to twice that length or more, depending on S/N and window). However the length also needs to be similar to or shorter than the length of time during which your signal is stationary within your desired frequency resolution bounds. This may not be possible or known, thus requiring you to specify which criteria (df vs. dt) is more important for your desired "optimality".
If multiple window lengths meet your criteria, then the shortest length that is a multiple of very small primes is likely to be the most computationally efficient for the following FFTs within an STFT computational sequence.

Based on the sampling theorem, the sampling frequency needs to be larger than twice the highest frequency of the signal. And based on DFT (discrete Fourier Transform), we also know that the frequency resolution is the inverse of the entire signal duration, and the the entire frequency span is the inverse of the time resolution. Note that the frequency is simply the inverse of the period, thus the relationships go inversely with each other.
Frequency resolution = 1 / (overall time duration)
Frequency span = 1 / (time resolution)
Having said that, to process 20kHz audio signal, we need to sample in 40kHz. And if we want to get the frequency resolution down, say to 10Hz, we will need to sample the entire duration as long as 0.1Sec, which is 1/10Hz.
This is the reason we normally see that audio files are said to be 44k. Because the human hearing range is limited to 20kHz. To add some margin to it, we use 44k sampling frequency in stead of 40kHz.
I think the uncertainty principle goes with the fact that more localized signal in one domain, actually spread out on the other. For example, a pulse in time domain goes from negative infinity to positive infinite, i.e the entire stretch of the spectrum. And vice versa that the a single frequency signal in spectrum stretches from negative infinity to positive infinite in time domain. This is simply because we had to go forever in order to know if a signal could be a pure sinusoidal signal or not.
But for DFT, we can always get the frequency span if we sample twice the highest frequency of the signal, and the resolution we want if we sample the signal duration long enough. So, not so uncertain as the uncertainty principle says, as long as we know how many samples to take and how fast and how long to take them.

DFT Frequency Components Opencv

I used the following link to learn about how to use DFT in Opencv
http://docs.opencv.org/doc/tutorials/core/discrete_fourier_transform/discrete_fourier_transform.html
I understood how the magnitude is extracted from the Dft. However, I want to know for what frequencies each magnitude stands for, to know about the presence of high frequencies and low frequencies. Could you please help me how to intepret this? For what frequencies each magnitude is coefficient for?
I want to know this without plotting, as I want to use this data autonomously, without manually referring to from the plot. Please help me

Sounds like you need a signal processing lesson instead of a computer vision lesson. What you get from the DFT is a matrix of complex components as big as the image you put into it. These correspond to the frequency components from 0 (top left) to the sampling frequency (bottom right). A component with frequency equal to the sampling frequency is a component with a period of 1 pixel. A component with a horizontal and vertical period of 4 pixels has a frequency of a quarter of the sampling frequency, so can be found at position [rows/4, cols/4], since four times longer period means four times smaller frequency.
Say you are looking for the component with horizontal period of 10 pixels and vertical period of 6 pixels. This component can be found at position [rows/6, cols/10] in the DFT result.

Pitch detection using FFT for trumpet

How do i get frequency using FFT? What's the right procedure and codes?

Pitch detection typically involves measuring the interval between harmonics in the power spectrum. The power spectrum is obtained form the FFT by taking the magnitude of the first N/2 bins (sqrt(re^2 + im^2)). However there are more sophisticated techniques for pitch detection, such as cepstral analysis, where we take the FFT of the log of the power spectrum, in order to identify periodicity in the spectral peaks.

A sustained note of a musical instrument is a periodic signal, and our friend Fourier (the second "F" in "FFT") tells us that any periodic signal can be constructed by adding a set of sine waves (generally with different amplitudes, frequencies, and phases). The fundamental is the lowest frequency component and it corresponds to pitch; the remaining components are overtones and are multiples of the fundamental's frequency. It is the relative mixture of fundamental and overtones that determines timbre, or the character of an instrument. A clarinet and a trumpet playing in unison sound "in tune" because they share the same fundamental frequency, however, they are individually identifiable because of their differing timbre (overtone mixture).
For your problem, you could sample the trumpet over a time window, calculate the FFT (which decomposes the sequence of samples into its constituent digital frequencies), and then assert that the pitch is the frequency of the bin with the greatest magnitude. If you desire, this could then be trivially quantized to the nearest musical half step, like E flat. (Lookup FFT on Wikipedia if you don't understand the relationship between the sampling frequency and the resultant frequency bins, or if you don't understand the detriment of having too low a sampling frequency.) This will probably meet your needs because the fundamental component usually has greater energy than any other component. The longer the window, the greater the pitch accuracy because the bin centers will become more closely spaced in frequency. However, if the window is so long that the trumpet is changing its pitch appreciably over the duration of the window, then the technique's effectiveness will break down considerably.

DansTuner is my open source project to solve this problem. I am in fact a trumpet player. It has pitch detection code lifted from Audacity.

ia added this org.apache.commons.math.transform.FastFourierTransforme package to the project and its works perfectly

Here is a short blog article on non-parametric techniques to estimating the PSD (power spectral density) along with some more detailed links. This might get you started in estimating the PSD - and then finding the pitch.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart