I'm trying to do some ffts with MKL's ComputeForward method. Sometimes I get bins with zero on real and imaginary parts. I,.e I'm doing an FFT of floats of 20480 samples of a 16K tone sampled at 1.024 Msps, thus 50 Hz resolution per bin. The bin 9920, which corresponds to 496K is 0+0i.
The rest of the 10240 bins seem correct.
I've done the FFT on Octave and that bin should fit without problems on a float.
What can cause this?
NOTE:
Curiosly enough, the failing bin is the symmetric with regards to the 16K tone, that is, the 16K tone is at bin 320, and the 9920 is the 320th bin starting from the right.
Matlab/octave is not officially supported by intel MKL thus might be causing the error. This could be possibly solved by using supported languages such as C/C++, DPC++, and Fortran.
https://software.intel.com/content/www/us/en/develop/articles/oneapi-math-kernel-library-system-requirements.html
https://community.intel.com/t5/Intel-oneAPI-Math-Kernel-Library/MKL-FFT-bin-zero-real-and-imaginary-parts-part-II/m-p/1296917#M31708
Related
I've been given some digitized sound recordings and asked to plot the sound pressure level per Hz.
The signal is sampled at 40KHz and the units for the y axis are simply volts.
I've been asked to produce a graph of the SPL as dB/Hz vs Hz.
EDIT: The input units are voltage vs time.
Does this make sense? I though SPL was a time domain measure?
If it does make sense how would I go about producing this graph? Apply the dB formula (20 * log10(x) IIRC) and do an FFT on that or...?
What you're describing is a Power Spectral Density. Matlab, for example, has a pwelch function that does literally what you're asking for. To scale to dBSPL/Hz, simply apply 10*log10([psd]) where psd is the output of pwelch. Let me know if you need help with the function inputs.
If you're working with a different framework, let me know which, 100% sure they'll have a version of this function, possibly with a different output format in which case the scaling might be different.
I am using fmcw radar to find out distance and speed of moving objects using stm32l476 microcontroller. I transmit the modulation signal as sawtooth waveform and I read the recieved signal in the digital form using ADC function available. Then, I copy this recieved ADC data into fft_in array(converting it into float32_t)(fft_in array size = 512). After copying this fft_in array, I apply fft on this array and process it for finding out range of the object. Until here everything works fine.
Now, in order to find velocity of the object, first, I copy this arrays(fft_in) as rows of the matrix for 64 chirps(Matrix size[64][512]). Then, I take Peak range bin column and apply fft for this column array. So while processing this column array by applying fft, its length reduce to half[32 elements]. Then finding out peak value bin multiplied by frequnecy resolution gives the phase differnce 'w' from which velocity can be calculated as "𝐯=𝛌𝛚/𝟒𝛑𝐓 𝐜".
while running this algorithm, I find that when object is stationery, I get peak value at 22th element(out of 32 elements). what does this imply?
I have sampling frequency for ADC as 24502hz. So per bin value for range estimation is 47.8566hz (24502/512).
I have 64 chirps and Tc is 0.006325s. So 1/0.006325 gives 158.10Hz.What would be per velocity bin resolution, Is it 2.47Hz(158.10/64)? I have bit confusion in this concept.How does 2nd fft works for finding out velocity in fmcw radar?
Infineon has excellent resources on this topic, see this FAQ for the basics: https://www.infineon.com/dgdl/Infineon-Radar%20FAQ-PI-v02_00-EN.pdf?fileId=5546d46266f85d6301671c76d2a00614
If you want to know more details, check out the P2G Software User Manual:
https://www.infineon.com/dgdl/Infineon-P2G_Software_User_Manual-UserManual-v01_01-EN.pdf?fileId=5546d4627762291e017769040a233324 (Chapter 4)
There is even the software available with all the algorithms (including FMCW). How to get the software with the "Infineon Toolbox" is described here: https://www.mouser.com/pdfdocs/Infineon_Position2Go_QS.pdf
Some hints from me:
I suggest applying a window function before the fft https://en.wikipedia.org/wiki/Window_function and remove the mean.
Read about frequency mixers https://en.wikipedia.org/wiki/Frequency_mixer
It's written that CUFFT library supports algorithms that higly optimized for input sizes can be written in the folowing form: 2^a X 3^b X 5^c X 7^d.
How could they managed to do that?
For as far as I know, FFT must provide best perfomance only for 2^a input size.
This means that input sizes with prime factors larger than 7 would go slower.
The Cooley-Tukey algorithm can operate on a variety of DFT lengths which can be expressed as N = N_1*N_2. The algorithm recursively expresses a DFT of length N into N_1 smaller DFTs of length N_2.
As you note, the fastest is generally the radix-2 factorization, which recursively breaks a DFT of length N into 2 smaller DFTs of length N/2, running in O(NlogN).
However, the actual performance will depend on hardware and implementation. For example, if we are considering the cuFFT with a thread warp size of 32 then DFTs that have a length of some multiple of 32 would be optimal (note: just an example, I'm not aware of the actual optimizations that exist under the hood of the cuFFT.)
Short answer: the underlying code is optimized for any prime factorization up to 7 based on the Cooley-Tukey radix-n algorithm.
http://mathworld.wolfram.com/FastFourierTransform.html
https://en.wikipedia.org/wiki/Cooley-Tukey_FFT_algorithm
On my system, for a 5 MP image with a large window size (75px) it takes a whopping 140 ms (roughly 20 times as much as linear operations) to complete and I am looking to optimize it. I have noticed that the OpenCV gpu module does not implement a gpu version of the adaptiveThreshold so I have been thinking of implementing that algorithm for the GPU myself.
Can I hope for any speedup if I implement an adaptive threshold algorithm in CUDA, based on a large window size (50px+) and a large image (5 MP+), ignoring the overhead for loading memory into the GPU?
adaptiveThreshold documentation on opencv.org:
http://docs.opencv.org/modules/imgproc/doc/miscellaneous_transformations.html#adaptivethreshold
Building on Eric's answer:
The Npp CUDA library does not implement adaptiveThreshold but it seems beneficial to getting an adaptive threshold in a VERY straightforward way (just tested it and anecdotally works):
Run a box filter on src (i.e. compute mean window value for every pixel),
store in an intermediate image tmp.
Subtract a number K from each pixel in tmp
Run a compare function between src and
tmp into dst. The end.
The code may look like this (here K=0, 2nd step omitted):
nppiFilterBox_8u_C1R(oDeviceSrc.data(), oDeviceSrc.pitch(),
oDeviceIntermediate.data(), oDeviceDst.pitch(),
oSizeROI, oAdapThreshWindowSize,oAnchor);
nppiCompare_8u_C1R(oDeviceSrc.data(),oDeviceSrc.pitch(),
oDeviceDst.data(),oDeviceDst.pitch(),
oDeviceResult.data(),oDeviceResult.pitch(),
oSizeROI,NPP_CMP_LESS);
Also, wikipedia claims that applying a box filter 3 times in a row approximates a Gaussian filter to 97% accuracy.
Yes, this algorithm can be optimized on the GPU. I would expect to see an excellent speedup.
For ADAPTIVE_THRESH_MEAN_C, you could use a standard parallel reduction to calculate the arithmetic mean. For ADAPTIVE_THRESH_GAUSSIAN_C, you might use a kernel that performs per-pixel gaussian attenuation combined with a standard parallel reduction for the sum.
Implementation by CUDA should give you a satisfied performance gain.
Since your window size is large, this operation should be compute-bounded. The theoretical peak performance of a 5 MP image with 75px window on a Tesla K20X GPU should be about
5e6 * 75 * 75 / 3.95 Tflops = 7ms
Here's a white paper about image convolution. It shows how to implement a high performance box filer with CUDA.
http://docs.nvidia.com/cuda/samples/3_Imaging/convolutionSeparable/doc/convolutionSeparable.pdf
Nvidia cuNPP library also provides a function nppiFilterBox(), which can be used to implement ADAPTIVE_THRESH_MEAN_C directly.
http://docs.nvidia.com/cuda/cuda-samples/index.html#box-filter-with-npp
For ADAPTIVE_THRESH_GAUSSIAN_C, the function nppiFilter() with a proper mask could be used.
NPP doc pp.1009 http://docs.nvidia.com/cuda/pdf/NPP_Library.pdf
Note - may be more related to computer organization than software, not sure.
I'm trying to understand something related to data compression, say for jpeg photos. Essentially a very dense matrix is converted (via discrete cosine transforms) into a much more sparse matrix. Supposedly it is this sparse matrix that is stored. Take a look at this link:
http://en.wikipedia.org/wiki/JPEG
Comparing the original 8x8 sub-block image example to matrix "B", which is transformed to have overall lower magnitude values and much more zeros throughout. How is matrix B stored such that it saves much more memory over the original matrix?
The original matrix clearly needs 8x8 (number of entries) x 8 bits/entry since values can range randomly from 0 to 255. OK, so I think it's pretty clear we need 64 bytes of memory for this. Matrix B on the other hand, hmmm. Best case scenario I can think of is that values range from -26 to +5, so at most an entry (like -26) needs 6 bits (5 bits to form 26, 1 bit for sign I guess). So then you could store 8x8x6 bits = 48 bytes.
The other possibility I see is that the matrix is stored in a "zig zag" order from the top left. Then we can specify a start and an end address and just keep storing along the diagonals until we're only left with zeros. Let's say it's a 32-bit machine; then 2 addresses (start + end) will constitute 8 bytes; for the other non-zero entries at 6 bits each, say, we have to go along almost all the top diagonals to store a sum of 28 elements. In total this scheme would take 29 bytes.
To summarize my question: if JPEG and other image encoders are claiming to save space by using algorithms to make the image matrix less dense, how is this extra space being realized in my hard disk?
Cheers
The dct needs to be accompanied with other compression schemes that take advantage of the zeros/high frequency occurrences. A simple example is run length encoding.
JPEG uses a variant of Huffman coding.
As it says in "Entropy coding" a zig-zag pattern is used, together with RLE which will already reduce size for many cases. However, as far as I know the DCT isn't giving a sparse matrix per se. But it usually enhances the entropy of the matrix. This is the point where the compressen becomes lossy: The intput matrix is transferred with DCT, then the values are quantizised and then the huffman-encoding is used.
The most simple compression would take advantage of repeated sequences of symbols (zeros). A matrix in memory may look like this (suppose in dec system)
0000000000000100000000000210000000000004301000300000000004
After compression it may look like this
(0,13)1(0,11)21(0,12)43010003(0,11)4
(Symbol,Count)...
As my under stand, JPEG on only compress, it also drop data. After the 8x8 block transfer to frequent domain, it drop the in-significant (high-frequent) data, which means it only has to save the significant 6x6 or even 4x4 data. That it can has higher compress rate then non-lost method (like gif)