1-to-4 broadcast and 4-to-1 reduce in AVX-512

1-to-4 broadcast and 4-to-1 reduce in AVX-512 - sse

I need to do the following two operations:
float x[4];
float y[16];
// 1-to-4 broadcast
for ( int i = 0; i < 16; ++i )
y[i] = x[i / 4];
// 4-to-1 reduce-add
for ( int i = 0; i < 16; ++i )
x[i / 4] += y[i];
What would be an efficient AVX-512 implementation?

For the reduce-add, just do in-lane shuffles and adds (vmovshdup / vaddps / vpermilps imm8/vaddps) like in Fastest way to do horizontal float vector sum on x86 to get a horizontal sum in each 128-bit lane, and then vpermps to shuffle the desired elements to the bottom. Or vcompressps with a constant mask to do the same thing, optionally with a memory destination.
Once packed down to a single vector, you have a normal SIMD 128-bit add.
If your arrays are actually larger than 16, instead of vpermps you could vpermt2ps to take every 4th element from each of two source vectors, setting you up for doing the += part with into x[] 256-bit vectors. (Or combine again with another shuffle into 512-bit vectors, but that will probably bottleneck on shuffle throughput on SKX).
On SKX, vpermt2ps is only a single uop, with 1c throughput / 3c latency, so it's very efficient for how powerful it is. On KNL it has 2c throughput, worse than vpermps, but maybe still worth it. (KNL doesn't have AVX512VL, but for adding to x[] with 256-bit vectors you (or a compiler) can use AVX1 vaddps ymm if you want.)
See https://agner.org/optimize/ for instruction tables.
For the load:
Is this done inside a loop, or repeatedly? (i.e. can you keep a a shuffle-control vector in a register? If so, you could
do a 128->512 broadcast with VBROADCASTF32X4 (single uop for a load port).
do an in-lane shuffle with vpermilps zmm,zmm,zmm to broadcast a different element within each 128-bit lane. (Has to be separate from the broadcast-load, because a memory-source vpermilps can either have a m512 or m32bcst source. (Instructions typically have their memory broadcast granularity = their element size, unfortunately in some cases like this where it's not at all useful. And vpermilps takes the control vector as a memory operand, not the source data.)
This is slightly better than vpermps zmm,zmm,zmm because the shuffle has 1 cycle latency instead of 3 (on Skylake-avx512).
Even outside a loop, loading a shuffle-control vector might still be your best bet.

Related

How to calculate 512 point FFT using 2048 point FFT hardware module

I have a 2048 point FFT IP. How may I use it to calculate 512 point FFT ?

There are different ways to accomplish this, but the simplest is to replicate the input data 4 times, to obtain a signal of 2048 samples. Note that the DFT (which is what the FFT computes) can be seen as assuming the input signal being replicated infinitely. Thus, we are just providing a larger "view" of this infinitely long periodic signal.
The resulting FFT will have 512 non-zero values, with zeros in between. Each of the non-zero values will also be four times as large as the 512-point FFT would have produced, because there are four times as many input samples (that is, if the normalization is as commonly applied, with no normalization in the forward transform and 1/N normalization in the inverse transform).
Here is a proof of principle in MATLAB:
data = randn(1,512);
ft = fft(data); % 512-point FFT
data = repmat(data,1,4);
ft2 = fft(data); % 2048-point FFT
ft2 = ft2(1:4:end) / 4; % 512-point FFT
assert(all(ft2==ft))
(Very surprising that the values were exactly equal, no differences due to numerical precision appeared in this case!)

An alternate solution from the correct solution provided by Cris Luengo which does not require any rescaling is to pad the data with zeros to the required length of 2048 samples. You then get your result by reading every 2048/512 = 4 outputs (i.e. output[0], output[3], ... in a 0-based indexing system).
Since you mention making use of a hardware module, this could be implemented in hardware by connecting the first 512 input pins and grounding all other inputs, and reading every 4th output pin (ignoring all other output pins).
Note that this works because the FFT of the zero-padded signal is an interpolation in the frequency-domain of the original signal's FFT. In this case you do not need the interpolated values, so you can just ignore them. Here's an example computing a 4-point FFT using a 16-point module (I've reduced the size of the FFT for brievety, but kept the same ratio of 4 between the two):
x = [1,2,3,4]
fft(x)
ans> 10.+0.j,
-2.+2.j,
-2.+0.j,
-2.-2.j
x = [1,2,3,4,0,0,0,0,0,0,0,0,0,0,0,0]
fft(x)
ans> 10.+0.j, 6.499-6.582j, -0.414-7.242j, -4.051-2.438j,
-2.+2.j, 1.808+1.804j, 2.414-1.242j, -0.257-2.3395j,
-2.+0.j, -0.257+2.339j, 2.414+1.2426j, 1.808-1.8042j,
-2.-2.j, -4.051+2.438j, -0.414+7.2426j, 6.499+6.5822j
As you can see in the second output, the first column (which correspond to output 0, 3, 7 and 11) is identical to the desired output from the first, smaller-sized FFT.

CUDA: Best number of pixel computed per thread (grayscale)

I'm working on a program to convert an image in grayscale. I'm using the CImg library. I have to read for each pixel, the 3 values R-G-B, calculate the corresponding gray value and store the gray pixel on the output image. I'm working with an NVIDIA GTX 480. Some details about the card:
Microarchitecture: Fermi
Compute capability (version): 2.0
Cores per SM (warp size): 32
Streaming Multiprocessors: 15
Maximum number of resident warps per multiprocessor: 48
Maximum amount of shared memory per multiprocessor: 48KB
Maximum number of resident threads per multiprocessor: 1536
Number of 32-bit registers per multiprocessor: 32K
I'm using a square grid with blocks of 256 threads.
This program can have as input images of different sizes (e.g. 512x512 px, 10000x10000 px). I observed that incrementing the number of the pixels assigned to each thread increments the performance, so it's better than compute one pixel per thread. The problem is, how can I determine the number of pixels to assign to each thread statically? Computing tests with every possible number? I know that on the GTX 480, 1536 is the maximum number of resident threads per multiprocessor. Have I to consider this number? The following, is the code executed by the kernel.
for(i = ((gridDim.x + blockIdx.x) * blockDim.x) + threadIdx.x; i < width * height; i += (gridDim.x * blockDim.x)) {
float grayPix = 0.0f;
float r = static_cast< float >(inputImage[i]);
float g = static_cast< float >(inputImage[(width * height) + i]);
float b = static_cast< float >(inputImage[(2 * width * height) + i]);
grayPix = ((0.3f * r) + (0.59f * g) + (0.11f * b));
grayPix = (grayPix * 0.6f) + 0.5f;
darkGrayImage[i] = static_cast< unsigned char >(grayPix);
}

The problem is, how can I determine the number of pixels to assign to each thread statically? Computing tests with every possible number?
Although you haven't shown any code, you've mentioned an observed characteristic:
I observed that incrementing the number of the pixels assigned to each thread increments the performance,
This is actually a fairly common observation for these types of workloads, and it may also be the case that this is more evident on Fermi than on newer architectures. A similar observation occurs during matrix transpose. If you write a "naive" matrix transpose that transposes one element per thread, and compare it with the matrix transpose discussed here that transposes multiple elements per thread, you will discover, especially on Fermi, that the multiple element per thread transpose can achieve approximately the available memory bandwidth on the device, whereas the one-element-per-thread transpose cannot. This ultimately has to do with the ability of the machine to hide latency, and the ability of your code to expose enough work to allow the machine to hide latency. Understanding the underlying behavior is somewhat involved, but fortunately, the optimization objective is fairly simple.
GPUs hide latency by having lots of available work to switch to, when they are waiting on previously issued operations to complete. So if I have a lot of memory traffic, the individual requests to memory have a long latency associated with them. If I have other work that the machine can do while it is waiting for the memory traffic to return data (even if that work generates more memory traffic), then the machine can use that work to keep itself busy and hide latency.
The way to give the machine lots of work starts by making sure that we have enabled the maximum number of warps that can fit within the machine's instantaneous capacity. This number is fairly simple to compute, it is the product of the number of SMs on your GPU and the maximum number of warps that can be resident on each SM. We want to launch a kernel that meets or exceeds this number, but additional warps/blocks beyond this number don't necessarily help us hide latency.
Once we have met the above number, we want to pack as much "work" as possible into each thread. Effectively, for the problem you describe and the matrix transpose case, packing as much work into each thread means handling multiple elements per thread.
So the steps are fairly simple:
Launch as many warps as the machine can handle instantaneously
Put all remaining work in the thread code, if possible.
Let's take a simplistic example. Suppose my GPU has 2 SMs, each of which can handle 4 warps (128 threads). Note that this is not the number of cores, but the "Maximum number of resident warps per multiprocessor" as indicated by the deviceQuery output.
My objective then is to create a grid of 8 warps, i.e. 256 threads total (in at least 2 threadblocks, so they can distribute to each of the 2 SMs) and make those warps perform the entire problem by handling multiple elements per thread. So if my overall problem space is a total of 1024x1024 elements, I would ideally want to handle 1024*1024/256 elements per thread.
Note that this method gives us an optimization direction. We do not necessarily have to achieve this objective completely in order to saturate the machine. It might be the case that it is only necessary, for example, to handle 8 elements per thread, in order to allow the machine to fully hide latency, and usually another limiting factor will appear, as discussed below.
Following this method will tend to remove latency as a limiting factor for performance of your kernel. Using the profiler, you can assess the extent to which latency is a limiting factor in a number of ways, but a fairly simple one is to capture the sm_efficiency metric, and perhaps compare that metric in the two cases you have outlined (one element per thread, multiple elements per thread). I suspect you will find, for your code, that the sm_efficiency metric indicates a higher efficiency in the multiple elements per thread case, and this is indicating that latency is less of a limiting factor in that case.
Once you remove latency as a limiting factor, you will tend to run into one of the other two machine limiting factors for performance: compute throughput and memory throughput (bandwidth). In the matrix transpose case, once we have sufficiently dealt with the latency issue, then the kernel tends to run at a speed limited by memory bandwidth.

What actually does the size of FFT mean

While using FFT sample code from Apple documentation, what actually does the N, log2n, n and nOver2 mean?
Does N refer to the window size of the fft or the whole number of samples in a given audio, and
how do I calculate N from an audio file?
how are they related to the audio sampling rate i.e. 44.1kHz?
What would be the FFT frame size in this code?
Code:
/* Set the size of FFT. */
log2n = N;
n = 1 << log2n;
stride = 1;
nOver2 = n / 2;
printf("1D real FFT of length log2 ( %d ) = %d\n\n", n, log2n);
/* Allocate memory for the input operands and check its availability,
* use the vector version to get 16-byte alignment. */
A.realp = (float *) malloc(nOver2 * sizeof(float));
A.imagp = (float *) malloc(nOver2 * sizeof(float));
originalReal = (float *) malloc(n * sizeof(float));
obtainedReal = (float *) malloc(n * sizeof(float));

N or n typically refers to the number of elements. log2n is the base-two logarithm of n. (The base-two logarithm of 32 is 5.) nOver2 is n/2, n divided by two.
In the context of an FFT, n is the number of samples being fed into the FFT.
n is usually determined by a variety of constraints. You want more samples to provide a better quality result, but you do not want so many samples that processing takes up a lot of computer time or that the result is not available until so late that the user notices a lag. Usually, it is not the length of an audio file that determines the size. Rather, you design a “window” that you will use for processing, then you read samples from the audio file into a buffer big enough to hold your window, then you process the buffer, then you repeat with more samples from the file. Repetitions continue until the entire file is processed.
A higher audio sampling rate means there will be more samples in a given period of time. E.g., if you want to keep your window under 1/30th of a second, then a 44.1 kHz sampling rate will have less than 44.1•1000/30 = 1470 samples. A higher sampling rate means you have more work to do, so you may need to adjust your window size to keep the processing within limits.
That code uses N for log2n, which is unfortunate, since it may confuse people. Otherwise, the code is as I described above, and the FFT frame size is n.
There can be some confusion about FFT size or length when a mix of real data and complex data is involved. Typically, for a real-to-complex FFT, the number of real elements is said to be the length. When doing a complex-to-complex FFT, the number of complex elements is the length.

'N' is the number of samples, i.e., your vector size. Corresponding, 'log2N' is the logarithm of 'N' with the base 2, and 'nOver2' is the half of 'N'.
To answer the other questions, one must know, what do you want to do with FFT. This document, even it is written with a specific system in mind, can serve as an survey about the relation and the meaning of the parameters in (D)FFT.

Does FFT neccessary to find peaks and pits on audio files

I'm able to read a wav files and its values. I need to find peaks and pits positions and their values. First time, i tried to smooth it by (i-1 + i + i +1) / 3 formula then searching on array as array[i-1] > array[i] & direction == 'up' --> pits style solution but because of noise and other reasons of future calculations of project, I'm tring to find better working area. Since couple days, I'm researching FFT. As my understanding, fft translates the audio files to series of sines and cosines. After fft operation the given values is a0's and a1's for a0 + ak * cos(k*x) + bk * sin(k*x) which k++ and x++ as this picture
http://zone.ni.com/images/reference/en-XX/help/371361E-01/loc_eps_sigadd3freqcomp.gif
My question is, does fft helps to me find peaks and pits on audio? Does anybody has a experience for this kind of problems?

It depends on exactly what you are trying to do, which you haven't really made clear. "finding the peaks and pits" is one thing, but since there might be various reasons for doing this there might be various methods. You already tried the straightforward thing of actually looking for the local maximum and minima, it sounds like. Here are some tips:
you do not need the FFT.
audio data usually swings above and below zero (there are exceptions, including 8-bit wavs, which are unsigned, but these are exceptions), so you must be aware of positive and negative values. Generally, large positive and large negative values carry large amounts of energy, though, so you want to count those as the same.
due to #2, if you want to average, you might want to take the average of the absolute value, or more commonly, the average of the square. Once you find the average of the squares, take the square root of that value and this gives the RMS, which is related to the power of the signal, so you might do something like this is you are trying to indicate signal loudness, intensity or approximate an analog meter. The average of absolutes may be more robust against extreme values, but is less commonly used.
another approach is to simply look for the peak of the absolute value over some number of samples, this is commonly done when drawing waveforms, and for digital "peak" meters. It makes less sense to look at the minimum absolute.
Once you've done something like the above, yes you may want to compute the log of the value you've found in order to display the signal in dB, but make sure you use the right formula. 10 * log_10( amplitude ) is not it. Rule of thumb: usually when computing logs from amplitude you will see a 20, not a 10. If you want to compute dBFS (the amount of "headroom" before clipping, which is the standard measurement for digital meters), the formula is -20 * log_10( |amplitude| ), where amplitude is normalize to +/- 1. Watch out for amplitude = 0, which gives an infinite headroom in dB.

If I understand you correctly, you just want to estimate the relative loudness/quietness of an audio digital sample at a given point.
For this estimation, you don't need to use FFT. However your method of averaging the signal does not produce the appropiate picture neither.
The digital signal is the value of the audio wave at a given moment. You need to find the overall amplitude of the signal at that given moment. You can somewhat see it as the local maximum value for a given interval around the moment you want to calculate. You may have a moving max for the signal and get your amplitude estimation.
At a 16 bit sound sample, the sound signal value can go from 0 up to 32767. At a 44.1 kHz sample rate, you can find peaks and pits of around 0.01 secs by finding the max value of 441 samples around a given t moment.
max=1;
for (i=0; i<441; i++) if (array[t*44100+i]>max) max=array[t*44100+i];
then for representing it on a 0 to 1 scale you (not really 0, because we used a minimum of 1)
amplitude = max / 32767;
or you might represent it in relative dB logarithmic scale (here you see why we used 1 for the minimum value)
dB = 20 * log10(amplitude);

all you need to do is take dy/dx, which can getapproximately by just scanning through the wave and and subtracting the previous value from the current one and look at where it goes to zero or changes from positive to negative
in this code I made it really brief and unintelligent for sake of brevity, of course you could handle cases of dy being zero better, find the 'centre' of a long section of a flat peak, that kind of thing. But if all you need is basic peaks and troughs, this will find them.
lastY=0;
bool goingup=true;
for( i=0; i < wave.length; i++ ) {
y = wave[i];
dy = y - lastY;
bool stillgoingup = (dy>0);
if( goingup != direction ) {
// changed direction - note value of i(place) and 'y'(height)
stillgoingup = goingup;
}
}

Block dimensions in CUDA

I have a NVIDIA GTX 570 compute capability 2.0 running cuda-4.0.
The deviceQuery executable in the CUDA SDK gives me information on my CUDA device and its various properties. Two of the lines in the output are
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Why is the 3rd dimension of the block restricted to be upto 64 threads only wheras the X and the Y dimension can vary upto 1024 threads?

EDIT2: ALso, please take this with a grain of salt; this is a purely hypothetical answer, or a guess. There may indeed be a clear hardware-based reason why 64 is the maximum. Frankly I don't know, and my answer is based on an assumption that there is no such hardware limit, per se.
It's probably a combination of three things: first, there is a limit to the number of threads which can be resident inside a block; second, block dimensions are typically in multiples of 32, and even more often in powers of 2 greater than 32; third, coordinate systems used in the solution of multi-dimensional problems are most often oriented so that you're looking at the scene directly (i.e., with the important bits more distributed in X and Y than in Z).
CUDA naturally has to support 1D access, as this is an immensely common and efficient access pattern when it is applicable. TO support this, the X dimension must be allowed to vary over the entire range of 1024 threads.
To support 2D access, which is less common, CUDA should minimally support up to 512 in the X dimension (using the convention that the X dimension should be oriented in the coordinate system so that it measures the biggest spread) and 32 in the Y dimension. It must support up to 1024 in the X dimension, and I suppose they relax the requirement that the X dimension be no smaller than the Y dimension and allow the full 1024 range of Y values. However, in my understanding, 32 would have been plenty big for the Y dimension maximum.
To support 3D access, maintaining X, Y >= Z and trying to reach 1024, it seems to be that in the best case X=Y=Z=10; so there's no real argument for allowing Z to be greater than 10, given my assumptions
In summary, I don't see why they couldn't have made the maximums (1024, 32, 10). My question is why make them (1024, 1024, 64)? The only answer I keep coming back to is to allow some flexibility to programmers to violate the X>=Y>=Z coordinate system convention.
Edit: given my summary and hypothetical answer, the real answer to your question is this: it's an arbitary decision.

My wild guess is that because threadIdx.x, threadIdx.y and threadIdx.z are kept in a special single 32-bit register, possibly with even some other additional data. Maybe warp id? Or maybe multiprocessor-block id to identify which block given thread handles, if given multiprocessor runs more than one?
This is purely speculative, I have no data to support it, but I would imagine that they want to have as few special registers as possible.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart