Is there an SIMD instruction to achieve batch array memory index mapping? - sse

In my RGB to grey case:
Y = (77*R + 150*G + 29*B) >> 8;
I know SIMD (NEON, SSE2) can do like:
foreach 8 elements:
{A0,A1,A2,A3,A4,A5,A6,A7} = 77*{R0,R1,R2,R3,R4,R5,R6,R7}
{B0,B1,B2,B3,B4,B5,B6,B7} = 150*{G0,G1,G2,G3,G4,G5,G6,G7}
{C0,C1,C2,C3,C4,C5,C6,C7} = 29*{B0,B1,B2,B3,B4,B5,B6,B7}
{D0,D1,D2,D3,D4,D5,D6,D7} = {A0,A1,A2,A3,A4,A5,A6,A7} + {B0,B1,B2,B3,B4,B5,B6,B7}
{D0,D1,D2,D3,D4,D5,D6,D7} = {D0,D1,D2,D3,D4,D5,D6,D7} + {C0,C1,C2,C3,C4,C5,C6,C7}
{D0,D1,D2,D3,D4,D5,D6,D7} = {D0,D1,D2,D3,D4,D5,D6,D7} >> 8
However, the multiply instruction take at least 2 clock cycles, and R,G,B in [0-255],
we can use three lookup table(an array, length=256) to store the partial result of
77*R(mark as X), 150*G(mark as Y), 29*B(mark as Z).
So I'm looking for instructions can do the intention:
foreach 8 elements:
{A0,A1,A2,A3,A4,A5,A6,A7} = {X[R0],X[R1],X[R2],X[R3],X[R4],X[R5],X[R6],X[R7]}
{B0,B1,B2,B3,B4,B5,B6,B7} = {Y[G0],Y[G1],Y[G2],Y[G3],Y[G4],Y[G5],Y[G6],Y[G7]}
{C0,C1,C2,C3,C4,C5,C6,C7} = {Z[B0],Z[B1],Z[B2],Z[B3],Z[B4],Z[B5],Z[B6],Z[B7]}
{D0,D1,D2,D3,D4,D5,D6,D7} = {A0,A1,A2,A3,A4,A5,A6,A7} + {B0,B1,B2,B3,B4,B5,B6,B7}
{D0,D1,D2,D3,D4,D5,D6,D7} = {D0,D1,D2,D3,D4,D5,D6,D7} + {C0,C1,C2,C3,C4,C5,C6,C7}
{D0,D1,D2,D3,D4,D5,D6,D7} = {D0,D1,D2,D3,D4,D5,D6,D7} >> 8
Any good suggestions?

There are no byte or word gather instructions in AVX2 / AVX512, and no gathers at all in NEON. The DWORD gathers that do exist are much slower than a multiply! e.g. one per 5 cycle throughput for vpgatherdd ymm,[reg + scale*ymm], ymm, according to Agner Fog's instruction table for Skylake.
You can use shuffles as a parallel table-lookup. But your table for each lookup is 256 16-bit words. That's 512 bytes. AVX512 has some shuffles that select from the concatenation of 2 registers, but that's "only" 2x 64 bytes, and the byte or word element-size versions of those are multiple uops on current CPUs. (e.g. AVX512BW vpermi2w). They are still fantastically powerful compared to vpshufb, though.
So using a shuffle as a LUT won't work in your case, but it does work very well for some cases, e.g. for popcount you can split bytes into 4-bit nibbles and use vpshufb to do 32 lookups in parallel from a 16-element table of bytes.
Normally for SIMD you want to replace table lookups with computation, because computation is much more SIMD friendly.
Suck it up and use pmullw / _mm_mullo_epi16. You have instruction-level parallelism, and Skylake has 2 per clock throughput for 16-bit SIMD multiply (but 5 cycle latency). For image processing, normally throughput matters more than latency, as long as you keep the latency within reason so out-of-order execution can hide it.
If your multipliers ever have few enough 1 bits in their binary representation, you could consider using shift/add instead of an actual multiply. e.g. B * 29 = B * 32 - B - B * 2. Or B<<5 - B<<1 - B. That many instructions probably has more throughput cost than a single multiply, though. If you could do it with just 2 terms, it might be worth it. (But then again, still maybe not, depending on the CPU. Total instruction throughput and vector ALU bottlenecks are a big deal.)

Related

Can dispatch overhead be more expensive than an actual thread work?

Recently I'm thinking about possibilities for hard optimization. I mean those kind of optimization when you sometimes hardcode the loop from 3 iterations just to get something.
So one thought came to my mind. Imagine we have a buffer of 1024 elements. We want to multiply every single element of it by 2. And we create a simple kernel, where we pass a buffer, outBuffer, their size (to check if we outside of the bounds) and [[thread_position_in_grid]]. Then we just do a simple multiplaction and write that number to another buffer.
It will look a bit like that:
kernel void multiplyBy2(constant float* in [[buffer(0)]],
device float* out [[buffer(1)]],
constant Uniforms& uniforms [[buffer(2)]],
uint gid [[thread_position_in_grid]])
{
if (gid >= uniforms.buffer_size) { return; }
out[gid] = in[gid] * 2.0;
}
The thing I'm concerned about is If the actual thread work still worth the overhead that is produced by it's dispatching?
Would it be more effective to, for example, dispatch 4 times less threads, that do something like that
out[gid * 4 + 0] = in[gid + 0] * 2.0;
out[gid * 4 + 1] = in[gid + 1] * 2.0;
out[gid * 4 + 2] = in[gid + 2] * 2.0;
out[gid * 4 + 3] = in[gid + 3] * 2.0;
So that thread can work a little bit longer? Or it is better to make threads as thin as possible?
Yes, and this is true not merely in contrived examples, but in some real-world scenarios too.
For extremely simple kernels like yours, the dispatch overhead can swamp the work to be done, but there's another factor that may have an even bigger effect on performance: sharing fetched data and intermediate results.
If you have a kernel that, for example, reads the 3x3 neighborhood of a pixel from an input texture and writes the average to an output texture, you could share the fetched texture data and partial sums between adjacent pixels by operating on more than one pixel in your kernel function and reducing the total number of threads you dispatch.
Perhaps this sates your curiosity. For any practical application, Scott Hunter is right that you should profile on all target devices before and after optimizing.

Select an integer number of periods

Suppose we have sinusoidal with frequency 100Hz and sampling frequency of 1000Hz. It means that our signal has 100 periods in a second and we are taking 1000 samples in a second. Therefore, in order to select a complete period I'll have to take fs/f=10 samples. Right?
What if the sampling period is not a multiple of the frequency of the signal (like 550Hz)? Do I have to find the minimum multiple M of f and fs, and than take M samples?
My goal is to select an integer number of periods in order to be able to replicate them without changes.
You have f periods a second, and fs samples a second.
If you take M samples, it would cover M/fs part of a second, or P = f * (M/fs) periods. You want this number to be integer.
So you need to take M = fs / gcd(f, fs) samples.
For your example P = 1000 / gcd(100, 1000) = 1000 / 100 = 10.
If you have 60 Hz frequency and 80 Hz sampling frequency, it gives P = 80 / gcd(60, 80) = 80 / 20 = 4 -- 4 samples will cover 4 * 1/80 = 1/20 part of a second, and that will be 3 periods.
If you have 113 Hz frequency and 512 Hz sampling frequency, you are out of luck, since gcd(113, 512) = 1 and you'll need 512 samples, covering the whole second and 113 periods.
In general, an arbitrary frequency will not have an integer number of periods. Irrational frequencies will never even repeat ever. So some means other than concatenation of buffers one period in length will be needed to synthesize exactly periodic waveforms of arbitrary frequencies. Approximation by interpolation for fractional phase offsets is one possibility.

CUDA coalesced access to global memory

I have read CUDA programming guide, but i missed one thing. Let's say that i have array of 32bit int in global memory and i want to copy it to shared memory with coalesced access.
Global array has indexes from 0 to 1024, and let's say i have 4 blocks each with 256 threads.
__shared__ int sData[256];
When is coalesced access performed?
1.
sData[threadIdx.x] = gData[threadIdx.x * blockIdx.x+gridDim.x*blockIdx.y];
Adresses in global memory are copied from 0 to 255, each by 32 threads in warp, so here it's ok?
2.
sData[threadIdx.x] = gData[threadIdx.x * blockIdx.x+gridDim.x*blockIdx.y + someIndex];
If someIndex is not multiple of 32 it is not coalesced? Misaligned adresses? Is that correct?
What you want ultimately depends on whether your input data is a 1D or 2D array, and whether your grid and blocks are 1D or 2D. The simplest case is both 1D:
shmem[threadIdx.x] = gmem[blockDim.x * blockIdx.x + threadIdx.x];
This is coalesced. The rule of thumb I use is that the most rapidly varying coordinate (the threadIdx) is added on as offset to the block offset (blockDim * blockIdx). The end result is that the indexing stride between threads in the block is 1. If the stride gets larger, then you lose coalescing.
The simple rule (on Fermi and later GPUs) is that if the addresses for all threads in a warp fall into the same aligned 128-byte range, then a single memory transaction will result (assuming caching is enabled for the load, which is the default). If they fall into two aligned 128-byte ranges, then two memory transactions result, etc.
On GT2xx and earlier GPUs, it gets more complicated. But you can find the details of that in the programming guide.
Additional examples:
Not coalesced:
shmem[threadIdx.x] = gmem[blockDim.x + blockIdx.x * threadIdx.x];
Not coalesced, but not too bad on GT200 and later:
stride = 2;
shmem[threadIdx.x] = gmem[blockDim.x * blockIdx.x + stride * threadIdx.x];
Not coalesced at all:
stride = 32;
shmem[threadIdx.x] = gmem[blockDim.x * blockIdx.x + stride * threadIdx.x];
Coalesced, 2D grid, 1D block:
int elementPitch = blockDim.x * gridDim.x;
shmem[threadIdx.x] = gmem[blockIdx.y * elementPitch +
blockIdx.x * blockDim.x + threadIdx.x];
Coalesced, 2D grid and block:
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
int elementPitch = blockDim.x * gridDim.x;
shmem[threadIdx.y * blockDim.x + threadIdx.x] = gmem[y * elementPitch + x];
Your indexing at 1 is wrong (or intentionally so strange it seems wrong), some blocks access same element in each thread, so there is no way for coalesced access in these blocks.
Proof:
Example:
Grid = dim(2,2,0)
t(blockIdx.x, blockIdx.y)
//complete block reads at 0
t(0,0) -> sData[threadIdx.x] = gData[0];
//complete block reads at 2
t(0,1) -> sData[threadIdx.x] = gData[2];
//definetly coalesced
t(1,0) -> sData[threadIdx.x] = gData[threadIdx.x];
//not coalesced since 2 is no multiple of a half of the warp size = 16
t(1,1) -> sData[threadIdx.x] = gData[threadIdx.x + 2];
So its a "luck" game if a block is coalesced, so in general No
But coalesced memory reads rules are not as strict on newer cuda versions as before.
But for compatibility issues you should try to optimise kernels for lowest cuda versions, if it is possible.
Here is some nice source:
http://mc.stanford.edu/cgi-bin/images/0/0a/M02_4.pdf
The rules for which accesses can be coalesced are somewhat complicated and they have changed over time. Each new CUDA architecture is more flexible in what it can coalesce. I would say not to worry about it at first. Instead, do the memory accesses in whatever way is the most convenient and then see what the CUDA profiler says.
Your examples are correct if you intended to use a 1D grid and thread-geometry. I think the indexing you intended to use is [blockIdx.x*blockDim.x + threadIdx.x].
With #1, the 32 threads in a warp execute that instruction 'simultaneously' so their requests, which are sequential and aligned to 128B (32 x 4), are coalesced in both Tesla and Fermi architectures, I believe.
With #2, it is a bit blurry. If someIndex is 1, then it won't coalesce all of the 32 requests in a warp, but it might do partial coalescing. I believe Fermi devices will coalesce the accesses for threads 1-31 in a warp as a part of a 128B sequential segment of memory (and the first 4B, which no thread needs, are wasted). I think Tesla architecture devices would make that an uncoalesced access due to the misalignment, but I am not sure.
With someIndex as, say, 8, Tesla will have 32B aligned addresses, and Fermi might group them as 32B, 64B, and 32B. But the bottom line is, depending on the value of someIndex and the architecture, what happens is blurry, and it won't necessarily be terrible.

Out of memory exception for a matrix

I have the "'System.OutOfMemoryException" exception for this simple code (a 10 000 * 10 000 matrix) multiplied by itself:
#time
#r "Microsoft.Office.Interop.Excel"
#r "FSharp.PowerPack.dll"
open System
open System.IO
open Microsoft.FSharp.Math
open System.Collections.Generic
let mutable Matrix1 = Matrix.create 10000 10000 0.
let matrix4 = Matrix1 * Matrix1
I have the following error:
System.OutOfMemoryException: An exception 'System.OutOfMemoryException' has been raised
Microsoft.FSharp.Collections.Array2DModule.ZeroCreate[T](Int32 length1, Int32 length2)
Microsoft.FSharp.Math.DoubleImpl.mulDenseMatrixDS(DenseMatrix`1 a, DenseMatrix`1 b)
Microsoft.FSharp.Math.SpecializedGenericImpl.mulM[a](Matrix`1 a, Matrix`1 b)
<StartupCode$FSI_0004>.$FSI_0004.main#() dans C:\Users\XXXXXXX\documents\visual studio 2010\Projects\Library1\Library1\Module1.fs:line 92
Stop due to an error
I have therefore 2 questions:
I have a 8 GB memory on my computer and according to my calculation a 10 000 * 10 000 matrix should take 381 MB [computed this way : 10 000 * 10 000 = 100 000 000 integers in the matrix => 100 000 000 * 4 bytes (integers of 32 bits) = 400 000 000 => 400 000 000 / (1024*1024) = 381 MB] so I cannot understand why there is an OutOfMemoryException
More generally (it's not the case here I think), I have the impression that F# interactive registers all the data and therefore overloads the memory, do you know of a way to free all the data registered by F# interactive without exiting F#?
In summary, fsi is a 32-bit process; at most it can hold 2GB of data. Run your test as a 64-bit Windows application; you can increase the size of the matrix, but it still has 2GB limit of .NET objects.
I correct your calculation a little bit. Matrix1 is a float matrix, so each element occupies 8 bytes in memory. The total size of Matrix1 and matrix4 in memory is at least:
2 * 10000 * 10000 * 8 = 1 600 000 000 bytes ~ 1.6 GB
(ignoring some bookkeeping parts of matrix)
So it's no surprise when fsi*32 runs out of memory in this case.
Execute the test as a 64-bit Windows process, you can create float matrices of size around 15000 but not more than that. Check out this informative article for concrete numbers with different types of matrix elements.
The amount of physical memory on your computer is not the relevant bottleneck - see Eric Lippert's great blog post for more information.

linear transformation function

I need to write a function that takes 4 bytes as input, performs a reversible linear transformation on this, and returns it as 4 bytes.
But wait, there is more: it also has to be distributive, so changing one byte on the input should affect all 4 output bytes.
The issues:
if I use multiplication it won't be reversible after it is modded 255 via the storage as a byte (and its needs to stay as a byte)
if I use addition it can't be reversible and distributive
One solution:
I could create an array of bytes 256^4 long and fill it in, in a one to one mapping, this would work, but there are issues: this means I have to search a graph of size 256^8 due to having to search for free numbers for every value (should note distributivity should be sudo random based on a 64*64 array of byte). This solution also has the MINOR (lol) issue of needing 8GB of RAM, making this solution nonsense.
The domain of the input is the same as the domain of the output, every input has a unique output, in other words: a one to one mapping. As I noted on "one solution" this is very possible and I have used that method when a smaller domain (just 256) was in question. The fact is, as numbers get big that method becomes extraordinarily inefficient, the delta flaw was O(n^5) and omega was O(n^8) with similar crappiness in memory usage.
I was wondering if there was a clever way to do it. In a nutshell, it's a one to one mapping of domain (4 bytes or 256^4). Oh, and such simple things as N+1 can't be used, it has to be keyed off a 64*64 array of byte values that are sudo random but recreatable for reverse transformations.
Balanced Block Mixers are exactly what you're looking for.
Who knew?
Edit! It is not possible, if you indeed want a linear transformation. Here's the mathy solution:
You've got four bytes, a_1, a_2, a_3, a_4, which we'll think of as a vector a with 4 components, each of which is a number mod 256. A linear transformation is just a 4x4 matrix M whose elements are also numbers mod 256. You have two conditions:
From Ma, we can deduce a (this means that M is an invertible matrix).
If a and a' differ in a single coordinate, then Ma and Ma' must differ in every coordinate.
Condition (2) is a little trickier, but here's what it means. Since M is a linear transformation, we know that
M(a - a) = Ma - Ma'
On the left, since a and a' differ in a single coordinate, a - a has exactly one nonzero coordinate. On the right, since Ma and Ma' must differ in every coordinate, Ma - Ma' must have every coordinate nonzero.
So the matrix M must take a vector with a single nonzero coordinate to one with all nonzero coordinates. So we just need every entry of M to be a non-zero-divisor mod 256, i.e., to be odd.
Going back to condition (1), what does it mean for M to be invertible? Since we're considering it mod 256, we just need its determinant to be invertible mod 256; that is, its determinant must be odd.
So you need a 4x4 matrix with odd entries mod 256 whose determinant is odd. But this is impossible! Why? The determinant is computed by summing various products of entries. For a 4x4 matrix, there are 4! = 24 different summands, and each one, being a product of odd entries, is odd. But the sum of 24 odd numbers is even, so the determinant of such a matrix must be even!
Here are your requirements as I understand them:
Let B be the space of bytes. You want a one-to-one (and thus onto) function f: B^4 -> B^4.
If you change any single input byte, then all output bytes change.
Here's the simplest solution I have thusfar. I have avoided posting for a while because I kept trying to come up with a better solution, but I haven't thought of anything.
Okay, first of all, we need a function g: B -> B which takes a single byte and returns a single byte. This function must have two properties: g(x) is reversible, and x^g(x) is reversible. [Note: ^ is the XOR operator.] Any such g will do, but I will define a specific one later.
Given such a g, we define f by f(a,b,c,d) = (a^b^c^d, g(a)^b^c^d, a^g(b)^c^d, a^b^g(c)^d). Let's check your requirements:
Reversible: yes. If we XOR the first two output bytes, we get a^g(a), but by the second property of g, we can recover a. Similarly for the b and c. We can recover d after getting a,b, and c by XORing the first byte with (a^b^c).
Distributive: yes. Suppose b,c, and d are fixed. Then the function takes the form f(a,b,c,d) = (a^const, g(a)^const, a^const, a^const). If a changes, then so will a^const; similarly, if a changes, so will g(a), and thus so will g(a)^const. (The fact that g(a) changes if a does is by the first property of g; if it didn't then g(x) wouldn't be reversible.) The same holds for b and c. For d, it's even easier because then f(a,b,c,d) = (d^const, d^const, d^const, d^const) so if d changes, every byte changes.
Finally, we construct such a function g. Let T be the space of two-bit values, and h : T -> T the function such that h(0) = 0, h(1) = 2, h(2) = 3, and h(3) = 1. This function has the two desired properties of g, namely h(x) is reversible and so is x^h(x). (For the latter, check that 0^h(0) = 0, 1^h(1) = 3, 2^h(2) = 1, and 3^h(3) = 2.) So, finally, to compute g(x), split x into four groups of two bits, and take h of each quarter separately. Because h satisfies the two desired properties, and there's no interaction between the quarters, so does g.
I'm not sure I understand your question, but I think I get what you're trying to do.
Bitwise Exclusive Or is your friend.
If R = A XOR B, R XOR A gives B and R XOR B gives A back. So it's a reversible transformation, assuming you know the result and one of the inputs.
Assuming I understood what you're trying to do, I think any block cipher will do the job.
A block cipher takes a block of bits (say 128) and maps them reversibly to a different block with the same size.
Moreover, if you're using OFB mode you can use a block cipher to generate an infinite stream of pseudo-random bits. XORing these bits with your stream of bits will give you a transformation for any length of data.
I'm going to throw out an idea that may or may not work.
Use a set of linear functions mod 256, with odd prime coefficients.
For example:
b0 = 3 * a0 + 5 * a1 + 7 * a2 + 11 * a3;
b1 = 13 * a0 + 17 * a1 + 19 * a2 + 23 * a3;
If I remember the Chinese Remainder Theorem correctly, and I haven't looked at it in years, the ax are recoverable from the bx. There may even be a quick way to do it.
This is, I believe, a reversible transformation. It's linear, in that af(x) mod 256 = f(ax) and f(x) + f(y) mod 256 = f(x + y). Clearly, changing one input byte will change all the output bytes.
So, go look up the Chinese Remainder Theorem and see if this works.
What you mean by "linear" transformation?
O(n), or a function f with f(c * (a+b)) = c * f(a) + c * f(b)?
An easy approach would be a rotating bitshift (not sure if this fullfils the above math definition). Its reversible and every byte can be changed. But with this it does not enforce that every byte is changed.
EDIT: My solution would be this:
b0 = (a0 ^ a1 ^ a2 ^ a3)
b1 = a1 + b0 ( mod 256)
b2 = a2 + b0 ( mod 256)
b3 = a3 + b0 ( mod 256)
It would be reversible (just subtract the first byte from the other, and then XOR the 3 resulting bytes on the first), and a change in one bit would change every byte (as b0 is the result of all bytes and impacts all others).
Stick all of the bytes into 32-bit number and then do a shl or shr (shift left or shift right) by one, two or three. Then split it back into bytes (could use a variant record). This will move bits from each byte into the adjacent byte.
There are a number of good suggestions here (XOR, etc.) I would suggest combining them.
You could remap the bits. Let's use ii for input and oo for output:
oo[0] = (ii[0] & 0xC0) | (ii[1] & 0x30) | (ii[2] & 0x0C) | (ii[3] | 0x03)
oo[1] = (ii[0] & 0x30) | (ii[1] & 0x0C) | (ii[2] & 0x03) | (ii[3] | 0xC0)
oo[2] = (ii[0] & 0x0C) | (ii[1] & 0x03) | (ii[2] & 0xC0) | (ii[3] | 0x30)
oo[3] = (ii[0] & 0x03) | (ii[1] & 0xC0) | (ii[2] & 0x30) | (ii[3] | 0x0C)
It's not linear, but significantly changing one byte in the input will affect all the bytes in the output. I don't think you can have a reversible transformation such as changing one bit in the input will affect all four bytes of the output, but I don't have a proof.

Resources