AVX equivalent for _mm_storeu_ps? - sse

I have quite a fast AVX code, but it's just one single function using AVX, the rest of the huge project is on SSE2, so I do NOT want to set architecture to AVX. At the end of each iteration I need to convert the 4 doubles in one YMM register to 4 floats and store it like this:
__m256d y = ......;
_mm_storeu_ps((float*)dst + i, _mm256_cvtpd_ps(y));
But MSVC is generating SSE2 code using movups (without "v" prefix). Is there a way to force it to just use one AVX instruction? It seems to me quite ridiculous that the only know way is to use AVX as target. I want to take advantage of AVX for just a single cycle. Intel compiler apparently understands it and since I'm using AVX autodispatch it works well there, but generally Intel compiler doesn't seem to way to go right now, it's slow and the code is worse than MSVC, well, except for this...

The AVX equivalent of _mm_storeu_ps(float*,_mm128) is _mm256_storeu_ps(float*,_mm256). To start of with, AVX is an instructions set. That means that a CPU has to physically have the register space and FPU front end to be able to use them. In other words AVX given to a non-AVX processor wouldn't run. You have to compile with /arch -AVX because it is the latest and it is backwards compatible with sse,sse2,sse3,ssse3,sse4,sse,sse4.1,sse4.2 but the inverse is not true. A AVX CPU inplements all the sses' but AVX is an instruction set foreign to sse CPUs.
The memory operations are not particularly optimizable on a large scale due to CPU<-->RAM bandwidth shortcomings. If you are using very large arrays then you will not see a difference in the slightest between SSE4 loadu and AVX loadu, a 6 lane highway will only every have the capacity to handle X many cars no matter how many cars you ask from it at one time. This tends not to be the case with smaller arrays when you can hide the latency of loading them behind other work. You should not swap AVX instructions and SSE not only due to design complications but also internal cpu complications.
Furthermore you do not want to go around changing the FPU state from AVX to SSE because that in itself causes a performance overhead that most people do not consider. Either have large homologous sections of one code be in one state (SSE,AVX) or have everything SSE, or AVX
Grammar warning I can't Grammar

Related

any ways to convert unsigned char to short based on AVX512 cpu intrinics?

I am just reading the cpu intrinic sets of AVX512 in Xeon Phi processors, but it seems that traditional data type converting method in sse doesn't work in avx512, so can I ask that are there any similar cpu set in avx512 can convert unsigned char array to short data type array? Thanks in advance!
Knight Landing (KNL) unfortunately does not have the AVX512BW subset of instructions which includes operations on 8 bit and 16 bit quantities. Otherwise you could just use _mm512_cvtepu8_epi16.
Eventually the forthcoming Skylake Xeon (Purley - due out 2017 - not to be confused with existing Skylake CPUs) should have AVX-512 which does include the AVX512BW subset, but until then you're out of luck, although you can of course still use SSE and AVX2 on KNL to do this kind of thing.

iOS Metal compute pipeline slower than CPU implementation for search task

I made simple experiment, by implementing naive char search algorithm searching 1.000.000 rows of 50 characters each (50 mil char map) on both CPU and GPU (using iOS8 Metal compute pipeline).
CPU implementation uses simple loop, Metal implementation gives each kernel 1 row to process (source code below).
To my surprise, Metal implementation is on average 2-3 times slower than simple, linear CPU (if I use 1 core) and 3-4 times slower if I employ 2 cores (each of them searching half of database)!
I experimented with diffrent threads per group (16, 32, 64, 128, 512) yet still get very similar results.
iPhone 6:
CPU 1 core: approx 0.12 sec
CPU 2 cores: approx 0.075 sec
GPU: approx 0.35 sec (relEase mode, validation disabled)
I can see Metal shader spending more than 90% of accessing memory (see below).
What can be done to optimise it?
Any insights will be appreciated, as there are not many sources in the internet (besides standard Apple programming guides), providing details on memory access internals & trade-offs specific to the Metal framework.
METAL IMPLEMENTATION DETAILS:
Host code gist:
https://gist.github.com/lukaszmargielewski/0a3b16d4661dd7d7e00d
Kernel (shader) code:
https://gist.github.com/lukaszmargielewski/6b64d06d2d106d110126
GPU frame capture profiling results:
The GPU shader is also striding vertically through memory, whereas the CPU is moving horizontally. Consider the addresses actually touched more or less concurrently by each thread executing in lockstep in your shader as you read charTable. The GPU will probably run a good deal faster if your charTable matrix is transposed.
Also, because this code executes in a SIMD fashion, each GPU thread will probably have to run the loop to the full search phrase length, whereas the CPU will get to take advantage of early outs. The GPU code might actually run a little faster if you remove the early outs and just keep the code simple. Much depends on the search phrase length and likelihood of a match.
I'll take my guesses too, gpu isn't optimized for if/else, it doesn't predict branches (it probably execute both), try to rewrite the algorithm in a more linear way without any conditional or reduce them to bare minimum.

Difference between the AVX instructions vxorpd and vpxor

According to the Intel Intrinsics Guide,
vxorpd ymm, ymm, ymm: Compute the bitwise XOR of packed double-precision (64-bit) floating-point elements in a and b, and store the results in dst.
vpxor ymm, ymm, ymm: Compute the bitwise XOR of 256 bits (representing integer data) in a and b, and store the result in dst.
What is the difference between the two? It appears to me that both instructions would do a bitwise XOR on all 256 bits of the ymm registers. Is there any performance penalty if I use vxorpd for integer data (and vice versa)?
Combining some comments into an answer:
Other than performance, they have identical behaviour (I think even with a memory argument: same lack of alignment requirements for all AVX instructions).
On Nehalem to Broadwell, (V)PXOR can run on any of the 3 ALU execution ports, p0/p1/p5. (V)XORPS/D can only run on p5.
Some CPUs have a "bypass delay" between integer and FP "domains". Agner Fog's microarch docs say that on SnB / IvB, the bypass delay is sometimes zero. e.g. when using the "wrong" type of shuffle or boolean operation. On Haswell, his examples show that orps has no extra latency when used on the result of an integer instruction, but that por has an extra 1 clock of latency when used on the result of addps.
On Skylake, FP booleans can run on any port, but bypass delay depends on which port they happened to run on. (See Intel's optimization manual for a table). Port5 has no bypass delay between FP math ops, but port 0 or port 1 do. Since the FMA units are on port 0 and 1, the uop issue stage will usually assign booleans to port5 in FP heavy code, because it can see that lots of uops are queued up for p0/p1 but p5 is less busy. (How are x86 uops scheduled, exactly?).
I'd recommend not worrying about this. Tune for Haswell and Skylake will do fine. Or just always use VPXOR on integer data and VXORPS on FP data, and Skylake will do fine (but Haswell might not).
On AMD Bulldozer / Piledriver / Steamroller there is no "FP" version of the boolean ops. (see pg. 182 of Agner Fog's microarch manual.) There's a delay for forwarding data between execution units (of 1 cycle for ivec->fp or fp->ivec, 10 cycles for int->ivec (eax -> xmm0), 8 cycles for ivec->int. (8,10 on bulldozer. 4, 5 on steamroller for movd/pinsrw/pextrw)) So anyway, you can't avoid the bypass delay on AMD by using the appropriate boolean insn. XORPS does take one less byte to encode than PXOR or XORPD (non-VEX version. VEX versions all take 4 bytes.)
In any case, bypass delays are just extra latency, not reduced throughput. If these ops aren't part of the longest dep chain in your inner loop, or if you can interleave two iterations in parallel (so you have multiple dependency chains going at once for out-of-order-execution), then PXOR may be the way to go.
On Intel CPUs before Skylake, packed-integer instructions can always run on more ports than their floating-point counterparts, so prefer integer ops.

why should the disparity value in sgbm be divisble by 16?

I am working on opencv sgbm(semi global block matching) function. Here two parameters (minDisparity and numberOfDisparities) are used.
In that why numberOfDisparities values should be divisible by 16?
Probably to simplify the code internally, which uses SSE2. In general, SSE2 instructions:
Work on multiple numbers simultaneously; having the total number of pieces of information be evenly divisible makes things simpler.
SSE2 requires 128-bit (16 byte) memory alignment; alignment can be more easily maintained when things are nice multiples of 16...
If you examine the OpenCV source code, you'll see lots of SSE2 code for the SGBM algorithm.

What is "vectorization"?

Several times now, I've encountered this term in matlab, fortran ... some other ... but I've never found an explanation what does it mean, and what it does? So I'm asking here, what is vectorization, and what does it mean for example, that "a loop is vectorized" ?
Many CPUs have "vector" or "SIMD" instruction sets which apply the same operation simultaneously to two, four, or more pieces of data. Modern x86 chips have the SSE instructions, many PPC chips have the "Altivec" instructions, and even some ARM chips have a vector instruction set, called NEON.
"Vectorization" (simplified) is the process of rewriting a loop so that instead of processing a single element of an array N times, it processes (say) 4 elements of the array simultaneously N/4 times.
I chose 4 because it's what modern hardware is most likely to directly support for 32-bit floats or ints.
The difference between vectorization and loop unrolling:
Consider the following very simple loop that adds the elements of two arrays and stores the results to a third array.
for (int i=0; i<16; ++i)
C[i] = A[i] + B[i];
Unrolling this loop would transform it into something like this:
for (int i=0; i<16; i+=4) {
C[i] = A[i] + B[i];
C[i+1] = A[i+1] + B[i+1];
C[i+2] = A[i+2] + B[i+2];
C[i+3] = A[i+3] + B[i+3];
}
Vectorizing it, on the other hand, produces something like this:
for (int i=0; i<16; i+=4)
addFourThingsAtOnceAndStoreResult(&C[i], &A[i], &B[i]);
Where "addFourThingsAtOnceAndStoreResult" is a placeholder for whatever intrinsic(s) your compiler uses to specify vector instructions.
Terminology:
Note that most modern ahead-of-time compilers are able to auto vectorize very simple loops like this, which can often be enabled via a compile option (on by default with full optimization in modern C and C++ compilers, like gcc -O3 -march=native). OpenMP #pragma omp simd is sometimes helpful to hint the compiler, especially for "reduction" loops like summing an FP array where vectorization requires pretending that FP math is associative.
More complex algorithms still require help from the programmer to generate good vector code; we call this manual vectorization, often with intrinsics like x86 _mm_add_ps that map to a single machine instruction as in SIMD prefix sum on Intel cpu or How to count character occurrences using SIMD. Or even use SIMD for short non-looping problems like Most insanely fastest way to convert 9 char digits into an int or unsigned int or How to convert a binary integer number to a hex string?
The term "vectorization" is also used to describe a higher level software transformation where you might just abstract away the loop altogether and just describe operating on arrays instead of the elements that comprise them. e.g. writing C = A + B in some language that allows that when those are arrays or matrices, unlike C or C++. In lower-level languages like that, you could describe calling BLAS or Eigen library functions instead of manually writing loops as a vectorized programming style. Some other answers on this question focus on that meaning of vectorization, and higher-level languages.
Vectorization is the term for converting a scalar program to a vector program. Vectorized programs can run multiple operations from a single instruction, whereas scalar can only operate on pairs of operands at once.
From wikipedia:
Scalar approach:
for (i = 0; i < 1024; i++)
{
C[i] = A[i]*B[i];
}
Vectorized approach:
for (i = 0; i < 1024; i+=4)
{
C[i:i+3] = A[i:i+3]*B[i:i+3];
}
Vectorization is used greatly in scientific computing where huge chunks of data needs to be processed efficiently.
In real programming application , i know it's used in NUMPY(not sure of other else).
Numpy (package for scientific computing in python) , uses vectorization for speedy manipulation of n-dimensional array ,which generally is slower if done with in-built python options for handling arrays.
although tons of explanation are out there , HERE'S WHAT VECTORIZATION IS DEFINED AS IN NUMPY DOCUMENTATION PAGE
Vectorization describes the absence of any explicit looping, indexing, etc., in the code - these things are taking place, of course, just “behind the scenes” in optimized, pre-compiled C code. Vectorized code has many advantages, among which are:
vectorized code is more concise and easier to read
fewer lines of code generally means fewer bugs
the code more closely resembles standard mathematical notation
(making it easier, typically, to correctly code mathematical
constructs)
vectorization results in more “Pythonic” code. Without
vectorization, our code would be littered with inefficient and
difficult to read for loops.
Vectorization, in simple words, means optimizing the algorithm so that it can utilize SIMD instructions in the processors.
AVX, AVX2 and AVX512 are the instruction sets (intel) that perform same operation on multiple data in one instruction. for eg. AVX512 means you can operate on 16 integer values(4 bytes) at a time. What that means is that if you have vector of 16 integers and you want to double that value in each integers and then add 10 to it. You can either load values on to general register [a,b,c] 16 times and perform same operation or you can perform same operation by loading all 16 values on to SIMD registers [xmm,ymm] and perform the operation once. This lets speed up the computation of vector data.
In vectorization we use this to our advantage, by remodelling our data so that we can perform SIMD operations on it and speed up the program.
Only problem with vectorization is handling conditions. Because conditions branch the flow of execution. This can be handled by masking. By modelling the condition into an arithmetic operation. eg. if we want to add 10 to value if it is greater then 100. we can either.
if(x[i] > 100) x[i] += 10; // this will branch execution flow.
or we can model the condition into arithmetic operation creating a condition vector c,
c[i] = x[i] > 100; // storing the condition on masking vector
x[i] = x[i] + (c[i] & 10) // using mask
this is very trivial example though... thus, c is our masking vector which we use to perform binary operation based on its value. This avoid branching of execution flow and enables vectorization.
Vectorization is as important as Parallelization. Thus, we should make use of it as much possible. All modern days processors have SIMD instructions for heavy compute workloads. We can optimize our code to use these SIMD instructions using vectorization, this is similar to parrallelizing our code to run on multiple cores available on modern processors.
I would like to leave with the mention of OpenMP, which lets yo vectorize the code using pragmas. I consider it as a good starting point. Same can be said for OpenACC.
It refers to a the ability to do single mathematical operation on a list -- or "vector" -- of numbers in a single step. You see it often with Fortran because that's associated with scientific computing, which is associated with supercomputing, where vectorized arithmetic first appeared. Nowadays almost all desktop CPUs offer some form of vectorized arithmetic, through technologies like Intel's SSE. GPUs also offer a form of vectorized arithmetic.
By Intel people I think is easy to grasp.
Vectorization is the process of converting an algorithm from operating
on a single value at a time to operating on a set of values at one
time. Modern CPUs provide direct support for vector operations where a
single instruction is applied to multiple data (SIMD).
For example, a CPU with a 512 bit register could hold 16 32- bit
single precision doubles and do a single calculation.
16 times faster than executing a single instruction at a time. Combine
this with threading and multi-core CPUs leads to orders of magnitude
performance gains.
Link https://software.intel.com/en-us/articles/vectorization-a-key-tool-to-improve-performance-on-modern-cpus
In Java there is a option to this be included in JDK 15 of 2020 or late at JDK 16 at 2021. See this official issue.
hope you are well!
vectorization refers to all the techniques that convert scaler implementation, in which a single operation processes a single entity at a time to vector implementation in which a single operation processes multiple entities at the same time.
Vectorization refers to a technique with the help of which we optimize the code to work with huge chunks of data efficiently. application of vectorization seen in scientific applications like NumPy, pandas also you can use this technique while working with Matlab, image processing, NLP, and much more. Overall it optimizes the runtime and memory allocation of the program.
Hope you may get your answer!
Thank you. 🙂
I would define vectorisation a feature of a given language where the responsibility on how to iterate over the elements of a certain collection can be delegated from the programmer (e.g. explicit loop of the elements) to some method provided by the language (e.g. implicit loop).
Now, why do we ever want to do that ?
Code readeability. For some (but not all!) cases operating over the entire collection at once rather than to its elements is easier to read and quicker to code;
Some interpreted languages (R, Python, Matlab.. but not Julia for example) are really slow in processing explicit loops. In these cases vectorisation uses under the hood compiled instructions for these "element order processing" and can be several orders of magnitude faster than processing each programmer-specified loop operation;
Most modern CPUs (and, nowadays, GPUs) have build-in parallelization that is exploitable when we use the vectorisation method provided by the language rather than our self-implemented order of operations of the elements;
In a similar way our programming language of choice will likely use for some vectorisation operations (e.g. matrix operations) software libraries (e.g. BLAS/LAPACK) that exploit multi-threading capabilities of the CPU, another form of parallel computation.
Note that for points 3 and 4 some languages (Julia notably) allow these hardware parallelizations to be exploited also using programmer-defined order processing (e.g. for loops), but this happens automatically and under the hood when using the vectorisation method provided by the language.
Now, while vectorisation has many advantages, sometimes an algorithm is more intuitively expressed using an explicit loop than vectorisation (where perhaps we need to resort to complex linear algebra operations, identity and diagonal matrices... all to retain our "vectorised" approach), and if using an explicit ordering form has no computational disadvantages, this one should be preferred.
See the two answers above. I just wanted to add that the reason for wanting to do vectorization is that these operations can easily be performed in paraell by supercomputers and multi-processors, yielding a big performance gain. On single processor computers there will be no performance gain.

Resources