False autovectorization in Intel C compiler (icc) - sse

I need to vectorize with SSE a some huge loops in a program. In order to save time I decided to let ICC deal with it. For that purpose, I prepare properly the data, taking into account the alignment and I make use of the compiler directives #pragma simd, #pragma aligned, #pragma ivdep. When compiling with the several -vec-report options, compiler tells me that loops were vectorized. A quick look to the assembly generated by the compiler seems to confirm that, since you can find there plenty of vectorial instructions that works with packed single precision operands (all operations in the serial code handler float operands).
The problem is that when I take hardware counters with PAPI the number of FP operations I get (PAPI_FP_INS and PAPI_FP_OPS) is pretty the same in the auto-vectorized code and the original one, when one would expect to be significantly less in the auto-vectorized code. What's more, a vectorized by-hand a simplified problem of the one that concerns and in this case I do get something like 3 times less of FP operations.
Has anyone experienced something similar with this?

Spills may destroy the advantage of vectorization, thus 64-bit mode may gain significantly over 32-bit mode. Also, icc may version a loop and you may be hitting a scalar version even though there is a vector version present. icc versions issued in the last year or 2 have fixed some problems in this area.

Related

What is the difference between loadu_ps and set_ps when using unformatted data?

I have some data that isn't stored as structure of arrays. What is the best practice for loading the data in registers?
__m128 _mm_set_ps (float e3, float e2, float e1, float e0)
// or
__m128 _mm_loadu_ps (float const* mem_addr)
With _mm_loadu_ps, I'd copy the data in a temporary stack array, vs. copying the data as values directly. Is there a difference?
It can be a tradeoff between latency and throughput, because separate stores into an array will cause a store-forwarding stall when you do a vector load. So it's high latency, but throughput could still be ok, and it doesn't compete with surrounding code for the vector shuffle execution unit. So it can be a throughput win if the surrounding code also has shuffle operations, vs. 3 shuffles to insert 3 elements into an XMM register after a scalar load of the first one. Either way it's still a lot of total uops, and that's another throughput bottleneck.
Most compilers like gcc and clang do a pretty good job with _mm_set_ps () when optimizing with -O3, whether the inputs are in memory or registers. I'd recommend it, except in some special cases.
The most common missed-optimization with _mm_set is when there's some locality between the inputs. e.g. don't do _mm_set_ps(a[i+2], a[i+3], a[i+0], a[i+1]]), because many compilers will use their regular pattern without taking advantage of the fact that 2 pairs of elements are contiguous in memory. In that case, use (the intrinsics for) movsd and movhps to load in two 64-bit chunks. (Not movlps: it merges into an existing register instead of zeroing the high elements, so it has a false dependency on the old contents while movsd zeros the high half.) Or a shufps if some reordering is needed between or within the 64-bit chunks.
The "regular pattern" that compilers use will usually be movss / insertps from memory if compiling with SSE4, or movss loads and unpcklps shuffles to combine pairs and then another unpcklps, unpcklpd, or movlhps to shuffle into one register. Or a shufps or shufpd if the compiler likes to waste code-side on immediate shuffle-control operands instead of using fixed shuffles intelligently.
See also Agner Fog's optimization guides for some handy tables of data-movement instructions to get a better idea of what the compiler has to work with, and how stuff performs. Note that Haswell and later can only do 1 shuffle per clock. Also other links in the x86 tag wiki.
There's no really cheap way for a compiler or human to do this, in the general case when you have 4 separate scalars that aren't contiguous in memory at all. Or for register inputs, where it can't optimize the way they're generated in registers in the first place to have some of them already packed together. (e.g. for function args passed in registers to a function that can't / doesn't inline.)
Anyway, it's not a big deal unless you have this inside an inner loop. In that case, definitely worry about it (and check the compiler's asm output to see if it made a mess or could do better if you program the gather yourself with intrinsics that map to single instructions like _mm_load_ss / _mm_shuffle_ps).
If possible, rearrange your data layout to make data contiguous in at least small chunks / stripes. (See https://stackoverflow.com/tags/sse/info, specifically these slides. But sometimes one part of the program needs the data one way, and the other needs another. Choose the layout that's good for the case that needs to be faster, or that runs more often, or whatever, and suck it up and do the best you can for the other part of the program. :P Possibly transpose / convert once to set up for multiple SIMD operations, but extra passes over data with no computation just suck up time and can hurt your computational intensity (how much ALU work you do for each time you load data into registers) more than they help.
And BTW, actual gather instructions (like AVX2 vgatherdps) are not very fast; even on Skylake it's probably not worth using a gather instruction for four 32-bit elements at known locations. On Broadwell / Haswell, gather is definitely not worth using for this.

Largest amount of entries in lua table

I am trying to build a Sieve of Eratosthenes in Lua and i tried several things but i see myself confronted with the following problem:
The tables of Lua are to small for this scenario. If I just want to create a table with all numbers (see example below), the table is too "small" even with only 1/8 (...) of the number (the number is pretty big I admit)...
max = 600851475143
numbers = {}
for i=1, max do
table.insert(numbers, i)
end
If I execute this script on my Windows machine there is an error message saying: C:\Program Files (x86)\Lua\5.1\lua.exe: not enough memory. With Lua 5.3 running on my Linux machine I tried that too, error was just killed. So it is pretty obvious that lua can´t handle the amount of entries.
I don´t really know whether it is just impossible to store that amount of entries in a lua table or there is a simple solution for this (tried it by using a long string aswell...)? And what exactly is the largest amount of entries in a Lua table?
Update: And would it be possible to manually allocate somehow more memory for the table?
Update 2 (Solution for second question): The second question is an easy one, I just tested it by running every number until the program breaks: 33.554.432 (2^25) entries fit in one one-dimensional table on my 12 GB RAM system. Why 2^25? Because 64 Bit per number * 2^25 = 2147483648 Bits which are exactly 2 GB. This seems to be the standard memory allocation size for the Lua for Windows 32 Bit compiler.
P.S. You may have noticed that this number is from the Euler Project Problem 3. Yes I am trying to accomplish that. Please don´t give specific hints (..). Thank you :)
The Sieve of Eratosthenes only requires one bit per number, representing whether the number has been marked non-prime or not.
One way to reduce memory usage would be to use bitwise math to represent multiple bits in each table entry. Current Lua implementations have intrinsic support for bitwise-or, -and etc. Depending on the underlying implementation, you should be able to represent 32 or 64 bits (number flags) per table entry.
Another option would be to use one or more very long strings instead of a table. You only need a linear array, which is really what a string is. Just have a long string with "t" or "f", or "0" or "1", at every position.
Caveat: String manipulation in Lua always involves duplication, which rapidly turns into n² or worse complexity in terms of performance. You wouldn't want one continuous string for the whole massive sequence, but you could probably break it up into blocks of a thousand, or of some power of 2. That would reduce your memory usage to 1 byte per number while minimizing the overhead.
Edit: After noticing a point made elsewhere, I realized your maximum number is so large that, even with a bit per number, your memory requirements would optimally be about 73 gigabytes, which is extremely impractical. I would recommend following the advice Piglet gave in their answer, to look at Jon Sorenson's version of the sieve, which works on segments of the space instead of the whole thing.
I'll leave my suggestion, as it still might be useful for Sorenson's sieve, but yeah, you have a bigger problem than you realize.
Lua uses double precision floats to represent numbers. That's 64bits per number.
600851475143 numbers result in almost 4.5 Terabytes of memory.
So it's not Lua's or its tables' fault. The error message even says
not enough memory
You just don't have enough RAM to allocate that much.
If you would have read the linked Wikipedia article carefully you would have found the following section:
As Sorenson notes, the problem with the sieve of Eratosthenes is not
the number of operations it performs but rather its memory
requirements.[8] For large n, the range of primes may not fit in
memory; worse, even for moderate n, its cache use is highly
suboptimal. The algorithm walks through the entire array A, exhibiting
almost no locality of reference.
A solution to these problems is offered by segmented sieves, where
only portions of the range are sieved at a time.[9] These have been
known since the 1970s, and work as follows
...

Is it possible to clear the FPU?

I'm using Delphi XE6 to perform a complicated floating point calculation. I realize the limitations of floating point numbers so understand the inaccuracies inherent in FP numbers. However this particular case, I always get 1 of 2 different values at the end of the calculation.
The first value and after a while (I haven't figured out why and when), it flips to the second value, and then I can't get the first value again unless I restart my application. I can't really be more specific as the calculation is very complicated. I could almost understand if the value was somewhat random, but just 2 different states is a little confusing. This only happens in the 32-bit compiler, the 64 bit compiler gives one single answer no matter how many times I try it. This number is different from the 2 from the 32-bit calculation, but I understand why that is happening and I'm fine with it. I need consistency, not total accuracy.
My one suspicion is that perhaps the FPU is being left in a state after some calculation that affects subsequent calculations, hence my question about clearing all registers and FPU stack to level out the playing field. I'd call this CLEARFPU before I start of the calculation.
After some more investigation I realized I was looking in the wrong place. What you see is not what you get with floating point numbers. I was looking at the string representation of the numbers and thinking here are 4 numbers going into a calculation ALL EQUAL and the result is different. Turns out the numbers only seemed to be the same. I started logging the hex equivalent of the numbers, worked my way back and found an external dll used for matrix multiplication the cause of the error. I replaced the matrix multiplication with a routine written in Delphi and all is well.
Floating point calculations are deterministic. The inputs are the input data and the floating point control word. With the same input, the same calculation will yield repeatable output.
If you have unpredictable results, then there will be a reason for it. Either the input data or the floating point control word is varying. You have to diagnose what that reason for that is. Until you understand the problem fully, you should not be looking for a problem. Do not attempt to apply a sticking plaster without understanding the disease.
So the next step is to isolate and reproduce the problem in a simple piece of code. Once you can reproduce the issue you can solve the problem.
Possible explanations include using uninitialized data, or external code modifying the floating point control word. But there could be other reasons.
Uninitialized data is plausible. Perhaps more likely is that some external code is modifying the floating point control word. Instrument your code to log the floating point control word at various stages of execution, to see if it ever changes unexpectedly.
You've probably been bitten by combination of optimization and excess x87 FPU precision resulting in the same bit of floating-point code in your source code being duplicated with different assembly code implementations with different rounding behaviour.
The problem with x87 FPU math
The basic problem is that while x87 FPU the supports 32-bit, 64-bit and 80-bit floating-point value, it only has 80-bit registers and the precision of operations is determined by the state of the bits in the floating point control word, not the instruction used. Changing the rounding bits is expensive, so most compilers don't, and so all floating point operations end being be performed at the same precision regardless of the data types involved.
So if the compiler sets the FPU to use 80-bit rounding and you add three 64-bit floating point variables, the code generated will often add the first two variables keeping the unrounded result in a 80-bit FPU register. It would then add the third 64-bit variable to 80-bit value in the register resulting in another unrounded 80-bit value in a FPU register. This can result in a different value being calculated than if the result was rounded to 64-bit precision after each step.
If that resulting value is then stored in a 64-bit floating-point variable then the compiler might write it to memory, rounding it to 64 bits at this point. But if the value is used in later floating point calculations then the compiler might keep it in a register instead. This means what rounding occurs at this point depends on the optimizations the compiler performs. The more its able to keep values in a 80-bit FPU register for speed, the more the result will differ from what you'd get if all floating point operation were rounded according to the size of actual floating point types used in the code.
Why SSE floating-point math is better
With 64-bit code the x87 FPU isn't normally used, instead equivalent scalar SSE instructions are used. With these instructions the precision of the operation used is determined by the instruction used. So with the adding three numbers example, the compiler would emit instructions that added the numbers using 64-bit precision. It doesn't matter if the result gets stored in memory or stays in register, the value remains the same, so optimization doesn't affect the result.
How optimization can turn deterministic FP code into non-deterministic FP code
So far this would explain why you'd get a different result with 32-bit code and 64-bit code, but it doesn't explain why you can get a different result with the same 32-bit code. The problem here is that optimizations can change the your code in surprising ways. One thing the compiler can do is duplicate code for various reasons, and this can cause the same floating point code being executed in different code paths with different optimizations applied.
Since optimization can affect floating point results this can mean the different code paths can give different results even though there's only one code path in the source code. If the code path chosen at run time is non-deterministic then this can cause non-deterministic results even when the in the source code the result isn't dependent on any non-deterministic factor.
An example
So for example, consider this loop. It performs a long running calculation, so every few seconds it prints a message letting the user know how many iterations have been completed so far. At the end of the loop there's simple summation performed using floating-point arithmetic. While there's non-deterministic factor in the loop, the floating-point operation isn't dependent on it. It's always performed regardless of whether progress updated is printed or not.
while ... do
begin
...
if TimerProgress() then
begin
PrintProgress(count);
count := 0
end
else
count := count + 1;
sum := sum + value
end
As optimization the compiler might move the last summing statement into the end of both blocks of the if statement. This lets both blocks finish by jumping back to the start of the loop, saving a jump instruction. Otherwise one of the blocks has to end with a jump to the summing statement.
This transforms the code into this:
while ... do
begin
...
if TimerProgress() then
begin
PrintProgress(count);
count := 0;
sum := sum + value
end
else
begin
count := count + 1;
sum := sum + value
end
end
This can result in the two summations being optimized differently. It may be in one code path the variable sum can be kept in a register, but in the other path its forced out in to memory. If x87 floating point instructions are used here this can cause sum to be rounded differently depending on a non-deterministic factor: whether or not its time to print the progress update.
Possible solutions
Whatever the source of your problem, clearing the FPU state isn't going to solve it. The fact that the 64-bit version works, provides an possible solution, using SSE math instead x87 math. I don't know if Delphi supports this, but it's common feature of C compilers. It's very hard and expensive to make x87 based floating-point math conforming to the C standard, so many C compilers support using SSE math instead.
Unfortunately, a quick search of the Internet suggests the Delphi compiler doesn't have option for using SSE floating-point math in 32-bit code. In that case your options would be more limited. You can try disabling optimization, that should prevent the compiler from creating differently optimized versions of the same code. You could also try to changing the rounding precision in the x87 floating-point control word. By default it uses 80-bit precision, but all your floating point variables are 64-bit then changing the FPU to use 64-bit precision should significantly reduce the effect optimization has on rounding.
To do the later you can probably use the Set8087CW procedure MBo mentioned, or maybe System.Math.SetPrecisionMode.

Why dead code in OpenCL kernel influence result in Nvidia GTX550ti?

I am using OpenCL dev software of Nvidia on GTX550ti graphics card, and encounter a strange problem. (I am freshman for OpenCL).
My kernel code is like this:
__kernel void kernel_name(...)
{
size_t d = get_local_id(0);
char abc[8];
...
}
Actually, the char abc[8] is useless (dead code) for my case. But, if I have the char abc[8] in my kernel code, the result will be totally messy and the running time of kernel will be much longer (2095712 ns). If I comment out the char abc[8], the result becomes correct, and the running time of kernel becomes shorter (697856 ns). The compiler of kernel won't wipe off the dead code?
The above is just an explicit example that I can repeat. I also encounter more stranger case that one program gets different result when run at different time in totally the same environment.
Is that related to memory allocation or..? Anyone can give me some advice on how to find the problem?
By the way, oclDeviceQuery output information is listed as follows:
Platform Version = OpenCL 1.1
CUDA 4.2.1,
SDK Revision = 7027912
My OS is Windows XP.
Today is 2012-07-17, and I think I have resolved this problem.
don't use #include in kernel source file.
don't use ultra length line (for example, you write program to generate some line data for kernel source file) in kernel source file.
You're right, that shouldn't effect anything.
That's not your real code though, and I suspect given those run-times that your kernel isn't a simple thing. Possibly you're pushing your locals over some limit which means that variables are having to be stored in some slower memory which pushes your run-times up.
Something like that might also cause a change in behaviour if you had an uninitialised variable bug somewhere. In the fast store it happens to get a value that works. In the slow store it gets something else.
To check this theory I'd try to remove some other local data structure and see if it has the same effect. Anything else 8 bytes or larger should have the same effect.
...of course it's possibly you've found a bug in the OpenCL implementation, but that's easy to check. Just compile the kernel for a different OpenCL device, e.g. the CPU. This is worth doing anyway because different compiler pick up different issues.
Other than that I think you're back to standard debug techniques.
BTW: at one point in your question you call the array abs[8] rather than abc[8]. I assume that's a typo, but if it isn't then that could be your problem as the abs name will clash with the abs() function. That could confuse a stupid compiler.

What's the deal with 17- and 40-bit math in TI DSPs?

The TMS320C55x has a 17-bit MAC unit and a 40-bit accumulator. Why the non-power-of-2-width units?
The 40-bit accumulator is common in a few TI DSPs. The idea is basically that you can accumulate up to 256 arbitrary 32-bit products without overflow. (vs. in C where if you take a 32-bit product, you can overflow fairly quickly unless you resort to using 64-bit integers.)
The only way you access these features is by assembly code or special compiler intrinsics. If you use regular C/C++ code, the accumulator is invisible. You can't get a pointer to it.
So there's not any real need to adhere to a power-of-2 scheme. DSP cores have been fairly optimized for power/performance tradeoffs.
I may be talking through my hat here, but I'd expect to see the 17-bit stuff used to avoid the need for a separate carry bit when adding/subtracting 16-bit samples.

Resources