Why dead code in OpenCL kernel influence result in Nvidia GTX550ti? - memory

I am using OpenCL dev software of Nvidia on GTX550ti graphics card, and encounter a strange problem. (I am freshman for OpenCL).
My kernel code is like this:
__kernel void kernel_name(...)
{
size_t d = get_local_id(0);
char abc[8];
...
}
Actually, the char abc[8] is useless (dead code) for my case. But, if I have the char abc[8] in my kernel code, the result will be totally messy and the running time of kernel will be much longer (2095712 ns). If I comment out the char abc[8], the result becomes correct, and the running time of kernel becomes shorter (697856 ns). The compiler of kernel won't wipe off the dead code?
The above is just an explicit example that I can repeat. I also encounter more stranger case that one program gets different result when run at different time in totally the same environment.
Is that related to memory allocation or..? Anyone can give me some advice on how to find the problem?
By the way, oclDeviceQuery output information is listed as follows:
Platform Version = OpenCL 1.1
CUDA 4.2.1,
SDK Revision = 7027912
My OS is Windows XP.
Today is 2012-07-17, and I think I have resolved this problem.
don't use #include in kernel source file.
don't use ultra length line (for example, you write program to generate some line data for kernel source file) in kernel source file.

You're right, that shouldn't effect anything.
That's not your real code though, and I suspect given those run-times that your kernel isn't a simple thing. Possibly you're pushing your locals over some limit which means that variables are having to be stored in some slower memory which pushes your run-times up.
Something like that might also cause a change in behaviour if you had an uninitialised variable bug somewhere. In the fast store it happens to get a value that works. In the slow store it gets something else.
To check this theory I'd try to remove some other local data structure and see if it has the same effect. Anything else 8 bytes or larger should have the same effect.
...of course it's possibly you've found a bug in the OpenCL implementation, but that's easy to check. Just compile the kernel for a different OpenCL device, e.g. the CPU. This is worth doing anyway because different compiler pick up different issues.
Other than that I think you're back to standard debug techniques.
BTW: at one point in your question you call the array abs[8] rather than abc[8]. I assume that's a typo, but if it isn't then that could be your problem as the abs name will clash with the abs() function. That could confuse a stupid compiler.

Related

Passing arguments through __local memory in OpenCL

I am confused about the the __local memory in OpenCL here.
I read some spec saying that the data flow has to be from Host to
__Global, and then __Local.
But I also see some kernel function like this:
__kernel void foo(__local float * a)
I was wondering how the data was transferred directly into the __local
memory in this way?
Thanks.
It is not possible to fill local buffer on the host side. Therefore you have to follow the flow host -> __global -> __local.
Local buffer can be either created on the host side and then it is passed as a kernel parameter or on gpu side inside the kernel.
Creating local buffer on the host side gives the advantage to decide about its size before the kernel is run which can be important if the local buffer size needs to be different each time the kernel is run.
Local memory is not visible to anything but a single work-group, and may be allocated as the work-group is dispatched by hardware on many architectures. Hardware that can mix multiple work-groups from different kernels on each CU will allow the scheduling component to chunk up the local memory for each of the groups being issued. It doesn't exist before the group is launched, and does not exist after the group terminates. The size of this region is what you pass in as other answers have pointed out.
The result of this is that the only way on many architectures for filling local memory from the host would be for kernel code to be inserted by the compiler that would copy data in from global memory. Given that as the basis, it isn't any worse in terms of performance for the programmer to do it manually, and gives more control over exactly what happens. You do not end up in a situation where the compiler always generates copy code and ends up copying more than was really necessary because the API didn't make it clear what memory was copy-in and what was not.
In summary, you cannot fill local memory in any automated way. In practice you will rarely want to, because doing it manually gives you the opportunity to only put the result of a first stage into local, removing extra copy operations, or to transform the data on the way in to local, allowing padding or data transposition to remove bank conflicts and so on.
As #doqtor said, the size of local memory on kernel parameter can be specified by clSetKernelArg calls.
Fortunately, OpenCL 1.2+ support VLA(variable length array), local memory kernel parameter is not required any more.

OpenCL code behavior is different for AMD vs NVIDIA cards

I have a constant at the top of my code...
__constant uint uintmaxx = (uint)( (((ulong)1)<<32) - 1 );
It compiles fine on AMD and NVIDIA OpenCL compilers... then executes.
(correct) on ATI cards, returns... 4294967295 or (all 32 bits = 1)
(wrong) on NVIDIA cards, returns... 2147483648 or (only 32'nd bit = 1)
I also tried -1 + 1<<32 and it worked on ATI but not NVIDIA.
What gives? Am I just missing something?
While I'm on the topic of OpenCL compiler differences, does anyone know a good resource that lists the compiler differences between AMD and NVIDIA?
OpenCL conveniently provides that for you already. You can use the predefined UINT_MAX in your kernel code and the implementation will guarantee that it holds the correct value.
However there is also nothing wrong in the method you use. The spec guarantees that uint is 32bits and ulong 64bits, ints are twos complement and everything that is not explicitly mentioned works exactly as is written in C99 spec.
Even just this should work and give you the correct result:
uint uintmaxx = -1;
It seems that NVidia just has a broken compiler, if not I really hope I'll be corrected on the issue. The really odd part there is that how on earth the 32nd bit is 1? Shift to left by 32 moves the original bit to the 33rd place. So what on earth places a bit in the 32nd spot? The only thing I got in my mind is that they don't respect operator ordering at all and transform the formula into (ulong)1 << (32-1) or something like that.
You probably should file a bug report. But to be frank considering that they hate OpenCL as much as Microsoft hates OpenGL, if not even more, I wouldn't really anticipate fast response times.
I fully agree with #sharpneli answer. But just try this:
__constant uint uintmaxx = -1;
And like sharpneli said, use the UINT_MAX macro, it is the safer way.

Where is memory interleaving and memory split up into ranks happening in Linux kernel?

I am working on a course homework on sysfs virtual file system in Linux Kernel. As part of setting up sysfs virtual file system, Linux kernel organizes the physical memory in to blocks and further into sections in this directoy sys/devices/system/memory. In that directory, memory chunks will be represented as memory0, meomory1, memory2 etc..
After digging the Linux kernel, I have found out that the memory is being split into 128MB blocks and then further into sections of memory and found the code which does this in the C file here: Memory.c. In the above C file, the method memory_dev_init() has the logic for the whole memory block splitting and dividing into sections (or that's what i understood :) ). As per my professor, memory in Linux is split up into ranks and ranks contain interleaved memory addresses as shown below:
rank0: [0-512KB] [2048KB-2560KB] [4096KB-4608KB] ...
rank1: [512KB-1024KB] [2560KB-3072KB] [4608KB-5120KB] ...
rank2: [1024KB-1536KB] [3072KB-3584KB] [5120KB-...
rank3: [1536KB-2048KB] [3584KB-4096KB] ...
As part of my homework, I want to change the rank format into this so that i can get a contiguous memory blocks:
rank0: [0-512KB] [512KB-1024KB] [1024KB-1536KB]...
rank1: [1536KB-2048KB] [2048KB-2560KB] [2560KB-3072KB]...
rank2: [3072KB-3584KB] [3584KB-4096KB] [4096KB-4608KB]...
rank3: [4608KB-5120KB] ...
So I just want to know where exactly this memory interleaving is happening and the existing ranking is happening in the current Linux kernel. Could anyone please point me in the right direction?
I'm not quite sure as I don't see any practical use of the question, it is indeed a sort of academic research... and what you are trying to achieve is achievable by disabling the memory interleaving entirely. I guess after you disable interleaving you will see the proper "picture" in sysfs as well.
In other words -- no coding required, just the change of configuration.
Have a look at the memory interleave settings in BIOS. Here's a post which describe how to do this in a couple of platforms.

False autovectorization in Intel C compiler (icc)

I need to vectorize with SSE a some huge loops in a program. In order to save time I decided to let ICC deal with it. For that purpose, I prepare properly the data, taking into account the alignment and I make use of the compiler directives #pragma simd, #pragma aligned, #pragma ivdep. When compiling with the several -vec-report options, compiler tells me that loops were vectorized. A quick look to the assembly generated by the compiler seems to confirm that, since you can find there plenty of vectorial instructions that works with packed single precision operands (all operations in the serial code handler float operands).
The problem is that when I take hardware counters with PAPI the number of FP operations I get (PAPI_FP_INS and PAPI_FP_OPS) is pretty the same in the auto-vectorized code and the original one, when one would expect to be significantly less in the auto-vectorized code. What's more, a vectorized by-hand a simplified problem of the one that concerns and in this case I do get something like 3 times less of FP operations.
Has anyone experienced something similar with this?
Spills may destroy the advantage of vectorization, thus 64-bit mode may gain significantly over 32-bit mode. Also, icc may version a loop and you may be hitting a scalar version even though there is a vector version present. icc versions issued in the last year or 2 have fixed some problems in this area.

Subkernel memory control in Mathematica

I have a somewhat similar question as:
Mathematica running out of memory
I am interested in something like this:
ParallelTable[F[i], {i, 0, 14.9, 0.001}]
where F[i] is a complicated numerical integral (I haven't yet found an easy way to reproduce the problem without page filling definitions for an integral).
My problem is that the subkernels blow up in memory and I have to stop evaluation if I won't let the machine swapping.
But even if I have stopped evaluation the kernels won't give free their occupied memory.
ClearSystemCache[]
I even have tried
ParallelEvaluate[ClearSystemCache[]]
but
ParallelEvaluate[MemoryInUse[]]
stays at
{823185944, 833146832, 812429208, 840150336, 850057024, 834441704,
847068768, 850424224}
it seems that all memory controlling only works for the main kernel?
By now the only way is to shut down all the kernels and launch them again.
I really hope there are some solutions out there...
Thanks a lot.
Memory control works for the kernel where control expressions involving such functions as MemoryConstrained, MemoryInUse, Clear, Unset, Remove, $HistoryLength, ClearSystemCache etc. are evaluated. It seems that in your case the source of the memory leaks is not due to Mathematica's internal caching mechanism (thanks for the link, BTW!).
Have you tried to evaluate $HistoryLength=0; in all subkernels before using them for computations? If you have not yet, I strongly recommend to try.
Since you are working with numerical integration functions, I suggest also to try to optimize usage of them. For example, if you make numerical integration using NDSolve and need only a limited set of calculated points (or even the only one point) you should use the form NDSolve[eqns,y,{x,x_needed_min,x_needed_max}] (or even NDSolve[eqns,y,{x,x_max,x_max}]) instead of NDSolve[eqns,y,{x,x_min,x_max}] or NDSolve[eqns,y,{x,0,x_max}]. This can dramatically reduce memory usage in some cases! You can also use EventLocator for memory control.
I was(am?) having the exact same problem, almost word for word. I just had some good luck with adding the option to the problem integral:
Method-> {"GlobalAdaptive", "SymbolicProcessing"->False}
You can probably choose any other method if you'd like, but I had success with this within the last few minutes. Also, a lot of nasty inconsistencies I used to be getting are gone, and integration proceeds MUCH faster.

Resources