Wrong semaphor in case of opencl usage - memory

Solution:
Finally I could solve or at least to find a good workaround for my problem.
This kind of semaphore doesn't work in case of NVIDIA.
I think this comment is right.
So I decided to use atomic_add() which is mandatory part of the OpenCL 1.1.
I have a resultBuffer array and resultBufferSize global variable and the last one is set to zero.
When I have results (my result is always!! x numbers) than I simple call
position = atomic_add(resultBufferSize, x);
and I can be sure no one writes between position and position + x into the buffer.
Don't forget the global variable must be volatile.
When the threads run into endless loops the resource is not available and therefore the -5 error code during the buffer reading.
Update:
When I read back:
oclErr |= clEnqueueReadBuffer(cqCommandQueue, cm_inputNodesArraySizes, CL_TRUE, 0, lastMapCounter*sizeof(cl_uint), (void*)&inputNodesArraySizes, 0, NULL, NULL);
The value of the lastMapCounter changes. It's strange because in the ocl code I do nothing and I take care of sizes: what I wrote into the buffer creation and what I copy I read the same back. And a hidden bufferoverflow can cause many stange things indeed.
End of update
I did the following code and there is a bug in it. I want a semaphore to change the resultBufferSize global variable (now I just want to try it how it works) and get back a big number (it is supposed that each worker write something). But I get always 3 or sometimes errors. There is no logic how the compiler works.
__kernel void findCircles(__global uint *inputNodesArray, __global
uint*inputNodesArraySizes, uint lastMapCounter,
__global uint *resultBuffer,
__global uint *resultBufferSize, volatile __global uint *sem)
{
for(;atom_xchg(sem, 1) > 0;)
(*resultBufferSize) = (*resultBufferSize) + 3;
atom_xchg(sem, 0);
}
I got -48 during the kernel execution and sometimes it's OK and I got -5 when I want to read back the buffer (the size buffer).
Do you have any idea where I can find the bug?
NVIDIA opencl 1.1 which is used.
Of course on the host I configure everything well:
uint32 resultBufferSize = 0;
uint32 sem;
cl_mem cmresultBufferSize = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE,
sizeof(uint32), NULL, &ciErrNum);
cl_mem cmsem = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE, sizeof(uint32), NULL,
&ciErrNum);
ciErrNum = clSetKernelArg(ckKernel, 4, sizeof(cl_mem), (void*)&cmresultBufferSize);
ciErrNum = clSetKernelArg(ckKernel, 5, sizeof(cl_mem), (void*)&cmsem);
ciErrNum |= clEnqueueNDRangeKernel(cqCommandQueue, ckKernel, 1, NULL,
&szGlobalWorkSize, &szLocalWorkSize, 0, NULL, NULL);
ciErrNum = clEnqueueReadBuffer(cqCommandQueue, cmresultBufferSize, CL_TRUE, 0,
sizeof(uint32), (void*)&resultBufferSize, 0, NULL, NULL);
(in case of this code the kernel is OK and the last reading is return -5)

I know you have come to a conclusion on this, but I want to point out two things:
1) The semaphore is non-portable because it isn't SIMD safe, as pointed out in the linked thread.
2) The memory model is not strong enough to give a meaning to the code. The update of the result buffer could move out of the critical section - nothing in the model says otherwise. At the very least you'd need fences, but the language around fences in the 1.x specs is also fairly weak. You'd need an OpenCL 2.0 implementation to be confident that this aspect is safe.

Related

Is atomic_fetch_add_explicit atomic within threadgroup memory?

The following code snippet in a Metal compute kernel suggests that atomic_fetch_add_explicit does not have an atomic read-modify-write within threadgroup memory.
The value of i is not unique within the threadgroup as I expect it to be.
Am I using it wrong??
threadgroup atomic_int index;
atomic_store_explicit( &index, 0, memory_order_relaxed );
threadgroup_barrier( mem_flags::mem_none );
int i = atomic_fetch_add_explicit( &index, 1, memory_order_relaxed );
This is indeed correct and functions atomically as expected.
The error was in my code verifying the uniqueness of i.

How to read vertices from vertex buffer in Direct3d11

I have a question regarding vertex buffers. How does one read the vertices from the vertex buffer in D3D11? I want to get a particular vertex's position for calculations, if this approach is wrong, how would one do it? The following code does not (obviously) work.
VERTEX* vert;
D3D11_MAPPED_SUBRESOURCE ms;
devcon->Map(pVBufferSphere, NULL, D3D11_MAP_READ, NULL, &ms);
vert = (VERTEX*) ms.pData;
devcon->Unmap(pVBufferSphere, NULL);
Thanks.
Where your code is wrong:
You asking GPU to give you an address to its memory(Map()),
Storing this adress (operator=()),
Then saying: "Thanks, I don't need it anymore" (Unmap()).
After unmap, you can't really say where your pointer now points. It can point to memory location where already allocated another stuff or at memory of your girlfriend's laptop (just kidding =) ).
You must copy data (all or it's part), not pointer in between Map() Unmap(): use memcopy, for loop, anything. Put it in array, std::vector, BST, everything.
Typical mistakes that newcomers can made here:
Not to check HRESULT return value from ID3D11DeviceContext::Map method. If map fails it can return whatever pointer it likes. Dereferencing such pointer leads to undefined behavior. So, better check any DirectX function return value.
Not to check D3D11 debug output. It can clearly say what's wrong and what to do in plain good English language (clearly better than my English =) ). So, you can fix bug almost instantly.
You can only read from ID3D11Buffer if it was created with D3D11_CPU_ACCESS_READ CPU access flag which means that you must also set D3D11_USAGE_STAGING usage fag.
How do we usualy read from buffer:
We don't use staging buffers for rendering/calculations: it's slow.
Instead we copy from main buffer (non-staging and non-readable by CPU) to staging one (ID3D11DeviceContext::CopyResource() or ID3D11DeviceContext::CopySubresourceRegion()), and then copying data to system memory (memcopy()).
We don't do this too much in release builds, it will harm performance.
There are two main real-life usages of staging buffers: debugging (see if buffer contains wrong data and fix some bug in algorithm) and reading final non-pixel data (for example if you calculating scientific data in Compute shader).
In most cases you can avoid staging buffers at all by well-designing your code. Think as if CPU<->GPU was connected only one way: CPU->GPU.
The following code only get the address of the mapped resource, you didn't read anything before Unmap.
vert = (VERTEX*) ms.pData;
If you want to read data from the mapped resource, first allocate enough memory, then use memcpy to copy the data, I don't know your VERTEX structure, so I suppose vert is void*, you can convert it yourself
vert = new BYTE[ms.DepthPitch];
memcpy(vert, ms.pData, ms.DepthPitch];
Drop's answer was helpful. I figured that the reason why I wasn't able to read the buffer was because I didn't have the CPU_ACCESS_FLAG set to D3D11_CPU_ACCESS_READ before. Here
D3D11_BUFFER_DESC bufferDesc;
ZeroMemory(&bufferDesc, sizeof(bufferDesc));
bufferDesc.ByteWidth = iNumElements * sizeof(T);
bufferDesc.Usage = D3D11_USAGE_DEFAULT;
bufferDesc.CPUAccessFlags = D3D11_CPU_ACCESS_READ | D3D11_CPU_ACCESS_WRITE;
bufferDesc.BindFlags = D3D11_BIND_UNORDERED_ACCESS | D3D11_BIND_SHADER_RESOURCE ;
bufferDesc.MiscFlags = D3D11_RESOURCE_MISC_BUFFER_STRUCTURED;
bufferDesc.StructureByteStride = sizeof(T);
And then to read data I did
const ID3D11Device& device = *DXUTGetD3D11Device();
ID3D11DeviceContext& deviceContext = *DXUTGetD3D11DeviceContext();
D3D11_MAPPED_SUBRESOURCE ms;
HRESULT hr = deviceContext.Map(g_pParticles, 0, D3D11_MAP_READ, 0, &ms);
Particle* p = (Particle*)malloc(sizeof(Particle*) * g_iNumParticles);
ZeroMemory(p, sizeof(Particle*) * g_iNumParticles);
memccpy(p, ms.pData, 0, sizeof(ms.pData));
deviceContext.Unmap(g_pParticles, 0);
delete[] p;
I agree it's a performance decline, I wanted to do this, just to be able to debug the values!
Thanks anyway! =)

How do I transfer an integer to __constant__ device memory?

I have a weird problem, so I thought I would ask and see if someone more experienced than me could see a solution.
I am writing a program with CUDA C/C++, and I have some constant integers that specify various things, like coordinates of the bounds of the calculation, etc.. Currently I just have those things in global device memory. They are accessed by every thread in every kernel call, and so I figured that if they are in global memory, then they never are being cached or broadcast (right?). And so these little integers are taking up a lot (relatively) of overhead, and have a lot of 'read redundancy.'
So I declare in a header:
__constant__ int* number;
I include that header, and, when I do memory stuff, I do:
cutilSafeCall( cudaMemcpyToSymbol(number, &(some_host_int), sizeof(int) );
I pass number into all my kernel's then:
__global__ void magical_kernel(int* number, ...){
//and I access 'number' like this
int data_thingy = big_array[ *number ];
}
My code crashes. With number in global memory, it is just fine. I have determined that it crashes sometime upon accessing number within the kernel. This means that either I am accessing or allocating it wrong. If it holds the wrong value, it will also cause a crash, because it is used to index into arrays.
To conclude, I will ask a few questions. First, what am I doing wrong? As a bonus: is there a better way than constant memory to accomplish this task - I don't know the value of number at compile time, so a simple #define won't work. Will constant memory even speed the code up at all, or has it been cached and broadcasted all along? Could I somehow put the data in shared memory for each threadblock and have it remain in shared memory through multiple kernel calls?
There are several problems here:
You have declared number as a pointer, but never assigned it a value which is valid address in GPU memory
You have a variable scope onflict: the argument variable int * number defined in magic_kernel is not the same variable as the __constant__ int * variable defined as compilation unit scope.
The first argument of the cudaMemcpyToSymbol call is almost certainly incorrect.
If you don't understand why either of the first two point are true, you have some revision to do on pointers and scope in C++.
Based on your response to a now deleted answer, I suspect what you are actually trying to do is this:
__constant__ int number;
__global__ void magical_kernel(...){
int data_thingy = big_array[ number ];
}
cudaMemcpyToSymbol("number", &(some_host_int), sizeof(int));
i.e. number is intended to be an integer in constant memory, not a pointer, and not a kernel argument.
EDIT: here is an exmaple which shows this in action:
#include <cstdio>
__constant__ int number;
__global__ void magical_kernel(int * out)
{
out[threadIdx.x] = number;
}
int main()
{
const int value = 314159;
const size_t sz = size_t(32) * sizeof(int);
cudaMemcpyToSymbol("number", &value, sizeof(int));
int * _out, * out;
out = (int *)malloc(sz);
cudaMalloc((void **)&_out, sz);
magical_kernel<<<1,32>>>(_out);
cudaMemcpy(out, _out, sz, cudaMemcpyDeviceToHost);
for(int i=0; i<32; i++)
fprintf(stdout, "%d %d\n", i, out[i]);
return 0;
}
You should be able to run this yourself and confirm it works as advertised.

How to declare local memory in OpenCL?

I'm running the OpenCL kernel below with a two-dimensional global work size of 1000000 x 100 and a local work size of 1 x 100.
__kernel void myKernel(
const int length,
const int height,
and a bunch of other parameters) {
//declare some local arrays to be shared by all 100 work item in this group
__local float LP [length];
__local float LT [height];
__local int bitErrors = 0;
__local bool failed = false;
//here come my actual computations which utilize the space in LP and LT
}
This however refuses to compile, since the parameters length and height are not known at compile time. But it is not clear to my at all how to do this correctly. Should I use pointers with memalloc? How to handle this in a way that the memory is only allocated once for the entire workgroup and not once per work item?
All that I need is 2 arrays of floats, 1 int and 1 boolean that are shared among the entire workgroup (so all 100 work items). But I fail to find any method that does this correctly...
It's relatively simple, you can pass the local arrays as arguments to your kernel:
kernel void myKernel(const int length, const int height, local float* LP,
local float* LT, a bunch of other parameters)
You then set the kernelargument with a value of NULL and a size equal to the size you want to allocate for the argument (in byte). Therefore it should be:
clSetKernelArg(kernel, 2, length * sizeof(cl_float), NULL);
clSetKernelArg(kernel, 3, height* sizeof(cl_float), NULL);
local memory is always shared by the workgroup (as opposed to private), so I think the bool and int should be fine, but if not you can always pass those as arguments too.
Not really related to your problem (and not necessarily relevant, since I do not know what hardware you plan to run this on), but at least gpus don't particulary like workingsizes which are not a multiple of a particular power of two (I think it was 32 for nvidia, 64 for amd), meaning that will probably create workgroups with 128 items, of which the last 28 are basically wasted. So if you are running opencl on gpu it might help performance if you directly use workgroups of size 128 (and change the global work size appropriately)
As a side note: I never understood why everyone uses the underscore variant for kernel, local and global, seems much uglier to me.
You could also declare your arrays like this:
__local float LP[LENGTH];
And pass the LENGTH as a define in your kernel compile.
int lp_size = 128; // this is an example; could be dynamically calculated
char compileArgs[64];
sprintf(compileArgs, "-DLENGTH=%d", lp_size);
clBuildProgram(program, 0, NULL, compileArgs, NULL, NULL);
You do not have to allocate all your local memory outside the kernel, especially when it is a simple variable instead of a array.
The reason that your code cannot compile is that OpenCL does not support local memory initialization. This is specified in the document(https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/local.html). It is also not feasible in CUDA(Is there a way of setting default value for shared memory array?)
ps:The answer from Grizzly is good enough and it would be better if I can post it as a comment, but I am restricted by the reputation policy. Sorry.

memset() behaving undesirably

I am using memset function in C and having a problem. Here is my problem:
char* tail;
tail = //some memory address
int pbytes = 5;
When I call memset like:
**memset(tail+pbytes, 0 , 8); // It gives no error**
When I call memset like:
**memset(tail+pbytes, 0 , 9); // It goes into infinite loop**
When I call memset like:
**memset(tail+pbytes, 0 , 10); // last parameter (10 or above). It gives Segmentation fault**
What can be the reason of this? The program runs and gives output as desired but it gives segmentation fault in the end. I am using Linux 64 virtual machine.
Any help would be appreciated.
OK. Let me clarify more with what i am doing. I am making 128 bytes (0-127 in array) data. I write 0(NULL) from byte 112 to 119 (it goes well) but when I try to write 0 on 120th byte and run the program, it goes into infinite loop. If I write 1,2,4,6 at 120th byte, program runs well. If I write other numbers at 120th byte, program gives segmentation fault. Basically there is something wrong with bytes from 120 to 127.
This is nothing wrong with memset. It's something wrong with how you defined your pointer variable tail.
If you simply wrote
char tail[128];
memset(tail+5, 0, 9);
of course it would work fine. Your problem is that you're not doing anything that simple and correct; you're doing something obscure and incorrect, such as
char tail[1];
memset(tail+5, 0, 9);
or
void foo(int x) {
char *tail = &x;
memset(tail+5, 0, 9);
}
To paraphrase Charles Babbage: When you put wrong code into the machine, wrong answers come out.
The segfault is probably because you're trying to write to a virtual address that has not yet been allocated. The infinite loop might be because you're overwriting some part of the memset's return address, so that it returns to the wrong place (such as into the middle of an infinite loop) instead of returning to the place it was called from.

Resources