Is atomic_fetch_add_explicit atomic within threadgroup memory?

Is atomic_fetch_add_explicit atomic within threadgroup memory? - ios

The following code snippet in a Metal compute kernel suggests that atomic_fetch_add_explicit does not have an atomic read-modify-write within threadgroup memory.
The value of i is not unique within the threadgroup as I expect it to be.
Am I using it wrong??
threadgroup atomic_int index;
atomic_store_explicit( &index, 0, memory_order_relaxed );
threadgroup_barrier( mem_flags::mem_none );
int i = atomic_fetch_add_explicit( &index, 1, memory_order_relaxed );

This is indeed correct and functions atomically as expected.
The error was in my code verifying the uniqueness of i.

Related

Wrong semaphor in case of opencl usage

Solution:
Finally I could solve or at least to find a good workaround for my problem.
This kind of semaphore doesn't work in case of NVIDIA.
I think this comment is right.
So I decided to use atomic_add() which is mandatory part of the OpenCL 1.1.
I have a resultBuffer array and resultBufferSize global variable and the last one is set to zero.
When I have results (my result is always!! x numbers) than I simple call
position = atomic_add(resultBufferSize, x);
and I can be sure no one writes between position and position + x into the buffer.
Don't forget the global variable must be volatile.
When the threads run into endless loops the resource is not available and therefore the -5 error code during the buffer reading.
Update:
When I read back:
oclErr |= clEnqueueReadBuffer(cqCommandQueue, cm_inputNodesArraySizes, CL_TRUE, 0, lastMapCounter*sizeof(cl_uint), (void*)&inputNodesArraySizes, 0, NULL, NULL);
The value of the lastMapCounter changes. It's strange because in the ocl code I do nothing and I take care of sizes: what I wrote into the buffer creation and what I copy I read the same back. And a hidden bufferoverflow can cause many stange things indeed.
End of update
I did the following code and there is a bug in it. I want a semaphore to change the resultBufferSize global variable (now I just want to try it how it works) and get back a big number (it is supposed that each worker write something). But I get always 3 or sometimes errors. There is no logic how the compiler works.
__kernel void findCircles(__global uint *inputNodesArray, __global
uint*inputNodesArraySizes, uint lastMapCounter,
__global uint *resultBuffer,
__global uint *resultBufferSize, volatile __global uint *sem)
{
for(;atom_xchg(sem, 1) > 0;)
(*resultBufferSize) = (*resultBufferSize) + 3;
atom_xchg(sem, 0);
}
I got -48 during the kernel execution and sometimes it's OK and I got -5 when I want to read back the buffer (the size buffer).
Do you have any idea where I can find the bug?
NVIDIA opencl 1.1 which is used.
Of course on the host I configure everything well:
uint32 resultBufferSize = 0;
uint32 sem;
cl_mem cmresultBufferSize = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE,
sizeof(uint32), NULL, &ciErrNum);
cl_mem cmsem = clCreateBuffer(cxGPUContext, CL_MEM_READ_WRITE, sizeof(uint32), NULL,
&ciErrNum);
ciErrNum = clSetKernelArg(ckKernel, 4, sizeof(cl_mem), (void*)&cmresultBufferSize);
ciErrNum = clSetKernelArg(ckKernel, 5, sizeof(cl_mem), (void*)&cmsem);
ciErrNum |= clEnqueueNDRangeKernel(cqCommandQueue, ckKernel, 1, NULL,
&szGlobalWorkSize, &szLocalWorkSize, 0, NULL, NULL);
ciErrNum = clEnqueueReadBuffer(cqCommandQueue, cmresultBufferSize, CL_TRUE, 0,
sizeof(uint32), (void*)&resultBufferSize, 0, NULL, NULL);
(in case of this code the kernel is OK and the last reading is return -5)

I know you have come to a conclusion on this, but I want to point out two things:
1) The semaphore is non-portable because it isn't SIMD safe, as pointed out in the linked thread.
2) The memory model is not strong enough to give a meaning to the code. The update of the result buffer could move out of the critical section - nothing in the model says otherwise. At the very least you'd need fences, but the language around fences in the 1.x specs is also fairly weak. You'd need an OpenCL 2.0 implementation to be confident that this aspect is safe.

How to read vertices from vertex buffer in Direct3d11

I have a question regarding vertex buffers. How does one read the vertices from the vertex buffer in D3D11? I want to get a particular vertex's position for calculations, if this approach is wrong, how would one do it? The following code does not (obviously) work.
VERTEX* vert;
D3D11_MAPPED_SUBRESOURCE ms;
devcon->Map(pVBufferSphere, NULL, D3D11_MAP_READ, NULL, &ms);
vert = (VERTEX*) ms.pData;
devcon->Unmap(pVBufferSphere, NULL);
Thanks.

Where your code is wrong:
You asking GPU to give you an address to its memory(Map()),
Storing this adress (operator=()),
Then saying: "Thanks, I don't need it anymore" (Unmap()).
After unmap, you can't really say where your pointer now points. It can point to memory location where already allocated another stuff or at memory of your girlfriend's laptop (just kidding =) ).
You must copy data (all or it's part), not pointer in between Map() Unmap(): use memcopy, for loop, anything. Put it in array, std::vector, BST, everything.
Typical mistakes that newcomers can made here:
Not to check HRESULT return value from ID3D11DeviceContext::Map method. If map fails it can return whatever pointer it likes. Dereferencing such pointer leads to undefined behavior. So, better check any DirectX function return value.
Not to check D3D11 debug output. It can clearly say what's wrong and what to do in plain good English language (clearly better than my English =) ). So, you can fix bug almost instantly.
You can only read from ID3D11Buffer if it was created with D3D11_CPU_ACCESS_READ CPU access flag which means that you must also set D3D11_USAGE_STAGING usage fag.
How do we usualy read from buffer:
We don't use staging buffers for rendering/calculations: it's slow.
Instead we copy from main buffer (non-staging and non-readable by CPU) to staging one (ID3D11DeviceContext::CopyResource() or ID3D11DeviceContext::CopySubresourceRegion()), and then copying data to system memory (memcopy()).
We don't do this too much in release builds, it will harm performance.
There are two main real-life usages of staging buffers: debugging (see if buffer contains wrong data and fix some bug in algorithm) and reading final non-pixel data (for example if you calculating scientific data in Compute shader).
In most cases you can avoid staging buffers at all by well-designing your code. Think as if CPU<->GPU was connected only one way: CPU->GPU.

The following code only get the address of the mapped resource, you didn't read anything before Unmap.
vert = (VERTEX*) ms.pData;
If you want to read data from the mapped resource, first allocate enough memory, then use memcpy to copy the data, I don't know your VERTEX structure, so I suppose vert is void*, you can convert it yourself
vert = new BYTE[ms.DepthPitch];
memcpy(vert, ms.pData, ms.DepthPitch];

Drop's answer was helpful. I figured that the reason why I wasn't able to read the buffer was because I didn't have the CPU_ACCESS_FLAG set to D3D11_CPU_ACCESS_READ before. Here
D3D11_BUFFER_DESC bufferDesc;
ZeroMemory(&bufferDesc, sizeof(bufferDesc));
bufferDesc.ByteWidth = iNumElements * sizeof(T);
bufferDesc.Usage = D3D11_USAGE_DEFAULT;
bufferDesc.CPUAccessFlags = D3D11_CPU_ACCESS_READ | D3D11_CPU_ACCESS_WRITE;
bufferDesc.BindFlags = D3D11_BIND_UNORDERED_ACCESS | D3D11_BIND_SHADER_RESOURCE ;
bufferDesc.MiscFlags = D3D11_RESOURCE_MISC_BUFFER_STRUCTURED;
bufferDesc.StructureByteStride = sizeof(T);
And then to read data I did
const ID3D11Device& device = *DXUTGetD3D11Device();
ID3D11DeviceContext& deviceContext = *DXUTGetD3D11DeviceContext();
D3D11_MAPPED_SUBRESOURCE ms;
HRESULT hr = deviceContext.Map(g_pParticles, 0, D3D11_MAP_READ, 0, &ms);
Particle* p = (Particle*)malloc(sizeof(Particle*) * g_iNumParticles);
ZeroMemory(p, sizeof(Particle*) * g_iNumParticles);
memccpy(p, ms.pData, 0, sizeof(ms.pData));
deviceContext.Unmap(g_pParticles, 0);
delete[] p;
I agree it's a performance decline, I wanted to do this, just to be able to debug the values!
Thanks anyway! =)

How do I transfer an integer to constant device memory?

I have a weird problem, so I thought I would ask and see if someone more experienced than me could see a solution.
I am writing a program with CUDA C/C++, and I have some constant integers that specify various things, like coordinates of the bounds of the calculation, etc.. Currently I just have those things in global device memory. They are accessed by every thread in every kernel call, and so I figured that if they are in global memory, then they never are being cached or broadcast (right?). And so these little integers are taking up a lot (relatively) of overhead, and have a lot of 'read redundancy.'
So I declare in a header:
__constant__ int* number;
I include that header, and, when I do memory stuff, I do:
cutilSafeCall( cudaMemcpyToSymbol(number, &(some_host_int), sizeof(int) );
I pass number into all my kernel's then:
__global__ void magical_kernel(int* number, ...){
//and I access 'number' like this
int data_thingy = big_array[ *number ];
}
My code crashes. With number in global memory, it is just fine. I have determined that it crashes sometime upon accessing number within the kernel. This means that either I am accessing or allocating it wrong. If it holds the wrong value, it will also cause a crash, because it is used to index into arrays.
To conclude, I will ask a few questions. First, what am I doing wrong? As a bonus: is there a better way than constant memory to accomplish this task - I don't know the value of number at compile time, so a simple #define won't work. Will constant memory even speed the code up at all, or has it been cached and broadcasted all along? Could I somehow put the data in shared memory for each threadblock and have it remain in shared memory through multiple kernel calls?

There are several problems here:
You have declared number as a pointer, but never assigned it a value which is valid address in GPU memory
You have a variable scope onflict: the argument variable int * number defined in magic_kernel is not the same variable as the __constant__ int * variable defined as compilation unit scope.
The first argument of the cudaMemcpyToSymbol call is almost certainly incorrect.
If you don't understand why either of the first two point are true, you have some revision to do on pointers and scope in C++.
Based on your response to a now deleted answer, I suspect what you are actually trying to do is this:
__constant__ int number;
__global__ void magical_kernel(...){
int data_thingy = big_array[ number ];
}
cudaMemcpyToSymbol("number", &(some_host_int), sizeof(int));
i.e. number is intended to be an integer in constant memory, not a pointer, and not a kernel argument.
EDIT: here is an exmaple which shows this in action:
#include <cstdio>
__constant__ int number;
__global__ void magical_kernel(int * out)
{
out[threadIdx.x] = number;
}
int main()
{
const int value = 314159;
const size_t sz = size_t(32) * sizeof(int);
cudaMemcpyToSymbol("number", &value, sizeof(int));
int * _out, * out;
out = (int *)malloc(sz);
cudaMalloc((void **)&_out, sz);
magical_kernel<<<1,32>>>(_out);
cudaMemcpy(out, _out, sz, cudaMemcpyDeviceToHost);
for(int i=0; i<32; i++)
fprintf(stdout, "%d %d\n", i, out[i]);
return 0;
}
You should be able to run this yourself and confirm it works as advertised.

How to declare local memory in OpenCL?

I'm running the OpenCL kernel below with a two-dimensional global work size of 1000000 x 100 and a local work size of 1 x 100.
__kernel void myKernel(
const int length,
const int height,
and a bunch of other parameters) {
//declare some local arrays to be shared by all 100 work item in this group
__local float LP [length];
__local float LT [height];
__local int bitErrors = 0;
__local bool failed = false;
//here come my actual computations which utilize the space in LP and LT
}
This however refuses to compile, since the parameters length and height are not known at compile time. But it is not clear to my at all how to do this correctly. Should I use pointers with memalloc? How to handle this in a way that the memory is only allocated once for the entire workgroup and not once per work item?
All that I need is 2 arrays of floats, 1 int and 1 boolean that are shared among the entire workgroup (so all 100 work items). But I fail to find any method that does this correctly...

It's relatively simple, you can pass the local arrays as arguments to your kernel:
kernel void myKernel(const int length, const int height, local float* LP,
local float* LT, a bunch of other parameters)
You then set the kernelargument with a value of NULL and a size equal to the size you want to allocate for the argument (in byte). Therefore it should be:
clSetKernelArg(kernel, 2, length * sizeof(cl_float), NULL);
clSetKernelArg(kernel, 3, height* sizeof(cl_float), NULL);
local memory is always shared by the workgroup (as opposed to private), so I think the bool and int should be fine, but if not you can always pass those as arguments too.
Not really related to your problem (and not necessarily relevant, since I do not know what hardware you plan to run this on), but at least gpus don't particulary like workingsizes which are not a multiple of a particular power of two (I think it was 32 for nvidia, 64 for amd), meaning that will probably create workgroups with 128 items, of which the last 28 are basically wasted. So if you are running opencl on gpu it might help performance if you directly use workgroups of size 128 (and change the global work size appropriately)
As a side note: I never understood why everyone uses the underscore variant for kernel, local and global, seems much uglier to me.

You could also declare your arrays like this:
__local float LP[LENGTH];
And pass the LENGTH as a define in your kernel compile.
int lp_size = 128; // this is an example; could be dynamically calculated
char compileArgs[64];
sprintf(compileArgs, "-DLENGTH=%d", lp_size);
clBuildProgram(program, 0, NULL, compileArgs, NULL, NULL);

You do not have to allocate all your local memory outside the kernel, especially when it is a simple variable instead of a array.
The reason that your code cannot compile is that OpenCL does not support local memory initialization. This is specified in the document(https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/local.html). It is also not feasible in CUDA(Is there a way of setting default value for shared memory array?)
ps:The answer from Grizzly is good enough and it would be better if I can post it as a comment, but I am restricted by the reputation policy. Sorry.

Timeout in CUDA? / fermi / gtx465

I am using CUDA SDK 3.1 on MS VS2005 with GPU GTX465 1 GB. I have such a kernel function:
__global__ void CRT_GPU_2(float *A, float *X, float *Y, float *Z, float *pIntensity, float *firstTime, float *pointsNumber)
{
int holo_x = blockIdx.x*20 + threadIdx.x;
int holo_y = blockIdx.y*20 + threadIdx.y;
float k=2.0f*3.14f/0.000000054f;
if (firstTime[0]==1.0f)
{
pIntensity[holo_x+holo_y*MAX_FINAL_X]=0.0f;
}
for (int i=0; i<pointsNumber[0]; i++)
{
pIntensity[holo_x+holo_y*MAX_FINAL_X]=pIntensity[holo_x+holo_y*MAX_FINAL_X]+A[i]*cosf(k*sqrtf(pow(holo_x-X[i],2.0f)+pow(holo_y-Y[i],2.0f)+pow(Z[i],2.0f)));
}
__syncthreads();
}
and this is function which calls kernel function:
extern "C" void go2(float *pDATA, float *X, float *Y, float *Z, float *pIntensity, float *firstTime, float *pointsNumber)
{
dim3 blockGridRows(MAX_FINAL_X/20,MAX_FINAL_Y/20);
dim3 threadBlockRows(20, 20);
CRT_GPU_2<<<blockGridRows, threadBlockRows>>>(pDATA, X, Y, Z, pIntensity,firstTime, pointsNumber);
CUT_CHECK_ERROR("multiplyNumbersGPU() execution failed\n");
CUDA_SAFE_CALL( cudaThreadSynchronize() );
}
I am loading in loop all the paramteres to this function (for example 4096 elements for each parameter in one loop iteration). In total I want to make this kernel for 32768 elements for each parameter after all loop iterations.
The MAX_FINAL_X is 1920 and MAX_FINAL_Y is 1080.
When I am starting alghoritm first iteration goes very fast and after one or two iteration more I get information about CUDA timeout error. I used this alghoritm on GPU gtx260 and it was doing better as far as I remember...
Could You help me.. maybe I am doing some mistake according to new Fermi arch in this algorithm?

It will be better to call
CUT_CHECK_ERROR after
cudaThreadSynchronize(). Because
kernel run asynchronous and you must
wait for kernel ending to know about
errors... Maybe in second iteration you receive an error
from first kernel usage.
Be sure
that you have some valid number in the most interesting variable
pointsNumber[0] (it might cause a
long internal loop).
You could also
improve speed of your kernel
function:
Use better blocks. Threads configuration 20x20 will cause very slow memory usage (see Programming Guide and Best Practices). Try to use blocks 16x16.
Do not use pow(..., 2.0) function. It's faster to use SQR macro (#define SQR(x) (x)*(x))
You don't use shared mem, so __syncthreads() is not required.
PS: You could also pass value parameters to CUDA functions, not only pointers. Speed will be the same.
PPS: please improve code's readability... Now you must edit six places to change block configuration... Inside the kernel you could use blockDim variable and you could use constants in go2 function.
You could also use bool firstTime - it will be MUCH better then float.

Is your GPU connected to a display? If so, I believe the default is that kernel execution will be aborted after 5 seconds. You can check whether kernel execution will timeout by using cudaGetDeviceProperties - see reference page

In kernel's cycle you write in the same array, from which you read - for global memory usage it is the worst, because warps from different blocks wait for each other.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart