How to declare local memory in OpenCL? - memory

I'm running the OpenCL kernel below with a two-dimensional global work size of 1000000 x 100 and a local work size of 1 x 100.
__kernel void myKernel(
const int length,
const int height,
and a bunch of other parameters) {
//declare some local arrays to be shared by all 100 work item in this group
__local float LP [length];
__local float LT [height];
__local int bitErrors = 0;
__local bool failed = false;
//here come my actual computations which utilize the space in LP and LT
}
This however refuses to compile, since the parameters length and height are not known at compile time. But it is not clear to my at all how to do this correctly. Should I use pointers with memalloc? How to handle this in a way that the memory is only allocated once for the entire workgroup and not once per work item?
All that I need is 2 arrays of floats, 1 int and 1 boolean that are shared among the entire workgroup (so all 100 work items). But I fail to find any method that does this correctly...

It's relatively simple, you can pass the local arrays as arguments to your kernel:
kernel void myKernel(const int length, const int height, local float* LP,
local float* LT, a bunch of other parameters)
You then set the kernelargument with a value of NULL and a size equal to the size you want to allocate for the argument (in byte). Therefore it should be:
clSetKernelArg(kernel, 2, length * sizeof(cl_float), NULL);
clSetKernelArg(kernel, 3, height* sizeof(cl_float), NULL);
local memory is always shared by the workgroup (as opposed to private), so I think the bool and int should be fine, but if not you can always pass those as arguments too.
Not really related to your problem (and not necessarily relevant, since I do not know what hardware you plan to run this on), but at least gpus don't particulary like workingsizes which are not a multiple of a particular power of two (I think it was 32 for nvidia, 64 for amd), meaning that will probably create workgroups with 128 items, of which the last 28 are basically wasted. So if you are running opencl on gpu it might help performance if you directly use workgroups of size 128 (and change the global work size appropriately)
As a side note: I never understood why everyone uses the underscore variant for kernel, local and global, seems much uglier to me.

You could also declare your arrays like this:
__local float LP[LENGTH];
And pass the LENGTH as a define in your kernel compile.
int lp_size = 128; // this is an example; could be dynamically calculated
char compileArgs[64];
sprintf(compileArgs, "-DLENGTH=%d", lp_size);
clBuildProgram(program, 0, NULL, compileArgs, NULL, NULL);

You do not have to allocate all your local memory outside the kernel, especially when it is a simple variable instead of a array.
The reason that your code cannot compile is that OpenCL does not support local memory initialization. This is specified in the document(https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/local.html). It is also not feasible in CUDA(Is there a way of setting default value for shared memory array?)
ps:The answer from Grizzly is good enough and it would be better if I can post it as a comment, but I am restricted by the reputation policy. Sorry.

Related

Any way to check if a bool** pointer to pointer contains a true value in C without loops?

I am trying to optimise an iOS app and just wanted some advice on an issue I am having.
I have a bool** which can hold up to 1024 * 1024 elements. Each element defaults to false but they can also be changed to true at random.
I wanted to know if there is a streamlined way to check if a true value is contained by any of the elements, as, in a worst case scenario over a million iterations would be needed to do this check using two loops.
I could be completely wrong but I had thought about casting the memory to an int, in the belief that, as a false value is equal to 0, if all the elements are false then the result of casting it to an int would, I had thought, be 0. This is not the case however.
What I may have to go with is to keep a tally of the number of true values as they are toggled but this could turn quite messy very quickly.
I hope I have made myself clear enough without code but, if you need to see any code, just ask.
--Edit--
So I decided to go with mvp's answer. Then when I need to check if a certain bit is set:
uint32_t mask32_t[] = {
0x01, 0x02, 0x04, 0x08,
0x10, 0x20, 0x40, 0x80,
0x100, 0x200, 0x400, 0x800,
0x1000, 0x2000, 0x4000, 0x8000,
0x10000, 0x20000, 0x40000, 0x80000,
0x100000, 0x200000, 0x400000, 0x800000,
0x1000000, 0x2000000, 0x4000000, 0x8000000,
0x10000000, 0x20000000, 0x40000000, 0x80000000
};
bool bitIsSet(uint32_t word, int n) {
return ( word & mask32_t[ n ] ) != 0x00;
}
bool isSetAtPoint( uint32_t** arr, int x, int y ) {
return bitIsSet( arr[ (int)floor(x / 32.0) ] [ y ], x % 32 );
}
Probably best optimization you can make is to convert your 1024x1024 array of booleans into array of bits. This approach has some benefits and drawbacks:
+ Bit array will consume 8 times less memory: 1024*1024/8 = 128KB.
+ You can quickly test many bits at once by checking 32-bit integers quickly. In other words, finding first non-zero bit can be 32-times as fast.
- You need to create custom routines to read and write bits in this array. However, this is rather simple task - just a bit of bit twiddling :).
Basically, the answer is no. You need to write a loop to check.
You cannot cast a composite value to an int. You don't report a compiler error, so I suspect you actually tried casting the pointer to your array to int, which is legal; however, the result will only be 0 if the pointer is NULL (that is, not pointing to anything.)
You can streamline the checking code by using a bitvector instead of an array of bools. (You'd also save quite a bit of memory.) However, that involves a lot more code, and it's fairly messy. If you were using C++, you would have access to std::bitset, which is a lot easier than writing your own, but has the disadvantage of needing a compile-time size.
Using integer array has been said here already and is probably the best option, but here's a another approach.
If number of true nodes is very low compared to false nodes, you could turn the problem upside-down: Store coordinates of the true-nodes instead.
I'm thinking of a hashset or BST where you would use position of the true node as a key. Then checking the existence of true would be trivial: check if your data structure has any coordinates stored in it.

How to read vertices from vertex buffer in Direct3d11

I have a question regarding vertex buffers. How does one read the vertices from the vertex buffer in D3D11? I want to get a particular vertex's position for calculations, if this approach is wrong, how would one do it? The following code does not (obviously) work.
VERTEX* vert;
D3D11_MAPPED_SUBRESOURCE ms;
devcon->Map(pVBufferSphere, NULL, D3D11_MAP_READ, NULL, &ms);
vert = (VERTEX*) ms.pData;
devcon->Unmap(pVBufferSphere, NULL);
Thanks.
Where your code is wrong:
You asking GPU to give you an address to its memory(Map()),
Storing this adress (operator=()),
Then saying: "Thanks, I don't need it anymore" (Unmap()).
After unmap, you can't really say where your pointer now points. It can point to memory location where already allocated another stuff or at memory of your girlfriend's laptop (just kidding =) ).
You must copy data (all or it's part), not pointer in between Map() Unmap(): use memcopy, for loop, anything. Put it in array, std::vector, BST, everything.
Typical mistakes that newcomers can made here:
Not to check HRESULT return value from ID3D11DeviceContext::Map method. If map fails it can return whatever pointer it likes. Dereferencing such pointer leads to undefined behavior. So, better check any DirectX function return value.
Not to check D3D11 debug output. It can clearly say what's wrong and what to do in plain good English language (clearly better than my English =) ). So, you can fix bug almost instantly.
You can only read from ID3D11Buffer if it was created with D3D11_CPU_ACCESS_READ CPU access flag which means that you must also set D3D11_USAGE_STAGING usage fag.
How do we usualy read from buffer:
We don't use staging buffers for rendering/calculations: it's slow.
Instead we copy from main buffer (non-staging and non-readable by CPU) to staging one (ID3D11DeviceContext::CopyResource() or ID3D11DeviceContext::CopySubresourceRegion()), and then copying data to system memory (memcopy()).
We don't do this too much in release builds, it will harm performance.
There are two main real-life usages of staging buffers: debugging (see if buffer contains wrong data and fix some bug in algorithm) and reading final non-pixel data (for example if you calculating scientific data in Compute shader).
In most cases you can avoid staging buffers at all by well-designing your code. Think as if CPU<->GPU was connected only one way: CPU->GPU.
The following code only get the address of the mapped resource, you didn't read anything before Unmap.
vert = (VERTEX*) ms.pData;
If you want to read data from the mapped resource, first allocate enough memory, then use memcpy to copy the data, I don't know your VERTEX structure, so I suppose vert is void*, you can convert it yourself
vert = new BYTE[ms.DepthPitch];
memcpy(vert, ms.pData, ms.DepthPitch];
Drop's answer was helpful. I figured that the reason why I wasn't able to read the buffer was because I didn't have the CPU_ACCESS_FLAG set to D3D11_CPU_ACCESS_READ before. Here
D3D11_BUFFER_DESC bufferDesc;
ZeroMemory(&bufferDesc, sizeof(bufferDesc));
bufferDesc.ByteWidth = iNumElements * sizeof(T);
bufferDesc.Usage = D3D11_USAGE_DEFAULT;
bufferDesc.CPUAccessFlags = D3D11_CPU_ACCESS_READ | D3D11_CPU_ACCESS_WRITE;
bufferDesc.BindFlags = D3D11_BIND_UNORDERED_ACCESS | D3D11_BIND_SHADER_RESOURCE ;
bufferDesc.MiscFlags = D3D11_RESOURCE_MISC_BUFFER_STRUCTURED;
bufferDesc.StructureByteStride = sizeof(T);
And then to read data I did
const ID3D11Device& device = *DXUTGetD3D11Device();
ID3D11DeviceContext& deviceContext = *DXUTGetD3D11DeviceContext();
D3D11_MAPPED_SUBRESOURCE ms;
HRESULT hr = deviceContext.Map(g_pParticles, 0, D3D11_MAP_READ, 0, &ms);
Particle* p = (Particle*)malloc(sizeof(Particle*) * g_iNumParticles);
ZeroMemory(p, sizeof(Particle*) * g_iNumParticles);
memccpy(p, ms.pData, 0, sizeof(ms.pData));
deviceContext.Unmap(g_pParticles, 0);
delete[] p;
I agree it's a performance decline, I wanted to do this, just to be able to debug the values!
Thanks anyway! =)

CUDA kernels and memory access (one kernel doesn't execute entirely and the next doesn't get launched)

I'm having trouble here. I launch two kernels , check if some value is the one expected (memcpy to the host), if it is I stop, if it isn't I launch the two kernels again.
the first kernel:
__global__ void aco_step(const KPDeviceData* data)
{
int obj = threadIdx.x;
int ant = blockIdx.x;
int id = threadIdx.x + blockIdx.x * blockDim.x;
*(data->added) = 1;
while(*(data->added) == 1)
{
*(data->added) = 0;
//check if obj fits
int fits = (data->obj_weights[obj] + data->weight[ant] <= data->max_weight);
fits = fits * !(getElement(data->selections, data->selections_pitch, ant, obj));
if(obj == 0)
printf("ant %d going..\n", ant);
__syncthreads();
...
The code goes on after this. But that printf never gets printed, that syncthreads is there just for debugging purposes.
The "added" variable was shared, but since shared memory is a PITA and usually throws bugs in the code, i just removed it for now. This "added" variable isn't the smartest thing to do but it's faster than the alternative, which is checking if any variable within an array is some value on the host and deciding to keep iterating or not.
The getElement, simply does the matrix memory calculation with the pitch to access the right position and returns the element there:
int* el = (int*) ((char*)mat + row * pitch) + col;
return *el;
The obj_weights array has the right size, n*sizeof(int). So does the weight array, ants*sizeof(float). So they aren't out of bounds.
The kernel after this one has a printf right on the beginning, and it doesn't get printed either and after the printf it sets a variable on the device memory, and this memory is copied to the CPU after the kernel finished, and it isn't the right value when I print it in the CPU code. So I think this kernel is doing something illegal and the second one doesn't even get launched.
I'm testing some instances, when I launch 8 blocks and 512 threads, it runs OK. 32 blocks, 512 threads, OK. But 8 blocks and 1024 threads, and this happens, the kernel doesn't work, neither 32 blocks and 1024 threads.
Am I doing something wrong? Memory access? Am I launching too many threads?
edit: tried removing the "added" variable and the while loop, so it should execute just once. Still doesnt work, nothing gets printed, even if the printf is right after the three initial lines and the next kernel also doesn't print anything.
edit: another thing, I'm using a GTX 570, so the "Maximum number of threads per block" is 1024 according to http://en.wikipedia.org/wiki/CUDA. Maybe I'll just stick with 512 maximum or check on how higher I can put this value.
__syncthreads() inside conditional code is only allowed if the condition evaluates identically on all threads of a block.
In your case the condition suffers a race condition and is nondeterministic, so it most probably evaluates to different results for different threads.
printf() output is only displayed after the kernel finishes successfully. In this case it doesn't due to the problem mentioned above, so the output never shows up. You could have figured out this by testing the return codes all CUDA function calls for errors.

How do I transfer an integer to __constant__ device memory?

I have a weird problem, so I thought I would ask and see if someone more experienced than me could see a solution.
I am writing a program with CUDA C/C++, and I have some constant integers that specify various things, like coordinates of the bounds of the calculation, etc.. Currently I just have those things in global device memory. They are accessed by every thread in every kernel call, and so I figured that if they are in global memory, then they never are being cached or broadcast (right?). And so these little integers are taking up a lot (relatively) of overhead, and have a lot of 'read redundancy.'
So I declare in a header:
__constant__ int* number;
I include that header, and, when I do memory stuff, I do:
cutilSafeCall( cudaMemcpyToSymbol(number, &(some_host_int), sizeof(int) );
I pass number into all my kernel's then:
__global__ void magical_kernel(int* number, ...){
//and I access 'number' like this
int data_thingy = big_array[ *number ];
}
My code crashes. With number in global memory, it is just fine. I have determined that it crashes sometime upon accessing number within the kernel. This means that either I am accessing or allocating it wrong. If it holds the wrong value, it will also cause a crash, because it is used to index into arrays.
To conclude, I will ask a few questions. First, what am I doing wrong? As a bonus: is there a better way than constant memory to accomplish this task - I don't know the value of number at compile time, so a simple #define won't work. Will constant memory even speed the code up at all, or has it been cached and broadcasted all along? Could I somehow put the data in shared memory for each threadblock and have it remain in shared memory through multiple kernel calls?
There are several problems here:
You have declared number as a pointer, but never assigned it a value which is valid address in GPU memory
You have a variable scope onflict: the argument variable int * number defined in magic_kernel is not the same variable as the __constant__ int * variable defined as compilation unit scope.
The first argument of the cudaMemcpyToSymbol call is almost certainly incorrect.
If you don't understand why either of the first two point are true, you have some revision to do on pointers and scope in C++.
Based on your response to a now deleted answer, I suspect what you are actually trying to do is this:
__constant__ int number;
__global__ void magical_kernel(...){
int data_thingy = big_array[ number ];
}
cudaMemcpyToSymbol("number", &(some_host_int), sizeof(int));
i.e. number is intended to be an integer in constant memory, not a pointer, and not a kernel argument.
EDIT: here is an exmaple which shows this in action:
#include <cstdio>
__constant__ int number;
__global__ void magical_kernel(int * out)
{
out[threadIdx.x] = number;
}
int main()
{
const int value = 314159;
const size_t sz = size_t(32) * sizeof(int);
cudaMemcpyToSymbol("number", &value, sizeof(int));
int * _out, * out;
out = (int *)malloc(sz);
cudaMalloc((void **)&_out, sz);
magical_kernel<<<1,32>>>(_out);
cudaMemcpy(out, _out, sz, cudaMemcpyDeviceToHost);
for(int i=0; i<32; i++)
fprintf(stdout, "%d %d\n", i, out[i]);
return 0;
}
You should be able to run this yourself and confirm it works as advertised.

Timeout in CUDA? / fermi / gtx465

I am using CUDA SDK 3.1 on MS VS2005 with GPU GTX465 1 GB. I have such a kernel function:
__global__ void CRT_GPU_2(float *A, float *X, float *Y, float *Z, float *pIntensity, float *firstTime, float *pointsNumber)
{
int holo_x = blockIdx.x*20 + threadIdx.x;
int holo_y = blockIdx.y*20 + threadIdx.y;
float k=2.0f*3.14f/0.000000054f;
if (firstTime[0]==1.0f)
{
pIntensity[holo_x+holo_y*MAX_FINAL_X]=0.0f;
}
for (int i=0; i<pointsNumber[0]; i++)
{
pIntensity[holo_x+holo_y*MAX_FINAL_X]=pIntensity[holo_x+holo_y*MAX_FINAL_X]+A[i]*cosf(k*sqrtf(pow(holo_x-X[i],2.0f)+pow(holo_y-Y[i],2.0f)+pow(Z[i],2.0f)));
}
__syncthreads();
}
and this is function which calls kernel function:
extern "C" void go2(float *pDATA, float *X, float *Y, float *Z, float *pIntensity, float *firstTime, float *pointsNumber)
{
dim3 blockGridRows(MAX_FINAL_X/20,MAX_FINAL_Y/20);
dim3 threadBlockRows(20, 20);
CRT_GPU_2<<<blockGridRows, threadBlockRows>>>(pDATA, X, Y, Z, pIntensity,firstTime, pointsNumber);
CUT_CHECK_ERROR("multiplyNumbersGPU() execution failed\n");
CUDA_SAFE_CALL( cudaThreadSynchronize() );
}
I am loading in loop all the paramteres to this function (for example 4096 elements for each parameter in one loop iteration). In total I want to make this kernel for 32768 elements for each parameter after all loop iterations.
The MAX_FINAL_X is 1920 and MAX_FINAL_Y is 1080.
When I am starting alghoritm first iteration goes very fast and after one or two iteration more I get information about CUDA timeout error. I used this alghoritm on GPU gtx260 and it was doing better as far as I remember...
Could You help me.. maybe I am doing some mistake according to new Fermi arch in this algorithm?
It will be better to call
CUT_CHECK_ERROR after
cudaThreadSynchronize(). Because
kernel run asynchronous and you must
wait for kernel ending to know about
errors... Maybe in second iteration you receive an error
from first kernel usage.
Be sure
that you have some valid number in the most interesting variable
pointsNumber[0] (it might cause a
long internal loop).
You could also
improve speed of your kernel
function:
Use better blocks. Threads configuration 20x20 will cause very slow memory usage (see Programming Guide and Best Practices). Try to use blocks 16x16.
Do not use pow(..., 2.0) function. It's faster to use SQR macro (#define SQR(x) (x)*(x))
You don't use shared mem, so __syncthreads() is not required.
PS: You could also pass value parameters to CUDA functions, not only pointers. Speed will be the same.
PPS: please improve code's readability... Now you must edit six places to change block configuration... Inside the kernel you could use blockDim variable and you could use constants in go2 function.
You could also use bool firstTime - it will be MUCH better then float.
Is your GPU connected to a display? If so, I believe the default is that kernel execution will be aborted after 5 seconds. You can check whether kernel execution will timeout by using cudaGetDeviceProperties - see reference page
In kernel's cycle you write in the same array, from which you read - for global memory usage it is the worst, because warps from different blocks wait for each other.

Resources