OpenCL global memory - memory

My OpenCL kernel requires a few MB of input data, about 300MB of temporary global memory for work, and it returns only a few MB. The only way I know to give the kernel this temporary memory is to allocate this memory with malloc and then pass it with clCreateBuffer, but it takes some time to copy 300MB to GPU and also requires 300MB of host RAM. Is it possible to skip it and either allocate global device memory inside of kernel or somehow declare a buffer with 300Mb but do not create it with malloc and do not copy it to GPU?

If you just call clCreateBuffer without using a host pointer, then memory will be allocated on the device without copying any data from the host. For example:
buffer = clCreateBuffer(context, CL_MEM_READ_WRITE, size, NULL, &err);

Related

How can I use Hugepage memory from kernel space?

i need to be able to allocate 2MB or 4MB sized pages of memory in a kernel module.
In Linux Kernel for allocation of continuous memory you can use function:
__get_free_pages(flags, page_rate);
where flags is usual flags and page_rate defines number of allocated pages where: number of pages = 2 ^ page_rate. You can use this function as proxy between the Kernel and your calling code.
Another approach is allocate huge page if it is possible.

Can a userspace process kfree() memory with GFP_USER?

I have a kernel module that handles IOCTL calls from userspace. One of the calls needs to return a variable length buffer from the kernel into userspace. From the module, I can kmalloc( ..., GFP_USER) a buffer for the userspace process to use. But, my question is, can this buffer be free'd from userspace or does it need to be free'd from kernel space?
Alternatively, is there a better way to handle data transfer with variable length data?
No, user space can't free kernel memory. Your module would have to offer another call / ioctl to let user space tell your kernel code to free the memory. You would also have to track your allocations to make sure to free them when the user space process exits so as not to leak memory… Also kernel memory is not swappable, if user space makes you allocate memory again and again it could run the kernel out of memory so you have to guard against that, too.
The easier method is to just let user space offer the buffer from its own memory. Include a maximum length argument in the call so that you won't write more than user space expects and return partial data or an error if the size is too small, as appropriate.
GFP_USER - means that its a kernel space memory that you can allow the user to access (used as a marker for shared kernel/user pages). note, process can sleep/block and run only in process context.
However, memory which gets allocated in the kernel space gets always get freed in the kernel space, and vis-a-vis for user space.

Can I load data from RAM by using pointer to memory with physical addressing?

Can I load data from RAM by using pointer to memory with physical addressing(not to virtual) from my driver (Linux-kernel) without allocating pages (PDEs/PTEs) in virtual addressing?
Yes! "/dev/mem" is an image of physical memory, and you can even access this from user space.
For example, to access physical address 0x7000000, the code below summarizes the steps:
fd = open("/dev/mem", O_RDWR);
map = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0x7000000);

How to allocate all of the available shared memory to a single block in CUDA?

I want to allocate all the available shared memory of an SM to one block. I am doing this because I don't want multiple blocks to be assigned to the same SM.
My GPU card has 64KB (Shared+L1) memory. In my current configuration, 48KB is assigned to the Shared memory and 16KB to the L1.
I wrote the following code to use up all of the available Shared memory.
__global__ void foo()
{
__shared__ char array[49152];
...
}
I have two questions:
How can I make sure that all of the shared memory space is used up?
I can increase "48K" to a much higher value(without getting any error or warning). Is there anyone who can justify this?
Thanks in advance,
Iman
You can read size of available device shared memory from cudaDeviceProp::sharedMemPerBlock that you can obtain by calling cudaGetDeviceProperties
You do not have to specify size of your array. Instead, you may dynamically pass size of the shared memory as 3rd kernel launch parameter.
The "clock" CUDA SDK sample illustrates how you can specify shared memory size at launch time.

Maximum memory allocation on openCL CPU

I have read that there's a limit to the maximum memory allocation to around 60% of device memory, and these can be changed by modifying the GPU_MAX_HEAP_SIZE and GPU_MAX_ALLOC_SIZE environment variables for GPU.
I am wonder if the AMD SDK has something similar for the CPU if I want to raise the limit of memory allocation?
For my current configuration, it returns the following:
CL_DEVICE_MAX_MEM_ALLOC_SIZE = 2973.37MB
CL_DEVI_CEGLOBAL_MEM_SIZE = 11893.5MB
Thanks.
I was able to change this on my system. I don't know if this method was possible when you originally asked the question.
set the environment variable 'CPU_MAX_ALLOC_PERCENT' to the percentage of total memory you want to be able to allocate for a single global buffer. I have 8GB system memory, and after setting CPU_MAX_ALLOC_PERCENT to 80, clinfo reports the following:
Max memory allocation: 6871207116
Success! 6.399GB
You can also use GPU_MAX_ALLOC_PERCENT in the same way for your GPU devices.

Resources