Question about cl_mem in OpenCL

Question about cl_mem in OpenCL - memory

I have been using cl_mem in some of my OpenCL boilerplate code, but I have been using it through context and not a sharp understanding of what exactly it is. I have been using it as a type for the memory I push on and off the board, which has so far been floats. I tried looking at the OpenCL docs, but cl_mem doesn't show up (does it?). Is there any documentation on it, or is it simple and can someone explain.

The cl_mem type is a handle to a "Memory Object" (as described in Section 3.5 of the OpenCL 1.1 Spec). These essentially are inputs and outputs for OpenCL kernels, and are returned from OpenCL API calls in host code such as clCreateBuffer
cl_mem clCreateBuffer (cl_context context, cl_mem_flags flags,
size_t size, void *host_ptr, cl_int *errcode_ret)
The memory areas represented can be permitted different access patterns e.g. Read Only, or be allocated in different memory regions, depending on the flags set in the create buffer calls.
The handle is typically stored to allow a later call to release the memory, e.g:
cl_int clReleaseMemObject (cl_mem memobj)
In short, it provides an abstraction over where the memory actually is: you can copy data into the associated memory or back out via the OpenCL APIs clEnqueueWriteBuffer and clEnqueueReadBuffer, but the OpenCL implementation can allocate the space where it wants.

For the computer a cl_mem is a number (like a file handler for Linux) that is reserved for the use as a "memory identifier"( the API/driver whatever stores information about your memory under this number that it knows what it holds/how big it is and stuff like that)

Related

Passing arguments through __local memory in OpenCL

I am confused about the the __local memory in OpenCL here.
I read some spec saying that the data flow has to be from Host to
__Global, and then __Local.
But I also see some kernel function like this:
__kernel void foo(__local float * a)
I was wondering how the data was transferred directly into the __local
memory in this way?
Thanks.

It is not possible to fill local buffer on the host side. Therefore you have to follow the flow host -> __global -> __local.
Local buffer can be either created on the host side and then it is passed as a kernel parameter or on gpu side inside the kernel.
Creating local buffer on the host side gives the advantage to decide about its size before the kernel is run which can be important if the local buffer size needs to be different each time the kernel is run.

Local memory is not visible to anything but a single work-group, and may be allocated as the work-group is dispatched by hardware on many architectures. Hardware that can mix multiple work-groups from different kernels on each CU will allow the scheduling component to chunk up the local memory for each of the groups being issued. It doesn't exist before the group is launched, and does not exist after the group terminates. The size of this region is what you pass in as other answers have pointed out.
The result of this is that the only way on many architectures for filling local memory from the host would be for kernel code to be inserted by the compiler that would copy data in from global memory. Given that as the basis, it isn't any worse in terms of performance for the programmer to do it manually, and gives more control over exactly what happens. You do not end up in a situation where the compiler always generates copy code and ends up copying more than was really necessary because the API didn't make it clear what memory was copy-in and what was not.
In summary, you cannot fill local memory in any automated way. In practice you will rarely want to, because doing it manually gives you the opportunity to only put the result of a first stage into local, removing extra copy operations, or to transform the data on the way in to local, allowing padding or data transposition to remove bank conflicts and so on.

As #doqtor said, the size of local memory on kernel parameter can be specified by clSetKernelArg calls.
Fortunately, OpenCL 1.2+ support VLA(variable length array), local memory kernel parameter is not required any more.

difference between pci_alloc_consistent and dma_alloc_coherent

I am working on pcie based network driver. Different examples use one of pci_alloc_consistent or dma_alloc_coherent to get memory for transmission and reception descriptors. Which one is better if any and what is the difference between the two?

The difference is subtle but quite important.
pci_alloc_consistent() is the older function of the two and legacy drivers still use it.
Nowadays, pci_alloc_consistent() just calls dma_alloc_coherent().
The difference? The type of the allocated memory.
pci_alloc_consistent() - Allocates memory of type GFP_ATOMIC.
Allocation does not sleep, for use in e.g. interrupt handlers, bottom
halves.
dma_alloc_coherent()- You specify yourself what type of memory to
allocate. You should not use the high priority GFP_ATOMIC memory
unless you need it and in most cases, you will be fine with
GFP_KERNEL allocations.
Kernel 3.18 definition of pci_alloc_consistent() is very simple, namely:
static inline void *
pci_alloc_consistent(struct pci_dev *hwdev, size_t size,
dma_addr_t *dma_handle)
{
return dma_alloc_coherent(hwdev == NULL ? NULL : &hwdev->dev, size, dma_handle, GFP_ATOMIC);
}
In short, use dma_alloc_coherent().

How to use pinned memory / mapped memory in OpenCL

In order to reduce the transfer time from host to device for my application, I want to use pinned memory. NVIDIA's best practices guide proposes mapping buffers and writing the data using the following code:
cDataIn = (unsigned char*)clEnqueueMapBuffer(cqCommandQue, cmPinnedBufIn, CL_TRUE,CL_MAP_WRITE, 0, memSize, 0, NULL, NULL, NULL);
for(unsigned int i = 0; i < memSize; i++)
{
cDataIn[i] = (unsigned char)(i & 0xff);
}
clEnqueueWriteBuffer(cqCommandQue, cmDevBufIn, CL_FALSE, 0,
szBuffBytes, cDataIn, 0, NULL, NULL);
Intel's optimization guide recommends to use calls to clEnqueueMapBuffer and clEnqueueUnmapBuffer instead of calls to clEnqueueReadBuffer or clEnqueueWriteBuffer.
What is the right way to use pinned memory/mapped memory? Is it necessary to write the data using enqueueWriteBuffer or is enqueueMapBuffer sufficient?
Also, what is the difference between CL_MEM_ALLOC_HOST_PTR and CL_MEM_USE_HOST_PTR?

This is an interesting topic that very little people detail.
I will try to define exactly how it works.
The pinned memory refers to a memory that as well as being in the device, exists in the host, so a DMA write is possible between these 2 memories. Increasing the copy performance.
That is why it needs CL_MEM_ALLOC_HOST_PTR in the buffer creation params.
On the other hand, CL_MEM_USE_HOST_PTR will take a host pointer for buffer creation, it is unclear by the spec if this can or cannot be a pinned memory. But generally speaking, it should NOT be pinned memory created this way, since the host pointer has not been reserved by the OpenCL API and is not clear where it resides in memory.
Regarding the Map/Read question. Both are ok. And they will give same performance.
The difference between the both techniques is that:
For Map/Unmap: You need to map before writing/reading and unmap afterwards. That way you ensure the consistency of the data. These are API calls, and take time to complete as well as being asynchronous. The good thing, is that you don't need to hold any other thing rather than the buffer object.
For Map+Read/Write: At the creation of the memory zone you need to do a Map and save the pointer value. Then, at the destruction of the buffer, you need to first Unmap and then destroy it. You need to hold buffer+Mapped_Buffer all along. The good thing is that you can now just clEnqueueRead/Write to that mapped pointer. The API will wait for the pinned data to be consistent and then consider it done. It is easier to use, since it is like doing a map+unmap in one shot.
The Read/Write mode is easier to use, specially for repetitive reads, but is not as versatile as the manual map option, since you CAN'T write a read only map, nor read a write only map. But for general use the variables that are read will never be written, and viceversa.
My understanding is that Intel recommendation, refers to "Use Map, not plain Read/Write", rather than "When you use Map, don't use Read/Write over Mapped pointers".
Did you check this nVIDIA recomendation over Intel HW? I think it should work, however I don't know if indeed the operation would be optimal (as in AMD or nVIDIA HW).

Transfer data from Mat/oclMat to cl_mem (OpenCV + OpenCL)

I am working on a project that needs a lot of OpenCL code. I am using OpenCV's ocl module to develop my project faster but there are some functions not implemented and I will have to write my own OpenCL code.
My question is this: what is the quickest and cheapest way to transfer data from Mat and/or oclMat to a cl_mem array. Re-wording this, is there a good way to transfer or enqueue (clEnqueueWriteBuffer) data from oclMat or Mat?
Currently, I am using a for-loop to read data from Mat (or download from oclMat and then use for-loops) and then enqueuing it. This is turning out to be costly, hence my question.
Thanks to anyone who sees this question :)

I've written a set of interop functions for the Boost.Compute library which ease the use of OpenCL and OpenCV. Take a look at the opencv_copy_mat_to_buffer() function.
There are also functions for copying from a OpenCL buffer back to the host cv::Mat and for copying cv::Mat to OpenCL image2d objects.

Calculate memory bandwidth, achieved in Host-Device interconnections.
If you get ~60% and more of maximal bandwidth, you've nothing to do, memory transfer is as fast as it can be. But if your bandwidth results are lower that 55% - 60% of theoretical maximum, try to use multiple command queues with unblocking operations (don't forget to sync at the end). Also, pay attention on avg image size. Small data transfers usually have big overhead rate.
If your Device uses shared memory, use memory mapping instead of read/write, this may dramatically save time. If Device has it's own memory, apply pinned memory technique, which is well described in NVIDIA OpenCL Best Practices Guide.

The documentation of oclMat states that there is some sort of functionality to the underlying ocl buffer data:
//! pointer to the data(OCL memory object)
uchar *data;
If you have clMat already in the device, you can simply perform a copy buffer from clMat.data to your clBuffer. But you will have to hack a little bit the memory, accessing some private members of the oclMat
Something like:
clEnqueueCopyBuffer(command_queue, (clBuffer *)oclMat.data, dst_buffer, 0, 0, size);
NOTE: Take care with the casting, maybe you have to cast another pointer.

For your comment, it's right. The oclMat can be used as cl_mem(void *) for device, since it was alloced by OpenCL device.
Additionally, you can creat svm memory(for example void* svmdata) at first, and then assign a Mat like: Mat A(rows, cols, CV_32FC1, svmdata).
Now you can process the Mat A between host and device without memory copy.
(PS. The svm memory is the new character of OCL, it can be created by clSVMAlloc).

Is it possible to use cudaMemcpy with src and dest as different types?

I'm using a Tesla, and for the first time, I'm running low on CPU memory instead of GPU memory! Hence, I thought I could cut the size of my host memory by switching all integers to short (all my values are below 255).
However, I want my device memory to use integers, since the memory access is faster. So is there a way to copy my host memory (in short) to my device global memory (in int)? I guess this won't work:
short *buf_h = new short[100];
int *buf_d = NULL;
cudaMalloc((void **)&buf_d, 100*sizeof(int));
cudaMemcpy( buf_d, buf_h, 100*sizeof(short), cudaMemcpyHostToDevice );
Any ideas? Thanks!

There isn't really a way to do what you are asking directly. The CUDA API doesn't support "smart copying" with padding or alignment, or "deep copying" of nested pointers, or anything like that. Memory transfers require linear host and device memory, and alignment must be the same between source and destination memory.
Having said that, one approach to circumvent this restriction would be to copy the host short data to an allocation of short2 on the device. Your device code can retrieve a short2 containing two packed shorts, extract the value it needs and then cast the value to int. This will give the code 32 bit memory transactions per thread, allowing for memory coalescing, and (if you are using Fermi GPUs) good L1 cache hit rates, because adjacent threads within a block would be reading the same 32 bit word. On non Fermi GPUs, you could probably use a shared memory scheme to efficiently retrieve all the values for a block using coalesced reads.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart