Memory transfer between two device in OpenCL - memory

I want to develop an application with OpenCL to run on multiGPU. At some point, data from one GPU should be transferred to another one. Is there any way to avoid transferring through host. This can be done on CUDA via cudaMemcpyPeerAsync function. Is there any function similar to it in OpenCL?

In OpenCL, a context is treated as a memory space. So if you have multiple devices associated with the same context, and you create a command queue per device, you can potentially access the same buffer object from multiple devices.
When you access a memory object from a specific device, the memory object first needs to be migrated to the device so it can physically access it. Migration can be done explicitly using clEnqueueMigrateMemObjects.
So a sequence of a simple producer-consumer with multiple devices can be implemented like so:
command queue on device 1:
migrate memory buffer1
enqueue kernels that process this buffer
save last event associated with buffer1 processing
command queue on device 2:
migrate memory buffer1 - use the event produced by queue 1 to sync the migration.
enqueue kernels that process this buffer
How exactly migration occurs under the hood I cannot tell, but I assume that it can either be DMA from device 1 to device 2 or (more likely) DMA from device 1 to host and then host to device 2.
If you wish to avoid the limitation of using a single context or would like to insure the data transfer is efficient, then you are at the mercy of vendor-specific extensions.
For example, AMD offers DirectGMA technology that allows explicit remote DMA between GPU and any other PCIe device (including other GPUs). From experience it works very nice.

Related

How DMA and PCIe play together?

in a PCIe configuration, devices have dedicated addresses and they send data in Peer-to-Peer mode to each other - every device can write when it wills and the switches take care to correctly pass data forward. There is no need to have a "bus master", which decides when and how data will be transmitted.
How does DMA come into play in such configuration? For me it seems that DMA is an outdated feature, which is not needed in a PCIe configuration. Every device can send data to the main memory, or read from it - obviously the main memory will always be the "slave" in such operations.
Or is there some other functionality of DMA, which I am missing?
Thank you in advance!
When a device other than a CPU accesses memory that is attached to a CPU, this is called direct memory access (DMA). So any PCIe read or write requests issued from PCIe devices constitute DMA operations. This can be extended with 'device to device' or 'peer to peer' DMA where devices perform reads and writes against each other without involving the CPU or system memory.
There are two main advantages of DMA: First, DMA operations can move data into and out of memory with minimal CPU load, improving software efficiency. Second, the CPU can only issue reads and writes of whatever the CPU word size is, which results in very poor throughput over the PCIe bus due to TLP headers and other protocol overheads. Devices directly issuing read and write requests can issue read and write operations with much larger payloads, resulting in higher throughput and more efficient use of the bus bandwidth.
So, DMA is absolutely not obsolete or outdated - basically all high-performance devices connected over PCIe will use DMA to use the bus efficiently.

Maximum data a GPU can take?

I have a large dataset, say, 5 GB and I am doing stream-wise processing on the data, now, I need to figure out on how much data I can send to GPU at a time for processing, so that I can make utilization of GPU memory to the fullest.
Also, if my RAM is not sufficient to do processing/hold on 5 GB of data, what is the work-around for this?
A pipelined application might use 3 buffers on the GPU. One buffer is used to hold the data currently being transferred to the GPU (from the host), one buffer to hold the data currently being processed by the GPU, and one buffer to hold the data(results) currently being transferred from the GPU (to the host).
This implies that your application processing can be broken into "chunks". This is true for many applications that work on large data sets.
CUDA streams enable the developer to write code that allows these 3 operations (transfer to, process, transfer from) to run simultaneously.
There is no specific number that defines the size of the buffers in the above scenario. Certainly, a straightforward implementation would create 3 buffers, each of which is smaller than 1/3 of the total memory on the GPU, leaving some memory left over for overhead and other data that may need to live in GPU memory. So if your GPU has 5GB, you might be able to run with three 1GB buffers. But there is no tool like deviceQuery that will tell you this; it is not a property of the device.
You may want to read carefully the above linked programming guide section, as well as review the CUDA simple streams sample code.

Consistency Rules for cudaHostAllocMapped

Does anyone know of documentation on the memory consistency model guarantees for a memory region allocated with cudaHostAlloc(..., cudaHostAllocMapped)? For instance, when writes from the device become visible to reads from the host would be useful (could be after the kernel completes, at earliest possible time during kernel execution, etc).
Writes from the device are guaranteed to be visible on the host (or on peer devices) after the performing thread has executed a __threadfence_system() call (which is only available on compute capability 2.0 or higher).
They are also visible after the kernel has finished, i.e. after a cudaDeviceSynchronize() or after one of the other synchronization methods listed in the "Explicit Synchronization" section of the Programming Guide has been successfully completed.
Mapped memory should never be modified from the host while a kernel using it is or could be running, as CUDA currently does not provide any way of synchronization in that direction.

Inter thread data transfer - Linux

My program have two thread created from main thread. Each thread operates on seperate external communicating device connected.
main thread
thread_1 thread_2
Thread_1 receives data packet from external device. Each data packet is an structure of 20 bytes each.
Now i want thread_2 to read data received by thread_1 & transfer it to device connected to it.
How can we transfer data between my two threads.
What exact name of the linux variables types to use in this case ?
Your problem is a classic example of the Producer Consumer Problem.
There a number of possible ways to implement this depending on the context - your post is tagged with both pthreads, and linux-device-drivers. Is this kernel-space, user-space, or kernel-space -> userspace?
Kernel Space
A solution is likely to involve a ring buffer (if you anticipate that multiple messages between threads can be in flight at once) and a semaphore.
Chapter 5 of Linux Device Drivers 3rd Edition would be a good place to start.
User-space
If both threads are in user-space, the producer-consumer pattern in the same process is usually implemented with a pthread condition variable. An worked example of how to do it is here
Kernel-space -> User-space
The general approach used in Linux is for user-space thread thread_2 to block on a filing system object signalled by kernel-space thread_1. Typically the filing system object in question is in /dev or /sys. LDD3 has examples of both approaches.

How to mitigate host + device memory tranfer bottlenecks in OpenCL/CUDA

If my algorithm is bottlenecked by host to device and device to host memory transfers, is the only solution a different or revised algorithm?
There are a couple things you can try to mitigate the PCIe bottleneck:
Asynchronous transfers - permits overlapping computation and bulk transfer
Mapped memory - allows a kernel to stream data to/from the GPU during execution
Note that neither of these techniques makes the transfer go faster, they just reduce the time the GPU is waiting on the data to arrive.
With the cudaMemcpyAsync API function you can initiate a transfer, launch one or more kernels that do not depend on the result of the transfer, synchronize the host and device, and then launch kernels that were waiting on the transfer to complete. If you can structure your algorithm such that you're doing productive work while the transfer is taking place, then asynchronous copies are a good solution.
With the cudaHostAlloc API function you can allocate host memory that can read and written directly from the GPU. The reason this is faster is that a block that needs host data only needs to wait for a small portion of the data to be transferred. In contrast, the usual approach makes all blocks wait until the entire transfer is complete. Mapped memory essentially breaks a big monolithic transfer into a bunch or smaller copy operations, so the latency is reduced.
You can read more about these topics in Section 3.2.6-3.2.7 of the CUDA Programming Guide and Section 3.1 of the CUDA Best Practices Guide. Chapter 3 of the OpenCL Best Practices Guide explains how to use these features in OpenCL.
You really need to do the math to be certain that you're going to be doing enough processing on the GPU to make it worthwhile transferring data between host and GPU. Ideally you do this at the design stage, before doing any coding, since it can be a deal-breaker.

Resources