Inter thread data transfer - Linux - pthreads

My program have two thread created from main thread. Each thread operates on seperate external communicating device connected.
main thread
thread_1 thread_2
Thread_1 receives data packet from external device. Each data packet is an structure of 20 bytes each.
Now i want thread_2 to read data received by thread_1 & transfer it to device connected to it.
How can we transfer data between my two threads.
What exact name of the linux variables types to use in this case ?

Your problem is a classic example of the Producer Consumer Problem.
There a number of possible ways to implement this depending on the context - your post is tagged with both pthreads, and linux-device-drivers. Is this kernel-space, user-space, or kernel-space -> userspace?
Kernel Space
A solution is likely to involve a ring buffer (if you anticipate that multiple messages between threads can be in flight at once) and a semaphore.
Chapter 5 of Linux Device Drivers 3rd Edition would be a good place to start.
User-space
If both threads are in user-space, the producer-consumer pattern in the same process is usually implemented with a pthread condition variable. An worked example of how to do it is here
Kernel-space -> User-space
The general approach used in Linux is for user-space thread thread_2 to block on a filing system object signalled by kernel-space thread_1. Typically the filing system object in question is in /dev or /sys. LDD3 has examples of both approaches.

Related

Memory transfer between two device in OpenCL

I want to develop an application with OpenCL to run on multiGPU. At some point, data from one GPU should be transferred to another one. Is there any way to avoid transferring through host. This can be done on CUDA via cudaMemcpyPeerAsync function. Is there any function similar to it in OpenCL?
In OpenCL, a context is treated as a memory space. So if you have multiple devices associated with the same context, and you create a command queue per device, you can potentially access the same buffer object from multiple devices.
When you access a memory object from a specific device, the memory object first needs to be migrated to the device so it can physically access it. Migration can be done explicitly using clEnqueueMigrateMemObjects.
So a sequence of a simple producer-consumer with multiple devices can be implemented like so:
command queue on device 1:
migrate memory buffer1
enqueue kernels that process this buffer
save last event associated with buffer1 processing
command queue on device 2:
migrate memory buffer1 - use the event produced by queue 1 to sync the migration.
enqueue kernels that process this buffer
How exactly migration occurs under the hood I cannot tell, but I assume that it can either be DMA from device 1 to device 2 or (more likely) DMA from device 1 to host and then host to device 2.
If you wish to avoid the limitation of using a single context or would like to insure the data transfer is efficient, then you are at the mercy of vendor-specific extensions.
For example, AMD offers DirectGMA technology that allows explicit remote DMA between GPU and any other PCIe device (including other GPUs). From experience it works very nice.

The reasons why rps procedure use spinlock with local_irq_disable

These days, I'm studying kernel internal network code, especially RPS code. You know, there are a lot of functions about that. But I am focusing on some functions about SMP queue processing such as enqueue_to_backlog and process_backlog.
I wonder about synchronization btw two cores(or single core) by using two functions -enqueue_to_backlog and process_backlog-.
In that functions, A core(A) holds a spin_lock of the other core(B) for queueing packets into input_pkt_queue and scheduling napi of the core(B). And A Core(B) also holds a spin_lock for splicing input_pkt_queue to process_queue of the core(B) and removing napi schedule by itself. I know that spin_lock should be held to prevent two core from accessing the same queue each other during processing queue.
But I can't understand why spin_lock is called with local_irq_disable(or local_irq_save). I think that there is no accessing the queues or rps_lock of the core(B) by Interrupts Context(TH), when interrupts(TH) preempt current context(softirq, BH). - Of course, napi struct can be accessed for scheduling napi by TH, but it holds disabling irq until queueing packet- So I wonder about why spin_lock is called with irq disable.
I think it is impossible to preempt current context(napi, softirq) by other BH such as tasklet. Is it true? And I want to know whether local_irq_disable disable all cores irq or just current core's irq literally? Actually, I read a book about kernel development, but I think i don't understand preemption enough.
Would explain the reasons why rps procedure use spin_lock with local_irq_disable?
Disabling interrupts affects the current core (only). When disabled, therefore, no other code on the same core will be able to interfere with an update to a data structure. The point of spinlocks is to extend the "lock-out" to other cores (although it's cooperative, not hardware-enforced).
It's dangerous/irresponsible to take a spin lock in the kernel without disabling interrupts because, when an interrupt then occurs, the current code will be suspended, and now you are preventing other cores from making progress while some unrelated interrupt handler is running (even if another user process or tasklet on the original core won't be able to preempt). Other cores might be in an interrupt or BH context themselves and now you're delaying the entire system. Spin locks are supposed to be held for very brief periods to do critical updates to shared data structures.
It's also a good way to generate deadlocks. Consider if the scenario above were replicated in another subsystem (or possibly another device in the same subsystem, but I'll describe the former).
Here, core A takes a spinlock in subsystem 1 without disabling interrupts. At the same time, core B takes a spinlock in subsystem 2 also without disabling interrupts. Now what happens if an interrupt related to subsystem 2 happens on core A, and while executing the subsystem 2 interrupt handler, core A needs to update a structure protected by the spinlock held in core B. But at about the same time, a subsystem 1 interrupt happens on core B, which needs to update a data structure in that subsystem. Now both cores are busy-waiting for a spinlock held by the other core, and the entire system is frozen until you do a hard reset.

stack management in CLR

I understand the basic concept of stack and heap but great if any1 can solve following confusions:
Is there a single stack for entire application process or for each thread starting in a project a new stack is created?
Is there a single Heap for entire application process or for each thread starting in a project a new stack is created?
If Stack are created for each thread, then how process manage sequential flow of threads (and hence stacks)
There is a separate stack for every thread. This is true not only for CLR, and not only for Windows, but pretty much for every OS or platform out there.
There is single heap for every Application Domain. A single process may run several app domains at once. A single app domain may run several threads.
To be more precise, there are usually two heaps per domain: one regular and one for really large objects (like, say, a 64K array).
I don't understand what you mean by "sequential flow of threads".
One stack for each thread, all threads share the same heaps.
There is no 'sequential flow' of threads. A thread is an operating system object that stores a copy of the processor state. The processor state includes the register values. One of them is ESP, the stack pointer. Another really important one is EIP, the instruction pointer. When the operating system switches between threads, it stores the processor state in the current thread object and reloads the state from the thread object for the thread that was selected to run next. The processor now simply continues executing where it left off previously.
Getting a thread started is perhaps now easy to understand as well. The operating system allocates a megabyte of memory for the stack. And initializes the ESP register value to point to that memory. And sets the value of the EIP register to the address of the method where the thread should start executing. The value of the ThreadStart delegate in C#.
Each thread must have it's own stack, that's where local variables and parameters are held, and the return addresses of the previous functions.

How to mitigate host + device memory tranfer bottlenecks in OpenCL/CUDA

If my algorithm is bottlenecked by host to device and device to host memory transfers, is the only solution a different or revised algorithm?
There are a couple things you can try to mitigate the PCIe bottleneck:
Asynchronous transfers - permits overlapping computation and bulk transfer
Mapped memory - allows a kernel to stream data to/from the GPU during execution
Note that neither of these techniques makes the transfer go faster, they just reduce the time the GPU is waiting on the data to arrive.
With the cudaMemcpyAsync API function you can initiate a transfer, launch one or more kernels that do not depend on the result of the transfer, synchronize the host and device, and then launch kernels that were waiting on the transfer to complete. If you can structure your algorithm such that you're doing productive work while the transfer is taking place, then asynchronous copies are a good solution.
With the cudaHostAlloc API function you can allocate host memory that can read and written directly from the GPU. The reason this is faster is that a block that needs host data only needs to wait for a small portion of the data to be transferred. In contrast, the usual approach makes all blocks wait until the entire transfer is complete. Mapped memory essentially breaks a big monolithic transfer into a bunch or smaller copy operations, so the latency is reduced.
You can read more about these topics in Section 3.2.6-3.2.7 of the CUDA Programming Guide and Section 3.1 of the CUDA Best Practices Guide. Chapter 3 of the OpenCL Best Practices Guide explains how to use these features in OpenCL.
You really need to do the math to be certain that you're going to be doing enough processing on the GPU to make it worthwhile transferring data between host and GPU. Ideally you do this at the design stage, before doing any coding, since it can be a deal-breaker.

Regarding interrupt based communication

We have a simple architecture :
Main chip (arm9 based)
PIC controller
The PIC communicates to ARM via an interrupt based I2C communication protocol for transfer of data. Inside the interrupt we signal a task which reads the data from the I2C layer (bus).
In case the data is limited we usually won't have much problem to read the data and send it to upper layer. In case this data is very huge the interrupt will be tied for a long time.
The first question is:
Am I right?
In case I am right, how to avoid the same? ...or can we a different solution?
Have some kind of 'worker thread', sometimes called a kernel thread, whose job it is to pull data out of the I2C interface and buffer it, hand it off to other parts of your system, etc. Use the interrupt routine only to un-block the kernel thread. That way, if there are other duties the system has to perform, it is not prevented from doing so by the interrupt handler, and you still get your data in from your device in a timely manner.
You shouldn't read a complete packet in one execution of the interrupt routine. Depending on the hardware support you should handle one sample/bit/byte, store data in a buffer and only signal the task when the packet is complete.

Resources