Using pthread_yield to return control over the CPU back to Kernel - pthreads

Consider the following scenario:
in a POSIX system, some thread from a user program is running and timer_interrupt has been disabled.
to my understanding, unless it terminates - the currently running thread won't willingly give away control over the CPU.
My question is as follows: will calling pthread_yield() from within the thread give the kernel control over the CPU?
any kind of help with this question would be greatly appreciated.

Turning off the operating system's timer interrupt would change it into a cooperative multitasking system. This is how Windows 1,2,3 and Mac OS 9 worked. The running task only changed when the program made a system call.
Since pthread_yield results in a system call, yes, the kernel would get control back from the program.
If you are writing programs on a cooperative multitasking system, it is very important to not hog the CPU. If your program does hog the CPU the entire system comes to a halt.
This is why Windows MFC has the idle message in its message loop. Programs doing long term tasks would do it in that message handler by operating on one or two items and then returning to the operating system to check if the user had clicked on something.

It can easily relinquish control by issuing system calls that perform blocking inter-thread comms or requesting I/O. The timer interrupt is not absolutely required for a multithreaded OS, though it is very useful for providing timeouts for such system calls and helping out if the system is overloaded with ready threads.

Related

The reasons why rps procedure use spinlock with local_irq_disable

These days, I'm studying kernel internal network code, especially RPS code. You know, there are a lot of functions about that. But I am focusing on some functions about SMP queue processing such as enqueue_to_backlog and process_backlog.
I wonder about synchronization btw two cores(or single core) by using two functions -enqueue_to_backlog and process_backlog-.
In that functions, A core(A) holds a spin_lock of the other core(B) for queueing packets into input_pkt_queue and scheduling napi of the core(B). And A Core(B) also holds a spin_lock for splicing input_pkt_queue to process_queue of the core(B) and removing napi schedule by itself. I know that spin_lock should be held to prevent two core from accessing the same queue each other during processing queue.
But I can't understand why spin_lock is called with local_irq_disable(or local_irq_save). I think that there is no accessing the queues or rps_lock of the core(B) by Interrupts Context(TH), when interrupts(TH) preempt current context(softirq, BH). - Of course, napi struct can be accessed for scheduling napi by TH, but it holds disabling irq until queueing packet- So I wonder about why spin_lock is called with irq disable.
I think it is impossible to preempt current context(napi, softirq) by other BH such as tasklet. Is it true? And I want to know whether local_irq_disable disable all cores irq or just current core's irq literally? Actually, I read a book about kernel development, but I think i don't understand preemption enough.
Would explain the reasons why rps procedure use spin_lock with local_irq_disable?
Disabling interrupts affects the current core (only). When disabled, therefore, no other code on the same core will be able to interfere with an update to a data structure. The point of spinlocks is to extend the "lock-out" to other cores (although it's cooperative, not hardware-enforced).
It's dangerous/irresponsible to take a spin lock in the kernel without disabling interrupts because, when an interrupt then occurs, the current code will be suspended, and now you are preventing other cores from making progress while some unrelated interrupt handler is running (even if another user process or tasklet on the original core won't be able to preempt). Other cores might be in an interrupt or BH context themselves and now you're delaying the entire system. Spin locks are supposed to be held for very brief periods to do critical updates to shared data structures.
It's also a good way to generate deadlocks. Consider if the scenario above were replicated in another subsystem (or possibly another device in the same subsystem, but I'll describe the former).
Here, core A takes a spinlock in subsystem 1 without disabling interrupts. At the same time, core B takes a spinlock in subsystem 2 also without disabling interrupts. Now what happens if an interrupt related to subsystem 2 happens on core A, and while executing the subsystem 2 interrupt handler, core A needs to update a structure protected by the spinlock held in core B. But at about the same time, a subsystem 1 interrupt happens on core B, which needs to update a data structure in that subsystem. Now both cores are busy-waiting for a spinlock held by the other core, and the entire system is frozen until you do a hard reset.

How do modern OS's achieve idempotent cleanup functions for process deaths?

Let's say that I have an OS that implements malloc by storing a list of segments that the process points to in a process control block. I grab my memory from a free list and give it to the process.
If that process dies, I simply remove the reference to the segment from the process control block, and move the segment back to my free list.
Is it possible to create an idempotent function that does this process cleanup? How is it possible to create a function such that it can be called again, regardless of whether it was called many times before or if previous calls died in the middle of executing the cleanup function? It seems to me that you can't execute two move commands atomically.
How do modern OS's implement the magic involved in culling memory from processes that randomly die? How do they implement it so that it's okay for even the process performing the cull to randomly die, or is this a false assumption that I made?
I'll assume your question boils down to how the OS culls a process's memory if that process crashes.
Although I'm self educated in these matters, I'll give you two ways an OS can make sure any memory used by a process is reclaimed if the process crashes.
In a typical modern CPU and modern OS with virtual memory:
You have two layers of allocation. Whenever the process calls malloc, malloc tries to satisfy the request from already available memory pages the kernel gave the process. If not enough pages are available, malloc asks the kernel to allocate more pages.
In this case, whenever a process crashes or even if it exits normally, the kernel doesn't care what malloc did, or what memory the process forgot to release. It only needs to free all the pages it gave the process.
In a simpler OS that doesn't care much about performance, memory fragmentation or virtual memory and maybe not even about memory protection:
Malloc/free is implemented completely on the kernel side (e.g: system calls). Whenever a process calls malloc/free, the kernel does all the work, and therefore knows about all the memory that needs to be freed. Once the process crashes or exits, the kernel can cleanup. Since the kernel is never supposed to crash, and keep a record of all the allocated memory per process, it's trivial.
Like I said, I'm self educated, and I didn't check how for example Linux or Windows implement it.

Consistency Rules for cudaHostAllocMapped

Does anyone know of documentation on the memory consistency model guarantees for a memory region allocated with cudaHostAlloc(..., cudaHostAllocMapped)? For instance, when writes from the device become visible to reads from the host would be useful (could be after the kernel completes, at earliest possible time during kernel execution, etc).
Writes from the device are guaranteed to be visible on the host (or on peer devices) after the performing thread has executed a __threadfence_system() call (which is only available on compute capability 2.0 or higher).
They are also visible after the kernel has finished, i.e. after a cudaDeviceSynchronize() or after one of the other synchronization methods listed in the "Explicit Synchronization" section of the Programming Guide has been successfully completed.
Mapped memory should never be modified from the host while a kernel using it is or could be running, as CUDA currently does not provide any way of synchronization in that direction.

DSP on Beaglebone

I have a Beaglebone running Ubuntu. We want to continuously sample from 3 on-board ATD converters at 100KS/s, and every window of samples we will run a cross correlation DSP algorithm. Once we find a correlation value above a threshold, we will send the value to a PC.
My concern is the process scheduling in Ubuntu. If our process gets swapped out and an ATD sample becomes available during this time, the process will miss the sample. We need to ensure that our process will capture every sample and save it in memory.
With this being said, is there a way to trigger interrupts on the Beaglebone so that if an ATD sample is ready, the sample will be saved in the memory of our program even if the program does not have the processor at the time?
Thanks!
You might be able to trigger the EDMA or use the PRUSS. Probably best to ask on beagleboard#googlegroups.com. There isn't a DSP per-se on the BeagleBone.
This is not exactly an answer to your question, but hopefully it explains how the process works. Since you didn't mention what hardware you are running for AD conversion, maybe this is the best that can be done:
With audio hardware, which faces the same problem, the solution comes from the hardware and the drivers working together: whenever the hardware has filled up enough of the buffer it signals the driver (via an interrupt or some similar mechanism). In some cases, it's also possible that the driver polls the hardware or something like that, but that's a less efficient solution, and I'm not sure anyone does it that way anymore (maybe on cheaper hardware?). From there, the driver process may call right into the end-user process, or it may simply mark the relevant end-user process as "runnable". Either way, control needs to be transferred to the end user process.
For that to happen, the end user process must be running at a higher priority than anything else occupying the CPUs at that moment. To guarantee that your process will always be first in the queue, you can run it at a high priority, with the appropriate permissions, you can even run in very high priorities.
The time it takes for the top priority process to go from runnable to running is sometimes called the "latency" of the OS, though I am sure there's a more specific technical term. The latency of Linux is on the order of 1 ms, but since it's not a "hard" real-time OS, this is not a guarantee. If this is too long to handle your chunks of data, you may have to buffer some of it in your driver.

Are cuda kernel calls synchronous or asynchronous

I read that one can use kernel launches to synchronize different blocks i.e., If i want all blocks to complete operation 1 before they go on to operation 2, I should place operation 1 in one kernel and operation 2 in another kernel. This way, I can achieve global synchronization between blocks. However, the cuda c programming guide mentions that kernel calls are asynchronous ie. the CPU does not wait for the first kernel call to finish and thus, the CPU can also call the second kernel before the 1st has finished. However, if this is true, then we cannot use kernel launches to synchronize blocks. Please let me know where i am going wrong
Kernel calls are asynchronous from the point of view of the CPU so if you call 2 kernels in succession the second one will be called without waiting for the first one to finish. It only means that the control returns to the CPU immediately.
On the GPU side, if you haven't specified different streams to execute the kernel they will be executed by the order they were called (if you don't specify a stream they both go to the default stream and are executed serially). Only after the first kernel is finished the second one will execute.
This behavior is valid for devices with compute capability 2.x which support concurrent kernel execution. On the other devices even though kernel calls are still asynchronous the kernel execution is always sequential.
Check the CUDA C programming guide on section 3.2.5 which every CUDA programmer should read.
The accepted answer is not always correct.
In most cases, kernel launch is asynchronous. But in the following case, it is synchronous. And they are easily ignored by people.
environment variable CUDA_LAUNCH_BLOCKING equals to 1.
using a profiler(nvprof), without enabling concurrent kernel profiling
memcpy that involve host memory which is not page-locked.
Programmers can globally disable asynchronicity of kernel launches for all CUDA applications running on a system by setting the CUDA_LAUNCH_BLOCKING environment variable to 1. This feature is provided for debugging purposes only and should not be used as a way to make production software run reliably.
Kernel launches are synchronous if hardware counters are collected via a profiler (Nsight, Visual Profiler) unless concurrent kernel profiling is enabled. Async memory copies will also be synchronous if they involve host memory that is not page-locked.
From the NVIDIA CUDA programming guide(http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#concurrent-execution-host-device).
Concurrent kernel execution is supported since 2.0 CUDA capability version.
In addition, a return to the CPU code can be made earlier than all the warp kernel to have worked.
In this case, you can provide synchronization yourself.

Resources