The reasons why rps procedure use spinlock with local_irq_disable - network-programming

These days, I'm studying kernel internal network code, especially RPS code. You know, there are a lot of functions about that. But I am focusing on some functions about SMP queue processing such as enqueue_to_backlog and process_backlog.
I wonder about synchronization btw two cores(or single core) by using two functions -enqueue_to_backlog and process_backlog-.
In that functions, A core(A) holds a spin_lock of the other core(B) for queueing packets into input_pkt_queue and scheduling napi of the core(B). And A Core(B) also holds a spin_lock for splicing input_pkt_queue to process_queue of the core(B) and removing napi schedule by itself. I know that spin_lock should be held to prevent two core from accessing the same queue each other during processing queue.
But I can't understand why spin_lock is called with local_irq_disable(or local_irq_save). I think that there is no accessing the queues or rps_lock of the core(B) by Interrupts Context(TH), when interrupts(TH) preempt current context(softirq, BH). - Of course, napi struct can be accessed for scheduling napi by TH, but it holds disabling irq until queueing packet- So I wonder about why spin_lock is called with irq disable.
I think it is impossible to preempt current context(napi, softirq) by other BH such as tasklet. Is it true? And I want to know whether local_irq_disable disable all cores irq or just current core's irq literally? Actually, I read a book about kernel development, but I think i don't understand preemption enough.
Would explain the reasons why rps procedure use spin_lock with local_irq_disable?

Disabling interrupts affects the current core (only). When disabled, therefore, no other code on the same core will be able to interfere with an update to a data structure. The point of spinlocks is to extend the "lock-out" to other cores (although it's cooperative, not hardware-enforced).
It's dangerous/irresponsible to take a spin lock in the kernel without disabling interrupts because, when an interrupt then occurs, the current code will be suspended, and now you are preventing other cores from making progress while some unrelated interrupt handler is running (even if another user process or tasklet on the original core won't be able to preempt). Other cores might be in an interrupt or BH context themselves and now you're delaying the entire system. Spin locks are supposed to be held for very brief periods to do critical updates to shared data structures.
It's also a good way to generate deadlocks. Consider if the scenario above were replicated in another subsystem (or possibly another device in the same subsystem, but I'll describe the former).
Here, core A takes a spinlock in subsystem 1 without disabling interrupts. At the same time, core B takes a spinlock in subsystem 2 also without disabling interrupts. Now what happens if an interrupt related to subsystem 2 happens on core A, and while executing the subsystem 2 interrupt handler, core A needs to update a structure protected by the spinlock held in core B. But at about the same time, a subsystem 1 interrupt happens on core B, which needs to update a data structure in that subsystem. Now both cores are busy-waiting for a spinlock held by the other core, and the entire system is frozen until you do a hard reset.

Related

How do modern OS's achieve idempotent cleanup functions for process deaths?

Let's say that I have an OS that implements malloc by storing a list of segments that the process points to in a process control block. I grab my memory from a free list and give it to the process.
If that process dies, I simply remove the reference to the segment from the process control block, and move the segment back to my free list.
Is it possible to create an idempotent function that does this process cleanup? How is it possible to create a function such that it can be called again, regardless of whether it was called many times before or if previous calls died in the middle of executing the cleanup function? It seems to me that you can't execute two move commands atomically.
How do modern OS's implement the magic involved in culling memory from processes that randomly die? How do they implement it so that it's okay for even the process performing the cull to randomly die, or is this a false assumption that I made?
I'll assume your question boils down to how the OS culls a process's memory if that process crashes.
Although I'm self educated in these matters, I'll give you two ways an OS can make sure any memory used by a process is reclaimed if the process crashes.
In a typical modern CPU and modern OS with virtual memory:
You have two layers of allocation. Whenever the process calls malloc, malloc tries to satisfy the request from already available memory pages the kernel gave the process. If not enough pages are available, malloc asks the kernel to allocate more pages.
In this case, whenever a process crashes or even if it exits normally, the kernel doesn't care what malloc did, or what memory the process forgot to release. It only needs to free all the pages it gave the process.
In a simpler OS that doesn't care much about performance, memory fragmentation or virtual memory and maybe not even about memory protection:
Malloc/free is implemented completely on the kernel side (e.g: system calls). Whenever a process calls malloc/free, the kernel does all the work, and therefore knows about all the memory that needs to be freed. Once the process crashes or exits, the kernel can cleanup. Since the kernel is never supposed to crash, and keep a record of all the allocated memory per process, it's trivial.
Like I said, I'm self educated, and I didn't check how for example Linux or Windows implement it.

Using pthread_yield to return control over the CPU back to Kernel

Consider the following scenario:
in a POSIX system, some thread from a user program is running and timer_interrupt has been disabled.
to my understanding, unless it terminates - the currently running thread won't willingly give away control over the CPU.
My question is as follows: will calling pthread_yield() from within the thread give the kernel control over the CPU?
any kind of help with this question would be greatly appreciated.
Turning off the operating system's timer interrupt would change it into a cooperative multitasking system. This is how Windows 1,2,3 and Mac OS 9 worked. The running task only changed when the program made a system call.
Since pthread_yield results in a system call, yes, the kernel would get control back from the program.
If you are writing programs on a cooperative multitasking system, it is very important to not hog the CPU. If your program does hog the CPU the entire system comes to a halt.
This is why Windows MFC has the idle message in its message loop. Programs doing long term tasks would do it in that message handler by operating on one or two items and then returning to the operating system to check if the user had clicked on something.
It can easily relinquish control by issuing system calls that perform blocking inter-thread comms or requesting I/O. The timer interrupt is not absolutely required for a multithreaded OS, though it is very useful for providing timeouts for such system calls and helping out if the system is overloaded with ready threads.

Consistency Rules for cudaHostAllocMapped

Does anyone know of documentation on the memory consistency model guarantees for a memory region allocated with cudaHostAlloc(..., cudaHostAllocMapped)? For instance, when writes from the device become visible to reads from the host would be useful (could be after the kernel completes, at earliest possible time during kernel execution, etc).
Writes from the device are guaranteed to be visible on the host (or on peer devices) after the performing thread has executed a __threadfence_system() call (which is only available on compute capability 2.0 or higher).
They are also visible after the kernel has finished, i.e. after a cudaDeviceSynchronize() or after one of the other synchronization methods listed in the "Explicit Synchronization" section of the Programming Guide has been successfully completed.
Mapped memory should never be modified from the host while a kernel using it is or could be running, as CUDA currently does not provide any way of synchronization in that direction.

Are cuda kernel calls synchronous or asynchronous

I read that one can use kernel launches to synchronize different blocks i.e., If i want all blocks to complete operation 1 before they go on to operation 2, I should place operation 1 in one kernel and operation 2 in another kernel. This way, I can achieve global synchronization between blocks. However, the cuda c programming guide mentions that kernel calls are asynchronous ie. the CPU does not wait for the first kernel call to finish and thus, the CPU can also call the second kernel before the 1st has finished. However, if this is true, then we cannot use kernel launches to synchronize blocks. Please let me know where i am going wrong
Kernel calls are asynchronous from the point of view of the CPU so if you call 2 kernels in succession the second one will be called without waiting for the first one to finish. It only means that the control returns to the CPU immediately.
On the GPU side, if you haven't specified different streams to execute the kernel they will be executed by the order they were called (if you don't specify a stream they both go to the default stream and are executed serially). Only after the first kernel is finished the second one will execute.
This behavior is valid for devices with compute capability 2.x which support concurrent kernel execution. On the other devices even though kernel calls are still asynchronous the kernel execution is always sequential.
Check the CUDA C programming guide on section 3.2.5 which every CUDA programmer should read.
The accepted answer is not always correct.
In most cases, kernel launch is asynchronous. But in the following case, it is synchronous. And they are easily ignored by people.
environment variable CUDA_LAUNCH_BLOCKING equals to 1.
using a profiler(nvprof), without enabling concurrent kernel profiling
memcpy that involve host memory which is not page-locked.
Programmers can globally disable asynchronicity of kernel launches for all CUDA applications running on a system by setting the CUDA_LAUNCH_BLOCKING environment variable to 1. This feature is provided for debugging purposes only and should not be used as a way to make production software run reliably.
Kernel launches are synchronous if hardware counters are collected via a profiler (Nsight, Visual Profiler) unless concurrent kernel profiling is enabled. Async memory copies will also be synchronous if they involve host memory that is not page-locked.
From the NVIDIA CUDA programming guide(http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#concurrent-execution-host-device).
Concurrent kernel execution is supported since 2.0 CUDA capability version.
In addition, a return to the CPU code can be made earlier than all the warp kernel to have worked.
In this case, you can provide synchronization yourself.

stack management in CLR

I understand the basic concept of stack and heap but great if any1 can solve following confusions:
Is there a single stack for entire application process or for each thread starting in a project a new stack is created?
Is there a single Heap for entire application process or for each thread starting in a project a new stack is created?
If Stack are created for each thread, then how process manage sequential flow of threads (and hence stacks)
There is a separate stack for every thread. This is true not only for CLR, and not only for Windows, but pretty much for every OS or platform out there.
There is single heap for every Application Domain. A single process may run several app domains at once. A single app domain may run several threads.
To be more precise, there are usually two heaps per domain: one regular and one for really large objects (like, say, a 64K array).
I don't understand what you mean by "sequential flow of threads".
One stack for each thread, all threads share the same heaps.
There is no 'sequential flow' of threads. A thread is an operating system object that stores a copy of the processor state. The processor state includes the register values. One of them is ESP, the stack pointer. Another really important one is EIP, the instruction pointer. When the operating system switches between threads, it stores the processor state in the current thread object and reloads the state from the thread object for the thread that was selected to run next. The processor now simply continues executing where it left off previously.
Getting a thread started is perhaps now easy to understand as well. The operating system allocates a megabyte of memory for the stack. And initializes the ESP register value to point to that memory. And sets the value of the EIP register to the address of the method where the thread should start executing. The value of the ThreadStart delegate in C#.
Each thread must have it's own stack, that's where local variables and parameters are held, and the return addresses of the previous functions.

Resources