Consistency Rules for cudaHostAllocMapped

Consistency Rules for cudaHostAllocMapped - memory

Does anyone know of documentation on the memory consistency model guarantees for a memory region allocated with cudaHostAlloc(..., cudaHostAllocMapped)? For instance, when writes from the device become visible to reads from the host would be useful (could be after the kernel completes, at earliest possible time during kernel execution, etc).

Writes from the device are guaranteed to be visible on the host (or on peer devices) after the performing thread has executed a __threadfence_system() call (which is only available on compute capability 2.0 or higher).
They are also visible after the kernel has finished, i.e. after a cudaDeviceSynchronize() or after one of the other synchronization methods listed in the "Explicit Synchronization" section of the Programming Guide has been successfully completed.
Mapped memory should never be modified from the host while a kernel using it is or could be running, as CUDA currently does not provide any way of synchronization in that direction.

Related

Operating Systems: Processes, Pagination and Memory Allocation doubts

I have several doubts about processes and memory management. List the main. I'm slowly trying to solve them by myself but I would still like some help from you experts =).
I understood that the data structures associated with a process are more or less these:
text, data, stack, kernel stack, heap, PCB.
If the process is created but the LTS decides to send it to secondary memory, are all the data structures copied for example on SSD or maybe just text and data (and PCB in kernel space)?
Pagination allows you to allocate processes in a non-contiguous way:
How does the kernel know if the process is trying to access an illegal memory area? After not finding the index on the page table, does the kernel realize that it is not even in virtual memory (secondary memory)? If so, is an interrupt (or exception) thrown? Is it handled immediately or later (maybe there was a process switch)?
If the processes are allocated non-contiguously, how does the kernel realize that there has been a stack overflow since the stack typically grows down and the heap up? Perhaps the kernel uses virtual addresses in PCBs as memory pointers that are contiguous for each process so at each function call it checks if the VIRTUAL pointer to the top of the stack has touched the heap?
How do programs generate their internal addresses? For example, in the case of virtual memory, everyone assumes starting from the address 0x0000 ... up to the address 0xffffff ... and is it then up to the kernel to proceed with the mapping?
How did the processes end? Is the system call exit called both in case of normal termination (finished last instruction) and in case of killing (by the parent process, kernel, etc.)? Does the process itself enter kernel mode and free up its associated memory?
Kernel schedulers (LTS, MTS, STS) when are they invoked? From what I understand there are three types of kernels:
separate kernel, below all processes.
the kernel runs inside the processes (they only change modes) but there are "process switching functions".
the kernel itself is based on processes but still everything is based on process switching functions.
I guess the number of pages allocated the text and data depend on the "length" of the code and the "global" data. On the other hand, is the number of pages allocated per heap and stack variable for each process? For example I remember that the JVM allows you to change the size of the stack.
When a running process wants to write n bytes in memory, does the kernel try to fill a page already dedicated to it and a new one is created for the remaining bytes (so the page table is lengthened)?
I really thank those who will help me.
Have a good day!

I think you have lots of misconceptions. Let's try to clear some of these.
If the process is created but the LTS decides to send it to secondary memory, are all the data structures copied for example on SSD or maybe just text and data (and PCB in kernel space)?
I don't know what you mean by LTS. The kernel can decide to send some pages to secondary memory but only on a page granularity. Meaning that it won't send a whole text segment nor a complete data segment but only a page or some pages to the hard-disk. Yes, the PCB is stored in kernel space and never swapped out (see here: Do Kernel pages get swapped out?).
How does the kernel know if the process is trying to access an illegal memory area? After not finding the index on the page table, does the kernel realize that it is not even in virtual memory (secondary memory)? If so, is an interrupt (or exception) thrown? Is it handled immediately or later (maybe there was a process switch)?
On x86-64, each page table entry has 12 bits reserved for flags. The first (right-most bit) is the present bit. On access to the page referenced by this entry, it tells the processor if it should raise a page-fault. If the present bit is 0, the processor raises a page-fault and calls an handler defined by the OS in the IDT (interrupt 14). Virtual memory is not secondary memory. It is not the same. Virtual memory doesn't have a physical medium to back it. It is a concept that is, yes implemented in hardware, but with logic not with a physical medium. The kernel holds a memory map of the process in the PCB. On page fault, if the access was not within this memory map, it will kill the process.
If the processes are allocated non-contiguously, how does the kernel realize that there has been a stack overflow since the stack typically grows down and the heap up? Perhaps the kernel uses virtual addresses in PCBs as memory pointers that are contiguous for each process so at each function call it checks if the VIRTUAL pointer to the top of the stack has touched the heap?
The processes are allocated contiguously in the virtual memory but not in physical memory. See my answer here for more info: Each program allocates a fixed stack size? Who defines the amount of stack memory for each application running?. I think stack overflow is checked with a page guard. The stack has a maximum size (8MB) and one page marked not present is left underneath to make sure that, if this page is accessed, the kernel is notified via a page-fault that it should kill the process. In itself, there can be no stack overflow attack in user mode because the paging mechanism already isolates different processes via the page tables. The heap has a portion of virtual memory reserved and it is very big. The heap can thus grow according to how much physical space you actually have to back it. That is the size of the swap file + RAM.
How do programs generate their internal addresses? For example, in the case of virtual memory, everyone assumes starting from the address 0x0000 ... up to the address 0xffffff ... and is it then up to the kernel to proceed with the mapping?
The programs assume an address (often 0x400000) for the base of the executable. Today, you also have ASLR where all symbols are kept in the executable and determined at load time of the executable. In practice, this is not done much (but is supported).
How did the processes end? Is the system call exit called both in case of normal termination (finished last instruction) and in case of killing (by the parent process, kernel, etc.)? Does the process itself enter kernel mode and free up its associated memory?
The kernel has a memory map for each process. When the process dies via abnormal termination, the memory map is crossed and cleared off of that process's use.
Kernel schedulers (LTS, MTS, STS) when are they invoked?
All your assumptions are wrong. The scheduler cannot be called otherwise than with a timer interrupt. The kernel isn't a process. There can be kernel threads but they are mostly created via interrupts. The kernel starts a timer at boot and, when there is a timer interrupt, the kernel calls the scheduler.
I guess the number of pages allocated the text and data depend on the "length" of the code and the "global" data. On the other hand, is the number of pages allocated per heap and stack variable for each process? For example I remember that the JVM allows you to change the size of the stack.
The heap and stack have portions of virtual memory reserved for them. The text/data segment start at 0x400000 and end wherever they need. The space reserved for them is really big in virtual memory. They are thus limited by the amount of physical memory available to back them. The JVM is another thing. The stack in JVM is not the real stack. The stack in JVM is probably heap because JVM allocates heap for all the program's needs.
When a running process wants to write n bytes in memory, does the kernel try to fill a page already dedicated to it and a new one is created for the remaining bytes (so the page table is lengthened)?
The kernel doesn't do that. On Linux, the libstdc++/libc C++/C implementation does that instead. When you allocate memory dynamically, the C++/C implementation keeps track of the allocated space so that it won't request a new page for a small allocation.
EDIT
Do compiled (and interpreted?) Programs only work with virtual addresses?
Yes they do. Everything is a virtual address once paging is enabled. Enabling paging is done via a control register set at boot by the kernel. The MMU of the processor will automatically read the page tables (among which some are cached) and will translate these virtual addresses to physical ones.
So do pointers inside PCBs also use virtual addresses?
Yes. For example, the PCB on Linux is the task_struct. It holds a field called pgd which is an unsigned long*. It will hold a virtual address and, when dereferenced, it will return the first entry of the PML4 on x86-64.
And since the virtual memory of each process is contiguous, the kernel can immediately recognize stack overflows.
The kernel doesn't recognize stack overflows. It will simply not allocate more pages to the stack then the maximum size of the stack which is a simple global variable in the Linux kernel. The stack is used with push pops. It cannot push more than 8 bytes so it is simply a matter of reserving a page guard for it to create page-faults on access.
however the scheduler is invoked from what I understand (at least in modern systems) with timer mechanisms (like round robin). It's correct?
Round-robin is not a timer mechanism. The timer is interacted with using memory mapped registers. These registers are detected using the ACPI tables at boot (see my answer here: https://cs.stackexchange.com/questions/141870/when-are-a-controllers-registers-loaded-and-ready-to-inform-an-i-o-operation/141918#141918). It works similarly to the answer I provided for USB (on the link I provided here). Round-robin is a scheduler priority scheme often called naive because it simply gives every process a time slice and executes them in order which is not currently used in the Linux kernel (I think).
I did not understand the last point. How is the allocation of new memory managed.
The allocation of new memory is done with a system call. See my answer here for more info: Who sets the RIP register when you call the clone syscall?.
The user mode process jumps into a handler for the system call by calling syscall in assembly. It jumps to an address specified at boot by the kernel in the LSTAR64 register. Then the kernel jumps to a function from assembly. This function will do the stuff the user mode process requires and return to the user mode process. This is often not done by the programmer but by the C++/C implementation (often called the standard library) that is a user mode library that is linked against dynamically.
The C++/C standard library will keep track of the memory it allocated by, itself, allocating some memory and by keeping records. Then, if you ask for a small allocation, it will use the pages it already allocated instead of requesting new ones using mmap (on Linux).

The reasons why rps procedure use spinlock with local_irq_disable

These days, I'm studying kernel internal network code, especially RPS code. You know, there are a lot of functions about that. But I am focusing on some functions about SMP queue processing such as enqueue_to_backlog and process_backlog.
I wonder about synchronization btw two cores(or single core) by using two functions -enqueue_to_backlog and process_backlog-.
In that functions, A core(A) holds a spin_lock of the other core(B) for queueing packets into input_pkt_queue and scheduling napi of the core(B). And A Core(B) also holds a spin_lock for splicing input_pkt_queue to process_queue of the core(B) and removing napi schedule by itself. I know that spin_lock should be held to prevent two core from accessing the same queue each other during processing queue.
But I can't understand why spin_lock is called with local_irq_disable(or local_irq_save). I think that there is no accessing the queues or rps_lock of the core(B) by Interrupts Context(TH), when interrupts(TH) preempt current context(softirq, BH). - Of course, napi struct can be accessed for scheduling napi by TH, but it holds disabling irq until queueing packet- So I wonder about why spin_lock is called with irq disable.
I think it is impossible to preempt current context(napi, softirq) by other BH such as tasklet. Is it true? And I want to know whether local_irq_disable disable all cores irq or just current core's irq literally? Actually, I read a book about kernel development, but I think i don't understand preemption enough.
Would explain the reasons why rps procedure use spin_lock with local_irq_disable?

Disabling interrupts affects the current core (only). When disabled, therefore, no other code on the same core will be able to interfere with an update to a data structure. The point of spinlocks is to extend the "lock-out" to other cores (although it's cooperative, not hardware-enforced).
It's dangerous/irresponsible to take a spin lock in the kernel without disabling interrupts because, when an interrupt then occurs, the current code will be suspended, and now you are preventing other cores from making progress while some unrelated interrupt handler is running (even if another user process or tasklet on the original core won't be able to preempt). Other cores might be in an interrupt or BH context themselves and now you're delaying the entire system. Spin locks are supposed to be held for very brief periods to do critical updates to shared data structures.
It's also a good way to generate deadlocks. Consider if the scenario above were replicated in another subsystem (or possibly another device in the same subsystem, but I'll describe the former).
Here, core A takes a spinlock in subsystem 1 without disabling interrupts. At the same time, core B takes a spinlock in subsystem 2 also without disabling interrupts. Now what happens if an interrupt related to subsystem 2 happens on core A, and while executing the subsystem 2 interrupt handler, core A needs to update a structure protected by the spinlock held in core B. But at about the same time, a subsystem 1 interrupt happens on core B, which needs to update a data structure in that subsystem. Now both cores are busy-waiting for a spinlock held by the other core, and the entire system is frozen until you do a hard reset.

How do modern OS's achieve idempotent cleanup functions for process deaths?

Let's say that I have an OS that implements malloc by storing a list of segments that the process points to in a process control block. I grab my memory from a free list and give it to the process.
If that process dies, I simply remove the reference to the segment from the process control block, and move the segment back to my free list.
Is it possible to create an idempotent function that does this process cleanup? How is it possible to create a function such that it can be called again, regardless of whether it was called many times before or if previous calls died in the middle of executing the cleanup function? It seems to me that you can't execute two move commands atomically.
How do modern OS's implement the magic involved in culling memory from processes that randomly die? How do they implement it so that it's okay for even the process performing the cull to randomly die, or is this a false assumption that I made?

I'll assume your question boils down to how the OS culls a process's memory if that process crashes.
Although I'm self educated in these matters, I'll give you two ways an OS can make sure any memory used by a process is reclaimed if the process crashes.
In a typical modern CPU and modern OS with virtual memory:
You have two layers of allocation. Whenever the process calls malloc, malloc tries to satisfy the request from already available memory pages the kernel gave the process. If not enough pages are available, malloc asks the kernel to allocate more pages.
In this case, whenever a process crashes or even if it exits normally, the kernel doesn't care what malloc did, or what memory the process forgot to release. It only needs to free all the pages it gave the process.
In a simpler OS that doesn't care much about performance, memory fragmentation or virtual memory and maybe not even about memory protection:
Malloc/free is implemented completely on the kernel side (e.g: system calls). Whenever a process calls malloc/free, the kernel does all the work, and therefore knows about all the memory that needs to be freed. Once the process crashes or exits, the kernel can cleanup. Since the kernel is never supposed to crash, and keep a record of all the allocated memory per process, it's trivial.
Like I said, I'm self educated, and I didn't check how for example Linux or Windows implement it.

MIPS memory execution prevention

I'm doing some research with the MIPS architecture and was wondering how operating systems are implemented with the limited instructions and memory protection that mips offers. I'm specifically wondering about how an operating system would prevent certain addresses ranges from being executed. For example, how could an operating system limit PC to operate in a particular range? In other words, prevent something such as executing from dynamically allocated memory?
The first thing that came to mind is with TLBs, but TLBs only offer memory write protection (and not execute).
I don't quite see how it could be handled by the OS either, because that would imply that every instruction would result in an exception and then MANY cycles would be burned just checking to see if PC was in a sane address range.
If anyone knows, how is it typically done? Is it handled somehow by the hardware during initialization (e.g. It's given an address range and an exception is hit if its out of range?)

Most of protection checks are done in hardware, by the CPU itself, and do not need much involvement from the OS side.
The OS sets up some special tables (page tables or segment descriptors or some such) where memory ranges have associated read, write, execute and user/kernel permissions that the CPU then caches internally.
The CPU then on every instruction checks whether or not the memory accesses comply with the OS-established permissions and if everything's OK, carries on. If there's an attempt to violate those permissions the CPU raises an exception (a form of an interrupt similar to those from external to the CPU I/O devices) that the OS handles. In most cases the OS simply terminates the offending application when it gets such an exception.
In some other cases it tries to handle them and make the seemingly broken code work. One of these cases is support for virtual, on-disk memory. The OS marks a region as unpresent/inaccessible when it's not backed up by physical memory and it's data is somewhere on the disk. When the app tries to use that region, the OS catches an exception from the instruction that tries to access this memory region, backs the region with physical memory, fills it in with data from the disk, marks it as present/accessible and restarts the instruction that's caused the exception. Whenever the OS is low on memory, it can offload data from certain ranges to the disk, mark those ranges as unpresent/inaccessible again and reclaim the memory from those regions for other purposes.
There may also be specific hard-coded by the CPU memory ranges inaccessible to software running outside of the OS kernel and the CPU can easily make a check here as well.
This seems to be the case for MIPS (from "Application Note 235 - Migrating from MIPS to ARM"):
3.4.2 Memory protection
MIPS offers memory protection only to the extent described earlier i.e. addresses
in the upper 2GB of the address space are not permitted when in user mode.
No finer-grained protection regime is possible.
This document lists "MEM - page fault on data fetch; misaligned memory access; memory-protection violation" among the other MIPS exceptions.
If a particular version of the MIPS CPU doesn't have any more fine-grained protection checks, they can only be emulated by the OS and at a significant cost. The OS would need to execute code instruction by instruction or translate it into almost equivalent code with inserted address and access checks and execute that instead of the original code.

This is indeed done with TLBs. No Execute Bits (NX bits) became popular only a few years ago, so older MIPS processors do not support it. The latest version of the MIPS architecture (Release 3) and the SmartMIPS Application-Specific Extension support it as an optional feature under the name of XI (Execute Inhibit).
If you have a chip without this feature you are out of luck. Like Alex already said, there is no simple way to emulate this feature.

Are cuda kernel calls synchronous or asynchronous

I read that one can use kernel launches to synchronize different blocks i.e., If i want all blocks to complete operation 1 before they go on to operation 2, I should place operation 1 in one kernel and operation 2 in another kernel. This way, I can achieve global synchronization between blocks. However, the cuda c programming guide mentions that kernel calls are asynchronous ie. the CPU does not wait for the first kernel call to finish and thus, the CPU can also call the second kernel before the 1st has finished. However, if this is true, then we cannot use kernel launches to synchronize blocks. Please let me know where i am going wrong

Kernel calls are asynchronous from the point of view of the CPU so if you call 2 kernels in succession the second one will be called without waiting for the first one to finish. It only means that the control returns to the CPU immediately.
On the GPU side, if you haven't specified different streams to execute the kernel they will be executed by the order they were called (if you don't specify a stream they both go to the default stream and are executed serially). Only after the first kernel is finished the second one will execute.
This behavior is valid for devices with compute capability 2.x which support concurrent kernel execution. On the other devices even though kernel calls are still asynchronous the kernel execution is always sequential.
Check the CUDA C programming guide on section 3.2.5 which every CUDA programmer should read.

The accepted answer is not always correct.
In most cases, kernel launch is asynchronous. But in the following case, it is synchronous. And they are easily ignored by people.
environment variable CUDA_LAUNCH_BLOCKING equals to 1.
using a profiler(nvprof), without enabling concurrent kernel profiling
memcpy that involve host memory which is not page-locked.
Programmers can globally disable asynchronicity of kernel launches for all CUDA applications running on a system by setting the CUDA_LAUNCH_BLOCKING environment variable to 1. This feature is provided for debugging purposes only and should not be used as a way to make production software run reliably.
Kernel launches are synchronous if hardware counters are collected via a profiler (Nsight, Visual Profiler) unless concurrent kernel profiling is enabled. Async memory copies will also be synchronous if they involve host memory that is not page-locked.
From the NVIDIA CUDA programming guide(http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#concurrent-execution-host-device).

Concurrent kernel execution is supported since 2.0 CUDA capability version.
In addition, a return to the CPU code can be made earlier than all the warp kernel to have worked.
In this case, you can provide synchronization yourself.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart