Kernel Read/Write Userspace memory - memory

First, malloc a buffer from userspace and fill the buffer with all 'A'
Then, pass the pointer of the buffer to kernel ,using netlink socket,
Finally, I can read and write the buffer, using the raw pointer directly passed from userspace.
Why ?
Why directly access to user space memory from kernel is allowed?
Linux Device Driver, Third Edition, Page 415, said that The kernel cannot directly manipulate memory that is not mapped into the kernel’s address space.

The point is that accessing user addresses directly in kernel only sometimes work.
As long as you try to access the user address in the context of the same process that allocated it and that the process has already faulted it in and you are using a kernel with a 3:1 memory mapping (as opposed to 4:4 mapping that is sometimes used) and that the kernel did not swap out the page the allocation is in - the access will work.
The problem is that all these conditions are not always true and they can change even from run time of the program to another. Therefore the kernel driver writers needs to not count on being able to access user addresses.
The worst thing that can happen is for you to assume it works, have it always work in the lab, and have it crash at a customer site every so often. This is the reason for the book statement.

In this book - words 'The kernel cannot directly manipulate memory that is not mapped into the kernel’s address space' is about physical memory. Other words - kernel has only 800-900 MB (on x86) that can be mapped to physical memory at one time. To access whole physical memory kernel need constantly remap this region.
Netlink not dealing with physical memory at all - it is designed for bidirectional communication between userspace<->userspace or userspace<->kernelspace.

Related

Operating Systems: Processes, Pagination and Memory Allocation doubts

I have several doubts about processes and memory management. List the main. I'm slowly trying to solve them by myself but I would still like some help from you experts =).
I understood that the data structures associated with a process are more or less these:
text, data, stack, kernel stack, heap, PCB.
If the process is created but the LTS decides to send it to secondary memory, are all the data structures copied for example on SSD or maybe just text and data (and PCB in kernel space)?
Pagination allows you to allocate processes in a non-contiguous way:
How does the kernel know if the process is trying to access an illegal memory area? After not finding the index on the page table, does the kernel realize that it is not even in virtual memory (secondary memory)? If so, is an interrupt (or exception) thrown? Is it handled immediately or later (maybe there was a process switch)?
If the processes are allocated non-contiguously, how does the kernel realize that there has been a stack overflow since the stack typically grows down and the heap up? Perhaps the kernel uses virtual addresses in PCBs as memory pointers that are contiguous for each process so at each function call it checks if the VIRTUAL pointer to the top of the stack has touched the heap?
How do programs generate their internal addresses? For example, in the case of virtual memory, everyone assumes starting from the address 0x0000 ... up to the address 0xffffff ... and is it then up to the kernel to proceed with the mapping?
How did the processes end? Is the system call exit called both in case of normal termination (finished last instruction) and in case of killing (by the parent process, kernel, etc.)? Does the process itself enter kernel mode and free up its associated memory?
Kernel schedulers (LTS, MTS, STS) when are they invoked? From what I understand there are three types of kernels:
separate kernel, below all processes.
the kernel runs inside the processes (they only change modes) but there are "process switching functions".
the kernel itself is based on processes but still everything is based on process switching functions.
I guess the number of pages allocated the text and data depend on the "length" of the code and the "global" data. On the other hand, is the number of pages allocated per heap and stack variable for each process? For example I remember that the JVM allows you to change the size of the stack.
When a running process wants to write n bytes in memory, does the kernel try to fill a page already dedicated to it and a new one is created for the remaining bytes (so the page table is lengthened)?
I really thank those who will help me.
Have a good day!
I think you have lots of misconceptions. Let's try to clear some of these.
If the process is created but the LTS decides to send it to secondary memory, are all the data structures copied for example on SSD or maybe just text and data (and PCB in kernel space)?
I don't know what you mean by LTS. The kernel can decide to send some pages to secondary memory but only on a page granularity. Meaning that it won't send a whole text segment nor a complete data segment but only a page or some pages to the hard-disk. Yes, the PCB is stored in kernel space and never swapped out (see here: Do Kernel pages get swapped out?).
How does the kernel know if the process is trying to access an illegal memory area? After not finding the index on the page table, does the kernel realize that it is not even in virtual memory (secondary memory)? If so, is an interrupt (or exception) thrown? Is it handled immediately or later (maybe there was a process switch)?
On x86-64, each page table entry has 12 bits reserved for flags. The first (right-most bit) is the present bit. On access to the page referenced by this entry, it tells the processor if it should raise a page-fault. If the present bit is 0, the processor raises a page-fault and calls an handler defined by the OS in the IDT (interrupt 14). Virtual memory is not secondary memory. It is not the same. Virtual memory doesn't have a physical medium to back it. It is a concept that is, yes implemented in hardware, but with logic not with a physical medium. The kernel holds a memory map of the process in the PCB. On page fault, if the access was not within this memory map, it will kill the process.
If the processes are allocated non-contiguously, how does the kernel realize that there has been a stack overflow since the stack typically grows down and the heap up? Perhaps the kernel uses virtual addresses in PCBs as memory pointers that are contiguous for each process so at each function call it checks if the VIRTUAL pointer to the top of the stack has touched the heap?
The processes are allocated contiguously in the virtual memory but not in physical memory. See my answer here for more info: Each program allocates a fixed stack size? Who defines the amount of stack memory for each application running?. I think stack overflow is checked with a page guard. The stack has a maximum size (8MB) and one page marked not present is left underneath to make sure that, if this page is accessed, the kernel is notified via a page-fault that it should kill the process. In itself, there can be no stack overflow attack in user mode because the paging mechanism already isolates different processes via the page tables. The heap has a portion of virtual memory reserved and it is very big. The heap can thus grow according to how much physical space you actually have to back it. That is the size of the swap file + RAM.
How do programs generate their internal addresses? For example, in the case of virtual memory, everyone assumes starting from the address 0x0000 ... up to the address 0xffffff ... and is it then up to the kernel to proceed with the mapping?
The programs assume an address (often 0x400000) for the base of the executable. Today, you also have ASLR where all symbols are kept in the executable and determined at load time of the executable. In practice, this is not done much (but is supported).
How did the processes end? Is the system call exit called both in case of normal termination (finished last instruction) and in case of killing (by the parent process, kernel, etc.)? Does the process itself enter kernel mode and free up its associated memory?
The kernel has a memory map for each process. When the process dies via abnormal termination, the memory map is crossed and cleared off of that process's use.
Kernel schedulers (LTS, MTS, STS) when are they invoked?
All your assumptions are wrong. The scheduler cannot be called otherwise than with a timer interrupt. The kernel isn't a process. There can be kernel threads but they are mostly created via interrupts. The kernel starts a timer at boot and, when there is a timer interrupt, the kernel calls the scheduler.
I guess the number of pages allocated the text and data depend on the "length" of the code and the "global" data. On the other hand, is the number of pages allocated per heap and stack variable for each process? For example I remember that the JVM allows you to change the size of the stack.
The heap and stack have portions of virtual memory reserved for them. The text/data segment start at 0x400000 and end wherever they need. The space reserved for them is really big in virtual memory. They are thus limited by the amount of physical memory available to back them. The JVM is another thing. The stack in JVM is not the real stack. The stack in JVM is probably heap because JVM allocates heap for all the program's needs.
When a running process wants to write n bytes in memory, does the kernel try to fill a page already dedicated to it and a new one is created for the remaining bytes (so the page table is lengthened)?
The kernel doesn't do that. On Linux, the libstdc++/libc C++/C implementation does that instead. When you allocate memory dynamically, the C++/C implementation keeps track of the allocated space so that it won't request a new page for a small allocation.
EDIT
Do compiled (and interpreted?) Programs only work with virtual addresses?
Yes they do. Everything is a virtual address once paging is enabled. Enabling paging is done via a control register set at boot by the kernel. The MMU of the processor will automatically read the page tables (among which some are cached) and will translate these virtual addresses to physical ones.
So do pointers inside PCBs also use virtual addresses?
Yes. For example, the PCB on Linux is the task_struct. It holds a field called pgd which is an unsigned long*. It will hold a virtual address and, when dereferenced, it will return the first entry of the PML4 on x86-64.
And since the virtual memory of each process is contiguous, the kernel can immediately recognize stack overflows.
The kernel doesn't recognize stack overflows. It will simply not allocate more pages to the stack then the maximum size of the stack which is a simple global variable in the Linux kernel. The stack is used with push pops. It cannot push more than 8 bytes so it is simply a matter of reserving a page guard for it to create page-faults on access.
however the scheduler is invoked from what I understand (at least in modern systems) with timer mechanisms (like round robin). It's correct?
Round-robin is not a timer mechanism. The timer is interacted with using memory mapped registers. These registers are detected using the ACPI tables at boot (see my answer here: https://cs.stackexchange.com/questions/141870/when-are-a-controllers-registers-loaded-and-ready-to-inform-an-i-o-operation/141918#141918). It works similarly to the answer I provided for USB (on the link I provided here). Round-robin is a scheduler priority scheme often called naive because it simply gives every process a time slice and executes them in order which is not currently used in the Linux kernel (I think).
I did not understand the last point. How is the allocation of new memory managed.
The allocation of new memory is done with a system call. See my answer here for more info: Who sets the RIP register when you call the clone syscall?.
The user mode process jumps into a handler for the system call by calling syscall in assembly. It jumps to an address specified at boot by the kernel in the LSTAR64 register. Then the kernel jumps to a function from assembly. This function will do the stuff the user mode process requires and return to the user mode process. This is often not done by the programmer but by the C++/C implementation (often called the standard library) that is a user mode library that is linked against dynamically.
The C++/C standard library will keep track of the memory it allocated by, itself, allocating some memory and by keeping records. Then, if you ask for a small allocation, it will use the pages it already allocated instead of requesting new ones using mmap (on Linux).

I/O-mapped I/O - are port addresses a part of the RAM

In I/O-mapped I/O (as opposed to memory-mapped I/O), a certain set of addresses are fixed for I/O devices. Are these addresses a part of the RAM, and thus that much physical address space is unusable ? Does it correspond to the 'Hardware Reserved' memory in the attached picture ?
If yes, how is it decided which bits of an address are to be used for addressing I/O devices (because the I/O address space would be much smaller than the actual memory. I have read this helps to reduce the number of pins/bits used by the decoding circuit) ?
What would happen if one tries to access, in assembly, any address that belongs to this address space ?
I/O mapped I/O doesn't use the same address space as memory mapped I/O. The later does use part of the address space normally used by RAM and therefore, "steals" addresses that no longer belong to RAM memory.
The set of address ranges that are used by different memory mapped I/O is what you see as "Hardware reserved".
About how is it decided how to address memory mapped devices, this is largely covered by the PnP subsystem, either in BIOS, or in the SO. Memory-mapped devices, with few exceptions, are PnP devices, so that means that for each of them, its base address can be changed (for PCI devices, the base address of the memory mapped registers, if any, is contained in a BAR -Base Address Register-, which is part of the PCI configuration space).
Saving pins for decoding devices (lazy decoding) is (was) done on early 8-bit systems, to save decoders and reduce costs. It haven't anything to do with memory mapped / IO mapped devices. Lazy decoding may be used in both situations. For example, a designer could decide that the 16-bit address range C000-FFFF is going to be reserved for memory mapped devices. To decide whether to enable some memory chip, or some device, it's enough to look at the value of A15 and A14. If both address lines are high, then the block addressed is C000-FFFF and that means that memory chip enables will be deasserted. On the other hand, a designer could decide that the 8 bit IO port 254 is going to be assigned to a device, and to decode this address, it only looks at the state of A0, needing no decoders to find out the port address (this is for example, what the ZX Spectrum does for addressing the ULA)
If a program (written in whatever language that allows you to access and write to arbitrary memory locations) tries to access a memory address reserved for a device, and assuming that the paging and protection mechanism allows such access, what happens will depend solely on what the device does when that address is accessed. A well known memory mapped device in PC's is the frame buffer. If the graphics card is configured to display color text mode with its default base address, any 8-bit write operation performed to even physical addresses between B8000 and B8F9F will cause the character whose ASCII code is the value written to show on screen, in a location that depends on the address chosen.
I/O mapped devices don't collide with memory, as they use a different address space, with different instructions to read and write values to addresses (ports). These devices cannot be addressed using machine code instructions that targets memory.
Memory mapped devices share the address space with RAM. Depending on the system configuration, memory mapped registers can be present all the time, using some addresses, and thus preventing the system to use them for RAM, or memory mapped devices may "shadow" memory at times, so allowing the program to change the I/O configuration to choose if a certain memory region will be decoded as in use by a device, or used by regular RAM (for example, what the Commodore 64 does to let the user have 64KB of RAM but allowing it to access device registers some times, by temporarily disabling access to the RAM that is "behind" the device that is currently being accessed at that very same address).
At the hardware level, what is happening is that there are two different signals: MREQ and IOREQ. The first one is asserted on every memory instruction, the second one, on every I/O insruction. So this code...
MOV DX,1234h
MOV AL,[DX] ;reads memory address 1234h (memory address space)
IN AL,DX ;reads I/O port 1234h (I/O address space)
Both put the value 1234h on the CPU address bus, and both assert the RD pin to indicate a read, but the first one will assert MREQ to indicate that the address belong to the memory address space, and the second one will assert IOREQ to indicate that it belongs to the I/O address space. The I/O device at port 1234h is connected to the system bus so that it is enabled only if the address is 1234h, RD is asserted and IOREQ is asserted. This way, it cannot collide with a RAM chip addressed at 1234h, because the later will be enabled only if MREQ is asserted (the CPU ensures that IOREQ and MREQ cannot be asserted at the same time).
These two address spaces don't exist in all CPU's. In fact, the majority of them don't have this, and therefore, they have to memory map all its devices.

Is there device side pointer of host memory for kernel use in OpenCL (like CUDA)?

In CUDA, we can achieve kernel managed data transfer from host memory to device shared memory by device side pointer of host memory. Like this:
int *a,*b,*c; // host pointers
int *dev_a, *dev_b, *dev_c; // device pointers to host memory
…
cudaHostGetDevicePointer(&dev_a, a, 0); // mem. copy to device not need now, but ptrs needed instead
cudaHostGetDevicePointer(&dev_b, b, 0);
cudaHostGetDevicePointer(&dev_c ,c, 0);
…
//kernel launch
add<<<B,T>>>(dev_a,dev_b,dev_c);
// dev_a, dev_b, dev_c are passed into kernel for kernel accessing host memory directly.
In the above example, kernel code can access host memory via dev_a, dev_b and dev_c. Kernel can utilize these pointers to move data from host to shared memory directly without relaying them by global memory.
But seems that it is an mission impossible in OpenCL? (local memory in OpenCL is the counterpart of shared memory in CUDA)
You can find exactly identical API in OpenCL.
How it works on CUDA:
According to this presentation and the official documentation.
The money quote about cudaHostGetDevicePointer :
Passes back device pointer of mapped host memory allocated by
cudaHostAlloc or registered by cudaHostRegister.
CUDA cudaHostAlloc with cudaHostGetDevicePointer works exactly like CL_MEM_ALLOC_HOST_PTR with MapBuffer works in OpenCL. Basically if it's a discrete GPU the results are cached in the device and if it's a discrete GPU with shared memory with the host it will use the memory directly. So there is no actual 'zero copy' operation with discrete GPU in CUDA.
The function cudaHostGetDevicePointer does not take raw malloced pointers in, just like what is the limitation in OpenCL. From the API users point of view those two are exactly identical approaches allowing the implementation to do pretty much identical optimizations.
With discrete GPU the pointer you get points to an area where the GPU can directly transfer stuff in via DMA. Otherwise the driver would take your pointer, copy the data to the DMA area and then initiate the transfer.
However in OpenCL2.0 that is explicitly possible, depending on the capabilities of your devices. With the finest granularity sharing you can use randomly malloced host pointers and even use atomics with the host, so you could even dynamically control the kernel from the host while it is running.
http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf
See page 162 for the shared virtual memory spec. Do note that when you write kernels even these are still just __global pointers from the kernel point of view.

Kernel mode - can it access to user mode?

As far as I know kernel mode code can access to any address available (high privilege), but if I pass a user mode pointer to a kernel mode function, will it be changed before using it? I mean: will it be resolved with paging/segmentation systems (or just paging for long mode) as it would in user mode?
First of all, you don't "supply a pointer to a kernel mode function". Kernel calls aren't simple jumps, they are either special instructions or software interrupts. Kernel function calling conventions are also different than your usual function calls.
In any event, exactly how accessing user memory from a kernel context works depends on the operating system in question. The kernel typically has a (virtual) address space of its own. This can be a completely independent address space from user process spaces (e.g. 32-bit OSX) or it can be in a special region (the high/low address split in many OSes). In the high/low model, the kernel can typically dereference pointers to user space while it is executing in the context of that process. In the general case, the kernel can explicitly look up the underlying physical memory the user virtual address refers to, and then map that into its own virtual address space.
As user space can maliciously supply bad pointers, they must never be used by the kernel without first checking for validity. This and the subsequent access must be atomic with regard to the user process's memory map, otherwise the process could munmap() the range in the time between the kernel's pointer validity check and actually reading/writing the memory. For this reason, most kernels have helper functions that are essentially a safe memcpy between user- and kernel space that is guaranteed to be safe or return an error in the case of an invalid pointer.
In any case, the kernel code has to do all of this explicitly, there is nothing "automatic" about it. Your syscall may pass through layers of abstraction that do automate this before reaching your kernel module, of course.
Update: Modern hardware supports SMAP (supervisor mode access prevention) which is designed to prevent accidental/malicious dereferencing of pointers to user address space from the kernel. Various operating systems have started enabling this feature, so in those cases you absolutely must go through the special kernel functions for accessing user memory.

How does the kernel know about segment fault?

When a segment fault occurs, it means I access memory which is not allocated or protected.But How does the kernel or CPU know it? Is it implemented by the hardware? What data structures need the CPU to look up? When a set of memory is allocated, what data structures need to be modified?
The details will vary, depending on what platform you're talking about, but typically the MMU will generate an exception (interrupt) when you attempt an invalid memory access and the kernel will then handle this as part of an interrupt service routine.
A seg fault generally happens when a process attempts to access memory that the CPU cannot physically address. It is the hardware that notifies the OS about a memory access violation. The OS kernel then sends a signal to the process which caused the exception
To answer the second part of your question, again it depends on hardware and OS. In a typical system (i.e. x86) the CPU consults the segment registers (via the global or local descriptor tables) to turn the segment relative address into a virtual address (this is usually, but not always, a no-op on modern x86 operating systems), and then (the MMU does this bit really, but on x86 its part of the CPU) consults the page tables to turn that virtual address into a physical address. When it encounters a page which is not marked present (the present bit is not set in the page directory or tables) it raises an exception. When the OS handles this exception, it will either give up (giving rise to the segfault signal you see when you make a mistake or a panic) or it will modify the page tables to make the memory valid and continue from the exception. Typically the OS has some bookkeeping which says which pages could be valid, and how to get the page. This is how demand paging occurs.
It all depends on the particular architecture, but all architectures with paged virtual memory work essentially the same. There are data structures in memory that describe the virtual-to-physical mapping of each allocated page of memory. For every memory access, the CPU/MMU hardware looks up those tables to find the mapping. This would be horribly slow, of course, so there are hardware caches to speed it up.

Resources