How can I get the CR3 value?
Furthermore, how can I get the CR3 value of process A (say Firefox)?
Is there any command I could use to read the current CR3?
Thanks!
From here:
I am trying to understand in what extend the PGD (page global directory)
value stored in the CR3 register indicate the running process/thread by the
Linux scheduler.
I know that each process has its own PGD value but what I am confused about
is the value of CR3 register when kernel threads are scheduled.
kernel thread(s) simply borrow latest scheduled process's PGD ( that
means, the entire address space)....this is done to save unneccessary
TLB flush since kernel thread operates in kernel space and that's the
same to all processes
So to avoid TLB (Translation Lookaside Buffer) flushes, the kernel just uses the PGD from the current user mode process, whereas in User Mode, the CR3 register would change each time a new process is started, as they are located at different virtual address, so their mapping would be different I guess.
But today on intel processors vulnerable to for example the spectre vulnerability, a new mitigation has been introduced: KPTI (kernel pointer table isolation). It makes that the cr3 of your process maps only a very small part of the kernel (only the code / structures mandatory to access to the interrupts/syscalls). Next the cr3 register is switched and replaced by the cr3 of your kernel which is equivalent to the cr3 of your process albeit it also maps all the kernel and the userland pages (of the calling process) with the NX bit set.
Related
Not sure if anyone here can answer this.
I've learned that an Operating System checks if an instruction of a program changes something outside of its allocated memory, and if it does then the OS won't allow the program to do this.
But, if the OS has to check this for every instruction, won't this take up at least 5/6 of the CPU? I tried to replicate this, and this is how many clock cycles I've come up with to check this for every instruction.
If I've understood something wrong, please correct me, because I can't imagine that an OS takes up that much of the CPU.
There are several safe-guards in place to ensure a non-privileged process behaves. I will discuss two of them in the context of the x86_64 architecture, but these concepts (mostly) extend to other major platforms.
Privilege Levels
There is a bit in a particular CPU register that indicates the current privilege level. These privileges are often called rings, where ring 0 corresponds to the kernel (ie. highest privilege), and ring 3 corresponds to a userspace process (ie. lowest privilege). There are other rings, but they're not relevant to this introduction.
Certain instructions in x86_64 may only be executed by privileged processes. The current ring must be 0 to execute a privileged instruction. If you try to execute this instruction without the correct privileges, the processor raises a general protection fault. The kernel synchronously processes this interrupt, and will almost certainly kill the userspace process.
The ring level can only be changed while in ring 0, so the userspace process can't simply change from ring 3 to ring 0 by itself.
Execute Permission in Page Tables
All instructions to be executed are stored in memory. Many architectures (including x86_64) use page tables to store mappings from virtual addresses to physical addresses. These page tables have several bookkeeping entries as well, one of which is an execute permission bit. If this bit is not set for a page that corresponds to the instruction trying to be executed, then the processor will produce a general protection fault. As before, the kernel will synchronously process this interrupt, and likely kill the offending process.
When are these execute bits set? They can be dynamically set via mmap(2), but in most cases the compiler emits special CODE sections in the binaries it generates, and when the OS loads the binary into memory it sets the execute bit in the page table entries for the pages that correspond to the CODE sections.
Who's checking these bits?
You're right to ask about the performance penalty of an OS checking these bits for every single instruction. If the OS were doing this, it would be prohibitively expensive. Instead, the processor supports privilege levels and page tables (with the execute bit). The OS can set these bits, and rely on the processor to generate interrupts when a process acts outside its privileges.
These hardware checks are very fast.
A page, memory page, or virtual page is a fixed-length contiguous block of virtual memory, described by a single entry in the page table.
I wamna know if kernel memory also can be pagable?
Yes, e.g. on architectures with an MMU every virtual address (user space and kernel space) is translated by the MMU. There is an area where the kernel is directly mapped, i.e. the virtual address is at a fixed offset from their physical address.
When for example a system call needs to access an address in kernel space, the page table of the last process that ran is used. It does not matter which one, since the kernel space is shared between all processes and thus is the same for all.
There is one case where physical addresses are used directly and that is in the boot process before paging is enabled.
As Giacomo Catenazzi mentioned correctly in the comments, these pages are handled differently, e.g. they can not be swapped out.
There is one case where physical addresses are used directly and that is in the boot process before paging is enabled.
In a modern PC, where will
MOV [0x0000], 7
put a 7? Is it the first byte of my RAM, or is it the first byte of the process's address space? Assuming it triggers a memory violation.
You mean assuming it doesn't trigger an access violation? Every process has it's own virtual address space. The first 64kiB are normally kept unmapped, so NULL-pointer accesses actually fault noisily, instead of letting programs silently do Bad Things.
In a user-space process on a typical OS, an absolute address of 0 does refer to the first byte of your process's virtual address space.
With paging enabled, there's no way even for the kernel to use physical addresses directly. To write to a given physical address, would have to create a page table entry mapping that physical page to a virtual page (or find an existing mapping), invlpg to make sure the TLB isn't caching a stale entry, and then use that virtual address.
it depends on the system architecture. Every architecture provides an instruction set and a memory layout. Furthermore it depends on the operating system you use. E.g. Real Time Operating systems often do not provide Virtual Memory.
greets
Consider the following CPU instruction which takes the memory at address 16777386 (decimal) and stores it in Register 1:
Move &0x010000AA, R1
Traditionally programs are translated to assembly (machine code) at compile time. (Let's ignore more complex modern systems like jitting).
However, if this address allocation is completed statically at compile time, how does the OS ensure that two processes do not use the same memory? (eg if you ran the same compiled program twice concurrently).
Question:
How, and when, does a program get its memory addresses assigned?
Virtual Memory:
I understand most (if not all) modern systems use Memory Management Units in hardware to allow for the use of virtual memory. The first few octets of an address space being used to reference which page. This would allow for memory protection if each process used different pages. However, if this is how memory protection is enforced, the original question still persists, only this time with how page numbers are assigned?
EDIT:
CPU:
One possibility is the CPU can handle memory protection by enforcing that a process id be assigned by the OS before executing memory based instructions. However, this is only speculation, and requires support in hardware by the CPU architecture, something I'm not sure RISC ISAs would be designed to do.
With virtual memory each process has separate address space, so 0x010000AA in one process will refer to different value than in another process.
Address spaces are implemented with kernel-controlled page tables that processor uses to translate virtual page addresses to physical ones. Having two processes using the same address page number is not an issue, since the processes have separate page tables and physical memory mapped can be different.
Usually executable code and global variables will be mapped statically, stack will be mapped at random address (some exploits are more difficult that way) and dynamic allocation routines will use syscalls to map more pages.
(ignoring the Unix fork) The initial state of a processes memory is set up by the executable loader. The linker defines the initial memory state and the loader creates it. That state usually includes memory to static data, executable code, writeable data, and the stack.
In most systems a process can modify the address space by adding pages (possibly removing them as well).
[Ignoring system addresses] In virtual (logical) memory systems each process has an address space starting at zero (usually the first page is not mapped). The address space is divided into pages. The operating system maps (and remaps) logical pages to physical pages.
Address 0x010000AA in one process is then a difference physical memory address in each process.
When we working on NUMA system, memory can be local or remote relative to current NUMA node.
To make memory more local there is a "first-touch" policy (the default memory to node binding strategy):
http://lse.sourceforge.net/numa/status/description.html
Default Memory Binding
It is important that user programs' memory is allocated on a node close to the one containing the CPU on which they are running. Therefore, by default, page faults are satisfied by memory from the node containing the page-faulting CPU. Because the first CPU to touch the page will be the CPU that faults the page in, this default policy is called "first touch".
http://techpubs.sgi.com/library/dynaweb_docs/0640/SGI_Developer/books/OrOn2_PfTune/sgi_html/ch08.html
The default policy is called first-touch. Under this policy, the process that first touches (that is, writes to, or reads from) a page of memory causes that page to be allocated in the node on which the process is running. This policy works well for sequential programs and for many parallel programs as well.
There are also some other non-local policies. Also there is a function to require explicit move of memory segment to some NUMA node.
But sometimes (in context of many threads of single applications) it can be useful to have "next touch" policy: call some function to "unbind" some memory region (up to 100s MB) with some data and reapply the "first touch"-like handler on this region which will migrate the page on next touch (read or write) to the numa node of accessing thread.
This policy is useful in case when there are huge data to process by many threads and there are different patterns of access to this data (e.g. first phase - split the 2D array by columns via threads; second - split the same data by rows).
Such policy was supported in Solaris since 9 via madvice with MADV_ACCESS_LWP flag
https://cims.nyu.edu/cgi-systems/man.cgi?section=3C&topic=madvise
MADV_ACCESS_LWP Tell the kernel that the next LWP to
touch the specified address range
will access it most heavily, so the
kernel should try to allocate the
memory and other resources for this
range and the LWP accordingly.
There was (may 2009) the patch to linux kernel named "affinity-on-next-touch", http://lwn.net/Articles/332754/ (thread) but as I understand it was unaccepted into mainline, isn't it?
Also there were Lee Schermerhorn's "migrate_on_fault" patches http://free.linux.hp.com/~lts/Patches/PageMigration/.
So, the question: Is there some next-touch for NUMA in current vanilla Linux kernel or in some major fork, like RedHat linux kernel or Oracle linux kernel?
Given my understanding, there aren't anything similar in the vanilla kernel. numactl has functions to migrate pages manually, but it's probably not helpful in your case. (NUMA policy description is in Documentation/vm/numa_memory_policy if you want to check yourself)
I think those patches are not merged as I don't see any of the relevant code snippets showing up in current kernel.