The Intel Architecture manual says when there is first write access against a memory page, the CPU sets the dirty bit of the page table entry. I have questions regarding this issue.
1. The 'dirty bit' in this context is used for guaranteeing the correctness of disk swapping in, out of memory pages. is this correct?
2. Is this automatically performed by the hardware? or is this an implementation of operating system?
3. If it is automatically performed by the hardware, is there any noteworthy difference compared to the usual memory updates which are performed by software instructions?
Thank you in advance.
1 The 'dirty bit' in this context is used for guaranteeing the correctness of disk swapping in, out of memory pages. is this correct?
This hardware part of paging support. This bit helps OS determine in very fast and efficcient way to determine which page must be dumped to disk. Because if memory page will page out to disk and there is already allocated space in page file we can don`t dump this page to disk if this flag are cleared. This is just example of way how OS can use this flag in paging.
2 Is this automatically performed by the hardware? or is this an implementation of operating system?
Software clears this flag. Hardware sets this flag:
3.7.6 Page-Directory and Page-Table Entries
Dirty (D) flag, bit 6
Indicates whether a page has been written to when set. (This flag is
not used in page-directory entries that point to page tables.) Memory
management software typically clears this flag when a page is
initially loaded into physical memory. The processor then sets this
flag the first time a page is accessed for a write operation.
.
3 If it is automatically performed by the hardware, is there any noteworthy difference compared to the usual memory updates which are performed by software instructions?
They have LOCK semantics and atomic.
Related
I know that Operating systems usually keep Page Tables to map a chunk of virtual memory to a chunk of physical memory.
My question is, does the CPU load the whole chunk when it's loading a given byte?
lets say I have:
ld %r0, 0x4(%r1)
Assuming my page size is 4 KB, does the the CPU load all 4KB at once or it manages to
Load only a byte given the offset properly?
Is the page size mandated by the hardware or configurable by software and the OS?
Edit:
Figured that page size is mandated by hardware:
The available page sizes depend on the instruction set architecture, processor type, and operating (addressing) mode. The operating system selects one or more sizes from the sizes supported by the architecture
your question touches so many levels of cpu/memory architecture ... without knowing the exact cpu-architecture/memory-architecture/version you have in mind: while the cpu-command targets only one byte, it will trigger the memory-controller to locate the whole physical-page and load that one (atleast, prefetch might kick in) completely to second/first-level cache. your data is transfered after filling the cache.
On a typical modern CPU, yes, it loads the whole page.
It couldn't really work any other way, since there are only two states in the page tables for a given page: present and not present. If the page is present, it must be mapped to some page in physical memory. If not present, every access to that page produces a page fault. There is no "partially present" state.
In order for it to be safe for the OS to mark the page present, it has to load the entire page into physical memory and update the page tables to point the virtual page to the physical page. If it only loaded a single byte or a smaller amount, the application might later try to access some other byte on the same page that hadn't been loaded, and it'd read garbage. There's no way for the CPU to generate another page fault in that case to let the OS fix things up, unless the page were marked not present, in which case the original access wouldn't be able to complete either.
The page size is fixed in hardware, though some architectures offer a few different choices that the OS can select from. For instance, recent x86-64 CPUs allow pages to be either 4 KB, 2 MB or 1 GB. The OS can mix-and-match these at runtime; there are bits in the page tables to indicate the size of each page.
In a typical memory layout there are 4 items:
code/text (where the compiled code of the program itself resides)
data
stack
heap
I am new to memory layouts so I am wondering if v8, which is a JIT compiler and dynamically generates code, stores this code in the "code" segment of the memory, or just stores it in the heap along with everything else. I'm not sure if the operating system gives you access to the code/text so not sure if this is a dumb question.
The below is true for the major operating systems running on the major CPUs in common use today. Things will differ on old or some embedded operating systems (in particular things are a lot simpler on operating systems without virtual memory) or when running code without an OS or on CPUs with no support for memory protection.
The picture in your question is a bit of a simplification. One thing it does not show is that (virtual) memory is made up of pages provided to you by the operating system. Each page has its own permissions controlling whether your process can read, write and/or execute the data in that page.
The text section of a binary will be loaded onto pages that are executable, but not writable. The read-only data section will be loaded onto pages that are neither writable nor executable. All other memory in your picture ((un)initialized data, heap, stack) will be stored on pages that are writable, but not executable.
These permissions prevent security flaws (such as buffer overruns) that could otherwise allow attackers to execute arbitrary code by making the program jump into code provided by the attacker or letting the attacker overwrite code in the text section.
Now the problem with these permissions, with regards to JIT compilation, is that you can't execute your JIT-compiled code: if you store it on the stack or the heap (or within a global variable), it won't be on an executable page, so the program will crash when you try to jump into the code. If you try to store it in the text area (by making use of left-over memory on the last page or by overwriting parts of the JIT-compilers code), the program will crash because you're trying to write to read-only memory.
But thankfully operating systems allow you to change the permissions of a page (on POSIX-systems this can be done using mprotect and on Windows using VirtualProtect). So your first idea might be to store the generated code on the heap and then simply make the containing pages executable. However this can be somewhat problematic: VirtualProtect and some implementations of mprotect require a pointer to the beginning of a page, but your array does not necessarily start at the beginning of a page if you allocated it using malloc (or new or your language's equivalent). Further your array may share a page with other data, which you don't want to be executable.
To prevent these issues, you can use functions, such as mmap on Unix-like operating systems and VirtualAlloc on Windows, that give you pages of memory "to yourself". These functions will allocate enough pages to contain as much memory as you requested and return a pointer to the beginning of that memory (which will be at the beginning of the first page). These pages will not be available to malloc. That is, even if you array is significantly smaller than the size of a page on your OS, the page will only be used to store your array - a subsequent call to malloc will not return a pointer to memory in that page.
So the way that most JIT-compilers work is that they allocate read-write memory using mmap or VirtualAlloc, copy the generated machine instructions into that memory, use mprotect or VirtualProtect to make the memory executable and non-writable (for security reasons you never want memory to be executable and writable at the same time if you can avoid it) and then jump into it. In terms of its (virtual) address, the memory will be part of the heap's area of the memory, but it will be separate from the heap in the sense that it won't be managed by malloc and free.
Heap and stack are the memory regions where programs can allocate at runtime. This is not specific to V8, or JIT compilers. For more detail, I humbly suggest that you read whatever book that illustration came from ;-)
I was wondering about this because it's a potential security hole if process A can malloc 50 megs of data that is not zero'd out and that chunk of memory turns out to include what had been physical pages from process B and still contain process B's data.
Is malloc'd data zeroed in objective c?
Mostly Yes. There's a zero-page writer that is part of the memory manager which provides a process with zero'd pages. The memory manager will call memory_object_data_unavailable to tell the kernel to supply zero-filled memory for the region.
If the process calls free and then mallocs again, the page is not re-zero'd. Zeroization only occurs when a new page is demanded. In fact, the page is probably not returned to the system upon free. The process retains the page for its own use due to the runtime. Related, see Will malloc implementations return free-ed memory back to the system?
If a page is returned to the system under a low-memory condition, the the page will be re-zero'd even if the process formerly held the page. The memory manager does not account for last owner of a page. It just assumes a new page needs to be zero'd to avoid an information leak across processes.
Note Microsoft calls it the zero-page writer. Darwin has the same component, but I don't recall seeing it named. Also see Mac OS X Internals: A Systems Approach by Singh. Its a bit dated, but it provides a lot of system information. Chapter 8, Memory, is the chapter of interest.
Singh's book goes into other details, like cases where a page is demanded but does not need to be zeroized. In this case, there was some shared data among processes, and a new page was allocated to the process under a Copy-on-Write (COW) scheme. Effectively, the new page was populated from existing data rather than zero's. The function of interest is memory_object_data_request.
Linux has an interesting discussion of the zero page at Some ado about zero. Its interesting reading about a topic that seems mundane on the surface.
When the cpu is executing a program, does it move all data through the memory pipeline? Then any piece of data would be moved from ram->cache->registers so all data that's executed goes in the cpu registers at some point. Or does it somehow select the code it puts in those faster memory types, or can you as a programmer select specific code you want to keep in, for example, the cache for optimization?
The answer to this question is an entire course in itself! A very brief summary of what (usually) happens is that:
You, the programmer, specify what goes in RAM. Well, the compiler does it on your behalf, but you're in control of this by how you declare your variables.
Whenever your code accesses a variable the CPU's MMU will check if the value is in the cache and if it is not, then it will fetch the 'line' that contains the variable from RAM into the cache. Some CPU instruction sets may allow you to prevent it from doing so (causing a stall) for specific low-frequecy operations, but it requires very low-level code to do so. When you update a value, the MMU will perform a 'cache flush' operation, committing the cached memory to RAM. Again, you can affect how and when this happens by low-level code. It will also depend on the MMU configuration such as whether the cache is write-through, etc.
If you are going to do any kind of operation on the value that will require it being used by an ALU (arithmetic Logic Unit) or similar, then it will be loaded into an appropriate register from the cache. Which register will depend on the instruction the compiler generated.
Some CPUs support Dynamic Memory Access (DMA), which provides a shortcut for operations that do not really require the CPU to be involved. These include memory-to-memory copies and the transfer of data between memory and memory-mapped peripheral control blocks (such as UARTs and other I/O blocks). These will cause data to be moved, read or written in RAM without actually affecting the CPU core at all.
At a higher level, some operating systems that support multiple processes will save the RAM allocated to the current process to the hard disk when the process is swapped out, and load it back in again from the disk when the process runs again. (This is why you may find 'Page Files' on your C: drive and the options to limit their size.) This allows all of the running processes to utilise most of the available RAM, even though they can't actually share it all simultaneously. Paging is yet another subject worthy of a course on its own. (Thanks to Leeor for mentioning this.)
Linux's synchronization primitives (spinlock, mutex, RCUs) use memory barrier instructions to force the memory access instructions from getting re-ordered. And this reordering can be done either by the CPU itself or by the compiler.
Can someone show some examples of GCC produced code where such reordering is done ? I am interested mainly in x86. The reason I am asking this is to understand how GCC decides what instructions can be reordered. Different x86 mirco architectures (for ex: sandy bridge vs ivy bridge) use different cache architecture. Hence I am wondering how GCC does effective reordering that helps in the execution performance irrespective of the cache architecture. Some example C code and reordered GCC generated code would be very useful. Thanks!
The reordering that GCC may do is unrelated to the reordering an (x86) CPU may do.
Let's start off with compiler reordering. The C language rules are such that GCC is forbidden from reordering volatile loads and store memory accesses with respect to each other, or deleting them, when a sequence point occurs between them (Thanks to bobc for this clarification). That is to say, in the assembly output, those memory accesses will appear, and will be sequenced precisely in the order you specified. Non-volatile accesses, on the other hand, can be reordered with respect to all other accesses, volatile or not, provided that (by the as-if rule) the end result of the calculation is the same.
For instance, a non-volatile load in the C code could be done as many times as the code says, but in a different order (e.g. If the compiler feels it's more convenient to do it earlier or later when more registers are available). It could be done fewer times than the code says (e.g. If a copy of the value happened to still be available in a register in the middle of a large expression). Or it could even be deleted (e.g. if the compiler can prove the uselessness of the load, or if it moved a variable entirely into a register).
To prevent compiler reorderings at other times, you must use a compiler-specific barrier. GCC uses __asm__ __volatile__("":::"memory"); for this purpose.
This is different from CPU reordering, a.k.a. the memory-ordering model. Ancient CPUs executed instructions precisely in the order they appeared in the program; This is called program ordering, or the strong memory-ordering model. Modern CPUs, however, sometimes resort to "cheats" to run faster, by weakening a little the memory model.
The way x86 CPUs weaken the memory model is documented in Intel's Software Developer Manuals, Volume 3, Chapter 8, Section 8.2.2 "Memory Ordering in P6 and More Recent Processor Families". This is, in part, what it reads:
Reads are not reordered with other reads.
Writes are not reordered with older reads.
Writes to memory are not reordered with other writes, with [some] exceptions.
Reads may be reordered with older writes to different locations but not with older writes to the same location.
Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions.
Reads cannot pass earlier LFENCE and MFENCE instructions.
Writes cannot pass earlier LFENCE, SFENCE, and MFENCE instructions.
LFENCE instructions cannot pass earlier reads.
SFENCE instructions cannot pass earlier writes.
MFENCE instructions cannot pass earlier reads or writes.
It also gives very good examples of what can and cannot be reordered, in Section 8.2.3 "Examples Illustrating the Memory-Ordering Principles".
As you can see, one uses FENCE instructions to prevent an x86 CPU from reordering memory accesses inappropriately.
Lastly, you may be interested in this link, which goes into further detail and comes with the assembly examples you crave.
The C language rules are such that GCC is forbidden from reordering volatile loads and store memory accesses with respect to each other, or deleting them.
That is not true, and is quite misleading. The C spec does not make such guarantee.
See When is a Volatile Object Accessed?
The standard encourages compilers to refrain from optimizations concerning accesses to volatile objects, but leaves it implementation defined as to what constitutes a volatile access. The minimum requirement is that at a sequence point all previous accesses to volatile objects have stabilized and no subsequent accesses have occurred. Thus an implementation is free to reorder and combine volatile accesses that occur between sequence points, but cannot do so for accesses across a sequence point.
Traditionally, programmers have relied on volatile as a cheap synchronization method but this is no longer a reliable method.