Does an Operating System check every Instruction? - memory

Not sure if anyone here can answer this.
I've learned that an Operating System checks if an instruction of a program changes something outside of its allocated memory, and if it does then the OS won't allow the program to do this.
But, if the OS has to check this for every instruction, won't this take up at least 5/6 of the CPU? I tried to replicate this, and this is how many clock cycles I've come up with to check this for every instruction.
If I've understood something wrong, please correct me, because I can't imagine that an OS takes up that much of the CPU.

There are several safe-guards in place to ensure a non-privileged process behaves. I will discuss two of them in the context of the x86_64 architecture, but these concepts (mostly) extend to other major platforms.
Privilege Levels
There is a bit in a particular CPU register that indicates the current privilege level. These privileges are often called rings, where ring 0 corresponds to the kernel (ie. highest privilege), and ring 3 corresponds to a userspace process (ie. lowest privilege). There are other rings, but they're not relevant to this introduction.
Certain instructions in x86_64 may only be executed by privileged processes. The current ring must be 0 to execute a privileged instruction. If you try to execute this instruction without the correct privileges, the processor raises a general protection fault. The kernel synchronously processes this interrupt, and will almost certainly kill the userspace process.
The ring level can only be changed while in ring 0, so the userspace process can't simply change from ring 3 to ring 0 by itself.
Execute Permission in Page Tables
All instructions to be executed are stored in memory. Many architectures (including x86_64) use page tables to store mappings from virtual addresses to physical addresses. These page tables have several bookkeeping entries as well, one of which is an execute permission bit. If this bit is not set for a page that corresponds to the instruction trying to be executed, then the processor will produce a general protection fault. As before, the kernel will synchronously process this interrupt, and likely kill the offending process.
When are these execute bits set? They can be dynamically set via mmap(2), but in most cases the compiler emits special CODE sections in the binaries it generates, and when the OS loads the binary into memory it sets the execute bit in the page table entries for the pages that correspond to the CODE sections.
Who's checking these bits?
You're right to ask about the performance penalty of an OS checking these bits for every single instruction. If the OS were doing this, it would be prohibitively expensive. Instead, the processor supports privilege levels and page tables (with the execute bit). The OS can set these bits, and rely on the processor to generate interrupts when a process acts outside its privileges.
These hardware checks are very fast.

Related

When do memory addresses get assigned?

Consider the following CPU instruction which takes the memory at address 16777386 (decimal) and stores it in Register 1:
Move &0x010000AA, R1
Traditionally programs are translated to assembly (machine code) at compile time. (Let's ignore more complex modern systems like jitting).
However, if this address allocation is completed statically at compile time, how does the OS ensure that two processes do not use the same memory? (eg if you ran the same compiled program twice concurrently).
Question:
How, and when, does a program get its memory addresses assigned?
Virtual Memory:
I understand most (if not all) modern systems use Memory Management Units in hardware to allow for the use of virtual memory. The first few octets of an address space being used to reference which page. This would allow for memory protection if each process used different pages. However, if this is how memory protection is enforced, the original question still persists, only this time with how page numbers are assigned?
EDIT:
CPU:
One possibility is the CPU can handle memory protection by enforcing that a process id be assigned by the OS before executing memory based instructions. However, this is only speculation, and requires support in hardware by the CPU architecture, something I'm not sure RISC ISAs would be designed to do.
With virtual memory each process has separate address space, so 0x010000AA in one process will refer to different value than in another process.
Address spaces are implemented with kernel-controlled page tables that processor uses to translate virtual page addresses to physical ones. Having two processes using the same address page number is not an issue, since the processes have separate page tables and physical memory mapped can be different.
Usually executable code and global variables will be mapped statically, stack will be mapped at random address (some exploits are more difficult that way) and dynamic allocation routines will use syscalls to map more pages.
(ignoring the Unix fork) The initial state of a processes memory is set up by the executable loader. The linker defines the initial memory state and the loader creates it. That state usually includes memory to static data, executable code, writeable data, and the stack.
In most systems a process can modify the address space by adding pages (possibly removing them as well).
[Ignoring system addresses] In virtual (logical) memory systems each process has an address space starting at zero (usually the first page is not mapped). The address space is divided into pages. The operating system maps (and remaps) logical pages to physical pages.
Address 0x010000AA in one process is then a difference physical memory address in each process.

MPI/Pthread program does not scale

I have a MPI/Pthread program in which each MPI process will be running on a separate computing node. Within each MPI process, certain number of Pthreads (1-8) are launched. However, no matter how many Pthreads are launched within a MPI process, the overall performance is pretty much the same. I suspect all the Pthreads are running on the same CPU core. How can I assign threads to different CPU cores?
Each computing node has 8 cores.(two Quad core Nehalem processors)
Open MPI 1.4
Linux x86_64
Questions like this are often dependent on the problem at hand. Most likely, you are running into a resource lock issue (where the threads are competing for a lock) -- this would look like only one core was doing any work, because only one thread can (effectively) do any work at any given time.
Setting CPU affinity for a certain thread is not a good solution. You should allow for the OS scheduler to optimally determine the physical core assignment for a given pthread.
Look at your code and try to figure out where you are locking where you shouldn't be, or if you've come up with a correct parallel solution to the problem at hand. You should also test a version of the program using only pthreads (not MPI) and see if scaling is achieved.

Is there NUMA next-touch policy in modern Linux

When we working on NUMA system, memory can be local or remote relative to current NUMA node.
To make memory more local there is a "first-touch" policy (the default memory to node binding strategy):
http://lse.sourceforge.net/numa/status/description.html
Default Memory Binding
It is important that user programs' memory is allocated on a node close to the one containing the CPU on which they are running. Therefore, by default, page faults are satisfied by memory from the node containing the page-faulting CPU. Because the first CPU to touch the page will be the CPU that faults the page in, this default policy is called "first touch".
http://techpubs.sgi.com/library/dynaweb_docs/0640/SGI_Developer/books/OrOn2_PfTune/sgi_html/ch08.html
The default policy is called first-touch. Under this policy, the process that first touches (that is, writes to, or reads from) a page of memory causes that page to be allocated in the node on which the process is running. This policy works well for sequential programs and for many parallel programs as well.
There are also some other non-local policies. Also there is a function to require explicit move of memory segment to some NUMA node.
But sometimes (in context of many threads of single applications) it can be useful to have "next touch" policy: call some function to "unbind" some memory region (up to 100s MB) with some data and reapply the "first touch"-like handler on this region which will migrate the page on next touch (read or write) to the numa node of accessing thread.
This policy is useful in case when there are huge data to process by many threads and there are different patterns of access to this data (e.g. first phase - split the 2D array by columns via threads; second - split the same data by rows).
Such policy was supported in Solaris since 9 via madvice with MADV_ACCESS_LWP flag
https://cims.nyu.edu/cgi-systems/man.cgi?section=3C&topic=madvise
MADV_ACCESS_LWP Tell the kernel that the next LWP to
touch the specified address range
will access it most heavily, so the
kernel should try to allocate the
memory and other resources for this
range and the LWP accordingly.
There was (may 2009) the patch to linux kernel named "affinity-on-next-touch", http://lwn.net/Articles/332754/ (thread) but as I understand it was unaccepted into mainline, isn't it?
Also there were Lee Schermerhorn's "migrate_on_fault" patches http://free.linux.hp.com/~lts/Patches/PageMigration/.
So, the question: Is there some next-touch for NUMA in current vanilla Linux kernel or in some major fork, like RedHat linux kernel or Oracle linux kernel?
Given my understanding, there aren't anything similar in the vanilla kernel. numactl has functions to migrate pages manually, but it's probably not helpful in your case. (NUMA policy description is in Documentation/vm/numa_memory_policy if you want to check yourself)
I think those patches are not merged as I don't see any of the relevant code snippets showing up in current kernel.

Get the number of cores in Erlang with Linux

i am writing a concurrent program and i need to know the number of cores of the system so then the program will know how many processes to open.
Is there command to get this inside Erlang code?
Thnx.
You can use
erlang:system_info(logical_processors_available)
to get the number of cores that can be used by the erlang runtime system.
There is also:
erlang:system_info(schedulers_online)
which tells you how many scheduler threads are actually running.
To get the number of available cores, use the logical_processors flag to erlang:system_info/1:
1> erlang:system_info(logical_processors).
8
There are two companion flags to this one: logical_processors_online shows how many are in use, and logical_processors_available show how many are available (it will return unknown when all logical processors available are online).
To know how to parallelize your code, you should rely on schedulers_online which will return the number of actual Erlang schedulers that are available in your current VM instance:
1> erlang:system_info(schedulers_online).
8
Note however that parallelizing on this value alone might not be enough. Sometimes you have other processes running that need some CPU time and sometimes your algorithm would benefit from even more parallelism (waiting on IO for example). A rule of thumb is to use the value obtained from schedulers_online as a multiplier for parallelism, but always test with different multiples to see what works best for your application.
How this information is exposed will be very operating system specific (unless you happen to be writing an operating system of course).
You didn't say what operating system you're working on. In the case of Linux, you can get the data from /proc/cpuinfo, however there are subtleties with the meaning of hyperthreading and the issue of multiple cores on the same die using a shared L2 cache (effectively you've got a NUMA architecture).

Windows Mobile memory corruption

Is WM operating system protects process memory against one another?
Can one badly written application crash some other application just mistakenly writing over the first one memory?
Windows Mobile, at least in all current incarnations, is build on Windows CE 5.0 and therefore uses CE 5.0's memory model (which is the same as it was in CE 3.0). The OS doesn't actually do a lot to protect process memory, but it does enough to generally keep processes from interfering with one another. It's not hard and fast though.
CE processes run in "slots" of which there are 32. The currently running process gets swapped to slot zero, and it's addresses are re-based to zero (so all memory in the running process effectively has 2 addresses, the slot 0 address and it's non-zero slot address). These addresses are proctected (though there's a simple API call to cross the boundary). This means that pointer corruptions, etc will not step on other apps but if you want to, you still can.
Also CE has the concept of shared memory. All processes have access to this area and it is 100% unprotected. If your app is using shared memory (and the memory manager can give you a shared address without you specifically asking, depending on your allocation and its size). If you have shared memory then yes, any process can access that data, including corrupting it, and you will get no error or warning in either process.
Is WM operating system protects process memory against one another?
Yes.
Can one badly written application crash some other application just mistakenly writing over the first one memory?
No (but it might do other things like use up all the 'disk' space).
Even if you're a device driver, to get permission to write to memory that's owned by a different process there's an API which you must invoke explicitly.
While ChrisW's answer is technically correct, my experience of Windows mobile is that it is much easier to crash the entire device from an application than it is on the desktop. I could guess at a few reasons why this is the case;
The operating sytem is often much more heavily OEMed than Windows desktop, that is the amount of manufacturer specific low level code can be very high, which leads to manufacturer specific bugs at a level that can cause bad crashes. On many devices it is common to see a new firmware revision every month or so, where the revisions are fixes to such bugs.
Resources are scarcer, and an application that exhausts all available resources is liable to cause a crash.
The protection mechanisms and architecture vary quite a bit. The device I'm currently working with is SH4 based, while you mostly see ARM, X86 and the odd MIPs CPU..

Resources