MIPS memory execution prevention - memory

I'm doing some research with the MIPS architecture and was wondering how operating systems are implemented with the limited instructions and memory protection that mips offers. I'm specifically wondering about how an operating system would prevent certain addresses ranges from being executed. For example, how could an operating system limit PC to operate in a particular range? In other words, prevent something such as executing from dynamically allocated memory?
The first thing that came to mind is with TLBs, but TLBs only offer memory write protection (and not execute).
I don't quite see how it could be handled by the OS either, because that would imply that every instruction would result in an exception and then MANY cycles would be burned just checking to see if PC was in a sane address range.
If anyone knows, how is it typically done? Is it handled somehow by the hardware during initialization (e.g. It's given an address range and an exception is hit if its out of range?)

Most of protection checks are done in hardware, by the CPU itself, and do not need much involvement from the OS side.
The OS sets up some special tables (page tables or segment descriptors or some such) where memory ranges have associated read, write, execute and user/kernel permissions that the CPU then caches internally.
The CPU then on every instruction checks whether or not the memory accesses comply with the OS-established permissions and if everything's OK, carries on. If there's an attempt to violate those permissions the CPU raises an exception (a form of an interrupt similar to those from external to the CPU I/O devices) that the OS handles. In most cases the OS simply terminates the offending application when it gets such an exception.
In some other cases it tries to handle them and make the seemingly broken code work. One of these cases is support for virtual, on-disk memory. The OS marks a region as unpresent/inaccessible when it's not backed up by physical memory and it's data is somewhere on the disk. When the app tries to use that region, the OS catches an exception from the instruction that tries to access this memory region, backs the region with physical memory, fills it in with data from the disk, marks it as present/accessible and restarts the instruction that's caused the exception. Whenever the OS is low on memory, it can offload data from certain ranges to the disk, mark those ranges as unpresent/inaccessible again and reclaim the memory from those regions for other purposes.
There may also be specific hard-coded by the CPU memory ranges inaccessible to software running outside of the OS kernel and the CPU can easily make a check here as well.
This seems to be the case for MIPS (from "Application Note 235 - Migrating from MIPS to ARM"):
3.4.2 Memory protection
MIPS offers memory protection only to the extent described earlier i.e. addresses
in the upper 2GB of the address space are not permitted when in user mode.
No finer-grained protection regime is possible.
This document lists "MEM - page fault on data fetch; misaligned memory access; memory-protection violation" among the other MIPS exceptions.
If a particular version of the MIPS CPU doesn't have any more fine-grained protection checks, they can only be emulated by the OS and at a significant cost. The OS would need to execute code instruction by instruction or translate it into almost equivalent code with inserted address and access checks and execute that instead of the original code.

This is indeed done with TLBs. No Execute Bits (NX bits) became popular only a few years ago, so older MIPS processors do not support it. The latest version of the MIPS architecture (Release 3) and the SmartMIPS Application-Specific Extension support it as an optional feature under the name of XI (Execute Inhibit).
If you have a chip without this feature you are out of luck. Like Alex already said, there is no simple way to emulate this feature.

Related

cudaMallocManaged vs cudaMalloc - Device memory limitation scenario

I understand that cudaMallocManaged simplifies memory access by eliminating the need for explicit memory allocations on host and device. Consider a scenario where the host memory is significantly larger than the device memory, say 16 GB host & 2 GB device which is fairly common these days. If I am dealing with input data of large size say 4-5 GB which is read from an external data source. Am I forced to resort to explicit host and device memory allocation (as device memory is insufficient to accommodate at once) or does the CUDA unified memory model has a way to get around this (something like, auto allocate/deallocate on need basis)?
Am I forced to resort to explicit host and device memory allocation?
You are not forced to resort to explicit host and device memory allocation, but you will be forced to handle the amount of allocated memory manually. This is because, on current hardware at least, the CUDA unified virtual memory doesn't allow you to oversubscribe GPU memory. In other words, cudaMallocManaged will fail once you allocate more memory than what is available on the device. But that doesn't mean you can't use cudaMallocManaged, it merely means you have to keep track of the amount of memory allocated and never exceed what the device could support, by "streaming" your data instead of allocating everything at once.
Pure speculation as I can't speak for NVIDIA, but I believe this could be one of the future improvements on upcoming hardware.
And indeed, one year and a half after the above prediction, as of CUDA 8, Pascal GPUs are now enhanced with a page-faulting capability that allows memory pages to migrate between the host and the device without explicit intervention from the programmer.

memory management and segmentation faults in modern day systems (Linux)

In modern-day operating systems, memory is available as an abstracted resource. A process is exposed to a virtual address space (which is independent from address space of all other processes) and a whole mechanism exists for mapping any virtual address to some actual physical address.
My doubt is:
If each process has its own address space, then it should be free to access any address in the same. So apart from permission restricted sections like that of .data, .bss, .text etc, one should be free to change value at any address. But this usually gives segmentation fault, why?
For acquiring the dynamic memory, we need to do a malloc. If the whole virtual space is made available to a process, then why can't it directly access it?
Different runs of a program results in different addresses for variables (both on stack and heap). Why is it so, when the environments for each run is same? Does it not affect the amount of addressable memory available for usage? (Does it have something to do with address space randomization?)
Some links on memory allocation (e.g. in heap).
The data available at different places is very confusing, as they talk about old and modern times, often not distinguishing between them. It would be helpful if someone could clarify the doubts while keeping modern systems in mind, say Linux.
Thanks.
Technically, the operating system is able to allocate any memory page on access, but there are important reasons why it shouldn't or can't:
different memory regions serve different purposes.
code. It can be read and executed, but shouldn't be written to.
literals (strings, const arrays). This memory is read-only and should be.
the heap. It can be read and written, but not executed.
the thread stack. There is no reason for two threads to access each other's stack, so the OS might as well forbid that. Moreover, the tread stack can be de-allocated when the tread ends.
memory-mapped files. Any changes to this region should affect a specific file. If the file is open for reading, the same memory page may be shared between processes because it's read-only.
the kernel space. Normally the application should not (or can not) access that region - only kernel code can. It's basically a scratch space for the kernel and it's shared between processes. The network buffer may reside there, so that it's always available for writes, no matter when the packet arrives.
...
The OS might assume that all unrecognised memory access is an attempt to allocate more heap space, but:
if an application touches the kernel memory from user code, it must be killed. On 32-bit Windows, all memory above 1<<31 (top bit set) or above 3<<30 (top two bits set) is kernel memory. You should not assume any unallocated memory region is in the user space.
if an application thinks about using a memory region but doesn't tell the OS, the OS may allocate something else to that memory (OS: sure, your file is at 0x12341234; App: but I wanted to store my data there). You could tell the OS by touching the end of your array (which is unreliable anyways), but it's easier to just call an OS function. It's just a good idea that the function call is "give me 10MB of heap", not "give me 10MB of heap starting at 0x12345678"
If the application allocates memory by using it then it typically does not de-allocate at all. This can be problematic as the OS still has to hold the unused pages (but the Java Virtual Machine does not de-allocate either, so hey).
Different runs of a program results in different addresses for variables
This is called memory layout randomisation and is used, alongside of proper permissions (stack space is not executable), to make buffer overflow attacks much more difficult. You can still kill the app, but not execute arbitrary code.
Some links on memory allocation (e.g. in heap).
Do you mean, what algorithm the allocator uses? The easiest algorithm is to always allocate at the soonest available position and link from each memory block to the next and store the flag if it's a free block or used block. More advanced algorithms always allocate blocks at the size of a power of two or a multiple of some fixed size to prevent memory fragmentation (lots of small free blocks) or link the blocks in a different structures to find a free block of sufficient size faster.
An even simpler approach is to never de-allocate and just point to the first (and only) free block and holds its size. If the remaining space is too small, throw it away and ask the OS for a new one.
There's nothing magical about memory allocators. All they do is to:
ask the OS for a large region and
partition it to smaller chunks
without
wasting too much space or
taking too long.
Anyways, the Wikipedia article about memory allocation is http://en.wikipedia.org/wiki/Memory_management .
One interesting algorithm is called "(binary) buddy blocks". It holds several pools of a power-of-two size and splits them recursively into smaller regions. Each region is then either fully allocated, fully free or split in two regions (buddies) that are not both fully free. If it's split, then one byte suffices to hold the size of the largest free block within this block.

How does the kernel know about segment fault?

When a segment fault occurs, it means I access memory which is not allocated or protected.But How does the kernel or CPU know it? Is it implemented by the hardware? What data structures need the CPU to look up? When a set of memory is allocated, what data structures need to be modified?
The details will vary, depending on what platform you're talking about, but typically the MMU will generate an exception (interrupt) when you attempt an invalid memory access and the kernel will then handle this as part of an interrupt service routine.
A seg fault generally happens when a process attempts to access memory that the CPU cannot physically address. It is the hardware that notifies the OS about a memory access violation. The OS kernel then sends a signal to the process which caused the exception
To answer the second part of your question, again it depends on hardware and OS. In a typical system (i.e. x86) the CPU consults the segment registers (via the global or local descriptor tables) to turn the segment relative address into a virtual address (this is usually, but not always, a no-op on modern x86 operating systems), and then (the MMU does this bit really, but on x86 its part of the CPU) consults the page tables to turn that virtual address into a physical address. When it encounters a page which is not marked present (the present bit is not set in the page directory or tables) it raises an exception. When the OS handles this exception, it will either give up (giving rise to the segfault signal you see when you make a mistake or a panic) or it will modify the page tables to make the memory valid and continue from the exception. Typically the OS has some bookkeeping which says which pages could be valid, and how to get the page. This is how demand paging occurs.
It all depends on the particular architecture, but all architectures with paged virtual memory work essentially the same. There are data structures in memory that describe the virtual-to-physical mapping of each allocated page of memory. For every memory access, the CPU/MMU hardware looks up those tables to find the mapping. This would be horribly slow, of course, so there are hardware caches to speed it up.

How to find number of memory accesses

Can anybody tell me a unix command that can be used to find the number of memory accesses that took place in a given interval. vmstat, top and sar only give the amount of physical memory space occupied/available .. But do not give the number of memory of accesses in a given interval
If I understand what you're asking, such a feature would almost certainly require hardware support at a very low level (e.g. a counter of some sort that monitors memory bus activity).
I don't think such support is available for the common architectures supported by
Unix or Linux, so I'm going to go out on a limb and say that no such Unix command exists.
The situation is somewhat different when considering memory in units of pages,
because most architectures that support virtual memory have dedicated MMU hardware
which operates at that level of granularity, and can be accessed by the operating
system. But as far as I know, the sorts of counter data you'd get from the MMU would
represent events like page faults, allocations, and releases, rather than individual
reads or writes.

How is external memory, internal memory, and cache organized?

Consider a system as follows: a hardware board having say ARM Cortex-A8 and Neon Vector coprocessor, and Embedded Linux OS running on Cortex-A8. On this environment, if some application - say, a video decoder - is executing, then:
How is it decided which buffers would be in external memory, which ones would be allocated in internal SRAM, etc.
When one calls calloc/malloc on such a system/code, the pointer returned is from which memory: internal or external?
Can a user make buffers to be allocated in the memories of his choice (internal/external)?
In ARM architectures, there is another memory called "tightly coupled memory" (TCM). What is that and how can user enable and use it? Can I declare buffers in this memory?
Do I need to see the memory map (if any) of the hardware board to understand about all these different physical memories present in a typical hardware board?
How much of a role does the OS play in distinguishing these different memories?
Sorry for multiple questions, but i think they all are interlinked.
Please note that I'm not familiar with the ARM nor embedded Linux's specifically, so all of my comments will be from a general point of view.
First, about cache: Very early during boot, the operating system will do some amount of cache initialization. Exactly what this entails will vary from processor to processor, but the net effect is to ensure cache is initialized properly, and then enable its use by the processor. After this, the cache is operated exclusively by the processor with no further interaction by the operating system or your programs.
Now, on to external (off-chip) and internal (on-chip) memories:
The operating system owns all hardware on the system, including the internal and external memories and so is ultimately responsible for discovering, configuring, and allocating these resources within the kernel and to user processes. In a typical system (eg, your desktop or a 1u server) there won't usually be any special internal (on-chip) ram, and so the operating system can treat all dram equally. It will go into a general pool of pages (usually 4k) for allocation to processes, file system buffers, etc. On a system with special memory of various sorts (nvram, high-speed on-chip memory, and a few others), the operating system's general policies aren't usually correct.
How this is presented to the user will depend on choices made while porting the OS to this system.
One could modify the OS to be explicitly aware of this special memory, and provide special system calls to allocate it to to user land processes. However, this could be quite a bit of work unless the embedded linux being used has at least some support for this sort of thing.
The approach I'd probably take would be to avoid modifying the kernel itself, and instead write a device driver for the internal memory. A driver of this sort would typically provide some sort of mmap interface to allow user processes to get simple address-based access to the internal memory.
Here are answers to some of your concrete questions.
How much of a role does the OS play in distinguishing these different memories?
If your system has taken the device driver approach described above, then the OS probably knows only about external memory, or perhaps just enough about the internal memories to initialize them properly although that would likely be in the device driver too, if at all possible. If the OS knows more explicitly about the on-chip memory, then it will definitely contain any needed initialization code, as well as some sort of scheme to provide access to the user processes.
How is it decided which buffers would be in external memory, which ones would be allocated in internal SRAM, etc.
It seems unlikely to me that the operating system would try to automate such choices. Instead, I suspect that either the OS or a device driver would provide a generic interface to provide access to the on-chip memory, and leave it up to your user code to decide what to do with it.
When one calls calloc/malloc on such a system/code, the pointer returned is from which memory: internal or external?
Almost certainly, malloc and friends will return pointers into the general off-chip memory. In the driver-based approach suggested above, you'd use mmap to gain access to the on-chip memory. If you needed to do finer-grained allocation than that, you'd need to write your own allocator, or find one that can be given an explicit region of memory to work in.
Can a user make buffers to be allocated in the memories of his choice (internal/external)?
If by buffers you mean the regions returned from the standard malloc calls, probably not. But, if you mean "can a user program somehow get a pointer to the on-chip memory", then the answer is almost certainly yes, but the mechanism will depend on choices made when porting linux to this system.
In ARM architectures, there is another memory called "tightly coupled memory" (TCM). What is that and how can user enable and use it? Can I declare buffers in this memory?
I don't know what this is. If I had to guess, I'd assume it's just another form of on-chip ram, but since it has a different name, perhaps I'm wrong.
Do I need to see the memory map (if any) of the hardware board to understand about all these different physical memories present in a typical hardware board?
If the OS and/or device drivers have provided some sort of abstract access to these memory regions, then you won't need to know explicitly about the address map. This knowledge is, however, needed to implement this access in either the kernel or a device driver.
I hope this helps somewhat.

Resources