Write-Only Memory

Write-Only Memory - memory

I know there exists read-only values in many languages (final in Java const in C++ etc.) but does such a thing as "Write-Only" values exist? I've heard a variation of this in jokes, such as write-only code, but I'm wondering if this is actually a legitimate concept in computer science. To be honest, I can't see how it would be helpful in any situation, but I'm just wondering.

In unix shell scripting there is a concept of write only memory. But it's not part of any shell or scripting language, it's a device: /dev/null.
The write-only device /dev/null is used to discard output you don't want. Generally by allowing the caller to redirect stdout and/or stderr to it.
There are other write-only memory on a computer. One example is your sound card which on some (older) unix machines are mapped to /dev/audio or /dev/dsp. Writing values to it makes your speaker produce sound but reading from it gets you nothing.
At the lower level of the device drivers themselves, these hardware devices are often connected to a specific memory or I/O address (some CPU architectures don't have separate memory and I/O address - just a single address space shared by RAM and all other hardware). So in a real sense these memory locations are really write-only.

There were certainly some FPUs for PCs that used a somewhat weird setup, by existing as memory-mapped devices. To perform some operations, you would simply write the value you wanted to operate on, to a memory address indicating what operation you wanted performed, the value would then (eventually) be available at another address.
I don't know if you would define this, strictly, as "write-only memory", it is rather memory where (part of) the address is used as an opcode.

Related

Does simulating memory-mapped I/O using VMX require instruction decoding?

I am wondering how a hypervisor using Intel's VMX / VT technology would simulate memory-mapped I/O (so that the guest could think it was performing memory mapped I/O againsta device).
I think the basic principle would be to set up the EPT page tables in such a way that the memory addresses in question would cause an EPT violation (i.e. VM exit) by setting them such that they cannot be read or written? However, the next question is how to process the VM exit. Such a VM-exit would fill out all the exit qualification reasons etc. including the guest-linear and guest-physical address etc. But what I am missing in these exit qualification fields is some field indicating - in case of a write instruction - the value that was attempted to be written and the size of the write. Likewise, for a read instruction it would be nice with some bit fields indicating the destination of the read, say a register or a memory location (in case of memory-to-memory string operations). This would make it very easy for the hypervisor to figure out what the guest was trying to do and then simulate the device behavior towards the guest.
But the trouble is, I can't find such fields among the exit qualifications. I can see an instruction pointer to where the faulting instruction is, so I could walk the page tables to read in the instruction and then decode it to understand the instruction, then simulate the I/O behavior. However, this requires the hypervisor to have a fairly complete picture of all x86 instructions, and be able to decode them. That seems to be quite a heavy burden on the hypervisor, and will also require it to stay current with later instruction additions. And the CPU should already have this information.
There's a chance that that I am missing these relevant fields because the documentation is quite extensive, but I have tried to search carefully but have not been able to find it. Maybe someone can point me in the right direction OR confirm that the hypervisor will need to contain an instruction decoder.

I believe most VMs decode the instruction. It's not actually that hard, and most VMs have software emulators to fallback on when the CPU VM extensions aren't available or up to the task. You don't need to handle every instruction, just those that can take memory operands, and you can probably ignore everything that isn't a 1, 2, or 4 byte memory operand since you're not likely to emulating device registers other than those sizes. (For memory mapped device buffers, like video memory, you don't want to be trapping every memory accesses because that's too slow, and so you'll have to take different approach.)
However, there is one way you can let the CPU do the work for you, but it's much slower then decoding the instruction itself and it's not entirely perfect. You can single step the instruction while temporarily mapping in a valid page of RAM. The VM exit will tell you the guest physical address access and whether it was a read or write. Unfortunately it doesn't reliably tell you whether it was read-modify-write instruction, those may just set the write flag, and with some device registers that can make a difference. It might be easier to copy the instruction (it can only be a most 15 bytes, but watch out for page boundaries) and execute it in the host, but that requires that you can map the page to same virtual address in the host as in the guest.
You could combine these techniques, decode the common instructions that are actually used to access memory mapped device registers, while using single stepping for the instructions you don't recognize.
Note that by choosing to write your own hypervisor you've put a heavy burden on yourself. Having to decode instructions in software is a pretty minor burden compared to the task of emulating an entire IBM PC compatible computer. The Intel virtualisation extensions aren't designed to make this easier, they're just designed to make it more efficient. It would be easier to write a pure software emulator that interpreted the instructions. Handling memory mapped I/O would be just a matter of dispatching the reads and writes to the correct function.

I don't know in details how VT-X works, but I think I see a flaw in your wishlist way it could work:
Remember that x86 is not a load/store machine. The load part of add [rdi], 2 doesn't have an architecturally-visible destination, so your proposed solution of telling the hypervisor where to find or put the data doesn't really work, unless there's some temporary location that isn't part of the guest's architectural state, used only for communication between the hypervisor and the VMX hardware.
To handle a read-modify-write instruction with a memory destination efficiently, the VM should do the whole thing with one VM exit. So you can't just provide separate load and store interfaces.
More importantly, handling atomic read-modify-writes is a special case. lock add [rdi], 2 can't just be done as a separate load and store.

How does compiler lay out code in memory

Ok I have a bit of a noob student question.
So I'm familiar with the fact that stacks contain subroutine calls, and heaps contain variable length data structures, and global static variables are assigned to permanant memory locations.
But how does it all work on a less theoretical level?
Does the compiler just assume it's got an entire memory region to itself from address 0 to address infinity? And then just start assigning stuff?
And where does it layout the instructions, stack, and heap? At the top of the memory region, end of memory region?
And how does this then work with virtual memory? The virtual memory is transparent to the program?
Sorry for a bajilion questions but I'm taking programming language structures and it keeps referring to these regions and I want to understand them on a more practical level.
THANKS much in advance!

A comprehensive explanation is probably beyond the scope of this forum. Entire texts are devoted to the subject. However, at a simplistic level you can look at it this way.
The compiler does not lay out the code in memory. It does assume it has the entire memory region to itself. The compiler generates object files where the symbols in the object files typically begin at offset 0.
The linker is responsible for pulling the object files together, linking symbols to their new offset location within the linked object and generating the executable file format.
The linker doesn't lay out code in memory either. It packages code and data into sections typically labeled .text for the executable code instructions and .data for things like global variables and string constants. (and there are other sections as well for different purposes) The linker may provide a hint to the operating system loader where to relocate symbols but the loader doesn't have to oblige.
It is the operating system loader that parses the executable file and decides where code and data are layed out in memory. The location of which depends entirely on the operating system. Typically the stack is located in a higher memory region than the program instructions and data and grows downward.
Each program is compiled/linked with the assumption it has the entire address space to itself. This is where virtual memory comes in. It is completely transparent to the program and managed entirely by the operating system.
Virtual memory typically ranges from address 0 and up to the max address supported by the platform (not infinity). This virtual address space is partitioned off by the operating system into kernel addressable space and user addressable space. Say on a hypothetical 32-bit OS, the addresses above 0x80000000 are reserved for the operating system and the addresses below are for use by the program. If the program tries to access memory above this partition it will be aborted.
The operating system may decide the stack starts at the highest addressable user memory and grows down with the program code located at a much lower address.
The location of the heap is typically managed by the run-time library against which you've built your program. It could live beginning with the next available address after your program code and data.

This is a wide open question with lots of topics.
Assuming the typical compiler -> assembler -> linker toolchain. The compiler doesnt know a whole lot, it simply encodes stack relative stuff, doesnt care how much or where the stack is, that is the purpose/beauty of a stack, dont care. The compiler generates assembler the assembler is assembled into an object, then the linker takes info linker script of some flavor or command line arguments that tell it the details of the memory space, when you
gcc hello.c -o hello
your installation of binutils has a default linker script which is tailored to your target (windows, mac, linux, whatever you are running on). And that script contains the info about where the program space starts, and then from there it knows where to start the heap (after the text, data and bss). The stack pointer is likely set either by that linker script and/or the os manages it some other way. And that defines your stack.
For an operating system with an mmu, which is what your windows and linux and mac and bsd laptop or desktop computers have, then yes each program is compiled assuming it has its own address space starting at 0x0000 that doesnt mean that the program is linked to start running at 0x0000, it depends on the operating system as to what that operating systems rules are, some start at 0x8000 for example.
For a desktop like application where it is somewhat a single linear address space from your programs perspective you will likely have .text first then either .data or .bss and then after all of that the heap will be aligned at some point after that. The stack however it is set is typically up high and works down but that can be processor and operating system specific. that stack is typically within the programs view of the world the top of its memory.
virtual memory is invisible to all of this the application normally doesnt know or care about virtual memory. if and when the application fetches an instruction or does a data tranfer it goes through hardware which is configured by the operating system and that converts between virtual and physical. If the mmu indicates a fault, meaning that space has not been mapped to a physical address, that can sometimes be intentional and then another use of the term "Virtual memory" applies. This second definition the operating system can then for example take some other chunk of memory, yours or someone elses, move that to hard disk for example, mark that other chunk as not being there, and then mark your chunk as having some ram then let you execute not knowing you were interrupted with some ram that you didnt know you had to take from someone else. Your application by design doesnt want to know any of this, it just wants to run, the operating system takes care of managing physical memory and the mmu that gives you a virtual (zero based) address space...
If you were to do a little bit of bare metal programming, without mmu stuff at first then later with, microcontroller, qemu, raspberry pi, beaglebone, etc you can get your hands dirty both with the compiler, linker script and configuring an mmu. I would use an arm or mips for this not x86, just to make your life easier, the overall big picture all translates directly across targets.

It depends.
If you're compiling a bootloader, which has to start from scratch, you can assume you've got the entire memory for yourself.
On the other hand, if you're compiling an application, you can assume you've got the entire memory for yourself.
The minor difference is that in the first case, you have all physical memory for yourself. As a bootloader, there's nothing else in RAM yet. In the second case, there's an OS in memory, but it will (normally) set up virtual memory for you so that it appears you have the entire address space for yourself. Usuaully you still have to ask the OS for actual memory, though.
The latter does mean that the OS imposes some rules. E.g. the OS very much would like to know where the first instruction of your program is. A simple rule might be that your program always starts at address 0, so the C compiler could put int main() there. The OS typically would like to know where the stack is, but this is already a more flexible rule. As far as "the heap" is concerned, the OS really couldn't care.

operating systems memory management - malloc() invocation

I'm studying up on OS memory management, and I wish to verify that I got the basic mechanism of allocation \ virtual memory \ paging straight.
Let's say a process calls malloc(), what happens behind the scenes?
my answer: The runtime library finds an appropriately sized block of memory in its virtual memory address space.
(This is where allocation algorithms such as first-fit, best-fit that deal with fragmentation come into play)
Now let's say the process accesses that memory, how is that done?
my answer: The memory address, as seen by the process, is in fact virtual. The OS checks if that address is currently mapped to a physical memory address and if so performs the access. If it isn't mapped - a page fault is raised.
Am I getting this straight? i.e. the compiler\runtime library are in charge of allocating virtual memory blocks, and the OS is in charge of a mapping between processes' virtual address and physical addresses (and the paging algorithm that entails)?
Thanks!

About right. The memory needs to exist in the virtual memory of the process for a page fault to actually allocate a physical page though. You can't just start poking around anywhere and expect the kernel to put physical memory where you happen to access.
There is much more to it than this. Read up on mmap(), anonymous and not, shared and private. And brk() too. malloc() builds on brk() and mmap().

You've almost got it. The one thing you missed is how the process asks the system for more virtual memory in the first place. As Thomas pointed out, you can't just write where you want. There's no reason an OS couldn't be designed to allow that, but it's much more efficient if it has some idea where you're going to be writing and the space where you do it is contiguous.
On Unixy systems, userland processes have a region called the data segment, which is what it sounds like: it's where the data goes. When a process needs memory for data, it calls brk(), which asks the system to extend the data segment to a specified pointer value. (For example, if your existing data segment was empty and you wanted to extend it to 2M, you'd call brk(0x200000).)
Note that while very common, brk() is not a standard; in fact it was yanked out of POSIX.1 a decade ago because C specifies malloc() and there's no reason to mandate the interface for data segment allocation.

Does Erlang always copy messages between processes on the same node?

A faithful implementation of the actor message-passing semantics means that message contents are deep-copied from a logical point-of-view, even for immutable types. Deep-copying of message contents remains a bottleneck for implementations the actor model, so for performance some implementations support zero-copy message passing (although it's still deep-copy from the programmer's point-of-view).
Is zero-copy message-passing implemented at all in Erlang? Between nodes it obviously can't be implemented as such, but what about between processes on the same node? This question is related.

I don't think your assertion is correct at all - deep copying of inter-process messages isn't a bottleneck in Erlang, and with the default VM build/settings, this is exactly what all Erlang systems are doing.
Erlang process heaps are completely separate from each other, and the message queue is located in the process heap, so messages must be copied. This is also true for transferring data into and out of ETS tables as their data is stored in a separate allocation area from process heaps.
There are a number of shared datastructures however. Large binaries (>64 bytes long) are generally allocated in a node-wide area and are reference counted. Erlang processes just store references to these binaries. This means that if you create a large binary and send it to another process, you're only sending the reference.
Sending data between processes is actually worse in terms of allocation size than you might imagine - sharing inside a term isn't preserved during the copy. This means that if you carefully construct a term with sharing to reduce memory consumption, it will expand to its unshared size in the other process. You can see a practical example in the OTP Efficiency Guide.
As Nikolaus Gradwohl pointed out, there was an experimental hybrid heap mode for the VM which did allow term sharing between processes and enabled zero-copy message passing. It hasn't been a particularly promising experiment as I understand it - it requires extra locking and complicates the existing ability of processes to independently garbage collect. So not only is copying inter-process messages not the usual bottleneck in Erlang systems, allowing it actually reduced performance.

AFAIK there was/is experimental support for zero-copy message-passing in erlang using the -shared or -hybrid modell. I read a blog post in 2009 claiming that it's broken on smp machines, but I have no idea about the current status

As has been mentioned here and in other questions current versions of Erlang basically copy everything except for larger binaries. In older pre-SMP times it was feasible to not copy but pass references. While this resulted in very fast message passing it created other problems in the implementation, primarily it made garbage collection more difficult and complicated implementation. I think that today passing references and having shared data could result in excessive locking and synchronisation which is, of course, not a Good Thing.

I wrote the accepted answer to that other question you're referencing, and in it I give you a direct pointer to this line of code:
message = copy_struct(message, msize, &hp, &bp->off_heap);
This is in a function called when the Erlang run-time system needs to send a message, and it's not inside any kind of "if" that could cause it to be skipped. So, as far as I can tell, the answer is "yes, it's always copied." (That's not strictly true -- there is an "if", but it seems to be dealing with exceptional cases, not the normal code-flow path.)
(I'm ignoring the hybrid heap option brought up by Nikolaus. It looks like he's right, but since this isn't the way Erlang is normally built and it has its own penalties, I don't see that it's worth considering as a way to answer your concern.)
I don't know why you're considering 10 GByte/sec a bottleneck, though. Nothing short of registers or CPU cache goes faster in the computer, and such memories are small, thus constituting a kind of bottleneck themselves. Besides which, the zero-copy idea you're proposing would require locking in the case of cross-CPU message passing in a multi-core system, which is also a bottleneck. We're already paying the locking penalty once in this function to copy the message into the other process's message queue; why pay it again later when that process gets around to reading the message?
Bottom line, I don't think your ideas of ways to make it go faster would actually help much.

How is external memory, internal memory, and cache organized?

Consider a system as follows: a hardware board having say ARM Cortex-A8 and Neon Vector coprocessor, and Embedded Linux OS running on Cortex-A8. On this environment, if some application - say, a video decoder - is executing, then:
How is it decided which buffers would be in external memory, which ones would be allocated in internal SRAM, etc.
When one calls calloc/malloc on such a system/code, the pointer returned is from which memory: internal or external?
Can a user make buffers to be allocated in the memories of his choice (internal/external)?
In ARM architectures, there is another memory called "tightly coupled memory" (TCM). What is that and how can user enable and use it? Can I declare buffers in this memory?
Do I need to see the memory map (if any) of the hardware board to understand about all these different physical memories present in a typical hardware board?
How much of a role does the OS play in distinguishing these different memories?
Sorry for multiple questions, but i think they all are interlinked.

Please note that I'm not familiar with the ARM nor embedded Linux's specifically, so all of my comments will be from a general point of view.
First, about cache: Very early during boot, the operating system will do some amount of cache initialization. Exactly what this entails will vary from processor to processor, but the net effect is to ensure cache is initialized properly, and then enable its use by the processor. After this, the cache is operated exclusively by the processor with no further interaction by the operating system or your programs.
Now, on to external (off-chip) and internal (on-chip) memories:
The operating system owns all hardware on the system, including the internal and external memories and so is ultimately responsible for discovering, configuring, and allocating these resources within the kernel and to user processes. In a typical system (eg, your desktop or a 1u server) there won't usually be any special internal (on-chip) ram, and so the operating system can treat all dram equally. It will go into a general pool of pages (usually 4k) for allocation to processes, file system buffers, etc. On a system with special memory of various sorts (nvram, high-speed on-chip memory, and a few others), the operating system's general policies aren't usually correct.
How this is presented to the user will depend on choices made while porting the OS to this system.
One could modify the OS to be explicitly aware of this special memory, and provide special system calls to allocate it to to user land processes. However, this could be quite a bit of work unless the embedded linux being used has at least some support for this sort of thing.
The approach I'd probably take would be to avoid modifying the kernel itself, and instead write a device driver for the internal memory. A driver of this sort would typically provide some sort of mmap interface to allow user processes to get simple address-based access to the internal memory.
Here are answers to some of your concrete questions.
How much of a role does the OS play in distinguishing these different memories?
If your system has taken the device driver approach described above, then the OS probably knows only about external memory, or perhaps just enough about the internal memories to initialize them properly although that would likely be in the device driver too, if at all possible. If the OS knows more explicitly about the on-chip memory, then it will definitely contain any needed initialization code, as well as some sort of scheme to provide access to the user processes.
How is it decided which buffers would be in external memory, which ones would be allocated in internal SRAM, etc.
It seems unlikely to me that the operating system would try to automate such choices. Instead, I suspect that either the OS or a device driver would provide a generic interface to provide access to the on-chip memory, and leave it up to your user code to decide what to do with it.
When one calls calloc/malloc on such a system/code, the pointer returned is from which memory: internal or external?
Almost certainly, malloc and friends will return pointers into the general off-chip memory. In the driver-based approach suggested above, you'd use mmap to gain access to the on-chip memory. If you needed to do finer-grained allocation than that, you'd need to write your own allocator, or find one that can be given an explicit region of memory to work in.
Can a user make buffers to be allocated in the memories of his choice (internal/external)?
If by buffers you mean the regions returned from the standard malloc calls, probably not. But, if you mean "can a user program somehow get a pointer to the on-chip memory", then the answer is almost certainly yes, but the mechanism will depend on choices made when porting linux to this system.
In ARM architectures, there is another memory called "tightly coupled memory" (TCM). What is that and how can user enable and use it? Can I declare buffers in this memory?
I don't know what this is. If I had to guess, I'd assume it's just another form of on-chip ram, but since it has a different name, perhaps I'm wrong.
Do I need to see the memory map (if any) of the hardware board to understand about all these different physical memories present in a typical hardware board?
If the OS and/or device drivers have provided some sort of abstract access to these memory regions, then you won't need to know explicitly about the address map. This knowledge is, however, needed to implement this access in either the kernel or a device driver.
I hope this helps somewhat.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart