Exclusives Reservation Granule (ERG) on Apple's processors - ios

Does anyone know what is ERG on Apple's A5, A5X, A6 and A6X processors?
We ran into an obscure bug with LDREX/STREX instructions and the behavior is different between A5's and A6's. The only explanation I have is that they have different ERG, but can't find anything on that. I also could not find a way to retrieve this value, the MRC instruction seems to be prohibited in the user mode on iOS.
Thank you!

On OMAP 4460 (ARM Cortex-A9, same as Apple A5/A5X) ERG is 32 bytes (which is same as cache line size).
I don't know that those values are on A6/A6X (and there is no way to find out without loading your own driver, which you can not do on Apple devices), but my guestimate is that cache line size increased to 64 bytes, and so did ERG.
Alternatively, you may optimize the algorithm for the architectural maximum of 512 words (2K bytes).

ERG size is a critical consideration when using ldrex/strex.
When an ldrex has been issued, if a memory access occurs in the ERG within which the ldrex read occured, the strex will fail.
It is not unusual to have a structure which contains an ldrex/strex target and some additional data, where the additional data is accessed between the ldrex/strex pair (for example, to store the value loaded by ldex).
If the ldrex/strex target has insufficient padding in the structure (i.e. the ERG size chosen is too small) the access to other members of the structure will therefore cause the strex to ALWAYS fail.
Game over, lights out.
Regarding ldrex/strex, ARM implements a "local monitor" and a "global monitor". On systems with only a local monitor, the only way an ldrex/strex can fail is if two ldrexs are issued on the same address prior to an strex having been issued - only systems with global monitors actually notice memory bus traffic within the ERG of the ldrex/strex target.
ARM systems vary hugely and I suspect there are systems which have only local monitors and so do not in fact actually support ldrex/strex.

Related

Does simulating memory-mapped I/O using VMX require instruction decoding?

I am wondering how a hypervisor using Intel's VMX / VT technology would simulate memory-mapped I/O (so that the guest could think it was performing memory mapped I/O againsta device).
I think the basic principle would be to set up the EPT page tables in such a way that the memory addresses in question would cause an EPT violation (i.e. VM exit) by setting them such that they cannot be read or written? However, the next question is how to process the VM exit. Such a VM-exit would fill out all the exit qualification reasons etc. including the guest-linear and guest-physical address etc. But what I am missing in these exit qualification fields is some field indicating - in case of a write instruction - the value that was attempted to be written and the size of the write. Likewise, for a read instruction it would be nice with some bit fields indicating the destination of the read, say a register or a memory location (in case of memory-to-memory string operations). This would make it very easy for the hypervisor to figure out what the guest was trying to do and then simulate the device behavior towards the guest.
But the trouble is, I can't find such fields among the exit qualifications. I can see an instruction pointer to where the faulting instruction is, so I could walk the page tables to read in the instruction and then decode it to understand the instruction, then simulate the I/O behavior. However, this requires the hypervisor to have a fairly complete picture of all x86 instructions, and be able to decode them. That seems to be quite a heavy burden on the hypervisor, and will also require it to stay current with later instruction additions. And the CPU should already have this information.
There's a chance that that I am missing these relevant fields because the documentation is quite extensive, but I have tried to search carefully but have not been able to find it. Maybe someone can point me in the right direction OR confirm that the hypervisor will need to contain an instruction decoder.
I believe most VMs decode the instruction. It's not actually that hard, and most VMs have software emulators to fallback on when the CPU VM extensions aren't available or up to the task. You don't need to handle every instruction, just those that can take memory operands, and you can probably ignore everything that isn't a 1, 2, or 4 byte memory operand since you're not likely to emulating device registers other than those sizes. (For memory mapped device buffers, like video memory, you don't want to be trapping every memory accesses because that's too slow, and so you'll have to take different approach.)
However, there is one way you can let the CPU do the work for you, but it's much slower then decoding the instruction itself and it's not entirely perfect. You can single step the instruction while temporarily mapping in a valid page of RAM. The VM exit will tell you the guest physical address access and whether it was a read or write. Unfortunately it doesn't reliably tell you whether it was read-modify-write instruction, those may just set the write flag, and with some device registers that can make a difference. It might be easier to copy the instruction (it can only be a most 15 bytes, but watch out for page boundaries) and execute it in the host, but that requires that you can map the page to same virtual address in the host as in the guest.
You could combine these techniques, decode the common instructions that are actually used to access memory mapped device registers, while using single stepping for the instructions you don't recognize.
Note that by choosing to write your own hypervisor you've put a heavy burden on yourself. Having to decode instructions in software is a pretty minor burden compared to the task of emulating an entire IBM PC compatible computer. The Intel virtualisation extensions aren't designed to make this easier, they're just designed to make it more efficient. It would be easier to write a pure software emulator that interpreted the instructions. Handling memory mapped I/O would be just a matter of dispatching the reads and writes to the correct function.
I don't know in details how VT-X works, but I think I see a flaw in your wishlist way it could work:
Remember that x86 is not a load/store machine. The load part of add [rdi], 2 doesn't have an architecturally-visible destination, so your proposed solution of telling the hypervisor where to find or put the data doesn't really work, unless there's some temporary location that isn't part of the guest's architectural state, used only for communication between the hypervisor and the VMX hardware.
To handle a read-modify-write instruction with a memory destination efficiently, the VM should do the whole thing with one VM exit. So you can't just provide separate load and store interfaces.
More importantly, handling atomic read-modify-writes is a special case. lock add [rdi], 2 can't just be done as a separate load and store.

Does AArch64 support unaligned access?

Does AArch64 support unaligned access natively? I am asking because currently ocamlopt assumes "no".
Providing the hardware bit for strict alignment checking is not turned on (which, as on x86, no general-purpose OS is realistically going to do), AArch64 does permit unaligned data accesses to Normal (not Device) memory with the regular load/store instructions.
However, there are several reasons why a compiler would still want to maintain aligned data:
Atomicity of reads and writes: naturally-aligned loads and stores are guaranteed to be atomic, i.e. if one thread reads an aligned memory location simultaneously with another thread writing the same location, the read will only ever return the old value or the new value. That guarantee does not apply if the location is not aligned to the access size - in that case the read could return some unknown mixture of the two values. If the language has a concurrency model which relies on that not happening, it's probably not going to allow unaligned data.
Atomic read-modify-write operations: If the language has a concurrency model in which some or all data types can be updated (not just read or written) atomically, then for those operations the code generation will involve using the load-exclusive/store-exclusive instructions to build up atomic read-modify-write sequences, rather than plain loads/stores. The exclusive instructions will always fault if the address is not aligned to the access size.
Efficiency: On most cores, an unaligned access at best still takes at least 1 cycle longer than a properly-aligned one. In the worst case, a single unaligned access can cross a cache line boundary (which has additional overhead in itself), and generate two cache misses or even two consecutive page faults. Unless you're in an incredibly memory-constrained environment, or have no control over the data layout (e.g. pulling packets out of a network receive buffer), unaligned data is still best avoided.
Necessity: If the language has a suitable data model, i.e. no pointers, and any data from external sources is already marshalled into appropriate datatypes at a lower level, then there's really no need for unaligned accesses anyway, and it makes the compiler's life that much easier to simply ignore the idea altogether.
I have no idea what concerns OCaml in particular, but I certainly wouldn't be surprised if it were "all of the above".

ARM: memory address ... why is it 0x04030201...?

can someone explain to me why we represent, the memory address itself in this way:
"Word on address =0x00":
0x04030201,
I know each of the 01, 02, 03, 04 is one byte, but can someone explain to me where that byte is, what does it represent? a memory cell in a register? I am totally confused...
An address, memory or otherwise is really no different than an address on a building. Sometimes systematically and well chosen, sometimes haphazardly. In any case some buildings are fire stations, some are grocery stores, some are apartments and some are houses and so on. But the addressing system used by a city or nation can get your item directly to that building.
when we talk about addresses in software it is no different. the processor at the lowest level doesnt know or care there is an address bus, an interface where the address is projected outside the processor. As you add on each layer of logic, like an onion, around the processor and eventually other chips, memory, usb controllers, hard drive controllers, etc. Just like the parts of an address on an envelope, portions of that address are extracted and the read or write command is delivered to the individual logic who wears that address on the side of their building.
You cant simply ask what is address 0x04030201 without any context. Addresses schemes are fairly specific their system there are hundreds or thousands or tens of thousands of arm based systems all of which have different address schemes and that address could point to nothing, with nobody to answer that request dies and possibly hangs the processor or it could be some ram or it could be a register in a usb controller or video controller or disk drive controller.
Generally you have read and write operations, in this example that would be once the letter makes it to the individual at the address on the envelope the contents of the letter contain instructions. Do this (write), or get this and mail it back (read). And the individual in the case of hardware just does what it is told without asking. If it is a read then it performs a read within the context of whatever that device is. A hard disk controller a read of a particular address might be a temperature sensor, or a status register that contains the speed at which the motor is spinning, or it might be some data that had been recently read from the hard disk. In the simple case of memory it is likely just some memory, some bytes.
how much stuff is being read is also another item that is specified on the processors bus, and this varies from processor to processor as to what is available to the programmer. Sometimes you can request to read or write individual bytes, sometimes 16 bit items or 32 or 64, etc.
then you get into address translation. Using the mail analogy this is kind of like having your mail forwarded to another address. You write one address on the letter, the post office has a forwarding request for that address, so they change your address to the new address and then complete the delivery of the letter. When you hear of a memory management unit, MMU, and in some uses of the word virtual memory, that is the kind of thing that is going on. Lets say that we want to make the programmers life simple and we tell every one that ram starts at address 0x00000000. that makes it much easier to have a compiler choose memory locations where our variables and arrays and programs live, it can compile every program the same way based on that address. But how is it that I can have many programs running at once if they all share the same memory. well they dont. One program thinks it is writing to address 0x00000000 but in reality there is some unique address which can be completely different that does belong only to that program lets say address 0x10000000, the mmu is like the mail carrier at the post office that changes the address, the processor knows from information as to which task is running that it needs to convert 0x00000000 to 0x10000000. When another program accesses what it thinks is 0x00000000 it might have that address changed to 0x11000000, and another 0x00000000 might map to physical address 0x12000000. The address that actually hits the memory is called the physical address, the address that the program uses is called the virtual address, it isnt real it is destined to be changed.
This mmu stuff not only allows for compilers and programmers to have an easier job but also the mmu allows us to protect one program or the operating system from another. Application programs run at a certain protection level which the mmu uses to know what that user is allowed to do. if a program generates a virtual address that is outside of its address space, say the system has 1 gig of memory and the program tries to address 1 gig plus a little bit more. the mmu instead of converting that to a physical address instead generates an interrupt to the processor which switches that processor into a mode that has more permissions, basically the operating system, and the operating system can then decided to try to use that other kind of virtual memory and allow the program to have more memory, or it may kill the program and put up a warning message to the user that such and such program has had a protection fault and was killed.
Address schemes for computers are generally a lot more thought out than developers that number houses in new neighborhoods, but not always, but it is not that far removed from an address on an envelope. You pick apart bits in the address and those chunks of bits mean something and deliver this processor request to the individual at that address. How the bits are parsed is very specific to the processor and platform and in some cases is dynamic or programmable on the fly, so if your next question is what is 0xabcd on my system, we may still not be able to help you. You may have to do more research or give is a lot of info...
Think of memory as an array of bytes. 'Word on address' can mean different things depending what the CPU designers consider a Word. In your case it seems a Word is 32 bits long.
So 'Word on address=0x00: 0x04030201' means:
'Beginning at memory cell 0x00 (inclusive), the 'next' four bytes are 0x04 0x03 0x02 0x01.
Also, depending on the endianness of your CPU the meaning of 'next' changes. It could be that 0x04 is stored in cell 0x00, or that 0x01 is stored there.

MIPS memory execution prevention

I'm doing some research with the MIPS architecture and was wondering how operating systems are implemented with the limited instructions and memory protection that mips offers. I'm specifically wondering about how an operating system would prevent certain addresses ranges from being executed. For example, how could an operating system limit PC to operate in a particular range? In other words, prevent something such as executing from dynamically allocated memory?
The first thing that came to mind is with TLBs, but TLBs only offer memory write protection (and not execute).
I don't quite see how it could be handled by the OS either, because that would imply that every instruction would result in an exception and then MANY cycles would be burned just checking to see if PC was in a sane address range.
If anyone knows, how is it typically done? Is it handled somehow by the hardware during initialization (e.g. It's given an address range and an exception is hit if its out of range?)
Most of protection checks are done in hardware, by the CPU itself, and do not need much involvement from the OS side.
The OS sets up some special tables (page tables or segment descriptors or some such) where memory ranges have associated read, write, execute and user/kernel permissions that the CPU then caches internally.
The CPU then on every instruction checks whether or not the memory accesses comply with the OS-established permissions and if everything's OK, carries on. If there's an attempt to violate those permissions the CPU raises an exception (a form of an interrupt similar to those from external to the CPU I/O devices) that the OS handles. In most cases the OS simply terminates the offending application when it gets such an exception.
In some other cases it tries to handle them and make the seemingly broken code work. One of these cases is support for virtual, on-disk memory. The OS marks a region as unpresent/inaccessible when it's not backed up by physical memory and it's data is somewhere on the disk. When the app tries to use that region, the OS catches an exception from the instruction that tries to access this memory region, backs the region with physical memory, fills it in with data from the disk, marks it as present/accessible and restarts the instruction that's caused the exception. Whenever the OS is low on memory, it can offload data from certain ranges to the disk, mark those ranges as unpresent/inaccessible again and reclaim the memory from those regions for other purposes.
There may also be specific hard-coded by the CPU memory ranges inaccessible to software running outside of the OS kernel and the CPU can easily make a check here as well.
This seems to be the case for MIPS (from "Application Note 235 - Migrating from MIPS to ARM"):
3.4.2 Memory protection
MIPS offers memory protection only to the extent described earlier i.e. addresses
in the upper 2GB of the address space are not permitted when in user mode.
No finer-grained protection regime is possible.
This document lists "MEM - page fault on data fetch; misaligned memory access; memory-protection violation" among the other MIPS exceptions.
If a particular version of the MIPS CPU doesn't have any more fine-grained protection checks, they can only be emulated by the OS and at a significant cost. The OS would need to execute code instruction by instruction or translate it into almost equivalent code with inserted address and access checks and execute that instead of the original code.
This is indeed done with TLBs. No Execute Bits (NX bits) became popular only a few years ago, so older MIPS processors do not support it. The latest version of the MIPS architecture (Release 3) and the SmartMIPS Application-Specific Extension support it as an optional feature under the name of XI (Execute Inhibit).
If you have a chip without this feature you are out of luck. Like Alex already said, there is no simple way to emulate this feature.

How to find number of memory accesses

Can anybody tell me a unix command that can be used to find the number of memory accesses that took place in a given interval. vmstat, top and sar only give the amount of physical memory space occupied/available .. But do not give the number of memory of accesses in a given interval
If I understand what you're asking, such a feature would almost certainly require hardware support at a very low level (e.g. a counter of some sort that monitors memory bus activity).
I don't think such support is available for the common architectures supported by
Unix or Linux, so I'm going to go out on a limb and say that no such Unix command exists.
The situation is somewhat different when considering memory in units of pages,
because most architectures that support virtual memory have dedicated MMU hardware
which operates at that level of granularity, and can be accessed by the operating
system. But as far as I know, the sorts of counter data you'd get from the MMU would
represent events like page faults, allocations, and releases, rather than individual
reads or writes.

Resources