Does AArch64 support unaligned access natively? I am asking because currently ocamlopt assumes "no".
Providing the hardware bit for strict alignment checking is not turned on (which, as on x86, no general-purpose OS is realistically going to do), AArch64 does permit unaligned data accesses to Normal (not Device) memory with the regular load/store instructions.
However, there are several reasons why a compiler would still want to maintain aligned data:
Atomicity of reads and writes: naturally-aligned loads and stores are guaranteed to be atomic, i.e. if one thread reads an aligned memory location simultaneously with another thread writing the same location, the read will only ever return the old value or the new value. That guarantee does not apply if the location is not aligned to the access size - in that case the read could return some unknown mixture of the two values. If the language has a concurrency model which relies on that not happening, it's probably not going to allow unaligned data.
Atomic read-modify-write operations: If the language has a concurrency model in which some or all data types can be updated (not just read or written) atomically, then for those operations the code generation will involve using the load-exclusive/store-exclusive instructions to build up atomic read-modify-write sequences, rather than plain loads/stores. The exclusive instructions will always fault if the address is not aligned to the access size.
Efficiency: On most cores, an unaligned access at best still takes at least 1 cycle longer than a properly-aligned one. In the worst case, a single unaligned access can cross a cache line boundary (which has additional overhead in itself), and generate two cache misses or even two consecutive page faults. Unless you're in an incredibly memory-constrained environment, or have no control over the data layout (e.g. pulling packets out of a network receive buffer), unaligned data is still best avoided.
Necessity: If the language has a suitable data model, i.e. no pointers, and any data from external sources is already marshalled into appropriate datatypes at a lower level, then there's really no need for unaligned accesses anyway, and it makes the compiler's life that much easier to simply ignore the idea altogether.
I have no idea what concerns OCaml in particular, but I certainly wouldn't be surprised if it were "all of the above".
Related
I'm in an assembly language course focusing on x86 Pentium processors, and am working on a Linux system. I understand that programs get loaded into memory and that you can perform operations directly within the registers but I'm not sure you can avoid creating a data segment altogether.
A yes or no, followed by a brief explanation as to why would be great.
It is not required. A data segment is simply a block of memory allocated for data and thus can be written to and read from. Code segments are read only. If you try to write to a code segment the hardware will generate an interrupt. However, assembly codes can be fed any address in memory, and if protected mode is disabled, then the hardware won't generate an interrupt.
As an example, the boot sector loads into a very restricted space on launch, and it is quite common (because space is so restricted) to place variables among the code bytes. Once I even wrote a boot sector that adjusted its own byte-code to accommodate differences in booting from different disks. So this is a case of code using code addresses as variables.
However, while you definitely can avoid creating a data segment, 99.99% of the time you do separate out a data segment.
You may also want to read up on protected mode to understand this better.
Linux's synchronization primitives (spinlock, mutex, RCUs) use memory barrier instructions to force the memory access instructions from getting re-ordered. And this reordering can be done either by the CPU itself or by the compiler.
Can someone show some examples of GCC produced code where such reordering is done ? I am interested mainly in x86. The reason I am asking this is to understand how GCC decides what instructions can be reordered. Different x86 mirco architectures (for ex: sandy bridge vs ivy bridge) use different cache architecture. Hence I am wondering how GCC does effective reordering that helps in the execution performance irrespective of the cache architecture. Some example C code and reordered GCC generated code would be very useful. Thanks!
The reordering that GCC may do is unrelated to the reordering an (x86) CPU may do.
Let's start off with compiler reordering. The C language rules are such that GCC is forbidden from reordering volatile loads and store memory accesses with respect to each other, or deleting them, when a sequence point occurs between them (Thanks to bobc for this clarification). That is to say, in the assembly output, those memory accesses will appear, and will be sequenced precisely in the order you specified. Non-volatile accesses, on the other hand, can be reordered with respect to all other accesses, volatile or not, provided that (by the as-if rule) the end result of the calculation is the same.
For instance, a non-volatile load in the C code could be done as many times as the code says, but in a different order (e.g. If the compiler feels it's more convenient to do it earlier or later when more registers are available). It could be done fewer times than the code says (e.g. If a copy of the value happened to still be available in a register in the middle of a large expression). Or it could even be deleted (e.g. if the compiler can prove the uselessness of the load, or if it moved a variable entirely into a register).
To prevent compiler reorderings at other times, you must use a compiler-specific barrier. GCC uses __asm__ __volatile__("":::"memory"); for this purpose.
This is different from CPU reordering, a.k.a. the memory-ordering model. Ancient CPUs executed instructions precisely in the order they appeared in the program; This is called program ordering, or the strong memory-ordering model. Modern CPUs, however, sometimes resort to "cheats" to run faster, by weakening a little the memory model.
The way x86 CPUs weaken the memory model is documented in Intel's Software Developer Manuals, Volume 3, Chapter 8, Section 8.2.2 "Memory Ordering in P6 and More Recent Processor Families". This is, in part, what it reads:
Reads are not reordered with other reads.
Writes are not reordered with older reads.
Writes to memory are not reordered with other writes, with [some] exceptions.
Reads may be reordered with older writes to different locations but not with older writes to the same location.
Reads or writes cannot be reordered with I/O instructions, locked instructions, or serializing instructions.
Reads cannot pass earlier LFENCE and MFENCE instructions.
Writes cannot pass earlier LFENCE, SFENCE, and MFENCE instructions.
LFENCE instructions cannot pass earlier reads.
SFENCE instructions cannot pass earlier writes.
MFENCE instructions cannot pass earlier reads or writes.
It also gives very good examples of what can and cannot be reordered, in Section 8.2.3 "Examples Illustrating the Memory-Ordering Principles".
As you can see, one uses FENCE instructions to prevent an x86 CPU from reordering memory accesses inappropriately.
Lastly, you may be interested in this link, which goes into further detail and comes with the assembly examples you crave.
The C language rules are such that GCC is forbidden from reordering volatile loads and store memory accesses with respect to each other, or deleting them.
That is not true, and is quite misleading. The C spec does not make such guarantee.
See When is a Volatile Object Accessed?
The standard encourages compilers to refrain from optimizations concerning accesses to volatile objects, but leaves it implementation defined as to what constitutes a volatile access. The minimum requirement is that at a sequence point all previous accesses to volatile objects have stabilized and no subsequent accesses have occurred. Thus an implementation is free to reorder and combine volatile accesses that occur between sequence points, but cannot do so for accesses across a sequence point.
Traditionally, programmers have relied on volatile as a cheap synchronization method but this is no longer a reliable method.
I am really a new starter to Cortex A and I am aware the ARM applies weakly-ordered memory model, and there are three mutually exclusive memory types:
Strongly-ordered
Device
Normal
I roughly understand what Normal is for and what Strongly-ordered and Device mean. However the diffrence between strongly-ordered and device is confusing to me.
According to the Cortex-A Series Programmer's Guide, the only difference is that:
A write to Strongly-ordered memory can complete only when it reaches the peripheral or memory component accessed by the write.
A write to Device memory is permitted to complete before it reaches the peripheral or memory component accessed by the write.
I am not quite sure about what the real implification of this. I am guessing that, the order of the access to the memory typed with Strongly-ordered or Device should be coherent with programmers' codes (no out-of-order access). But the CPU will potentially execute the next instruction while accessing the memory if typed Device, and it will simply wait untill the access to be complete if typed Strongly-ordered.
Correct me if I am wrong and please tell me what is the meaning of doing this.
Thanks in advance.
One important bit to understand is that memory types have no guaranted effect on the instruction stream as a whole - they affect only the ordering of memory accesses. (They may have a specific effect on a specific processor integrated in a specific way with a specific interconnect - but that can never be relied on by software.)
Another important thing to understand is that even Strongly-ordered memory provides implicit guarantees of ordering only with regards to accesses to the same peripheral. Any ordering requirements more strict than that require use of explicit barrier instructions.
A third important point is that any implicit memory access ordering that takes place due to memory types does not affect the ordering of accesses to other memory types. Again, if your application has dependencies like this, explicit barrier instructions are required.
Now, against that background - a simpler way of describing the difference between Device and Strongly-ordered memory is that Device memory accesses can be buffered - in the processor itself or in the interconnect. The difference being that a buffered access can be signalled as complete to the processor before it has completed (or even initiated) at the end point.
This provides better performance at the cost of losing the synchronous reporting of any error condition.
Currently coding for atmel tiny45 microcontroller and I use several lookup tables. Where is the best place to store them? Could you give me a general idea about the memory speed differences between sram-flash-eeprom?
EEPROM is by far the slowest alternative, with write access times in the area of 10ms. Read access is about as fast as FLASH access, plus the overhead of address setup and triggering. Because there's no auto-increment in the EEPROM's address registers, every byte read will require at least four instructions.
SRAM access is the fastest possible (except for direct register access).
FLASH is a little slower than SRAM and needs indirect addressing in every case (Z-pointer), which may or may not be needed for SRAM access, depending on the structure and access pattern of your table.
For execution times of instructions see AVR Instruction Set, especially the LPM vs. the LDS, LD, and LDD instructions.
Are there any Intel/AMD desktop processors that support weakly ordered memory or this is a feature of a server setup with multiple processors?
I'm not sure if this is what you're after, but IIRC there have been some architectures that had weakly-ordered memory accesses in that they could be ordered arbitrarily, and you had to insert memory barriers to ensure a particular ordering.
Modern processors use what's called a "load-store queue" that hides memory reordering, making it look almost as though it's happening in program order. Reads are often reordered (but with some care), writes can be made out of order but are committed in-order (although multiple writes to the same location are consolidated), and reads and writes are reordered wrt each other only carefully and speculatively. The latter is called "hoisting", where a read is performed speculatively ahead of a write (that appears earlier in the instruction sequence) and may be canceled (like a mispredicted branch) if it turns out that preceding write would have affected it.
Also, if memory is marked as uncached, then CPUs generally infer that to mean it's I/O space and perform no access reordering. x86 and SPARC are like this. However, PowerPC will still reorder reads to I/O memory space, and we have to use the EIEIO (ensure in order execution of I/O) instruction to force a particular ordering. IIRC, we also had to use memory barriers on PA-RISC and Alpha. Moreover, there are memory barriers on x86, but I'm not familiar with their use (possibly to ensure ordering of accesses to cached memory space).
You mention multi-core systems. In general, elaborate cache coherency protocols are employed to make all memory access appear to conform to certain interleaving rules, such that accesses hit the last-level caches and main memory in an order that would be possible if there were no caching.
Many modern processors now use out-of-order execution to improve performance by hiding memory latencies. This is not related to multiple processors/cores, it can be done with a single-cored processor acting alone. For this reason, you should not rely on memory ordering.