Which PCR can be extended by our own code? - trusted

As there are 24 PCR's in TPM 1.2 specification. Some of these PCRs are reserved and cannot be extended by user's code. Below are the PCR Index their PCR Usage
CRTM, BIOS and Platform Extensions
Platform Conguration
Option ROM Code
Option ROM Conguration and Data
IPL7Code (MBR Information and Bootloader Stage 1)
IPL Code and Conguration Data (for use by IPL Code)
State Transition and Wake Events
Reserved for future usage. Do not use.
Bootloader Stage 2 Part 1
Bootloader Stage 2 Part 2
Not in Use.
Not in Use.
Bootloader Commandline Arguments
Files checked via checkle routine
Files which are actually loaded (e.g. Linux kernel, initrd, modules..)
Not in Use.
Not in Use.
DRTM8
18-23. Not in Use.
what I understood that a user can extend all the PCR's which are not in use? Is this correct? I asked this question Because I have written my own code to extend PCR's (by following trousers coding guidelines) and it turns out that i can extend all the PCR's except from PCR 17 to PCR 22. And my understanding was that I can only extend few and especially cannot play with the lower ones from PCR 0 to PCR 7.

Depends on the locality, I was in locality 0.

Related

STM32 Current Flash Vector Address

I'm working on a dual OS system with STM32F103, I have two separate program that programmed on different FLASH locations. if both of the programs are the same, the only way to know which of them running is just by its start vector address.
But How I Can Read The Current Program Start Vector Address in STM32 ???
After reading the comments, it sounds like what you have/want is a bootloader. If your goal here is to have two different applications, one to do your main processing and real time handling and the other to just program new firmware, then you want to make a bootloader in your default boot flash space.
Bootloaders fundamentally do a few things, everything else is extra.
Check itself using some type of data integrity check like a CRC.
Checks the application
Jumps to the application.
Bootloaders will also program applications in the app space and verify they are programmed correctly before jumping as well. Colin gave some good advice about appending a CRC to the hex file before it is programmed in flash space to verify the applications.
There are a few things to look out for. The first would be the linker script and this is extremely important. A linker script will be used to map input objects to output objects and then determine based upon that script, what memory space they go into. For both of your applications, you need to create a memory map of how you want both programs to sit inside of the flash space. From this point, you can then make linker scripts for both programs so that a hex file can be generated within the parameters of what you deem acceptable flash space for the program. Each project you have will have its own linker script. An example would look something like this:
LR_IROM1 0x08000000 0x00010000 { ; load region size_region
ER_IROM1 0x08000000 0x00010000 { ; load address = execution address
*.o (RESET, +First)
*(InRoot$$Sections)
.ANY (+RO)
}
RW_IRAM1 0x20000000 0x00018000 { ; RW data
.ANY (+RW +ZI)
}
}
This will give RAM for the application to use as well as a starting point for the application.
After that, you can start the bootloader and give it information about where the application space lies for jumping and programming. Once again this is all determined by you from your memory map and both applications' linker scripts. You are going to need to add a separate entry inside of the linker for your CRC and length for a comparison of the calculated versus stored as well. Whatever tool you use to append the CRC to the hex file and have it programmed to flash space, remember to note the location and make it known to the linker script so you can reference those addresses to check integrity later.
After you check everything and it is determined that it is okay to go to the application, you can use some ARM assembly to jump to the starting application address. Before jumping, make sure to disable all peripherals and interrupts that were enabled in the bootloader. As Colin mentioned, these will share RAM, so it is important you de-initialize all used, otherwise, you'll end up with a hard fault.
At this point, the program used another hex file laid out by a linker script, so it should begin executing as planned, as long as you have the correct vector table offset, which gets into your question fully.
As far as your question on the "Flash vector address", I think what your really mean is your interrupt vector table address. An interrupt vector table is a data structure in memory that maps interrupt requests to the addresses of interrupt handlers. This is where the PC register grabs the next available instruction address upon hardware interrupt triggers, for example. You can see this by keeping track of the ARM pipeline in a few lines of assembly code. Each entry of this table is a handler's address. This offset must be aligned with your application, otherwise you will never go into the main function and the program will sit in the application space, but have nothing to do since all handlers addresses are unknown. This is what the SCB->VTOR is for. It is a vector interrupt table offset address register. In this case, there are a few things you can do. Luckily, these are hard-coded inside of STM generated files inside of the file "system_stm32(xx)xx.c" (xx is your microcontroller variant). There is a define for something called VECT_TAB_OFFSET which is the offset in the memory map of the vector table and is assigned to the SCB->VTOR register with the value that is chosen. Your interrupt vector table will always lie at the starting address of your main application, so for the bootloader it can be 0x00, but for the application, it will be the subtraction of the starting address of the application space, and the first addressable flash address of the microcontroller.
/************************* Miscellaneous Configuration ************************/
/*!< Uncomment the following line if you need to relocate your vector Table in
Internal SRAM. */
/* #define VECT_TAB_SRAM */
#define VECT_TAB_OFFSET 0x00 /*!< Vector Table base offset field.
This value must be a multiple of 0x200. */
/******************************************************************************/
Make sure you understand what is expected from the micro side using STM documentation before programming things. Vector tables in this chip can only be in multiples of 0x200. But to answer your question, this address can be determined by a few things. Your memory map, and eventually, you will have a hard-coded reference to it as a define. You can figure it out from there.
Hope this helps and good luck to you on your application.

Trying to understand the load memory address (LMA) and the binary file offset in an ARM binary image

I'm working in an ARM Cortex M4 (STM32F4xxxx) and I'm trying to understand how exactly the binaries (*.elf and *.bin) are built and flashed in memory, specially with regards to the memory locations. Specifically, what I don't understand is how the LMA gets 'translated' from the actual binary file offset. Let me explain with an example:
I have an *.elf file whose (relevant) sections are the following ones:(obtained from objdump -h)
my_file.elf: file format elf32-littlearm
Sections:
Idx Name Size VMA LMA File off Algn
0 .text 000001c4 08010000 08010000 00020000 2**0
CONTENTS, ALLOC, LOAD, READONLY, DATA
1 .bootloader 00004000 08000000 08000000 00010000 2**0
CONTENTS, ALLOC, LOAD, DATA
According to that file, the VMA and LMA are 0x8000000 and 0x8010000, what is perfectly fine since they are defined that way in the linker script file. In addition, according to that report, the offsets of those sections are 0x10000 and 0x20000 respectively. Next, I execute the following command for dumping the memory corresponding to the .bootloader:
xxd -s 0x10000 -l 16 my_file.elf
00010000: b007 c0de b007 c0de b007 c0de b007 c0de ................
Now, create the binary file to be flashed into memory:
arm-none-eabi-objcopy -O binary --gap-fill 0xFF -S my_file.elf my_file.bin
According to the information provided above, and as far as I understand, the generated binary file should have the .bootloader section located at 0x8000000. I understand that this is not how it actually works, inasmuch as the file would get extremely big, so the bootloader is placed at the beginning of the file, so the address 0x0 (check that both memory chunks are identical, even though the are at different addresses):
xxd -s 0x00000 -l 16 my_file.bin
00000000: b007 c0de b007 c0de b007 c0de b007 c0de ................
As far as I understand, when the mentioned binary file is flashed into memory, the bootloader will be at address 0x0, what is perfectly fine taking into account that the MCU in question jumps to the address 0x4 (after getting the SP from 0x0) when it starts working, as I have checked here (page 26): https://www.st.com/content/ccc/resource/technical/document/application_note/76/f9/c8/10/8a/33/4b/f0/DM00115714.pdf/files/DM00115714.pdf/jcr:content/translations/en.DM00115714.pdf
Finally, my questions are:
Will the bootloader actually be placed at 0x0? If so, what's the purpose of defining the memory sectors in the linker file?
Is this because 0x0 belongs to flash memory, and when the MCU starts, all the flash is copied into RAM at address 0x8000000? If so, will the bootloader be executed from flash memory and all the rest of the code from RAM?
Taking into account the above questions, if I have not understood anything, what's the relation/difference between the LMA and the File offset?
No, bootloader will be at 08000000, as defined in elf file.
Image will be burned in flash at that address and executed directly from there (not copied somewhere else or so).
There's somewhat undocumented behaviour, that unitialized area before actual data is skipped when producing binary image. As comment in BFDlib source states (https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=bfd/binary.c;h=37f5f9f7363e7349612cdfc8bc579369bbabbc0c;hb=HEAD#l238)
/* The lowest section LMA sets the virtual address of the start
of the file. We use this to set the file position of all the
sections. */
Lowest section (.bootloader) LMA is 08000000 in your .elf, so binary file will start at this address.
You should take this address into account and add it to file offset when determining address in the image.
Sections:
Idx Name Size VMA LMA File off Algn
0 .text 000001c4 08010000 08010000 00020000 2**0
/* ^^^^^^^^ */
/* this section will be at offset 10000 in image */
CONTENTS, ALLOC, LOAD, READONLY, DATA
1 .bootloader 00004000 08000000 08000000 00010000 2**0
/* ^^^^^^^^ */
/* this is the lowest LMA in your case it will be used */
/* as a start of an image, and this section will be placed */
/* directly at start of the image */
CONTENTS, ALLOC, LOAD, DATA
Memory layout: Bin. image layout:
000000000 \ skipped
... ________________ /
080000000 .bootloader 0
... ________________
080004000 <gap> 4000
... ________________
080010000 .text 10000
... ________________
0800101C4 101C4
That address defined in ldscript, so binary image should start at fixed location. However you should be aware of this behaviour when dealing with ldscrips and binary images.
To summarize building and flashing process:
When linking, start address is defined in ldscript, and and first section in elf located there.
When converting to binary, start address is determined from LMA and binary image starts from that address.
When flashing image, same address given to flasher as a parameter, so image is placed at the right place (defined in ldscript).
Update: STM32F4xxx booting process.
Address region starting at address 0 is special to those MCUs. It can be configured to map other regions, which are flash, SRAM, or system ROM. They're selected by pins BOOTSELx.
From CPU side it looks like second copy of flash (SRAM or system ROM) appears at address 0.
When CPU starts, it first reads initial SP from adress 0 and initial PC from address 4. Actually, reads from the flash memory are performed.
If the code is linked to run from actual flash location, then initial PC will point there. In this case execution starts at actual flash address.
----- Mapped area (mimics contents as flash) ---
0: (02001000) ;
4: (0800ABCD) ----. ; CPU reads PC here
.... | ; (it points to flash)
----- FLASH ----- |
8000000: 20001000 | ; initial stack pointer
8000004: 0800ABCD --. | ; address of _start in flash
.... | |
800ABCD: <_start:> movw ... <-'<-' ; Code execution starts here
(Note: this does not apply to hex images (like intel hex or s-record) as such formats define loading address explicitly and it is used as is there).
The documentation is pretty clear on where the the address space is for the application code for the stm32's which is 0x08000000 (a competing vendor is like 0x01000000, and so on). And that when booting in a certain mode that 0x08000000 is mapped to address 0x00000000 as can easily be seen with a debugger (in both spaces).
The address space at 0x00000000 mapped to 0x08000000 is smaller than the potential address space at 0x08000000 depending on the chip. So it is wise to build for and use 0x08000000 rather than 0x00000000 but for small programs you can choose either.
Because the cortex-m is a vector table machine when the logic reads address 0x00000004 which is mapped to 0x08000004 in a normal boot mode it sees 0x080xxxxx and then gets out of the 0x00000000 memory space, avoiding any limitations there.
When you use the boot0/boot1 strap pins you can instead cause 0x00000000 to map elsewhere where the burned in bootloader lives. That bootloader of course can easily read 0x08000000 and easily simulate a reset by branching or it can change the logic and actually reset (if you ask it to, although I don't know if that bootloader actually supports running a program). Who knows if we did work there we couldn't necessarily say. Quite possible it always boots into the bootloader and then it changes the mapping depending on the straps.
Similar to an mmu but much simpler decoding addresses and aliasing them is pretty easy. if boot0 == 0 and address[31:16] = 0x0000 then address[31:16]=0x0800 and the memory system decodes it at the different address, as easy as it is to write in C it is that easy in the HDL if not easier.
This is not uncommon to be found in microcontrollers as well as others, but since microcontrollers generally boot from a flash/rom but that same boot space on some architectures is also the vector or exception table that an rtos might want to manipulate sometimes you see that ram can be swapped into that space so the cpu "sees" some ram after a control register is changed where on boot it "saw" the vector table on flash. that or you have the code on flash branch to somewhere in ram for the non-reset vector and then the rtos or any other application that cares to do this can make runtime changes to what code actually gets run for those exceptions or interrupts.
ARM imposes address space rules for where code can execute and data can live and where you might want to start your peripheral address space and what address space is reserved by arm for resources within the core. so you will sometimes see ram have an alias at a lower address implying that if you want to run a program in ram you want to use the lower address for execution but can use either address to copy the code there.
Up to the chip designers as to how simple or complicated to make this. For ST its pretty simple then have one or more boot pins on the package that at least let you choose between your application and the on chip bootloader, so far all the stm32s I have seen the application flash space is considered to live at 0x08000000 and is mapped/aliased to 0x00000000 for one of those boot modes. When there are two boot pins exposed then up to four possible boot conditions can exist of which one is the application with 0x00000000 aliased to 0x08000000.
As to how to get the bits into the flash, that varies widely by tool. The toolchains like gnu certainly will build a .bin file where the first byte of the file is the first byte from the elf that we desire to have at 0x08000000 (if built that way, if you built for 0x02000000 it will still be the first byte, and that code probably won't work). There are tools and you can certainly write your own, that knowing that can load a .bin file at the desired place of 0x08000000 or you can have your too write to address 0x00000000 in the right mode for a program that is not too big and have it still land in the right place to execute on reset. Likewise there are or you can write tools that can parse .elf files, intel hex, motorola srecord and others and based on the information in those binaries have the data be loaded into the address space you desire, assuming everything is bug free.
You might be trying to overcomplicate it. There isn't any magic to it, the tools need to do the sane thing and the sane thing is to take the binary from the compiler and put it in the chip where we want. We are responsible for the linker script and such and bootstrap code/vector table of course, but if we do that right the tools if they are designed right will put the bits in the right place in the chip and if the chip is designed right as documented then it will boot and run.
Will the bootloader actually be placed at 0x0? If so, what's the
purpose of defining the memory sectors in the linker file?
Ideally you want your application or bootloader as you are calling it to be at address 0x08000000 in the processors address space. In certain boot modes (boot0/boot1) that address is also aliased to 0x00000000 so you can see that vector table at both places at the same time. If you are not in the right boot mode then only 0x08000000 will show your code.
Is this because 0x0 belongs to flash memory, and when the MCU starts,
all the flash is copied into RAM at address 0x8000000? If so, will the
bootloader be executed from flash memory and all the rest of the code
from RAM?
The logic in the chip is designed to take the address the processor puts on its address bus and have more than one address land on the application flash, the application flash is not at 0x08000000 if its a 16Kbyte flash for example its only got an address from 0x0000 to 0xFFFF when you access 0x08001234 it actually sends 0x1234 to the flash controller and or the flash controller chops the top off if it knows it is supposed to handle that request. 0x00000000, 0x08000000 are the processors view of the address space, the reality is the upper bits are decoded and route the request to whomever it belongs to and the final handler ultimately looks at the lower bits to determine what is being addressed.
Like when you deliver a letter it has a first and last name, a street address, city state zip. Once it gets to the right post office in the right state then the street address is all that matters to the postal person. Once it gets to the right house, often the first name is all that matters, the rest can be ignored. No difference here. Portions of the address (can often) become don't cares as the responsible logic that inspects that address aims the request at the correct party.
Taking into account the above questions, if I have not understood
anything, what's the relation/difference between the LMA and the File offset?
the elf file format is generic, way overkill for microcontroller work but being well supported and easy to use why not. The load memory address is where we the programmer have desired that code to live with respect to the processors view of the world. From a readelf perspective the offset in the file is the offset for that information in the elf file and it is just wherever the tool put it it has no other interesting relationship. Or at least doesn't need to. Objcopy will rip that data out of the file and for -O binary put it in a sort of memory image file with the lowest address being copied out being offset 0 in that file and the size being determined by the total address space for all of the loadable blocks (unless you use more command line parameters).
And as you sort of implied but if you think about it and have a linker script bug if you were to have even a single instruction at 0x08000000 and a single byte of .data at 0x20000000 but didn't do the AT > thing then your file despite only having three relevant bytes will be 0x20000001 - 0x08000000 bytes long. (after a -O binary) so good idea to not put objcopy in your make file until you have debugged your linker script. Imagine say a target where flash is 0x00000000 and memory is 0xE0000000, pretty big .bin files until you get the linker script sorted out.

Do any CPUs have hardware support for bounds checking?

It doesn't seem like it would be difficult to associate ranges with segments of memory. Then have an assembly instruction which treats 2 integers as "location" & "offset" (another for "data" if setting), and returns the data and error code. This would mean no longer having to make a choice between speed and security/safety when working with arrays.
Another example might be a function which verifies that instructions originating in a particular memory range cannot physically access memory outside that range. If all hardware connected to the motherboard had this capability (and were made to be compatible with each other), it would be trivial to make perfect virtual machines that run at nearly the same speed as the physical machine.
Dustin Soodak
Yes.
Decades ago, Lisp machines performed simultaneous validation checks (e.g. type checks and bounds checks) as the program ran with the assumption the program and state were valid, jumping "back in time" if the check failed - unfortunately this ability to get "free" runtime validation was lost when conventional (i.e. x86) machines became dominant.
https://en.wikipedia.org/wiki/Lisp_machine
Lisp Machines ran the tests in parallel with the more conventional single instruction additions. If the simultaneous tests failed, then the result was discarded and recomputed; this meant in many cases a speed increase by several factors. This simultaneous checking approach was used as well in testing the bounds of arrays when referenced, and other memory management necessities (not merely garbage collection or arrays).
Fortunately we're finally learning from the past and slowly, and by piecemeal, reintroducing those innovations - Intel's "MPX" (Memory Protection eXtensions) for x86 were introduced in Skylake-generation processors for hardware bounds-checking - though it isn't perfect.
(x86 is a regression in other ways too: IBM's mainframes had true hardware-accelerated system virtualization in the 1980s - we didn't get it on x86 until 2005 with Intel's "VT-x" and AMD's "AMD-V" extensions).
x86 BOUND
Technically, x86 does have hardware bounds-checking: the the BOUND instruction was introduced in 1982 in the Intel 80188 (as well as the Intel 286 and above, but not the Intel 8086, 8088 or 80186 processors).
While the BOUND instruction does provide hardware bounds-checking, I understand it indirectly caused performance issues because it breaks the hardware branch predictor (according to a Reddit thread, but I'm unsure why), but also because it requires the bounds to be specified in a tuple in memory - that's terrible for performance - I understand at runtime it's no faster than manually having the instructions to do an "if index not in range [x,y] then signal the BR exception to the program or OS" (so you might imagine the BOUND instruction was added for the convenience of people who coded assembly by-hand, which was quite common in the 1980s).
The BOUND instruction is still present in today's processors, but it was not included in AMD64 (x64) - likely for the performance reasons I explained above, and also because likely very few people were using it (and compilers could trivially replace it with a manual bounds check, that might have better performance anyway, as that could use registers).
Another disadvantage to storing the array bounds in memory is that code elsewhere (that wasn't subject to BOUNDS checking) could overwrite the previously written bounds for another pointer and circumvent the check that way - this is mostly a problem with code that intentionally tries to disable safety features (i.e. malware), but if the bounds were stored in the stack - and given how easy it is to corrupt the stack, it has even less utility.
Intel MPX
Intel MPX was introduced in Skylake architecture in 2015 and should be present in all Skylake and subsequent processor models in the mainstream Intel Core family (including Xeon, and non-SoC versions of Celeron and Pentium). Intel also implemented MPX in the Goldmont architecture (Atom, and SoC versions of Celeron and Pentium) from 2016 onwards.
MPX is superior to BOUND in that it provides dedicated registers to store the bounds range so the bounds-check should be almost zero-cost compared to BOUND which required a memory access. On the Intel 486 the BOUND instruction takes 7 cycles (compare to CMP which takes only 2 cycles even if the operand was a memory address). In Skylake the MPX equivalent (BNDMK, BNDCL and BNDCU) are all 1-cycle instructions and BNDMK can be amortized as it only needs to be called once for each new pointer).
I cannot find any information on wherever or not AMD has implemented their own version of MPX yet (as of June 2017).
Critical thoughts on MPX
Unfortunately the current state of MPX is not all that rosy - a recent paper by Oleksenko, Kuvaiskii, et al. in February 2017 "Intel MPX Explained" (PDF link: caution: not yet peer-reviewed) is a tad critical:
Our main conclusion is that Intel MPX is a promising technique that is not yet practical for widespread adoption. Intel MPX’s performance overheads are still high (~50% on average), and the supporting infrastructure has bugs which may cause compilation or runtime errors. Moreover, we showcase the design limitations of Intel MPX: it cannot detect temporal errors, may have false positives and false negatives in multithreaded code, and its restrictions
on memory layout require substantial code changes for some programs.
Also note that compared to the Lisp Machines of yore, Intel MPX is still executed inline - whereas in Lisp Machines (if my understanding is correct) bounds checks happened concurrently in hardware with a retroactive jump backwards if the check failed; thus, so-long as a running program's pointers do not point to out-of-bounds locations then there would be an absolutely zero runtime performance cost, so if you have this C code:
char arr[10];
arr[9] = 'a';
arr[8] = 'b';
Then under MPX then this would be executed:
Time Instruction Notes
1 BNDMK arr, arr+9 Set bounds 0 to 9.
2 BNDCL arr Check `arr` meets lower-bound.
3 BNDCU arr Check `arr` meets upper-bound.
4 MOV 'a' arr+9 Assign 'a' to arr+9.
5 MOV 'a' arr+8 Assign 'a' to arr+8.
But on a Lisp machine (if it were magically possible to compile C to Lisp...), then the program-reader-hardware in the computer has the ability to execute additional "side" instructions concurrently with the "actual" instructions, allowing the "side" instructions to instruct the computer to disregard the results from the "actual" instructions in the event of an error:
Time Actual instruction Side instruction
1 MOV 'A' arr+9 ENSURE arr+9 BETWEEN arr, arr+9
2 MOV 'A' arr+8 ENSURE arr+8 BETWEEN arr, arr+9
I understand the instructions-per-cycle for the "side" instructions are not the same as the "Actual" instructions - so the side-check for the instruction at Time=1 might only complete after the "Actual" instructions have already progressed on to Time=3 - but if the check failed then it would pass the instruction pointer of the failed instruction to the exception handler that would direct the program to disregard the results of the instructions executed after Time=1. I don't know how they could achieve that without massive amounts of memory or some mandatory execution pauses, possibly memory-fencing too -
that's outside the scope of my answer, but it is at least theoretically possible.
(Note in this contrived example I'm using constexpr index values that a compiler can prove will never be out-of-bounds so would omit the MPX checks entirely - so pretend they're user-supplied variables instead :) ).
I'm not an expert in x86 (or have any experience in microprocessor design, spare a CS500-level course I took at UW and didn't do the homework for...) but I don't believe concurrent execution of bounds-checks nor "time travel" is possible with x86's current design, despite the extant implementation of out-of-order execution - I might be wrong, however. I speculate that if all pointer-types were promoted to 3-tuples ( struct BoundedPointer<T> { T* ptr, T* min, T* max } - which technically already happens with MPX and other software-based bounds-checks as every guarded pointer has its bounds defined when BNDMK is called) then the protection could be provided for free by the MMU - but now pointers will consume 24 bytes of memory, each, instead of the current 8 bytes - or compare to the measly 4 bytes under 32-bit x86 - RAM is plentiful, but still a finite resource that shouldn't be wasted.
MPX in GCC
GCC supported for MPX from version 5.0 to 9.1 ( https://gcc.gnu.org/wiki/Intel%20MPX%20support%20in%20the%20GCC%20compiler ) when it was removed due to its maintenance burden.
MPX in Visual Studio / Visual C++
Visual Studio 2015 Update 1 (2015.1) added "experimental" support for MPX with the /d2MPX switch ( https://blogs.msdn.microsoft.com/vcblog/2016/01/20/visual-studio-2015-update-1-new-experimental-feature-mpx/ ). Support is still present in Visual Studio 2017 but Microsoft has not announced if it's considered a mainstream (i.e. non-experimental) feature yet.
MPX in Clang / LLVM
Clang has partially supported manual use of MPX in the past, but that support was fully removed in version 10.0
As of July 2021, LLVM still seems capable of outputting MPX instructions, but I can't see any evidence of an MPX "pass".
MPX in Intel C/C++ Compiler
The Intel C/C++ Compiler has supported MPX since version 15.0.
The XL compilers available on the IBM POWER processors on the Little Endian Linux, Big Endian Linux or AIX operating systems have a different implementation of array bounds checking.
Using the -qcheck or its synonym -C option turns on various kinds of checking. -qcheck=bounds checks array bounds. When this is used, the compilers check that every array reference has a valid subscript.
The hardware instruction used is a conditional trap, comparing the subscript to the upper limit and trapping if the subscript is too large or too small. In C and C++ the lower limit is 0. In Fortran it defaults to 1 but can be any integer. When it is not zero, the lower limit is subtracted from the subscript being checked, and the check compares that to the upper limit minus the lower limit.
When the limit is known at compile time and small enough, a conditional trap immediate instruction is enough. When the limit is calculated at execution time or is greater than 65535, a conditional trap instruction comparing two registers is needed.
The performance impact is small for several reasons:
1. The conditional trap instructions are fast.
2. They are executed in a standard integer pipeline. Since most POWER CPUs have 2 or 4 integer pipelines, there is usually an otherwise empty slot to put the trap in, so it is often essentially zero cost.
3. When it can the compiler optimizer moves the conditional trap out of loops so it is executed only once, checking all loop iterations at once.
4. When it can prove the actual subscript cannot exceed the limit, the optimizer discards the instruction.
5. Also when it can prove the subscript will also be invalid, the optimizer uses an unconditional trap.
6. If necessary -qcheck can be used during testing and skipped for production builds, but the overhead is small enough that's not usually necessary.
If my memory is correct, one long ago paper reported a 2% slowdown in one case and 0% in another. Since that CPU had only one integer pipeline, the slowdown should be significantly less with modern CPUs.
Other checking using the same mechanism is available to detect dereferencing NULL pointers, dividing an integer by zero, using an uninitialized auto variable, specially written asserts, etc.
This doesn't include all kinds of invalid memory usage, but it does handle the most common kind, does it very efficiently, and is very easy to use.
GCC supports -fbounds-check for similar purposes, but at this time it is only available for the Fortran front end (gfortran).

What are the alignment restrictions on the new Haswell AVX "gather" instructions?

I'm looking at the AVX programming reference. The new Haswell instructions include some eagerly awaited "gather" loads. However, I can't figure out what the alignment restrictions are on the indexed data items. Section 2.5 "Memory alignment" of the reference seems like it ought to list the various VGATHER* instructions in one of tables 2.4 or 2.5... but it doesn't.
Background: while gather instructions' supported data sizes are 4 and 8 bytes, my application could benefit from gather-loading adjacent 16-bit data value pairs to DWORDS. Odd indices with a 2-byte scale will produce 2-byte aligned 4-byte loads and it's not clear to me from the manual whether this will fault or otherwise fail to work as intended (I rather suspect I'm out of luck given all the instructions supporting unaligned accesses seem to have a 'U' in them).
This is the first time I hear about AVX2. But I'm guessing the memory alignment restriction won't be different from current implementation of AVX on Sandy Bridge with the new VEX coding scheme. I.e. no alignment required unless explicitly using aligned VMOV instruction with A in the name. Most instruction allow access with any byte-granularity alignment.
In fact, see section 2.5, page 35 of Intel(R) Advanced Vector Extensions Programming Reference which states exactly this.

How does a virtual machine work?

I've been looking into how programming languages work, and some of them have a so-called virtual machines. I understand that this is some form of emulation of the programming language within another programming language, and that it works like how a compiled language would be executed, with a stack. Did I get that right?
With the proviso that I did, what bamboozles me is that many non-compiled languages allow variables with "liberal" type systems. In Python for example, I can write this:
x = "Hello world!"
x = 2**1000
Strings and big integers are completely unrelated and occupy different amounts of space in memory, so how can this code even be represented in a stack-based environment? What exactly happens here? Is x pointed to a new place on the stack and the old string data left unreferenced? Do these languages not use a stack? If not, how do they represent variables internally?
Probably, your question should be titled as "How do dynamic languages work?."
That's simple, they store the variable type information along with it in memory. And this is not only done in interpreted or JIT compiled languages but also natively-compiled languages such as Objective-C.
In most VM languages, variables can be conceptualized as pointers (or references) to memory in the heap, even if the variable itself is on the stack. For languages that have primitive types (int and bool in Java, for example) those may be stored on the stack as well, but they can not be assigned new types dynamically.
Ignoring primitive types, all variables that exist on the stack have their actual values stored in the heap. Thus, if you dynamically reassign a value to them, the original value is abandoned (and the memory cleaned up via some garbage collection algorithm), and the new value is allocated in a new bit of memory.
The VM has nothing to do with the language. Any language can run on top of a VM (the Java VM has hundreds of languages already).
A VM enables a different kind of "assembly language" to be run, one that is more fit to adapting a compiler to. Everything done in a VM could be done in a CPU, so think of the VM like a CPU. (Some actually are implemented in hardware).
It's extremely low level, and in many cases heavily stack based--instead of registers, machine-level math is all relative to locations relative to the current stack pointer.
With normal compiled languages, many instructions are required for a single step. a + might look like "Grab the item from a point relative to the stack pointer into reg a, grab another into reg b. add reg a and b. put reg a into a place relative to the stack pointer.
The VM does all this with a single, short instruction, possibly one or two bytes instead of 4 or 8 bytes PER INSTRUCTION in machine language (depending on 32 or 64 bit architecture) which (guessing) should mean around 16 or 32 bytes of x86 for 1-2 bytes of machine code. (I could be wrong, my last x86 coding was in the 80286 era.)
Microsoft used (probably still uses) VMs in their office products to reduce the amount of code.
The procedure for creating the VM code is the same as creating machine language, just a different processor type essentially.
VMs can also implement their own security, error recovery and memory mechanisms that are very tightly related to the language.
Some of my description here is summary and from memory. If you want to explore the bytecode definition yourself, it's kinda fun:
http://java.sun.com/docs/books/jvms/second_edition/html/Instructions2.doc.html
The key to many of the 'how do VMs handle variables like this or that' really comes down to metadata... The meta information stored and then updated gives the VM a much better handle on how to allocate and then do the right thing with variables.
In many cases this is the type of overhead that can really get in the way of performance. However, modern day implementations, etc have come a long way in doing the right thing.
As for your specific questions - treating variables as vanilla objects / etc ... comes down to reassigning / reevaluating meta information on new assignments - that's why x can look one way and then the next.
To answer a part of your questions, I'd recommend a google tech talk about python, where some of your questions concerning dynamic languages are answered; for example what a variable is (it is not a pointer, nor a reference, but in case of python a label).

Resources