mmu in arm: how to map kernel code (bare metal)

mmu in arm: how to map kernel code (bare metal) - arm64

In the link script, the starting address is 0xffffffff00000000. I can then load my own bare metal kernel (for aarch64) in an arbitrary (physical) address and use relative addressing. So when I turn on MMU, how do I know memory won't be written at where kernel is loaded? I mean, if I loaded kernel at 0x01000000, and map physical memory from 0xffffffff00000000 to 0xffffffffffffffff, it seems to me I will still be running into problems if I combine relative with absolute addressing. And it seems like the only solution is to ensure the kernel is loaded always at the same physical address, and then to map that to 0xffffffff00000000... But somehow this beats the purpose of an MMU. Am I correct in my thinking?

Actually the realization that ADR instruction on Aarch64 returns physical address helped me achieve what I wanted. I hope I am right in my thinking, though:
ADR x0, label # here we get physical address of label to x0
LDR x0, =label # here we get start address from linker + label address to x0
So I don't really need to know the physical address (at which the kernel was loaded) at compile time to set a boundary around kernel to set MMU up properly.

Related

How do memory addresses in binary programs point to the right place in memory at runtime?

From what I understand when you compile a program (let's say a C program for example), the Compiler takes your code and outputs a executable program in binary (i.e. machine code for the targeted arch) format.
Within this binary you're going to have instructions that point to addresses in memory to load data/instructions from other parts of the program.
Given this program will be loaded into memory at some arbitrary location, how does the program know what these memory addresses are? How are they set/calculated and who's job is it to do this?
For example, does the binary just have placeholders for the memory locations that are replaced by the OS when it loads it into memory for the first time?
If it needs to dynamically load a shared library how does it work out where the memory location is for that?
How does 'virtual memory' come into play with this? (if at all)

how does the program know what these memory addresses are?
The program (and its author) does not know what the memory address will be when it's loaded to computer memory, it only knows where the placeholder is, relative to the start of its segment. That's why the compiler accompanies each such placeholder with relocation record. Relocation is a piece of information which tells the OS or the linker
where the relocated address is (its offset in code or data segment)
which segment it is in
which segment or symbol it refers
what kind of relocation should apply on the address
Consider the following simple piece or source code of Windows Portable executable program:
[.text]
Main:NOP
LEA ESI,[Mem]
; more instructions
[.data]
DB "Some data"
Mem: DB "Other data"
which will be converted to machine instructions and memory data:
|[.text] |[.text]
|00000000:90 |Main:NOP
|00000001:8D35[09000000] | LEA ESI,[Mem]
|00000007: | ; more instructions
|[.data] |[.data]
|00000000:536F6D6520646174~| DB "Some data"
|00000009:4F74686572206461~|Mem: DB "Other data"
Compiler does not know the virtual address of Mem, it only knows that it is located 0x00000009 bytes from the start of .data segment, so it will put this temporary number into operation code of LEA ESI,[Mem] and creates relocation of the placeholder (located in segment .text at offset 0x00000003) which is relative to segment .data.
At link-time the linker decides that .text segment will be loaded at virtual address 0x00401000 and .data segment at VA 0x00402000. Linker then reads the relocation record and modifies the placeholder by adding 0x00402000. Instruction LEA ESI,[Mem] in the linked executable then will be 8D3509204000, which is the final fixed-up virtual address of Mem. We'll be able to see that address in debugger at run-time.
Relocations are present in linked executable files, too (16bit DOS MZ or Windows PE), for the case that they could not be loaded at the virtual imagebase address assumed at link time. With linking SO libraries in Linux it is more complicated, see chapter 2 Dynamic linking in http://www.skyfree.org/linux/references/ELF_Format.pdf

MMUs allow the OS to create the same address space (think addresses zero to N) for each application such that each application can be compiled for a known address space. There isn't much need for relocation in this situation. Even in the DOS days you could/would have a fixed offset relative to some data segment so that the applications could have an assumed address space.
The kernel bootstrap for Linux is a place where you will see relocation but the kernel itself not so much or perhaps that has changed in the last so many years.
Loadable modules and shared libraries would be one place where you might see relocation required. For at least the popular processors running the popular operating systems (Linux, Windows, macOS, arm, x86, mips) the code itself can be built to be relocatable without modification so long as it is all relative to itself, which is what is assumed.
Data relative to code though if you want to move the data then some form of table is typical, where the table is fixed relative to the code (or some other linked mechanism), but it contains information to tell where the data starts, or specific items/markers in the data start so that other data references can be relative to that.

aarch64 inline assembly stack pointer constraint memory address with offset for Clang 6+

I noticed at different optimization levels, Clang 6 is sometimes using ldp (load neon register pair) for adjacent memory addresses vld1 neon load instrinsics.
I am trying to use inline assembly to manually force more load pair instructions. The source array is held in the stack and when Clang itself produces ldp instructions, it is using a stack pointer with offset however when I enter the array with its index with inline assembly, it expands into a x register for the address. This works however is causing performance regression. I believe this is because reading from the stack is faster but a x register as a source address might be pointing to the heap which may in turn referencing back to the stack though I am not sure, or perhaps it is reading from duplicate data in the heap.
This is an example what I am using now.
asm (
"ldp %q[DST1], %q[DST2], [%[SRC]]" "\n"
: [DST1] "=w" (TMP1), [DST2] "=w" (TMP2)
: [SRC] "X" (&K2[8])
);
and this is what Clang expands it into
ldp q19, q4, [x11]
But I want to use a stack pointer with offset address, automatically resolved from the indexed K2 array variable. e.g.
ldp q19, q4, [sp,#32]
The offsets of the stack pointer address in the disassembled code are not adjacent, so I cannot just hard code the sp register and enter an offset to load sequential data. This is because Clang 6 is consolidating identical values in other arrays used by other functions into the stack.
GCC has the aarch64 machine constraints like k which is for the stack pointer (sp) register and Ump which are meant for stp and ldp store/load pair instruction addresses which I never got to work on either GCC or Clang the latter having no equivalent constraints in its sparse documentation.
My preference is to use Clang 6 as it is producing code that is over 6% faster than GCC 8 because it is arranging most of the instructions in a performance critical loop to dual issue properly.
Is there anyway to enter an array with index as input into inline assembly and have it automatically resolve to a stack pointer with offset address in Clang 6?

Have you tried using a memory source operand like [SRC] "m" (K2[8])? Without that, you haven't even told the compiler that the memory contents are also an input to the inline asm, so it might reorder your asm wrt. stores, or do dead-store elimination.
Letting the compiler pick the addressing mode is the entire point of "m" operands.

Trying to understand the load memory address (LMA) and the binary file offset in an ARM binary image

I'm working in an ARM Cortex M4 (STM32F4xxxx) and I'm trying to understand how exactly the binaries (*.elf and *.bin) are built and flashed in memory, specially with regards to the memory locations. Specifically, what I don't understand is how the LMA gets 'translated' from the actual binary file offset. Let me explain with an example:
I have an *.elf file whose (relevant) sections are the following ones:(obtained from objdump -h)
my_file.elf: file format elf32-littlearm
Sections:
Idx Name Size VMA LMA File off Algn
0 .text 000001c4 08010000 08010000 00020000 2**0
CONTENTS, ALLOC, LOAD, READONLY, DATA
1 .bootloader 00004000 08000000 08000000 00010000 2**0
CONTENTS, ALLOC, LOAD, DATA
According to that file, the VMA and LMA are 0x8000000 and 0x8010000, what is perfectly fine since they are defined that way in the linker script file. In addition, according to that report, the offsets of those sections are 0x10000 and 0x20000 respectively. Next, I execute the following command for dumping the memory corresponding to the .bootloader:
xxd -s 0x10000 -l 16 my_file.elf
00010000: b007 c0de b007 c0de b007 c0de b007 c0de ................
Now, create the binary file to be flashed into memory:
arm-none-eabi-objcopy -O binary --gap-fill 0xFF -S my_file.elf my_file.bin
According to the information provided above, and as far as I understand, the generated binary file should have the .bootloader section located at 0x8000000. I understand that this is not how it actually works, inasmuch as the file would get extremely big, so the bootloader is placed at the beginning of the file, so the address 0x0 (check that both memory chunks are identical, even though the are at different addresses):
xxd -s 0x00000 -l 16 my_file.bin
00000000: b007 c0de b007 c0de b007 c0de b007 c0de ................
As far as I understand, when the mentioned binary file is flashed into memory, the bootloader will be at address 0x0, what is perfectly fine taking into account that the MCU in question jumps to the address 0x4 (after getting the SP from 0x0) when it starts working, as I have checked here (page 26): https://www.st.com/content/ccc/resource/technical/document/application_note/76/f9/c8/10/8a/33/4b/f0/DM00115714.pdf/files/DM00115714.pdf/jcr:content/translations/en.DM00115714.pdf
Finally, my questions are:
Will the bootloader actually be placed at 0x0? If so, what's the purpose of defining the memory sectors in the linker file?
Is this because 0x0 belongs to flash memory, and when the MCU starts, all the flash is copied into RAM at address 0x8000000? If so, will the bootloader be executed from flash memory and all the rest of the code from RAM?
Taking into account the above questions, if I have not understood anything, what's the relation/difference between the LMA and the File offset?

No, bootloader will be at 08000000, as defined in elf file.
Image will be burned in flash at that address and executed directly from there (not copied somewhere else or so).
There's somewhat undocumented behaviour, that unitialized area before actual data is skipped when producing binary image. As comment in BFDlib source states (https://sourceware.org/git/gitweb.cgi?p=binutils-gdb.git;a=blob;f=bfd/binary.c;h=37f5f9f7363e7349612cdfc8bc579369bbabbc0c;hb=HEAD#l238)
/* The lowest section LMA sets the virtual address of the start
of the file. We use this to set the file position of all the
sections. */
Lowest section (.bootloader) LMA is 08000000 in your .elf, so binary file will start at this address.
You should take this address into account and add it to file offset when determining address in the image.
Sections:
Idx Name Size VMA LMA File off Algn
0 .text 000001c4 08010000 08010000 00020000 2**0
/* ^^^^^^^^ */
/* this section will be at offset 10000 in image */
CONTENTS, ALLOC, LOAD, READONLY, DATA
1 .bootloader 00004000 08000000 08000000 00010000 2**0
/* ^^^^^^^^ */
/* this is the lowest LMA in your case it will be used */
/* as a start of an image, and this section will be placed */
/* directly at start of the image */
CONTENTS, ALLOC, LOAD, DATA
Memory layout: Bin. image layout:
000000000 \ skipped
... ________________ /
080000000 .bootloader 0
... ________________
080004000 <gap> 4000
... ________________
080010000 .text 10000
... ________________
0800101C4 101C4
That address defined in ldscript, so binary image should start at fixed location. However you should be aware of this behaviour when dealing with ldscrips and binary images.
To summarize building and flashing process:
When linking, start address is defined in ldscript, and and first section in elf located there.
When converting to binary, start address is determined from LMA and binary image starts from that address.
When flashing image, same address given to flasher as a parameter, so image is placed at the right place (defined in ldscript).
Update: STM32F4xxx booting process.
Address region starting at address 0 is special to those MCUs. It can be configured to map other regions, which are flash, SRAM, or system ROM. They're selected by pins BOOTSELx.
From CPU side it looks like second copy of flash (SRAM or system ROM) appears at address 0.
When CPU starts, it first reads initial SP from adress 0 and initial PC from address 4. Actually, reads from the flash memory are performed.
If the code is linked to run from actual flash location, then initial PC will point there. In this case execution starts at actual flash address.
----- Mapped area (mimics contents as flash) ---
0: (02001000) ;
4: (0800ABCD) ----. ; CPU reads PC here
.... | ; (it points to flash)
----- FLASH ----- |
8000000: 20001000 | ; initial stack pointer
8000004: 0800ABCD --. | ; address of _start in flash
.... | |
800ABCD: <_start:> movw ... <-'<-' ; Code execution starts here
(Note: this does not apply to hex images (like intel hex or s-record) as such formats define loading address explicitly and it is used as is there).

The documentation is pretty clear on where the the address space is for the application code for the stm32's which is 0x08000000 (a competing vendor is like 0x01000000, and so on). And that when booting in a certain mode that 0x08000000 is mapped to address 0x00000000 as can easily be seen with a debugger (in both spaces).
The address space at 0x00000000 mapped to 0x08000000 is smaller than the potential address space at 0x08000000 depending on the chip. So it is wise to build for and use 0x08000000 rather than 0x00000000 but for small programs you can choose either.
Because the cortex-m is a vector table machine when the logic reads address 0x00000004 which is mapped to 0x08000004 in a normal boot mode it sees 0x080xxxxx and then gets out of the 0x00000000 memory space, avoiding any limitations there.
When you use the boot0/boot1 strap pins you can instead cause 0x00000000 to map elsewhere where the burned in bootloader lives. That bootloader of course can easily read 0x08000000 and easily simulate a reset by branching or it can change the logic and actually reset (if you ask it to, although I don't know if that bootloader actually supports running a program). Who knows if we did work there we couldn't necessarily say. Quite possible it always boots into the bootloader and then it changes the mapping depending on the straps.
Similar to an mmu but much simpler decoding addresses and aliasing them is pretty easy. if boot0 == 0 and address[31:16] = 0x0000 then address[31:16]=0x0800 and the memory system decodes it at the different address, as easy as it is to write in C it is that easy in the HDL if not easier.
This is not uncommon to be found in microcontrollers as well as others, but since microcontrollers generally boot from a flash/rom but that same boot space on some architectures is also the vector or exception table that an rtos might want to manipulate sometimes you see that ram can be swapped into that space so the cpu "sees" some ram after a control register is changed where on boot it "saw" the vector table on flash. that or you have the code on flash branch to somewhere in ram for the non-reset vector and then the rtos or any other application that cares to do this can make runtime changes to what code actually gets run for those exceptions or interrupts.
ARM imposes address space rules for where code can execute and data can live and where you might want to start your peripheral address space and what address space is reserved by arm for resources within the core. so you will sometimes see ram have an alias at a lower address implying that if you want to run a program in ram you want to use the lower address for execution but can use either address to copy the code there.
Up to the chip designers as to how simple or complicated to make this. For ST its pretty simple then have one or more boot pins on the package that at least let you choose between your application and the on chip bootloader, so far all the stm32s I have seen the application flash space is considered to live at 0x08000000 and is mapped/aliased to 0x00000000 for one of those boot modes. When there are two boot pins exposed then up to four possible boot conditions can exist of which one is the application with 0x00000000 aliased to 0x08000000.
As to how to get the bits into the flash, that varies widely by tool. The toolchains like gnu certainly will build a .bin file where the first byte of the file is the first byte from the elf that we desire to have at 0x08000000 (if built that way, if you built for 0x02000000 it will still be the first byte, and that code probably won't work). There are tools and you can certainly write your own, that knowing that can load a .bin file at the desired place of 0x08000000 or you can have your too write to address 0x00000000 in the right mode for a program that is not too big and have it still land in the right place to execute on reset. Likewise there are or you can write tools that can parse .elf files, intel hex, motorola srecord and others and based on the information in those binaries have the data be loaded into the address space you desire, assuming everything is bug free.
You might be trying to overcomplicate it. There isn't any magic to it, the tools need to do the sane thing and the sane thing is to take the binary from the compiler and put it in the chip where we want. We are responsible for the linker script and such and bootstrap code/vector table of course, but if we do that right the tools if they are designed right will put the bits in the right place in the chip and if the chip is designed right as documented then it will boot and run.
Will the bootloader actually be placed at 0x0? If so, what's the
purpose of defining the memory sectors in the linker file?
Ideally you want your application or bootloader as you are calling it to be at address 0x08000000 in the processors address space. In certain boot modes (boot0/boot1) that address is also aliased to 0x00000000 so you can see that vector table at both places at the same time. If you are not in the right boot mode then only 0x08000000 will show your code.
Is this because 0x0 belongs to flash memory, and when the MCU starts,
all the flash is copied into RAM at address 0x8000000? If so, will the
bootloader be executed from flash memory and all the rest of the code
from RAM?
The logic in the chip is designed to take the address the processor puts on its address bus and have more than one address land on the application flash, the application flash is not at 0x08000000 if its a 16Kbyte flash for example its only got an address from 0x0000 to 0xFFFF when you access 0x08001234 it actually sends 0x1234 to the flash controller and or the flash controller chops the top off if it knows it is supposed to handle that request. 0x00000000, 0x08000000 are the processors view of the address space, the reality is the upper bits are decoded and route the request to whomever it belongs to and the final handler ultimately looks at the lower bits to determine what is being addressed.
Like when you deliver a letter it has a first and last name, a street address, city state zip. Once it gets to the right post office in the right state then the street address is all that matters to the postal person. Once it gets to the right house, often the first name is all that matters, the rest can be ignored. No difference here. Portions of the address (can often) become don't cares as the responsible logic that inspects that address aims the request at the correct party.
Taking into account the above questions, if I have not understood
anything, what's the relation/difference between the LMA and the File offset?
the elf file format is generic, way overkill for microcontroller work but being well supported and easy to use why not. The load memory address is where we the programmer have desired that code to live with respect to the processors view of the world. From a readelf perspective the offset in the file is the offset for that information in the elf file and it is just wherever the tool put it it has no other interesting relationship. Or at least doesn't need to. Objcopy will rip that data out of the file and for -O binary put it in a sort of memory image file with the lowest address being copied out being offset 0 in that file and the size being determined by the total address space for all of the loadable blocks (unless you use more command line parameters).
And as you sort of implied but if you think about it and have a linker script bug if you were to have even a single instruction at 0x08000000 and a single byte of .data at 0x20000000 but didn't do the AT > thing then your file despite only having three relevant bytes will be 0x20000001 - 0x08000000 bytes long. (after a -O binary) so good idea to not put objcopy in your make file until you have debugged your linker script. Imagine say a target where flash is 0x00000000 and memory is 0xE0000000, pretty big .bin files until you get the linker script sorted out.

pagination - virtual addresses, physical addresses, mapping - considerations

Pagination in some processor make it possible to map virtual address
(A2345678) to physical address (823C5678). However, it is not possible
to map virtual address (345678) to (2ABC678). What can we conclude
about size of frame, page, size of virtual memory, size of physical
memory.
What I think about it:
(A2345678) -> (823C5678)
So, size offset is most 19 bits. We know that size of page (and frame) has size at most 219, like in my previous question.
When it comes to size of virtual memory, and physical memory - I can conclude nothing.
Similary, I don't know what tell me information about non-possibility mapping address.
Can you try to explain it me ?

I do see something we can conclude after all:
If a mapping to physical address 0x823C5678 is possible, physical memory is at least that large. (Assuming there aren't any holes in physical address space; not a good assumption on real hardware, but whatever. We can tell that physical address space is at least that big, even if it doesn't all map to DRAM or MMIO).
Similarly, the valid virtual address 0xA2345678 gives us a lower bound on the virtual address size. Presumably all the virtual address bits can be 1, so the highest possible virtual address is at least 0xFFFFFFFF. i.e. virtual addresses are at least 32 bits, but could be any larger size.
This reasoning applies to physical address space, but not the size of physical memory. (e.g. in a computer with 19GiB of RAM, the highest valid physical address isn't 2n-1.)
The fact that you can't map 0x345678 to 0x2ABC678 does tell us that the page size is greater than 212. The physical address is below the address that was mappable, so we can rule out that possible reason for the mapping being impossible. I think too high and misaligned are the only possible reasons for a mapping not being possible.
(0xc = 0b1100, while 0x5 is 0b0101, so the common bits are only 0x678.)
We can assume that physical memory is a whole number of pages, so we can round up the lowest possible end of physical memory to the next multiple of 213.

ARM: Safe physical memory position (to reserve) for my ARM hypervisor in relation to a Linux/Android guest

I am developing a basic hypervisor on ARM (using the board Arndale Exynos 5250).
I want to load Linux(ubuntu or smth else)/Android as the guest. Currently I'm using a Linaro distribution.
I'm almost there, most of the big problems have already been dealt with, except for the last one: reserving memory for my hypervisor such that the kernel does not try to OVERWRITE it BEFORE parsing the FDT or the kernel command line.
The problem is that my Linaro distribution's U-Boot passes a FDT in R2 to the linux kernel, BUT the kernel tries to overwrite my hypervisor's memory before seeing that I reserved that memory region in the FDT (by decompiling the DTB, modifying the DTS and recompiling it). I've tried to change the kernel command-line parameters, but they are also parsed AFTER the kernel tries to overwrite my reserved portion of memory.
Thus, what I need is a safe memory location in the physical RAM where to put my hypervisor's code at such that the Linux kernel won't try to access (r/w) it BEFORE parsing the FDT or it's kernel command line.
Context details:
The system RAM layout on Exynos 5250 is: physical RAM starts at 0x4000_0000 (=1GB) and has the length 0x8000_0000 (=2GB).
The linux kernel is loaded (by U-Boot) at 0x4000_7000, it's size (uncompressed uImage) is less than 5MB and it's entry point is set to be at 0x4000_8000;
uInitrd is loaded at 0x4200_0000 and has the size less than 2MB
The FDT (board.dtb) is loaded at 0x41f0_0000 (passed in R2) and has the size less than 35KB
I currently load my hypervisor at 0x40C0_0000 and I want to reserve 200MB (0x0C80_0000) starting from that address, but the kernel tries to write there (a stage 2 HYP trap tells me that) before looking in the FDT or in the command line to see that the region is actually reserved. If instead I load my hypervisor at 0x5000_0000 (without even modifying the original DTB or the command line), it does not try to overwrite me!
The FDT is passed directly, not through ATAGs
Since when loading my hypervisor at 0x5000_0000 the kernel does not try to overwrite it whatsoever, I assume there are memory regions that Linux does not touch before parsing the FDT/command-line. I need to know whether this is true or not, and if true, some details regarding these memory regions.
Thanks!
RELATED QUESTION:
Does anyone happen to know what is the priority between the following: ATAGs / kernel-command line / FDT? For instance, if I reserve memory through the kernel command-line, but not in the FDT (.dtb) should it work or is the command-line overriden by the FDT? Is there somekind of concatenation between these three?

As per
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/Documentation/arm/Booting, safe locations start 128MB from start of RAM (assuming the kernel is loaded in that region, which is should be). If a zImage was loaded lower in memory than what is likely to be the end address of the decompressed image, it might relocate itself higher up before it starts decompressing. But in addition to this, the kernel has a .bss region beyond the end of the decompressed image in memory.
(Do also note that your FDT and initrd locations already violate this specification, and that the memory block you are wanting to reserve covers the locations of both of these.)
Effectively, your reserved area should go after the FDT and initrd in memory - which 0x50000000 is. But anything > 0x08000000 from start of RAM should work, portably, so long as that doesn't overwrite the FDT, initrd or U-Boot in memory.

The priority of kernel/FDT/bootloader command line depends on the kernel configuration - do a menuconfig and check under "Boot options". You can combine ATAGS with the built-in command lines, but not FDT - after all, the FDT chosen node is supposed to be generated by the bootloader - U-boot's FDT support is OK so you should let it do this rather than baking it into the .dts if you want an FDT command line.
The kernel is pretty conservative before it's got its memory map since it has to blindly trust the bootloader has laid things out as specified. U-boot on the other hand is copying bits of itself all over the place and is certainly the culprit for the top end of RAM - if you #define DEBUG in (I think) common/board_f.c you'll get a dump of what it hits during relocation (not including the Exynos iRAM SPL/boot code stuff, but that won't make a difference here anyway).

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart