Why is code relocation done in U-boot proper? - beagleboneblack

I am trying to understand the boot process of BeagleBone Black by browsing through the source code. Assume that Iam keeping MLO and u-boot.img files in micro-SD card, and making BeagleBone to boot from SD card.
To my understanding, ROM code executes first, and it loads an MLO file from MMC into internal SRAM of the SOC. MLO file contains the code for x-loader, a Second Stage Program Loader(SPL). The SPL then sets up the DRAM and copies Third stage Program Loader(U-boot proper) into the DRAM. U-boot proper directly starts its execution from DRAM.
The architecture dependent portion of U-boot proper is located at arch/arm/ directory of the U-boot sources.
The code pertaining to SPL is located in spl/ directory.(while executing make mrproper followed by make SPL, *.o files are getting created only in spl/ directory)
For U-boot proper, iam guessing this is the execution flow -
arch/arm/cpu/armv7/start.S is located at the reset vector(so it runs first), and after some initialization it calls '_main' procedure located at arch/arm/lib/crt0.S .
In crt0.S, the board_init_f() is called, which sets up the DRAM (and something else), and then returns back to where it left off(in main_). It later calls the function relocate_code which relocates it again to DRAM.
ENTRY(_main)
/*
* Set up initial C runtime environment and call board_init_f(0).
*/
#if defined(CONFIG_SPL_BUILD) && defined(CONFIG_SPL_STACK)
ldr r0, =(CONFIG_SPL_STACK)
#else
ldr r0, =(CONFIG_SYS_INIT_SP_ADDR)
#endif
bic r0, r0, #7 /* 8-byte alignment for ABI compliance */
mov sp, r0
bl board_init_f_alloc_reserve
mov sp, r0
/* set up gd here, outside any C code */
mov r9, r0
bl board_init_f_init_reserve
mov r0, #0
bl board_init_f
#if ! defined(CONFIG_SPL_BUILD)
/*
* Set up intermediate environment (new sp and gd) and call
* relocate_code(addr_moni). Trick here is that we'll return
* 'here' but relocated.
*/
ldr r0, [r9, #GD_START_ADDR_SP] /* sp = gd->start_addr_sp */
bic r0, r0, #7 /* 8-byte alignment for ABI compliance */
mov sp, r0
ldr r9, [r9, #GD_BD] /* r9 = gd->bd */
sub r9, r9, #GD_SIZE /* new GD is below bd */
adr lr, here
ldr r0, [r9, #GD_RELOC_OFF] /* r0 = gd->reloc_off */
add lr, lr, r0
#if defined(CONFIG_CPU_V7M)
orr lr, #1 /* As required by Thumb-only */
#endif
ldr r0, [r9, #GD_RELOCADDR] /* r0 = gd->relocaddr */
b relocate_code
here:
Why U-boot proper needs to setup DRAM again and relocate itself again if this was already done by SPL? Am i missing something here?

Why U-boot proper needs to setup DRAM again and relocate itself again if this was already done by SPL?
The relocation to the top of memory is to maximize contiguous free space.
You could build/link for and load U-Boot at this high-memory address (to save on the copy). But whenever you rebuild U-Boot with added features and the image gets larger, then that load address has to be recalculated.
Which also means that the SPL has to be changed.
BTW "setup DRAM" would typically be thought of as initializing a DRAM controller, something that is not done at this stage.
Besides relocating its code, U-Boot also has to move the stack and heap, aka the C runtime environment.
Am i missing something here?
It's more convenient to load U-Boot in the middle of main memory, and then let U-Boot relocate itself to high memory.

Related

iOS ARM64 Syscalls

I am learning more about shellcode and making syscalls in arm64 on iOS devices. The device I am testing on is iPhone 6S.
I got the list of syscalls from this link (https://github.com/radare/radare2/blob/master/libr/include/sflib/darwin-arm-64/ios-syscalls.txt).
I learnt that x8 is used for putting the syscall number for arm64 from here (http://arm.ninja/2016/03/07/decoding-syscalls-in-arm64/).
I figured the various registers used to pass in parameters for arm64 should be the same as arm so I referred to this link (https://w3challs.com/syscalls/?arch=arm_strong), taken from https://azeria-labs.com/writing-arm-shellcode/.
I wrote inline assembly in Xcode and here are some snippets
//exit syscall
__asm__ volatile("mov x8, #1");
__asm__ volatile("mov x0, #0");
__asm__ volatile("svc 0x80");
However, the application does not terminate when I stepped over these codes.
char write_buffer[]="console_text";
int write_buffer_size = sizeof(write_buffer);
__asm__ volatile("mov x8,#4;" //arm64 uses x8 for syscall number
"mov x0,#1;" //1 for stdout file descriptor
"mov x1,%0;" //the buffer to display
"mov x2,%1;" //buffer size
"svc 0x80;"
:
:"r"(write_buffer),"r"(write_buffer_size)
:"x0","x1","x2","x8"
);
If this syscall works, it should print out some text in Xcode's console output screen. However, nothing gets printed.
There are many online articles for ARM assembly, some use svc 0x80 and some use svc 0 etc and so there can be a few variations. I tried various methods but I could not get the two code snippets to work.
Can someone provide some guidance?
EDIT:
This is what Xcode shows in its Assembly view when I wrote a C function syscall int return_value=syscall(1,0);
mov x1, sp
mov x30, #0
str x30, [x1]
orr w8, wzr, #0x1
stur x0, [x29, #-32] ; 8-byte Folded Spill
mov x0, x8
bl _syscall
I am not sure why this code was emitted.
The registers used for syscalls are completely arbitrary, and the resources you've picked are certainly wrong for XNU.
As far as I'm aware, the XNU syscall ABI for arm64 is entirely private and subject to change without notice so there's no published standard that it follows, but you can scrape together how it works by getting a copy of the XNU source (as tarballs, or viewing it online if you prefer that), grep for the handle_svc function, and just following the code.
I'm not gonna go into detail on where exactly you find which bits, but the end result is:
The immediate passed to svc is ignored, but the standard library uses svc 0x80.
x16 holds the syscall number
x0 through x8 hold up to 9 arguments*
There are no arguments on the stack
x0 and x1 hold up to 2 return values (e.g. in the case of fork)
The carry bit is used to report an error, in which case x0 holds the error code
* This is used only in the case of an indirect syscall (x16 = 0) with 8 arguments.
* Comments in the XNU source also mention x9, but it seems the engineer who wrote that should brush up on off-by-one errors.
And then it comes to the actual syscall numbers available:
The canonical source for "UNIX syscalls" is the file bsd/kern/syscalls.master in the XNU source tree. Those take syscall numbers from 0 up to about 540 in the latest iOS 13 beta.
The canonical source for "Mach syscalls" is the file osfmk/kern/syscall_sw.c in the XNU source tree. Those syscalls are invoked with negative numbers between -10 and -100 (e.g. -28 would be task_self_trap).
Unrelated to the last point, two syscalls mach_absolute_time and mach_continuous_time can be invoked with syscall numbers -3 and -4 respectively.
A few low-level operations are available through platform_syscall with the syscall number 0x80000000.
This should get you going. As #Siguza mentioned you must use x16 , not x8 for the syscall number.
#import <sys/syscall.h>
char testStringGlobal[] = "helloWorld from global variable\n";
int main(int argc, char * argv[]) {
char testStringOnStack[] = "helloWorld from stack variable\n";
#if TARGET_CPU_ARM64
//VARIANT 1 suggested by #PeterCordes
//an an input it's a file descriptor set to STD_OUT 1 so the syscall write output appears in Xcode debug output
//as an output this will be used for returning syscall return value;
register long x0 asm("x0") = 1;
//as an input string to write
//as an output this will be used for returning syscall return value higher half (in this particular case 0)
register char *x1 asm("x1") = testStringOnStack;
//string length
register long x2 asm("x2") = strlen(testStringOnStack);
//syscall write is 4
register long x16 asm("x16") = SYS_write; //syscall write definition - see my footnote below
//full variant using stack local variables for register x0,x1,x2,x16 input
//syscall result collected in x0 & x1 using "semi" intrinsic assembler
asm volatile(//all args prepared, make the syscall
"svc #0x80"
:"=r"(x0),"=r"(x1) //mark x0 & x1 as syscall outputs
:"r"(x0), "r"(x1), "r"(x2), "r"(x16): //mark the inputs
//inform the compiler we read the memory
"memory",
//inform the compiler we clobber carry flag (during the syscall itself)
"cc");
//VARIANT 2
//syscall write for globals variable using "semi" intrinsic assembler
//args hardcoded
//output of syscall is ignored
asm volatile(//prepare x1 with the help of x8 register
"mov x1, %0 \t\n"
//set file descriptor to STD_OUT 1 so it appears in Xcode debug output
"mov x0, #1 \t\n"
//hardcoded length
"mov x2, #32 \t\n"
//syscall write is 4
"mov x16, #0x4 \t\n"
//all args prepared, make the syscall
"svc #0x80"
::"r"(testStringGlobal):
//clobbered registers list
"x1","x0","x2","x16",
//inform the compiler we read the memory
"memory",
//inform the compiler we clobber carry flag (during the syscall itself)
"cc");
//VARIANT 3 - only applicable to global variables using "page" address
//which is PC-relative addressing to load addresses at a fixed offset from the current location (PIC code).
//syscall write for global variable using "semi" intrinsic assembler
asm volatile(//set x1 on proper PAGE
"adrp x1,_testStringGlobal#PAGE \t\n" //notice the underscore preceding variable name by convention
//add the offset of the testStringGlobal variable
"add x1,x1,_testStringGlobal#PAGEOFF \t\n"
//set file descriptor to STD_OUT 1 so it appears in Xcode debug output
"mov x0, #1 \t\n"
//hardcoded length
"mov x2, #32 \t\n"
//syscall write is 4
"mov x16, #0x4 \t\n"
//all args prepared, make the syscall
"svc #0x80"
:::
//clobbered registers list
"x1","x0","x2","x16",
//inform the compiler we read the memory
"memory",
//inform the compiler we clobber carry flag (during the syscall itself)
"cc");
#endif
#autoreleasepool {
return UIApplicationMain(argc, argv, nil, NSStringFromClass([AppDelegate class]));
}
}
EDIT
To #PeterCordes excellent comment, yes there is a syscall numbers definition header <sys/syscall.h> which I included in the above snippet^ in Variant 1. But it's important to mention inside it's defined by Apple like this:
#ifdef __APPLE_API_PRIVATE
#define SYS_syscall 0
#define SYS_exit 1
#define SYS_fork 2
#define SYS_read 3
#define SYS_write 4
I haven't heard of a case yet of an iOS app AppStore rejection due to using a system call directly through svc 0x80 nonetheless it's definitely not public API.
As for the suggested "=#ccc" by #PeterCordes i.e. carry flag (set by syscall upon error) as an output constraint that's not supported as of latest XCode11 beta / LLVM 8.0.0 even for x86 and definitely not for ARM.

Clarifications on iOS Assembly Language

I'm investigating a little bit on how Objective-C language is mapped into Assembl. I've started from a tutorial found at iOS Assembly Tutorial.
The code snippet under analysis is the following.
void fooFunction() {
int add = addFunction(12, 34);
printf("add = %i", add);
}
It is translated into
_fooFunction:
# 1:
push {r7, lr}
# 2:
movs r0, #12
movs r1, #34
# 3:
mov r7, sp
# 4:
bl _addFunction
# 5:
mov r1, r0
# 6:
movw r0, :lower16:(L_.str-(LPC1_0+4))
movt r0, :upper16:(L_.str-(LPC1_0+4))
LPC1_0:
add r0, pc
# 7:
blx _printf
# 8:
pop {r7, pc}
About the assembly code, I cannot understand the following two points
-> Comment #1
The author says that push decrements the stack by 8 byte since r7 and lr are of 4byte each. Ok. But he also says that the two values are stored with the one instruction. What does it mean?
-> Comment #6
movw r0, :lower16:(L_.str-(LPC1_0+4))
movt r0, :upper16:(L_.str-(LPC1_0+4))
The author says the that r0 will hold the address of the "add = %i" (that can be find in the data segment) but I don't really get how the memory layout looks like. Why does he represent the difference L_.str-(LPC1_0+4) with the dotted black line and not with red one (drawn by me).
Any clarifications will be appreciated.
Edit
I'm missing the concept of pushing r7 onto the stack. What does mean to push that value and what does it contain?
But he also says that the two values are stored with the one
instruction. What does it mean?
That the single push instruction will put both values onto the stack.
Why does he represent the difference L_.str-(LPC1_0+4)
Because the add r0, pc implicitly adds 4 bytes more. To quote the instruction set reference:
Add an immediate constant to the value from sp or pc, and place the result into a low register.
Syntax: ADD Rd, Rp, #expr
where:
Rd is the destination register. Rd mustbe in the range r0-r7.
Rp is either sp or pc.
expr is an expression that evaluates (at assembly time) to a multiple of 4 in the range 0-1020.
If Rp is the pc, the value used is: (the address of the current instruction + 4) AND &FFFFFFFC.
For comment 1:
The two values pushed to the stack are the values store in r7 and lr.
Two 4 byte values equals 8 bytes.
For comment 6:
The label LPC1_0 is followed by the instruction
add r0, pc
which adds another 4 bytes to the difference between the two addresses.

Address of interrupt handler in interrupt vector is +1 of actual address

I cross compiled a program for cortex-m3. In start up code all interrupts are given in g_pfnVectors. When I disassemble, at address 0x0, I see "stack pointer value". Than at address 0x4 reset interrupt handler address is given. This keeps going with following system interrupt addresses.
Here is my question : Why in interrupt vector, addresses of interrupt handlers are +1 of the actual address. Address of ResetISR handler is 0x184 but in interrupt table it is 0x185. This is the case for all other interrupt handler addresses. What is the reason of this?
00000000 <g_pfnVectors>:
0: 10008000 andne r8, r0, r0
4: 00000185 andeq r0, r0, r5, lsl #3
8: 00000215 andeq r0, r0, r5, lsl r2
c: 0000021d andeq r0, r0, sp, lsl r2
00000184 <ResetISR>:
184: b580 push {r7, lr}
186: b084 sub sp, #16
188: af00 add r7, sp, #0
.......
00000214 <NMI_Handler>:
214: b480 push {r7}
216: af00 add r7, sp, #0
218: e7fe b.n 218 <NMI_Handler+0x4>
21a: bf00 nop
.......
0000021c <HardFault_Handler>:
21c: b480 push {r7}
21e: af00 add r7, sp, #0
220: e7fe b.n 220 <HardFault_Handler+0x4>
.......
I found the answer in ARMv7-M Architecture Reference Manual, section Vector Table Definition(B1.5.3):
The Vector table must be naturally aligned to a power of two whose
alignment value is greater than or equal to (Number of Exceptions
supported x 4), with a minimum alignment of 128 bytes. On power-on or
reset, the processor uses the entry at offset 0 as the initial value
for SP_main, see The SP registers on page B1-572. All other entries
must have bit[0] set to 1, because this bit defines the EPSR.T bit on
exception entry. See Reset behavior on page B1-586 and Exception
entry behavior on page B1-587 for more information. On exception
entry, if bit[0] of the associated vector table entry is set to 0,
execution of the first instruction causes an INVSTATE UsageFault, see
The special-purpose program status registers, xPSR on page B1-572 and
Fault behavior on page B1-608. If this happens on a reset, this
escalates to a HardFault, because UsageFault is disabled on reset, see
Priority escalation on page B1-585 for more information.

ARM bare-metal with MMU: successive reads yield different values

Context (probably not needed):
As a learning exercise, I'm trying to implement a mini "OS" for the Raspberry Pi.
I'm currently implementing a very dumb memory management system. I already have the MMU enabled, and I'm in the process of getting a usable kmalloc.
It can already allocate chunks of memory from a pre-existing little kernel heap, mapped after the code and data segments. I'm trying to get it to grow as needed by mapping more pages. It must also be able to produce physically contiguous chunks.
The code is hosted at Github, there's a branch dedicated to this question with debug code. Note that it's not an example of well-organized, well-commented nor very clever code. :)
Actual question:
While trying to debug a data abort, I found something very strange.
This is a piece of code from my kmalloc:
next->prev_size = chunk->size;
next->size = -1;
term_printf(term, "A chunk->next_free = 0x%x\n", chunk->next_free);
term_printf(term, "B chunk->next_free = 0x%x\n", chunk->next_free);
*prev_list = next;
next->next_free = chunk->next_free;
term_printf(term, "next_free = 0x%x, chunk 0x%x\n", next->next_free, chunk->next_free);
term_printf(term, "next_free = 0x%x, chunk 0x%x\n", next->next_free, chunk->next_free);
I run it three times. Here are the results:
# 1st
A chunk->next_free = 0x0
B chunk->next_free = 0x0
next_free = 0x0, chunk 0x0
next_free = 0x0, chunk 0x0
# 2nd
A chunk->next_free = 0xffffffff
B chunk->next_free = 0x0
next_free = 0x0, chunk 0xffffffff
next_free = 0x0, chunk 0x0
# 3rd
A chunk->next_free = 0xffffffff
B chunk->next_free = 0xffffffff
next_free = 0xffffffff, chunk 0xffffffff
next_free = 0xffffffff, chunk 0xffffffff
The first and third iterations look normal (though next_free is supposed to have the value 0, the data abort occurs because it's got 0xffffffff). But what is my code doing during the second? O_o What kind of black sorcery can make my printf output two different values for chunk->next_free when read four times in a row? O_o
The data is well aligned, the pages are cacheable and non-bufferable (making them non-cacheable doesn't help), and I get the same result whether compiler optimizations are turned on or off. I tried throwing a data memory barrier in there but indeed it does nothing. I also checked the assembly produced, it looks ok.
I thought it could be caused by corrupted TLBs. I'm issuing "Invalidate unified single entry" (mcr p15, 0, %[addr], c8, c7, 1) after each new page mapping. Is it enough?
I tried debugging with qemu but it gets a data abort earlier when setting a bitmap of used physical pages, though this part works fine on the Pi.
I'm just looking for clues about what can cause this behavior. If you need more context please ask, though my code is for the moment rapidly changing and messy with lots of printf.
ETA:
The disassembly with -O0 for the first two printf:
c00025e4: e51b3018 ldr r3, [fp, #-24]
c00025e8: e5933008 ldr r3, [r3, #8]
c00025ec: e59b0004 ldr r0, [fp, #4]
c00025f0: e59f10a0 ldr r1, [pc, #160] ; c0002698 <kmalloc_wilderness+0x2c0>
c00025f4: e1a02003 mov r2, r3
c00025f8: eb000238 bl c0002ee0 <term_printf>
c00025fc: e51b3018 ldr r3, [fp, #-24]
c0002600: e5933008 ldr r3, [r3, #8]
c0002604: e59b0004 ldr r0, [fp, #4]
c0002608: e59f1088 ldr r1, [pc, #140] ; c000269c <kmalloc_wilderness+0x2c4>
c000260c: e1a02003 mov r2, r3
c0002610: eb000232 bl c0002ee0 <term_printf>
So it puts the address of chunk into r3, then perform a ldr to get next_free. It does it again before the second prinf. There's only one core, the DMA is not running, so there's nothing changing the value in memory between the calls.
With -O2:
c0001c38: e1a00006 mov r0, r6
c0001c3c: e59f10d8 ldr r1, [pc, #216] ; c0001d1c <kmalloc_wilderness+0x1b8>
c0001c40: e5942008 ldr r2, [r4, #8]
c0001c44: eb000278 bl c000262c <term_printf>
c0001c48: e1a00006 mov r0, r6
c0001c4c: e59f10cc ldr r1, [pc, #204] ; c0001d20 <kmalloc_wilderness+0x1bc>
c0001c50: e5942008 ldr r2, [r4, #8]
c0001c54: eb000274 bl c000262c <term_printf>
So it still fetches the value with ldr. That's why I get the same thing with both optimization levels.
New edit: I added more printfs, and it seems that the singularity happens at this point:
next->size = -1;
After this line, chunk->next_free turns into Heisenberg's cat. Before, it reads as 0.
The structure is defined as such:
struct kheap_chunk {
size_t prev_size;
size_t size; // -1 for wilderness chunk, bit 0 high if free
struct kheap_chunk *next_free;
};
chunk and next don't overlap.
If I move the "singularity line" below the next->next_free = chunk->next_free, it stops alternating between two values, but it's still weird: chunk->next_free is 0 before *prev_list = next, 0xffffffff after that. But next->next_free is still set to 0.

EXC_BAD_ACCESS in Assembly Code in iOS App

I'm trying to debug an EXC_BAD_ACCESS crash in an iOS App I am working on. Basically, my code calls the function new_dyna_start() which corresponds to the a certain assembly method. Here's the relevant assembly code:
.align 4
42430:
.long _translation_cache_iphone
.align 2
.globl _new_dyna_start
// .type new_dyna_start, %function
_new_dyna_start:
ldr r12, .dlptr
mov r0, #0xa4000000
stmia r12, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
sub fp, r12, #28
add r0, r0, #0x40
bl _new_recompile_block
ldr r0, [fp, #64]
ldr r10, [fp, #400+36] /* Count */
str r0, [fp, #72]
sub r10, r10, r0
ldr r0, 42430b
ldr pc, [r0]
From my (limited) understanding, at line 6 of the method, it calls the C function new_recompile_block(). This method works fine, and I know it finishes because at the end of the function I have
printf("End of loop");
which then appears in the debugger. After the method completes, I'm not entirely sure I understand what happens, but it seems that the assembly method obtains a reference to the C variable translation_cache_iphone. However, at the final line the app crashes oddly. This message appears in Xcode: http://imgur.com/dqKo0
However, if I click on the side to the last method called, I see it is this: http://imgur.com/M5h84
This seems to support my idea that it is the translation_cache_iphone variable causing the crash, as the memory address of the EXC_BAD_ACCESS (0x401000) is the same as translation_cache_iphone. translation_cache_iphone is declared as:
unsigned char* translation_cache_iphone = NULL;
and is initialized by:
translation_cache_iphone = (unsigned char *)(((unsigned long) translation_cache_static_iphone + (4096)) & ~(4095));
Am I right in assuming that this is the problem? Is the problem in the assembly code, or in the C code? I've tried modifying both, but to no avail. The assembly code above is the original.
Here is a link to the full source on Github. Simply compile and run on an iDevice with Xcode and you'll see the exact issues I'm facing. It may be easier to debug that way.
The last two instructions form an indirect jump to the translation_cache_iphone which is thus expected to be executable code. Verify that is the case and that memory permissions are appropriate - in many systems data pages are not executable by default.
This seems to support my idea that it is the translation_cache_iphone variable causing the crash
Yes, I believe that this variable is the problem.
In the assembly code you posted I can see one line that could cause an invalid access to the memory, and it is:
ldr r0, 42430b
ldr pc, [r0]
The first line loads the data from the label 42430 to the register r0. Then, the second line points PC (Program Counter) to the content of r0.
In the beginning of the assembly code you have declared what is the label 42430:
42430:
.long _translation_cache_iphone
Then, when it tries to access this value and execute is as code, it crashes.

Resources