I am trying to understand the boot process of BeagleBone Black by browsing through the source code. Assume that Iam keeping MLO and u-boot.img files in micro-SD card, and making BeagleBone to boot from SD card.
To my understanding, ROM code executes first, and it loads an MLO file from MMC into internal SRAM of the SOC. MLO file contains the code for x-loader, a Second Stage Program Loader(SPL). The SPL then sets up the DRAM and copies Third stage Program Loader(U-boot proper) into the DRAM. U-boot proper directly starts its execution from DRAM.
The architecture dependent portion of U-boot proper is located at arch/arm/ directory of the U-boot sources.
The code pertaining to SPL is located in spl/ directory.(while executing make mrproper followed by make SPL, *.o files are getting created only in spl/ directory)
For U-boot proper, iam guessing this is the execution flow -
arch/arm/cpu/armv7/start.S is located at the reset vector(so it runs first), and after some initialization it calls '_main' procedure located at arch/arm/lib/crt0.S .
In crt0.S, the board_init_f() is called, which sets up the DRAM (and something else), and then returns back to where it left off(in main_). It later calls the function relocate_code which relocates it again to DRAM.
ENTRY(_main)
/*
* Set up initial C runtime environment and call board_init_f(0).
*/
#if defined(CONFIG_SPL_BUILD) && defined(CONFIG_SPL_STACK)
ldr r0, =(CONFIG_SPL_STACK)
#else
ldr r0, =(CONFIG_SYS_INIT_SP_ADDR)
#endif
bic r0, r0, #7 /* 8-byte alignment for ABI compliance */
mov sp, r0
bl board_init_f_alloc_reserve
mov sp, r0
/* set up gd here, outside any C code */
mov r9, r0
bl board_init_f_init_reserve
mov r0, #0
bl board_init_f
#if ! defined(CONFIG_SPL_BUILD)
/*
* Set up intermediate environment (new sp and gd) and call
* relocate_code(addr_moni). Trick here is that we'll return
* 'here' but relocated.
*/
ldr r0, [r9, #GD_START_ADDR_SP] /* sp = gd->start_addr_sp */
bic r0, r0, #7 /* 8-byte alignment for ABI compliance */
mov sp, r0
ldr r9, [r9, #GD_BD] /* r9 = gd->bd */
sub r9, r9, #GD_SIZE /* new GD is below bd */
adr lr, here
ldr r0, [r9, #GD_RELOC_OFF] /* r0 = gd->reloc_off */
add lr, lr, r0
#if defined(CONFIG_CPU_V7M)
orr lr, #1 /* As required by Thumb-only */
#endif
ldr r0, [r9, #GD_RELOCADDR] /* r0 = gd->relocaddr */
b relocate_code
here:
Why U-boot proper needs to setup DRAM again and relocate itself again if this was already done by SPL? Am i missing something here?
Why U-boot proper needs to setup DRAM again and relocate itself again if this was already done by SPL?
The relocation to the top of memory is to maximize contiguous free space.
You could build/link for and load U-Boot at this high-memory address (to save on the copy). But whenever you rebuild U-Boot with added features and the image gets larger, then that load address has to be recalculated.
Which also means that the SPL has to be changed.
BTW "setup DRAM" would typically be thought of as initializing a DRAM controller, something that is not done at this stage.
Besides relocating its code, U-Boot also has to move the stack and heap, aka the C runtime environment.
Am i missing something here?
It's more convenient to load U-Boot in the middle of main memory, and then let U-Boot relocate itself to high memory.
This code(very similar code, haven't tried exactly this code) compiles using Android NDK, but not with Xcode/armv7+arm64/iOS
Errors in comments:
uint32_t *src;
uint32_t *dst;
#ifdef __ARM_NEON
__asm__ volatile(
"vld1.32 {d0, d1}, [%[src]] \n" // error: Vector register expected
"vrev32.8 q0, q0 \n" // error: Unrecognized instruction mnemonic
"vst1.32 {d0, d1}, [%[dst]] \n" // error: Vector register expected
:
: [src]"r"(src), [dst]"r"(dst)
: "d0", "d1"
);
#endif
What's wrong with this code?
EDIT1:
I rewrote the code using intrinsics:
uint8x16_t x = vreinterpretq_u8_u32(vld1q_u32(src));
uint8x16_t y = vrev32q_u8(x);
vst1q_u32(dst, vreinterpretq_u32_u8(y));
After disassembling, I get the following, which is a variation I have already tried:
vld1.32 {d16, d17}, [r0]!
vrev32.8 q8, q8
vst1.32 {d16, d17}, [r1]!
So my code looks like this now, but gives the exact same errors:
__asm__ volatile("vld1.32 {d0, d1}, [%0]! \n"
"vrev32.8 q0, q0 \n"
"vst1.32 {d0, d1}, [%1]! \n"
:
: "r"(src), "r"(dst)
: "d0", "d1"
);
EDIT2:
Reading through the disassembly, I actually found a second version of the function. It turns out that arm64 uses a slightly different instruction set. For example, the arm64 assembly uses rev32.16b v0, v0 instead. The whole function listing(which I can't make heads or tails of) is below:
_My_Function:
cmp w2, #0
add w9, w2, #3
csel w8, w9, w2, lt
cmp w9, #7
b.lo 0x3f4
asr w9, w8, #2
ldr x8, [x0]
mov w9, w9
lsl x9, x9, #2
ldr q0, [x8], #16
rev32.16b v0, v0
str q0, [x1], #16
sub x9, x9, #16
cbnz x9, 0x3e0
ret
I have successfully published several iOS apps which make use of ARM assembly language and inline code is the most frustrating way to do it. Apple still requires apps to support both ARM32 and ARM64 devices. Since the code will be built as both ARM32 and ARM64 by default (unless you changed the compile options), you need to design code which will successfully compile in both modes. As you noticed, ARM64 is a completely different mnemonic format and register model. There are 2 simple ways around this:
1) Write your code using NEON intrinsics. ARM specified that the original ARM32 intrinsics would remain mostly unchanged for ARMv8 targets and therefore can be compiled to both ARM32 and ARM64 code. This is the safest/easiest option.
2) Write inline code or a separate '.S' module for your assembly language code. To deal with the 2 compile modes, use "#ifdef __arm64__" and "#ifdef __arm__" to distinguish between the two instruction sets.
Intrinsics are apparently the only way to use the same code for NEON between ARM (32-bit) and AArch64.
There are many reasons not to use inline-assembly: https://gcc.gnu.org/wiki/DontUseInlineAsm
Unfortunately, current compilers often do a very poor job with ARM / AArch64 intrinsics, which is surprising because they do an excellent job optimizing x86 SSE/AVX intrinsics and PowerPC Altivec. They often do ok in simple cases, but can easily introduce extra store/reloads.
In theory with intrinsics, you should get good asm output, and it lets the compiler schedule instructions between the vector load and store, which will help most on an in-order core. (Or you could write a whole loop in inline asm that you schedule by hand.)
ARM's official documentation:
Although it is technically possible to optimize NEON assembly by hand, this can be very difficult because the pipeline and memory access timings have complex inter-dependencies. Instead of hand assembly, ARM strongly recommends the use of intrinsics
If you do use inline asm anyway, avoid future pain by getting it right.
It's easy to write inline asm that happens to work, but isn't safe wrt. future source changes (and sometimes to future compiler optimizations), because the constraints don't accurately describe what the asm does. The symptoms will be weird, and this kind of context-sensitive bug could even lead to unit tests passing but wrong code in the main program. (or vice versa).
A latent bug that doesn't cause any defects in the current build is still a bug, and is a really Bad Thing in a Stackoverflow answer that can be copied as an example into other contexts. #bitwise's code in the question and self-answer both have bugs like this.
The inline asm in the question isn't safe, because it modifies memory telling the compiler about it. This probably only manifests in a loop that reads from dst in C both before and after the inline asm. However, it's easy to fix, and doing so lets us drop the volatile (and the `"memory" clobber which it's missing) so the compiler can optimize better (but still with significant limitations compared to intrinsics).
volatile should prevent reordering relative to memory accesses, so it may not happen outside of fairly contrived circumstances. But that's hard to prove.
The following compiles for ARM and AArch64 (it might fail if compiling for ILP32 on AArch64, though, I forgot about that possibility). Using -funroll-loops leads to gcc choosing different addressing modes, and not forcing the dst++; src++; to happen between every inline asm statement. (This maybe wouldn't be possible with asm volatile).
I used memory operands so the compiler knows that memory is an input and an output, and giving the compiler the option to use auto-increment / decrement addressing modes. This is better than anything you can do with a pointer in a register as an input operand, because it allows loop unrolling to work.
This still doesn't let the compiler schedule the store many instructions after the corresponding load to software pipeline the loop for in-order cores, so it's probably only going to perform decently on out-of-order ARM chips.
void bytereverse32(uint32_t *dst32, const uint32_t *src32, size_t len)
{
typedef struct { uint64_t low, high; } vec128_t;
const vec128_t *src = (const vec128_t*) src32;
vec128_t *dst = (vec128_t*) dst32;
// with old gcc, this gets gcc to use a pointer compare as the loop condition
// instead of incrementing a loop counter
const vec128_t *src_endp = src + len/(sizeof(vec128_t)/sizeof(uint32_t));
// len is in units of 4-byte chunks
while (src < src_endp) {
#if defined(__ARM_NEON__) || defined(__ARM_NEON)
#if __LP64__ // FIXME: doesn't account for ILP32 in 64-bit mode
// aarch64 registers: s0 and d0 are subsets of q0 (128bit), synonym for v0
asm ("ldr q0, %[src] \n\t"
"rev32.16b v0, v0 \n\t"
"str q0, %[dst] \n\t"
: [dst] "=<>m"(*dst) // auto-increment/decrement or "normal" memory operand
: [src] "<>m" (*src)
: "q0", "v0"
);
#else
// arm32 registers: 128bit q0 is made of d0:d1, or s0:s3
asm ("vld1.32 {d0, d1}, %[src] \n\t"
"vrev32.8 q0, q0 \n\t" // reverse 8 bit elements inside 32bit words
"vst1.32 {d0, d1}, %[dst] \n"
: [dst] "=<>m"(*dst)
: [src] "<>m"(*src)
: "d0", "d1"
);
#endif
#else
#error "no NEON"
#endif
// increment pointers by 16 bytes
src++; // The inline asm doesn't modify the pointers.
dst++; // of course, these increments may compile to a post-increment addressing mode
// this way has the advantage of letting the compiler unroll or whatever
}
}
This compiles (on the Godbolt compiler explorer with gcc 4.8), but I don't know if it assembles, let alone works correctly. Still, I'm confident these operand constraints are correct. Constraints are basically the same across all architectures, and I understand them much better than I know NEON.
Anyway, the inner loop on ARM (32bit) with gcc 4.8 -O3, without -funroll-loops is:
.L4:
vld1.32 {d0, d1}, [r1], #16 # MEM[(const struct vec128_t *)src32_17]
vrev32.8 q0, q0
vst1.32 {d0, d1}, [r0], #16 # MEM[(struct vec128_t *)dst32_18]
cmp r3, r1 # src_endp, src32
bhi .L4 #,
The register constraint bug
The code in the OP's self-answer has another bug: The input pointer operands uses separate "r" constraints. This leads to breakage if the compiler wants to keep the old value around, and chooses an input register for src that isn't the same as the output register.
If you want to take pointer inputs in registers and choose your own addressing modes, you can use "0" matching-constraints, or you can use "+r" read-write output operands.
You will also need a "memory" clobber or dummy memory input/output operands (i.e. that tell the compiler which bytes of memory are read and written, even if you don't use that operand number in the inline asm).
See Looping over arrays with inline assembly for a discussion of the advantages and disadvantages of using r constraints for looping over an array on x86. ARM has auto-increment addressing modes, which appear to produce better code than anything you could get with manual choice of addressing modes. It lets gcc use different addressing modes in different copies of the block when loop-unrolling. "r" (pointer) constraints appear to have no advantage, so I won't go into detail about how to use a dummy input / output constraint to avoid needing a "memory" clobber.
Test-case that generates wrong code with #bitwise's asm statement:
// return a value as a way to tell the compiler it's needed after
uint32_t* unsafe_asm(uint32_t *dst, const uint32_t *src)
{
uint32_t *orig_dst = dst;
uint32_t initial_dst0val = orig_dst[0];
#ifdef __ARM_NEON
#if __LP64__
asm volatile("ldr q0, [%0], #16 # unused src input was %2\n\t"
"rev32.16b v0, v0 \n\t"
"str q0, [%1], #16 # unused dst input was %3\n"
: "=r"(src), "=r"(dst)
: "r"(src), "r"(dst)
: "d0", "d1" // ,"memory"
// clobbers don't include v0?
);
#else
asm volatile("vld1.32 {d0, d1}, [%0]! # unused src input was %2\n\t"
"vrev32.8 q0, q0 \n\t"
"vst1.32 {d0, d1}, [%1]! # unused dst input was %3\n"
: "=r"(src), "=r"(dst)
: "r"(src), "r"(dst)
: "d0", "d1" // ,"memory"
);
#endif
#else
#error "No NEON/AdvSIMD"
#endif
uint32_t final_dst0val = orig_dst[0];
// gcc assumes the asm doesn't change orig_dst[0], so it only does one load (after the asm)
// and uses it for final and initial
// uncomment the memory clobber, or use a dummy output operand, to avoid this.
// pointer + initial+final compiles to LSL 3 to multiply by 8 = 2 * sizeof(uint32_t)
// using orig_dst after the inline asm makes the compiler choose different registers for the
// "=r"(dst) output operand and the "r"(dst) input operand, since the asm constraints
// advertise this non-destructive capability.
return orig_dst + final_dst0val + initial_dst0val;
}
This compiles to (AArch64 gcc4.8 -O3):
ldr q0, [x1], #16 # unused src input was x1 // src, src
rev32.16b v0, v0
str q0, [x2], #16 # unused dst input was x0 // dst, dst
ldr w1, [x0] // D.2576, *dst_1(D)
add x0, x0, x1, lsl 3 //, dst, D.2576,
ret
The store uses x2 (an uninitialized register, since this function only takes 2 args). The "=r"(dst) output (%1) picked x2, while the "r"(dst) input (%3 which is used only in a comment) picked x0.
final_dst0val + initial_dst0val compiles to 2x final_dst0val, because we lied to the compiler and told it that memory wasn't modified. So instead of reading the same memory before and after the inline asm statement, it just reads after and left-shifts by one extra position when adding to the pointer. (The return value exists only to use the values so they're not optimized away).
We can fix both problems by correcting the constraints: using "+r" for the pointers and adding a "memory" clobber. (A dummy output would also work, and might hurt optimization less.) I didn't bother since this appears to have no advantage over the memory-operand version above.
With those changes, we get
safe_register_pointer_asm:
ldr w3, [x0] //, *dst_1(D)
mov x2, x0 // dst, dst ### These 2 insns are new
ldr q0, [x1], #16 // src
rev32.16b v0, v0
str q0, [x2], #16 // dst
ldr w1, [x0] // D.2597, *dst_1(D)
add x3, x1, x3, uxtw // D.2597, D.2597, initial_dst0val ## And this is new, to add the before and after loads
add x0, x0, x3, lsl 2 //, dst, D.2597,
ret
As stated in the edits to the original question, it turned out that I needed a different assembly implementation for arm64 and armv7.
#ifdef __ARM_NEON
#if __LP64__
asm volatile("ldr q0, [%0], #16 \n"
"rev32.16b v0, v0 \n"
"str q0, [%1], #16 \n"
: "=r"(src), "=r"(dst)
: "r"(src), "r"(dst)
: "d0", "d1"
);
#else
asm volatile("vld1.32 {d0, d1}, [%0]! \n"
"vrev32.8 q0, q0 \n"
"vst1.32 {d0, d1}, [%1]! \n"
: "=r"(src), "=r"(dst)
: "r"(src), "r"(dst)
: "d0", "d1"
);
#endif
#else
The intrinsics code that I posted in the original post generated surprisingly good assembly though, and also generated the arm64 version for me, so it may be a better idea to use intrinsics instead in the future.
I am porting some 32-bit ARM code to 64-bit and am having trouble determining the 64-bit version of this instruction:
ldr r1, =_fns
Where _fns is a symbol defined in some C source file elsewhere in the project.
I've tried the below but both give errors:
adr x1, _fns <== "error: unknown AArch64 fixup kind!"
adrl x1, _fns <== "error: unrecognized instruction mnemonic"
The assembler is LLVM in the iOS SDK (XCode 7.1).
I've noticed that if _fns is locally defined (i.e. in the same .S file) then "adr x1,_fns" works fine. However that's not a fix as _fns has to be in C code (i.e. in a different translation unit).
What is the right way to do this with LLVM ARM assembler?
If I feed
extern char ar[];
char *f()
{
return ar;
}
into the ELLCC (clang based) demo, I get:
Output from the compiler targeting ARM AArch64
.text
.file "/tmp/webcompile/_3793_0.c"
.globl f
.align 2
.type f,#function
f: // #f
// BB#0: // %entry
adrp x0, ar
add x0, x0, :lo12:ar
ret
.Lfunc_end0:
.size f, .Lfunc_end0-f
.ident "ecc 0.1.13 based on clang version 3.7.0 (trunk) (based on LLVM 3.7.0svn)"
.section ".note.GNU-stack","",#progbits
The adrp instruction gets the "page" address of ar into x0. The symbol argument to adrp translates into a 21 bit PC relative offset to the 4K page in which the symbol resides. That offset is added to the PC to get the actual start of the page. The add instruction adds the low 12 bits of the symbol address to get the actual symbol address.
This instruction sequence allows the address of a symbol within +/-4GB of the PC to be loaded.
As far as I can tell, there doesn't seem to be a way of getting functionality similar to 32 bit ARM's "=ar" in C. I assembly language, it looks like this will work:
.text
.file "atest.s"
.globl f
.align 2
f:
ldr x0, p
ret
.align 3
p:
.xword _fns
This is very similar to what the 32 bit ARM does under the hood.
The only reason I started out with the C version was to show how I usually attack a problem like this, especially if I'm not that familiar with the target assembly language.
this work well for me, xcode7.1 LLVM7.0 IOS9.1
In 32bit arm
ldr r9,=JumpTab
Change to 64bit arm
adrp x9,JumpTab#PAGE
add x9,x9,JumpTab#PAGEOFF
By the way,you need care registers' number, some registers have specific useful in arm64
I get linking problem when create library for iOS 7 on iPhone (ARM64).
The error message is:
ld: in /long_path/libHEVCCodec.a(inv_xforms_arm64.o), in section TEXT,text reloc 0:
ARM64_RELOC_SUBTRACTOR must have r_length of 2 or 3 for architecture arm64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
This error appears as a result to this code (it's some sort of switch):
adr addr, .L.dct_add_switch
ldrh offset, [addr, ta, lsl #1]
add addr, addr, offset, uxth
br addr
.L.dct_add_switch:
.hword .L.dct_add_4 - .L.dct_add_switch
.hword .L.dst_add_4 - .L.dct_add_switch
...
ta, addr, offset are general registers x3, x4, w5 respectively.
Does anybody know how to handle this situation?
PS: there are not any problems with GNU GCC & Android.
EDIT1:
It seems that problem is not in linker itself but in compiler.
I checked object file (objdump) and instead of difference constants there are just zeros.
.L.dct_add_switch:
0000000000000010 .long 0x00000000
0000000000000014 .long 0x00000000
0000000000000018 .long 0x00000000
000000000000001c nop
When I put manually calculated constants instead of ".L.dct_add_4 - .L.dct_add_switch", etc expressions, everything is going all right.
Maybe there is some compiler keys which will make compiler to do it job rightfully?
Thanks.
Well there is a compiler & linker problem and it depends on the size of data which are used for offsets. Clang is not very friendly to anything what is different from 4 Bytes.
The discussion and possible solutions in other topic: creating constant jump table; xcode; clang; asm
The problem is the Mach-O object file format for ARM 64-bit targets doesn't support a relocation for the 16-bit difference between two symbols. It appears that the difference must be 32-bit or 64-bit. It doesn't seem to be a problem with the compiler or the linker. The assembly code you've quoted in your question looks like handcrafted assembly, not compiler output.
The solution would be to rewrite the assembly to use 32-bit difference values. Something like this:
adr addr, .L.dct_add_switch
ldr offset, [addr, ta, lsl #2]
add addr, addr, offset, uxtw
br addr
.L.dct_add_switch:
.word .L.dct_add_4 - .L.dct_add_switch
.word .L.dst_add_4 - .L.dct_add_switch
I'm investigating a little bit on how Objective-C language is mapped into Assembl. I've started from a tutorial found at iOS Assembly Tutorial.
The code snippet under analysis is the following.
void fooFunction() {
int add = addFunction(12, 34);
printf("add = %i", add);
}
It is translated into
_fooFunction:
# 1:
push {r7, lr}
# 2:
movs r0, #12
movs r1, #34
# 3:
mov r7, sp
# 4:
bl _addFunction
# 5:
mov r1, r0
# 6:
movw r0, :lower16:(L_.str-(LPC1_0+4))
movt r0, :upper16:(L_.str-(LPC1_0+4))
LPC1_0:
add r0, pc
# 7:
blx _printf
# 8:
pop {r7, pc}
About the assembly code, I cannot understand the following two points
-> Comment #1
The author says that push decrements the stack by 8 byte since r7 and lr are of 4byte each. Ok. But he also says that the two values are stored with the one instruction. What does it mean?
-> Comment #6
movw r0, :lower16:(L_.str-(LPC1_0+4))
movt r0, :upper16:(L_.str-(LPC1_0+4))
The author says the that r0 will hold the address of the "add = %i" (that can be find in the data segment) but I don't really get how the memory layout looks like. Why does he represent the difference L_.str-(LPC1_0+4) with the dotted black line and not with red one (drawn by me).
Any clarifications will be appreciated.
Edit
I'm missing the concept of pushing r7 onto the stack. What does mean to push that value and what does it contain?
But he also says that the two values are stored with the one
instruction. What does it mean?
That the single push instruction will put both values onto the stack.
Why does he represent the difference L_.str-(LPC1_0+4)
Because the add r0, pc implicitly adds 4 bytes more. To quote the instruction set reference:
Add an immediate constant to the value from sp or pc, and place the result into a low register.
Syntax: ADD Rd, Rp, #expr
where:
Rd is the destination register. Rd mustbe in the range r0-r7.
Rp is either sp or pc.
expr is an expression that evaluates (at assembly time) to a multiple of 4 in the range 0-1020.
If Rp is the pc, the value used is: (the address of the current instruction + 4) AND &FFFFFFFC.
For comment 1:
The two values pushed to the stack are the values store in r7 and lr.
Two 4 byte values equals 8 bytes.
For comment 6:
The label LPC1_0 is followed by the instruction
add r0, pc
which adds another 4 bytes to the difference between the two addresses.