RGBA to ABGR: Inline arm neon asm for iOS/Xcode - ios

This code(very similar code, haven't tried exactly this code) compiles using Android NDK, but not with Xcode/armv7+arm64/iOS
Errors in comments:
uint32_t *src;
uint32_t *dst;
#ifdef __ARM_NEON
__asm__ volatile(
"vld1.32 {d0, d1}, [%[src]] \n" // error: Vector register expected
"vrev32.8 q0, q0 \n" // error: Unrecognized instruction mnemonic
"vst1.32 {d0, d1}, [%[dst]] \n" // error: Vector register expected
:
: [src]"r"(src), [dst]"r"(dst)
: "d0", "d1"
);
#endif
What's wrong with this code?
EDIT1:
I rewrote the code using intrinsics:
uint8x16_t x = vreinterpretq_u8_u32(vld1q_u32(src));
uint8x16_t y = vrev32q_u8(x);
vst1q_u32(dst, vreinterpretq_u32_u8(y));
After disassembling, I get the following, which is a variation I have already tried:
vld1.32 {d16, d17}, [r0]!
vrev32.8 q8, q8
vst1.32 {d16, d17}, [r1]!
So my code looks like this now, but gives the exact same errors:
__asm__ volatile("vld1.32 {d0, d1}, [%0]! \n"
"vrev32.8 q0, q0 \n"
"vst1.32 {d0, d1}, [%1]! \n"
:
: "r"(src), "r"(dst)
: "d0", "d1"
);
EDIT2:
Reading through the disassembly, I actually found a second version of the function. It turns out that arm64 uses a slightly different instruction set. For example, the arm64 assembly uses rev32.16b v0, v0 instead. The whole function listing(which I can't make heads or tails of) is below:
_My_Function:
cmp w2, #0
add w9, w2, #3
csel w8, w9, w2, lt
cmp w9, #7
b.lo 0x3f4
asr w9, w8, #2
ldr x8, [x0]
mov w9, w9
lsl x9, x9, #2
ldr q0, [x8], #16
rev32.16b v0, v0
str q0, [x1], #16
sub x9, x9, #16
cbnz x9, 0x3e0
ret

I have successfully published several iOS apps which make use of ARM assembly language and inline code is the most frustrating way to do it. Apple still requires apps to support both ARM32 and ARM64 devices. Since the code will be built as both ARM32 and ARM64 by default (unless you changed the compile options), you need to design code which will successfully compile in both modes. As you noticed, ARM64 is a completely different mnemonic format and register model. There are 2 simple ways around this:
1) Write your code using NEON intrinsics. ARM specified that the original ARM32 intrinsics would remain mostly unchanged for ARMv8 targets and therefore can be compiled to both ARM32 and ARM64 code. This is the safest/easiest option.
2) Write inline code or a separate '.S' module for your assembly language code. To deal with the 2 compile modes, use "#ifdef __arm64__" and "#ifdef __arm__" to distinguish between the two instruction sets.

Intrinsics are apparently the only way to use the same code for NEON between ARM (32-bit) and AArch64.
There are many reasons not to use inline-assembly: https://gcc.gnu.org/wiki/DontUseInlineAsm
Unfortunately, current compilers often do a very poor job with ARM / AArch64 intrinsics, which is surprising because they do an excellent job optimizing x86 SSE/AVX intrinsics and PowerPC Altivec. They often do ok in simple cases, but can easily introduce extra store/reloads.
In theory with intrinsics, you should get good asm output, and it lets the compiler schedule instructions between the vector load and store, which will help most on an in-order core. (Or you could write a whole loop in inline asm that you schedule by hand.)
ARM's official documentation:
Although it is technically possible to optimize NEON assembly by hand, this can be very difficult because the pipeline and memory access timings have complex inter-dependencies. Instead of hand assembly, ARM strongly recommends the use of intrinsics
If you do use inline asm anyway, avoid future pain by getting it right.
It's easy to write inline asm that happens to work, but isn't safe wrt. future source changes (and sometimes to future compiler optimizations), because the constraints don't accurately describe what the asm does. The symptoms will be weird, and this kind of context-sensitive bug could even lead to unit tests passing but wrong code in the main program. (or vice versa).
A latent bug that doesn't cause any defects in the current build is still a bug, and is a really Bad Thing in a Stackoverflow answer that can be copied as an example into other contexts. #bitwise's code in the question and self-answer both have bugs like this.
The inline asm in the question isn't safe, because it modifies memory telling the compiler about it. This probably only manifests in a loop that reads from dst in C both before and after the inline asm. However, it's easy to fix, and doing so lets us drop the volatile (and the `"memory" clobber which it's missing) so the compiler can optimize better (but still with significant limitations compared to intrinsics).
volatile should prevent reordering relative to memory accesses, so it may not happen outside of fairly contrived circumstances. But that's hard to prove.
The following compiles for ARM and AArch64 (it might fail if compiling for ILP32 on AArch64, though, I forgot about that possibility). Using -funroll-loops leads to gcc choosing different addressing modes, and not forcing the dst++; src++; to happen between every inline asm statement. (This maybe wouldn't be possible with asm volatile).
I used memory operands so the compiler knows that memory is an input and an output, and giving the compiler the option to use auto-increment / decrement addressing modes. This is better than anything you can do with a pointer in a register as an input operand, because it allows loop unrolling to work.
This still doesn't let the compiler schedule the store many instructions after the corresponding load to software pipeline the loop for in-order cores, so it's probably only going to perform decently on out-of-order ARM chips.
void bytereverse32(uint32_t *dst32, const uint32_t *src32, size_t len)
{
typedef struct { uint64_t low, high; } vec128_t;
const vec128_t *src = (const vec128_t*) src32;
vec128_t *dst = (vec128_t*) dst32;
// with old gcc, this gets gcc to use a pointer compare as the loop condition
// instead of incrementing a loop counter
const vec128_t *src_endp = src + len/(sizeof(vec128_t)/sizeof(uint32_t));
// len is in units of 4-byte chunks
while (src < src_endp) {
#if defined(__ARM_NEON__) || defined(__ARM_NEON)
#if __LP64__ // FIXME: doesn't account for ILP32 in 64-bit mode
// aarch64 registers: s0 and d0 are subsets of q0 (128bit), synonym for v0
asm ("ldr q0, %[src] \n\t"
"rev32.16b v0, v0 \n\t"
"str q0, %[dst] \n\t"
: [dst] "=<>m"(*dst) // auto-increment/decrement or "normal" memory operand
: [src] "<>m" (*src)
: "q0", "v0"
);
#else
// arm32 registers: 128bit q0 is made of d0:d1, or s0:s3
asm ("vld1.32 {d0, d1}, %[src] \n\t"
"vrev32.8 q0, q0 \n\t" // reverse 8 bit elements inside 32bit words
"vst1.32 {d0, d1}, %[dst] \n"
: [dst] "=<>m"(*dst)
: [src] "<>m"(*src)
: "d0", "d1"
);
#endif
#else
#error "no NEON"
#endif
// increment pointers by 16 bytes
src++; // The inline asm doesn't modify the pointers.
dst++; // of course, these increments may compile to a post-increment addressing mode
// this way has the advantage of letting the compiler unroll or whatever
}
}
This compiles (on the Godbolt compiler explorer with gcc 4.8), but I don't know if it assembles, let alone works correctly. Still, I'm confident these operand constraints are correct. Constraints are basically the same across all architectures, and I understand them much better than I know NEON.
Anyway, the inner loop on ARM (32bit) with gcc 4.8 -O3, without -funroll-loops is:
.L4:
vld1.32 {d0, d1}, [r1], #16 # MEM[(const struct vec128_t *)src32_17]
vrev32.8 q0, q0
vst1.32 {d0, d1}, [r0], #16 # MEM[(struct vec128_t *)dst32_18]
cmp r3, r1 # src_endp, src32
bhi .L4 #,
The register constraint bug
The code in the OP's self-answer has another bug: The input pointer operands uses separate "r" constraints. This leads to breakage if the compiler wants to keep the old value around, and chooses an input register for src that isn't the same as the output register.
If you want to take pointer inputs in registers and choose your own addressing modes, you can use "0" matching-constraints, or you can use "+r" read-write output operands.
You will also need a "memory" clobber or dummy memory input/output operands (i.e. that tell the compiler which bytes of memory are read and written, even if you don't use that operand number in the inline asm).
See Looping over arrays with inline assembly for a discussion of the advantages and disadvantages of using r constraints for looping over an array on x86. ARM has auto-increment addressing modes, which appear to produce better code than anything you could get with manual choice of addressing modes. It lets gcc use different addressing modes in different copies of the block when loop-unrolling. "r" (pointer) constraints appear to have no advantage, so I won't go into detail about how to use a dummy input / output constraint to avoid needing a "memory" clobber.
Test-case that generates wrong code with #bitwise's asm statement:
// return a value as a way to tell the compiler it's needed after
uint32_t* unsafe_asm(uint32_t *dst, const uint32_t *src)
{
uint32_t *orig_dst = dst;
uint32_t initial_dst0val = orig_dst[0];
#ifdef __ARM_NEON
#if __LP64__
asm volatile("ldr q0, [%0], #16 # unused src input was %2\n\t"
"rev32.16b v0, v0 \n\t"
"str q0, [%1], #16 # unused dst input was %3\n"
: "=r"(src), "=r"(dst)
: "r"(src), "r"(dst)
: "d0", "d1" // ,"memory"
// clobbers don't include v0?
);
#else
asm volatile("vld1.32 {d0, d1}, [%0]! # unused src input was %2\n\t"
"vrev32.8 q0, q0 \n\t"
"vst1.32 {d0, d1}, [%1]! # unused dst input was %3\n"
: "=r"(src), "=r"(dst)
: "r"(src), "r"(dst)
: "d0", "d1" // ,"memory"
);
#endif
#else
#error "No NEON/AdvSIMD"
#endif
uint32_t final_dst0val = orig_dst[0];
// gcc assumes the asm doesn't change orig_dst[0], so it only does one load (after the asm)
// and uses it for final and initial
// uncomment the memory clobber, or use a dummy output operand, to avoid this.
// pointer + initial+final compiles to LSL 3 to multiply by 8 = 2 * sizeof(uint32_t)
// using orig_dst after the inline asm makes the compiler choose different registers for the
// "=r"(dst) output operand and the "r"(dst) input operand, since the asm constraints
// advertise this non-destructive capability.
return orig_dst + final_dst0val + initial_dst0val;
}
This compiles to (AArch64 gcc4.8 -O3):
ldr q0, [x1], #16 # unused src input was x1 // src, src
rev32.16b v0, v0
str q0, [x2], #16 # unused dst input was x0 // dst, dst
ldr w1, [x0] // D.2576, *dst_1(D)
add x0, x0, x1, lsl 3 //, dst, D.2576,
ret
The store uses x2 (an uninitialized register, since this function only takes 2 args). The "=r"(dst) output (%1) picked x2, while the "r"(dst) input (%3 which is used only in a comment) picked x0.
final_dst0val + initial_dst0val compiles to 2x final_dst0val, because we lied to the compiler and told it that memory wasn't modified. So instead of reading the same memory before and after the inline asm statement, it just reads after and left-shifts by one extra position when adding to the pointer. (The return value exists only to use the values so they're not optimized away).
We can fix both problems by correcting the constraints: using "+r" for the pointers and adding a "memory" clobber. (A dummy output would also work, and might hurt optimization less.) I didn't bother since this appears to have no advantage over the memory-operand version above.
With those changes, we get
safe_register_pointer_asm:
ldr w3, [x0] //, *dst_1(D)
mov x2, x0 // dst, dst ### These 2 insns are new
ldr q0, [x1], #16 // src
rev32.16b v0, v0
str q0, [x2], #16 // dst
ldr w1, [x0] // D.2597, *dst_1(D)
add x3, x1, x3, uxtw // D.2597, D.2597, initial_dst0val ## And this is new, to add the before and after loads
add x0, x0, x3, lsl 2 //, dst, D.2597,
ret

As stated in the edits to the original question, it turned out that I needed a different assembly implementation for arm64 and armv7.
#ifdef __ARM_NEON
#if __LP64__
asm volatile("ldr q0, [%0], #16 \n"
"rev32.16b v0, v0 \n"
"str q0, [%1], #16 \n"
: "=r"(src), "=r"(dst)
: "r"(src), "r"(dst)
: "d0", "d1"
);
#else
asm volatile("vld1.32 {d0, d1}, [%0]! \n"
"vrev32.8 q0, q0 \n"
"vst1.32 {d0, d1}, [%1]! \n"
: "=r"(src), "=r"(dst)
: "r"(src), "r"(dst)
: "d0", "d1"
);
#endif
#else
The intrinsics code that I posted in the original post generated surprisingly good assembly though, and also generated the arm64 version for me, so it may be a better idea to use intrinsics instead in the future.

Related

iOS ARM64 Syscalls

I am learning more about shellcode and making syscalls in arm64 on iOS devices. The device I am testing on is iPhone 6S.
I got the list of syscalls from this link (https://github.com/radare/radare2/blob/master/libr/include/sflib/darwin-arm-64/ios-syscalls.txt).
I learnt that x8 is used for putting the syscall number for arm64 from here (http://arm.ninja/2016/03/07/decoding-syscalls-in-arm64/).
I figured the various registers used to pass in parameters for arm64 should be the same as arm so I referred to this link (https://w3challs.com/syscalls/?arch=arm_strong), taken from https://azeria-labs.com/writing-arm-shellcode/.
I wrote inline assembly in Xcode and here are some snippets
//exit syscall
__asm__ volatile("mov x8, #1");
__asm__ volatile("mov x0, #0");
__asm__ volatile("svc 0x80");
However, the application does not terminate when I stepped over these codes.
char write_buffer[]="console_text";
int write_buffer_size = sizeof(write_buffer);
__asm__ volatile("mov x8,#4;" //arm64 uses x8 for syscall number
"mov x0,#1;" //1 for stdout file descriptor
"mov x1,%0;" //the buffer to display
"mov x2,%1;" //buffer size
"svc 0x80;"
:
:"r"(write_buffer),"r"(write_buffer_size)
:"x0","x1","x2","x8"
);
If this syscall works, it should print out some text in Xcode's console output screen. However, nothing gets printed.
There are many online articles for ARM assembly, some use svc 0x80 and some use svc 0 etc and so there can be a few variations. I tried various methods but I could not get the two code snippets to work.
Can someone provide some guidance?
EDIT:
This is what Xcode shows in its Assembly view when I wrote a C function syscall int return_value=syscall(1,0);
mov x1, sp
mov x30, #0
str x30, [x1]
orr w8, wzr, #0x1
stur x0, [x29, #-32] ; 8-byte Folded Spill
mov x0, x8
bl _syscall
I am not sure why this code was emitted.
The registers used for syscalls are completely arbitrary, and the resources you've picked are certainly wrong for XNU.
As far as I'm aware, the XNU syscall ABI for arm64 is entirely private and subject to change without notice so there's no published standard that it follows, but you can scrape together how it works by getting a copy of the XNU source (as tarballs, or viewing it online if you prefer that), grep for the handle_svc function, and just following the code.
I'm not gonna go into detail on where exactly you find which bits, but the end result is:
The immediate passed to svc is ignored, but the standard library uses svc 0x80.
x16 holds the syscall number
x0 through x8 hold up to 9 arguments*
There are no arguments on the stack
x0 and x1 hold up to 2 return values (e.g. in the case of fork)
The carry bit is used to report an error, in which case x0 holds the error code
* This is used only in the case of an indirect syscall (x16 = 0) with 8 arguments.
* Comments in the XNU source also mention x9, but it seems the engineer who wrote that should brush up on off-by-one errors.
And then it comes to the actual syscall numbers available:
The canonical source for "UNIX syscalls" is the file bsd/kern/syscalls.master in the XNU source tree. Those take syscall numbers from 0 up to about 540 in the latest iOS 13 beta.
The canonical source for "Mach syscalls" is the file osfmk/kern/syscall_sw.c in the XNU source tree. Those syscalls are invoked with negative numbers between -10 and -100 (e.g. -28 would be task_self_trap).
Unrelated to the last point, two syscalls mach_absolute_time and mach_continuous_time can be invoked with syscall numbers -3 and -4 respectively.
A few low-level operations are available through platform_syscall with the syscall number 0x80000000.
This should get you going. As #Siguza mentioned you must use x16 , not x8 for the syscall number.
#import <sys/syscall.h>
char testStringGlobal[] = "helloWorld from global variable\n";
int main(int argc, char * argv[]) {
char testStringOnStack[] = "helloWorld from stack variable\n";
#if TARGET_CPU_ARM64
//VARIANT 1 suggested by #PeterCordes
//an an input it's a file descriptor set to STD_OUT 1 so the syscall write output appears in Xcode debug output
//as an output this will be used for returning syscall return value;
register long x0 asm("x0") = 1;
//as an input string to write
//as an output this will be used for returning syscall return value higher half (in this particular case 0)
register char *x1 asm("x1") = testStringOnStack;
//string length
register long x2 asm("x2") = strlen(testStringOnStack);
//syscall write is 4
register long x16 asm("x16") = SYS_write; //syscall write definition - see my footnote below
//full variant using stack local variables for register x0,x1,x2,x16 input
//syscall result collected in x0 & x1 using "semi" intrinsic assembler
asm volatile(//all args prepared, make the syscall
"svc #0x80"
:"=r"(x0),"=r"(x1) //mark x0 & x1 as syscall outputs
:"r"(x0), "r"(x1), "r"(x2), "r"(x16): //mark the inputs
//inform the compiler we read the memory
"memory",
//inform the compiler we clobber carry flag (during the syscall itself)
"cc");
//VARIANT 2
//syscall write for globals variable using "semi" intrinsic assembler
//args hardcoded
//output of syscall is ignored
asm volatile(//prepare x1 with the help of x8 register
"mov x1, %0 \t\n"
//set file descriptor to STD_OUT 1 so it appears in Xcode debug output
"mov x0, #1 \t\n"
//hardcoded length
"mov x2, #32 \t\n"
//syscall write is 4
"mov x16, #0x4 \t\n"
//all args prepared, make the syscall
"svc #0x80"
::"r"(testStringGlobal):
//clobbered registers list
"x1","x0","x2","x16",
//inform the compiler we read the memory
"memory",
//inform the compiler we clobber carry flag (during the syscall itself)
"cc");
//VARIANT 3 - only applicable to global variables using "page" address
//which is PC-relative addressing to load addresses at a fixed offset from the current location (PIC code).
//syscall write for global variable using "semi" intrinsic assembler
asm volatile(//set x1 on proper PAGE
"adrp x1,_testStringGlobal#PAGE \t\n" //notice the underscore preceding variable name by convention
//add the offset of the testStringGlobal variable
"add x1,x1,_testStringGlobal#PAGEOFF \t\n"
//set file descriptor to STD_OUT 1 so it appears in Xcode debug output
"mov x0, #1 \t\n"
//hardcoded length
"mov x2, #32 \t\n"
//syscall write is 4
"mov x16, #0x4 \t\n"
//all args prepared, make the syscall
"svc #0x80"
:::
//clobbered registers list
"x1","x0","x2","x16",
//inform the compiler we read the memory
"memory",
//inform the compiler we clobber carry flag (during the syscall itself)
"cc");
#endif
#autoreleasepool {
return UIApplicationMain(argc, argv, nil, NSStringFromClass([AppDelegate class]));
}
}
EDIT
To #PeterCordes excellent comment, yes there is a syscall numbers definition header <sys/syscall.h> which I included in the above snippet^ in Variant 1. But it's important to mention inside it's defined by Apple like this:
#ifdef __APPLE_API_PRIVATE
#define SYS_syscall 0
#define SYS_exit 1
#define SYS_fork 2
#define SYS_read 3
#define SYS_write 4
I haven't heard of a case yet of an iOS app AppStore rejection due to using a system call directly through svc 0x80 nonetheless it's definitely not public API.
As for the suggested "=#ccc" by #PeterCordes i.e. carry flag (set by syscall upon error) as an output constraint that's not supported as of latest XCode11 beta / LLVM 8.0.0 even for x86 and definitely not for ARM.

Why is code relocation done in U-boot proper?

I am trying to understand the boot process of BeagleBone Black by browsing through the source code. Assume that Iam keeping MLO and u-boot.img files in micro-SD card, and making BeagleBone to boot from SD card.
To my understanding, ROM code executes first, and it loads an MLO file from MMC into internal SRAM of the SOC. MLO file contains the code for x-loader, a Second Stage Program Loader(SPL). The SPL then sets up the DRAM and copies Third stage Program Loader(U-boot proper) into the DRAM. U-boot proper directly starts its execution from DRAM.
The architecture dependent portion of U-boot proper is located at arch/arm/ directory of the U-boot sources.
The code pertaining to SPL is located in spl/ directory.(while executing make mrproper followed by make SPL, *.o files are getting created only in spl/ directory)
For U-boot proper, iam guessing this is the execution flow -
arch/arm/cpu/armv7/start.S is located at the reset vector(so it runs first), and after some initialization it calls '_main' procedure located at arch/arm/lib/crt0.S .
In crt0.S, the board_init_f() is called, which sets up the DRAM (and something else), and then returns back to where it left off(in main_). It later calls the function relocate_code which relocates it again to DRAM.
ENTRY(_main)
/*
* Set up initial C runtime environment and call board_init_f(0).
*/
#if defined(CONFIG_SPL_BUILD) && defined(CONFIG_SPL_STACK)
ldr r0, =(CONFIG_SPL_STACK)
#else
ldr r0, =(CONFIG_SYS_INIT_SP_ADDR)
#endif
bic r0, r0, #7 /* 8-byte alignment for ABI compliance */
mov sp, r0
bl board_init_f_alloc_reserve
mov sp, r0
/* set up gd here, outside any C code */
mov r9, r0
bl board_init_f_init_reserve
mov r0, #0
bl board_init_f
#if ! defined(CONFIG_SPL_BUILD)
/*
* Set up intermediate environment (new sp and gd) and call
* relocate_code(addr_moni). Trick here is that we'll return
* 'here' but relocated.
*/
ldr r0, [r9, #GD_START_ADDR_SP] /* sp = gd->start_addr_sp */
bic r0, r0, #7 /* 8-byte alignment for ABI compliance */
mov sp, r0
ldr r9, [r9, #GD_BD] /* r9 = gd->bd */
sub r9, r9, #GD_SIZE /* new GD is below bd */
adr lr, here
ldr r0, [r9, #GD_RELOC_OFF] /* r0 = gd->reloc_off */
add lr, lr, r0
#if defined(CONFIG_CPU_V7M)
orr lr, #1 /* As required by Thumb-only */
#endif
ldr r0, [r9, #GD_RELOCADDR] /* r0 = gd->relocaddr */
b relocate_code
here:
Why U-boot proper needs to setup DRAM again and relocate itself again if this was already done by SPL? Am i missing something here?
Why U-boot proper needs to setup DRAM again and relocate itself again if this was already done by SPL?
The relocation to the top of memory is to maximize contiguous free space.
You could build/link for and load U-Boot at this high-memory address (to save on the copy). But whenever you rebuild U-Boot with added features and the image gets larger, then that load address has to be recalculated.
Which also means that the SPL has to be changed.
BTW "setup DRAM" would typically be thought of as initializing a DRAM controller, something that is not done at this stage.
Besides relocating its code, U-Boot also has to move the stack and heap, aka the C runtime environment.
Am i missing something here?
It's more convenient to load U-Boot in the middle of main memory, and then let U-Boot relocate itself to high memory.

LLVM ARM64 assembly: Getting a symbol/label address?

I am porting some 32-bit ARM code to 64-bit and am having trouble determining the 64-bit version of this instruction:
ldr r1, =_fns
Where _fns is a symbol defined in some C source file elsewhere in the project.
I've tried the below but both give errors:
adr x1, _fns <== "error: unknown AArch64 fixup kind!"
adrl x1, _fns <== "error: unrecognized instruction mnemonic"
The assembler is LLVM in the iOS SDK (XCode 7.1).
I've noticed that if _fns is locally defined (i.e. in the same .S file) then "adr x1,_fns" works fine. However that's not a fix as _fns has to be in C code (i.e. in a different translation unit).
What is the right way to do this with LLVM ARM assembler?
If I feed
extern char ar[];
char *f()
{
return ar;
}
into the ELLCC (clang based) demo, I get:
Output from the compiler targeting ARM AArch64
.text
.file "/tmp/webcompile/_3793_0.c"
.globl f
.align 2
.type f,#function
f: // #f
// BB#0: // %entry
adrp x0, ar
add x0, x0, :lo12:ar
ret
.Lfunc_end0:
.size f, .Lfunc_end0-f
.ident "ecc 0.1.13 based on clang version 3.7.0 (trunk) (based on LLVM 3.7.0svn)"
.section ".note.GNU-stack","",#progbits
The adrp instruction gets the "page" address of ar into x0. The symbol argument to adrp translates into a 21 bit PC relative offset to the 4K page in which the symbol resides. That offset is added to the PC to get the actual start of the page. The add instruction adds the low 12 bits of the symbol address to get the actual symbol address.
This instruction sequence allows the address of a symbol within +/-4GB of the PC to be loaded.
As far as I can tell, there doesn't seem to be a way of getting functionality similar to 32 bit ARM's "=ar" in C. I assembly language, it looks like this will work:
.text
.file "atest.s"
.globl f
.align 2
f:
ldr x0, p
ret
.align 3
p:
.xword _fns
This is very similar to what the 32 bit ARM does under the hood.
The only reason I started out with the C version was to show how I usually attack a problem like this, especially if I'm not that familiar with the target assembly language.
this work well for me, xcode7.1 LLVM7.0 IOS9.1
In 32bit arm
ldr r9,=JumpTab
Change to 64bit arm
adrp x9,JumpTab#PAGE
add x9,x9,JumpTab#PAGEOFF
By the way,you need care registers' number, some registers have specific useful in arm64

Clarifications on iOS Assembly Language

I'm investigating a little bit on how Objective-C language is mapped into Assembl. I've started from a tutorial found at iOS Assembly Tutorial.
The code snippet under analysis is the following.
void fooFunction() {
int add = addFunction(12, 34);
printf("add = %i", add);
}
It is translated into
_fooFunction:
# 1:
push {r7, lr}
# 2:
movs r0, #12
movs r1, #34
# 3:
mov r7, sp
# 4:
bl _addFunction
# 5:
mov r1, r0
# 6:
movw r0, :lower16:(L_.str-(LPC1_0+4))
movt r0, :upper16:(L_.str-(LPC1_0+4))
LPC1_0:
add r0, pc
# 7:
blx _printf
# 8:
pop {r7, pc}
About the assembly code, I cannot understand the following two points
-> Comment #1
The author says that push decrements the stack by 8 byte since r7 and lr are of 4byte each. Ok. But he also says that the two values are stored with the one instruction. What does it mean?
-> Comment #6
movw r0, :lower16:(L_.str-(LPC1_0+4))
movt r0, :upper16:(L_.str-(LPC1_0+4))
The author says the that r0 will hold the address of the "add = %i" (that can be find in the data segment) but I don't really get how the memory layout looks like. Why does he represent the difference L_.str-(LPC1_0+4) with the dotted black line and not with red one (drawn by me).
Any clarifications will be appreciated.
Edit
I'm missing the concept of pushing r7 onto the stack. What does mean to push that value and what does it contain?
But he also says that the two values are stored with the one
instruction. What does it mean?
That the single push instruction will put both values onto the stack.
Why does he represent the difference L_.str-(LPC1_0+4)
Because the add r0, pc implicitly adds 4 bytes more. To quote the instruction set reference:
Add an immediate constant to the value from sp or pc, and place the result into a low register.
Syntax: ADD Rd, Rp, #expr
where:
Rd is the destination register. Rd mustbe in the range r0-r7.
Rp is either sp or pc.
expr is an expression that evaluates (at assembly time) to a multiple of 4 in the range 0-1020.
If Rp is the pc, the value used is: (the address of the current instruction + 4) AND &FFFFFFFC.
For comment 1:
The two values pushed to the stack are the values store in r7 and lr.
Two 4 byte values equals 8 bytes.
For comment 6:
The label LPC1_0 is followed by the instruction
add r0, pc
which adds another 4 bytes to the difference between the two addresses.

ARM bare-metal with MMU: successive reads yield different values

Context (probably not needed):
As a learning exercise, I'm trying to implement a mini "OS" for the Raspberry Pi.
I'm currently implementing a very dumb memory management system. I already have the MMU enabled, and I'm in the process of getting a usable kmalloc.
It can already allocate chunks of memory from a pre-existing little kernel heap, mapped after the code and data segments. I'm trying to get it to grow as needed by mapping more pages. It must also be able to produce physically contiguous chunks.
The code is hosted at Github, there's a branch dedicated to this question with debug code. Note that it's not an example of well-organized, well-commented nor very clever code. :)
Actual question:
While trying to debug a data abort, I found something very strange.
This is a piece of code from my kmalloc:
next->prev_size = chunk->size;
next->size = -1;
term_printf(term, "A chunk->next_free = 0x%x\n", chunk->next_free);
term_printf(term, "B chunk->next_free = 0x%x\n", chunk->next_free);
*prev_list = next;
next->next_free = chunk->next_free;
term_printf(term, "next_free = 0x%x, chunk 0x%x\n", next->next_free, chunk->next_free);
term_printf(term, "next_free = 0x%x, chunk 0x%x\n", next->next_free, chunk->next_free);
I run it three times. Here are the results:
# 1st
A chunk->next_free = 0x0
B chunk->next_free = 0x0
next_free = 0x0, chunk 0x0
next_free = 0x0, chunk 0x0
# 2nd
A chunk->next_free = 0xffffffff
B chunk->next_free = 0x0
next_free = 0x0, chunk 0xffffffff
next_free = 0x0, chunk 0x0
# 3rd
A chunk->next_free = 0xffffffff
B chunk->next_free = 0xffffffff
next_free = 0xffffffff, chunk 0xffffffff
next_free = 0xffffffff, chunk 0xffffffff
The first and third iterations look normal (though next_free is supposed to have the value 0, the data abort occurs because it's got 0xffffffff). But what is my code doing during the second? O_o What kind of black sorcery can make my printf output two different values for chunk->next_free when read four times in a row? O_o
The data is well aligned, the pages are cacheable and non-bufferable (making them non-cacheable doesn't help), and I get the same result whether compiler optimizations are turned on or off. I tried throwing a data memory barrier in there but indeed it does nothing. I also checked the assembly produced, it looks ok.
I thought it could be caused by corrupted TLBs. I'm issuing "Invalidate unified single entry" (mcr p15, 0, %[addr], c8, c7, 1) after each new page mapping. Is it enough?
I tried debugging with qemu but it gets a data abort earlier when setting a bitmap of used physical pages, though this part works fine on the Pi.
I'm just looking for clues about what can cause this behavior. If you need more context please ask, though my code is for the moment rapidly changing and messy with lots of printf.
ETA:
The disassembly with -O0 for the first two printf:
c00025e4: e51b3018 ldr r3, [fp, #-24]
c00025e8: e5933008 ldr r3, [r3, #8]
c00025ec: e59b0004 ldr r0, [fp, #4]
c00025f0: e59f10a0 ldr r1, [pc, #160] ; c0002698 <kmalloc_wilderness+0x2c0>
c00025f4: e1a02003 mov r2, r3
c00025f8: eb000238 bl c0002ee0 <term_printf>
c00025fc: e51b3018 ldr r3, [fp, #-24]
c0002600: e5933008 ldr r3, [r3, #8]
c0002604: e59b0004 ldr r0, [fp, #4]
c0002608: e59f1088 ldr r1, [pc, #140] ; c000269c <kmalloc_wilderness+0x2c4>
c000260c: e1a02003 mov r2, r3
c0002610: eb000232 bl c0002ee0 <term_printf>
So it puts the address of chunk into r3, then perform a ldr to get next_free. It does it again before the second prinf. There's only one core, the DMA is not running, so there's nothing changing the value in memory between the calls.
With -O2:
c0001c38: e1a00006 mov r0, r6
c0001c3c: e59f10d8 ldr r1, [pc, #216] ; c0001d1c <kmalloc_wilderness+0x1b8>
c0001c40: e5942008 ldr r2, [r4, #8]
c0001c44: eb000278 bl c000262c <term_printf>
c0001c48: e1a00006 mov r0, r6
c0001c4c: e59f10cc ldr r1, [pc, #204] ; c0001d20 <kmalloc_wilderness+0x1bc>
c0001c50: e5942008 ldr r2, [r4, #8]
c0001c54: eb000274 bl c000262c <term_printf>
So it still fetches the value with ldr. That's why I get the same thing with both optimization levels.
New edit: I added more printfs, and it seems that the singularity happens at this point:
next->size = -1;
After this line, chunk->next_free turns into Heisenberg's cat. Before, it reads as 0.
The structure is defined as such:
struct kheap_chunk {
size_t prev_size;
size_t size; // -1 for wilderness chunk, bit 0 high if free
struct kheap_chunk *next_free;
};
chunk and next don't overlap.
If I move the "singularity line" below the next->next_free = chunk->next_free, it stops alternating between two values, but it's still weird: chunk->next_free is 0 before *prev_list = next, 0xffffffff after that. But next->next_free is still set to 0.

Resources