ARM bare-metal with MMU: successive reads yield different values - memory

Context (probably not needed):
As a learning exercise, I'm trying to implement a mini "OS" for the Raspberry Pi.
I'm currently implementing a very dumb memory management system. I already have the MMU enabled, and I'm in the process of getting a usable kmalloc.
It can already allocate chunks of memory from a pre-existing little kernel heap, mapped after the code and data segments. I'm trying to get it to grow as needed by mapping more pages. It must also be able to produce physically contiguous chunks.
The code is hosted at Github, there's a branch dedicated to this question with debug code. Note that it's not an example of well-organized, well-commented nor very clever code. :)
Actual question:
While trying to debug a data abort, I found something very strange.
This is a piece of code from my kmalloc:
next->prev_size = chunk->size;
next->size = -1;
term_printf(term, "A chunk->next_free = 0x%x\n", chunk->next_free);
term_printf(term, "B chunk->next_free = 0x%x\n", chunk->next_free);
*prev_list = next;
next->next_free = chunk->next_free;
term_printf(term, "next_free = 0x%x, chunk 0x%x\n", next->next_free, chunk->next_free);
term_printf(term, "next_free = 0x%x, chunk 0x%x\n", next->next_free, chunk->next_free);
I run it three times. Here are the results:
# 1st
A chunk->next_free = 0x0
B chunk->next_free = 0x0
next_free = 0x0, chunk 0x0
next_free = 0x0, chunk 0x0
# 2nd
A chunk->next_free = 0xffffffff
B chunk->next_free = 0x0
next_free = 0x0, chunk 0xffffffff
next_free = 0x0, chunk 0x0
# 3rd
A chunk->next_free = 0xffffffff
B chunk->next_free = 0xffffffff
next_free = 0xffffffff, chunk 0xffffffff
next_free = 0xffffffff, chunk 0xffffffff
The first and third iterations look normal (though next_free is supposed to have the value 0, the data abort occurs because it's got 0xffffffff). But what is my code doing during the second? O_o What kind of black sorcery can make my printf output two different values for chunk->next_free when read four times in a row? O_o
The data is well aligned, the pages are cacheable and non-bufferable (making them non-cacheable doesn't help), and I get the same result whether compiler optimizations are turned on or off. I tried throwing a data memory barrier in there but indeed it does nothing. I also checked the assembly produced, it looks ok.
I thought it could be caused by corrupted TLBs. I'm issuing "Invalidate unified single entry" (mcr p15, 0, %[addr], c8, c7, 1) after each new page mapping. Is it enough?
I tried debugging with qemu but it gets a data abort earlier when setting a bitmap of used physical pages, though this part works fine on the Pi.
I'm just looking for clues about what can cause this behavior. If you need more context please ask, though my code is for the moment rapidly changing and messy with lots of printf.
ETA:
The disassembly with -O0 for the first two printf:
c00025e4: e51b3018 ldr r3, [fp, #-24]
c00025e8: e5933008 ldr r3, [r3, #8]
c00025ec: e59b0004 ldr r0, [fp, #4]
c00025f0: e59f10a0 ldr r1, [pc, #160] ; c0002698 <kmalloc_wilderness+0x2c0>
c00025f4: e1a02003 mov r2, r3
c00025f8: eb000238 bl c0002ee0 <term_printf>
c00025fc: e51b3018 ldr r3, [fp, #-24]
c0002600: e5933008 ldr r3, [r3, #8]
c0002604: e59b0004 ldr r0, [fp, #4]
c0002608: e59f1088 ldr r1, [pc, #140] ; c000269c <kmalloc_wilderness+0x2c4>
c000260c: e1a02003 mov r2, r3
c0002610: eb000232 bl c0002ee0 <term_printf>
So it puts the address of chunk into r3, then perform a ldr to get next_free. It does it again before the second prinf. There's only one core, the DMA is not running, so there's nothing changing the value in memory between the calls.
With -O2:
c0001c38: e1a00006 mov r0, r6
c0001c3c: e59f10d8 ldr r1, [pc, #216] ; c0001d1c <kmalloc_wilderness+0x1b8>
c0001c40: e5942008 ldr r2, [r4, #8]
c0001c44: eb000278 bl c000262c <term_printf>
c0001c48: e1a00006 mov r0, r6
c0001c4c: e59f10cc ldr r1, [pc, #204] ; c0001d20 <kmalloc_wilderness+0x1bc>
c0001c50: e5942008 ldr r2, [r4, #8]
c0001c54: eb000274 bl c000262c <term_printf>
So it still fetches the value with ldr. That's why I get the same thing with both optimization levels.
New edit: I added more printfs, and it seems that the singularity happens at this point:
next->size = -1;
After this line, chunk->next_free turns into Heisenberg's cat. Before, it reads as 0.
The structure is defined as such:
struct kheap_chunk {
size_t prev_size;
size_t size; // -1 for wilderness chunk, bit 0 high if free
struct kheap_chunk *next_free;
};
chunk and next don't overlap.
If I move the "singularity line" below the next->next_free = chunk->next_free, it stops alternating between two values, but it's still weird: chunk->next_free is 0 before *prev_list = next, 0xffffffff after that. But next->next_free is still set to 0.

Related

Stack overflow: Thread 1: EXC_BAD_ACCESS (code=2, address=0x16d09aa00)

Crash description
Recently I'm facing kinda really strange memory issues in one of my iOS/Swift projects. I'm really not sure what's going on and feel it's also not quite easy to describe, but I'll try my best anyway.
It basically behaves like follows:
On a certain code base, the crash always occurs in the same place (100% reproducible)
Changes to the code base, may resolve the issue, but it may also just pop up somewhere else
Crashes only occur on real devices, never inside simulators
Currently the app crashes with following error (results from 3 different runs):
Thread 1: EXC_BAD_ACCESS (code=2, address=0x16d09aa00)
Thread 1: EXC_BAD_ACCESS (code=2, address=0x16af46a00)
Thread 1: EXC_BAD_ACCESS (code=2, address=0x16d526a00)
Reasoning about memory addresses
WWDC session
I found an interesting session (Understanding Crashes and Crash Logs) from WWDC 2018, where one of the guys points out that it's sometimes possible to derive more information from the specific memory addresses, the crashes occur.
Unfortunately the addresses it crashes in my app are somewhat completely different, but maybe we can get clues from them anyway? At least it's interesting, that they're all quite similar, or isn't it?
Changes due to Diagnostic options enabled
Further investigation shows that the first 2 bytes (16) stay always the same, followed by 4 random bytes followd by 3 bytes (a00). When activating diagnositcs (e.g. ASan or Scribble), the last 3 bytes change (e.g. 3a0 or 9e0). But maybe this is only a kind of shift due to more "debug stuff" being added? I'm really not that "memory guy", but just want to provide anything I noticed.
Trying "Diagnostic options"
I tried different Diagnostic options (from schemes), but none of them really changed the crash in any way, or provided any more information.
1. Scribble
Crashes do not reference 0xAA or 0x55, so it's nothing to be catched using Scribble? (Xcode - scribble, guard edges and guard malloc)
2. Malloc Guard Edges
Didn't notice any difference using this either.
3. Zombies
Using this guide.
malloc_info --type 0x16b15e9c0
error: error: Trying to put the stack in unreadable memory at: 0x16b15e920.
4. ASan
Using ASan just puts following entry on top of the stack trace. Unfortunately I didn't find anything helpful related to that.
#0 0x0000000109efbf60 in __asan_alloca_poison ()
5. TSan
Not available on real devices (crashes only occur there)
Recursion / BOF?
Could it be a recursion that is too long, or another kind of stack/heap buffer overflow?
But it seems like the stack size on real devices as well as simulators is exactly the same with 524288 bytes (from Thread.main.stackSize).
So, as it doesn't crash in simulators, it's not a BOF? Or is the architecture difference too big, to make such conclusions here?
Disassembling
I also tried "disassembling".
disassemble -a 0x16d09aa00
error: Could not find function bounds for address 0x16d09aa00
Or disassemble -frame
But my assembler skills are really lacking behind, so currently there is nothing to get for me from that information.
Need help
As you can see I'm really running out of ideas. Either the crashes are really totally weird, or I just do not have enough knowledge/skills to use above tools, to get me any closer to the cause of those issues.
Either way... Any help, hints, ideas or whatever could point me in the right direction is highly appreciated!
Thanks in advance, guys.
Update May 19, 2020
I totally forgot to mention, that we're using ReSwift heavily in our app, and the crashes seem to be related to how we use the Middlewares there, I guess.
I'm also already in contact with the devs there: github.com/ReSwift/ReSwift/issues/271.
Here's finally some code. Unfortunately I can't share all the apps code (which may be necessary!?) and also don't want to overload you with way to much code.
Current issue
Thread 1: EXC_BAD_ACCESS (code=1, address=0x16ed82da0)
UserAccountMiddleware.swift
Note: Using those DispatchQueue.main.async actually makes the crashes go away. They indeed break the current cycle, so maybe there's some kind of recursion or timing issue happening?
func userAccountMiddleware() -> Middleware<AppState> {
return { dispatch, getState in
return { next in
return { action in
switch action {
case _ as ReSwiftInit:
// DispatchQueue.main.async {
dispatch(UserAccountSetAuthToken(authToken: Defaults.customerAuthToken))
dispatch(UserAccountSetAvatar(index: Defaults.avatarIndex))
// }
if let data = Defaults.customer,
let customer = try? JSONDecoder().decode(Customer.self, from: data) {
// DispatchQueue.main.async {
dispatch(UserAccountSetCustomerLoggedIn(customer: customer))
// }
}
// [...]
default:
break
}
next(action)
}
}
}
}
ReSwift Store.swift
// [...]
open func _defaultDispatch(action: Action) {
guard !isDispatching else {
raiseFatalError(
"ReSwift:ConcurrentMutationError- Action has been dispatched while" +
" a previous action is action is being processed. A reducer" +
" is dispatching an action, or ReSwift is used in a concurrent context" +
" (e.g. from multiple threads)."
)
}
isDispatching = true
let newState = reducer(action, state) // Thread 1: EXC_BAD_ACCESS (code=1, address=0x16ed82da0)
isDispatching = false
state = newState
}
// [...]
Xcode console:
(lldb) po state
error: warning: couldn't get required object pointer (substituting NULL): Couldn't load 'self' because its value couldn't be evaluated
error: Trying to put the stack in unreadable memory at: 0x16d95ad00.
Assembler (very last step of crash):
myapp`type metadata accessor for GlobalState:
0x101f6ac10 <+0>: sub sp, sp, #0x30 ; =0x30
-> 0x101f6ac14 <+4>: stp x29, x30, [sp, #0x20] // Thread 1: EXC_BAD_ACCESS (code=1, address=0x16ed82da0)
0x101f6ac18 <+8>: adrp x8, 3620
0x101f6ac1c <+12>: add x8, x8, #0x148 ; =0x148
0x101f6ac20 <+16>: ldr x8, [x8]
0x101f6ac24 <+20>: mov x9, #0x0
0x101f6ac28 <+24>: mov x1, x8
0x101f6ac2c <+28>: str x0, [sp, #0x18]
0x101f6ac30 <+32>: str x1, [sp, #0x10]
0x101f6ac34 <+36>: str x9, [sp, #0x8]
0x101f6ac38 <+40>: cbnz x8, 0x101f6ac54 ; <+68> at <compiler-generated>
0x101f6ac3c <+44>: adrp x1, 2122
0x101f6ac40 <+48>: add x1, x1, #0x1dc ; =0x1dc
0x101f6ac44 <+52>: ldr x0, [sp, #0x18]
0x101f6ac48 <+56>: bl 0x102775358 ; symbol stub for: swift_getSingletonMetadata
0x101f6ac4c <+60>: str x0, [sp, #0x10]
0x101f6ac50 <+64>: str x1, [sp, #0x8]
0x101f6ac54 <+68>: ldr x0, [sp, #0x8]
0x101f6ac58 <+72>: ldr x1, [sp, #0x10]
0x101f6ac5c <+76>: str x0, [sp]
0x101f6ac60 <+80>: mov x0, x1
0x101f6ac64 <+84>: ldr x1, [sp]
0x101f6ac68 <+88>: ldp x29, x30, [sp, #0x20]
0x101f6ac6c <+92>: add sp, sp, #0x30 ; =0x30
0x101f6ac70 <+96>: ret
TL;DR
Just move huge structs to the heap, by wrapping them inside arrays. Using #propertyWrappers, this can be an at least partly elegant solution.
#propertyWrapper
struct StoredOnHeap<T> {
private var value: [T]
init(wrappedValue: T) {
self.value = [wrappedValue]
}
var wrappedValue: T {
get {
return self.value[0]
}
set {
self.value[0] = newValue
}
}
}
// Usage:
#StoredOnHeap var hugeStruct: HugeStruct
https://gist.github.com/d4rkd3v1l/ab582a7cafd3a8b8c164c8541a3eef96
Long version
I'm almost 100% certain now, that this is a stack overflow, as I (finally) managed to reproduce this in a little demo project: https://github.com/d4rkd3v1l/ReSwift-StackOverflowDemo
Now I will just provide some more details and solutions for anyone else may running into this or similar issues.
The stack size on iOS (as of iOS 13) is 512kb and should apply to both, devices and simulators. Why did I say "should"? Because it almost certainly is somewhat different on simulators, as I did not see those crashes there. So maybe Thread.main.stackSize just tells 512kb but is in fact larger? IDK 🤷‍♂️
Here are some indicators, you may face the same issue:
You get EXC_BAD_ACCESS crashes with code 1 or 2**. And the crashes occur in high memory addresses, or at least completely out of where the rest of your app/stack normally "lives". Something like 0x16d95ad00 in my case.
Reducing the stuff you put on the stack (value types, e.g. very very large structs) or breaking the call stack down into smaller pieces (e.g. dispatch async) to give the stack some "time to breathe" prevents this crash.
And here at the latter we're already in the middle of the solution for that issue. As the stack size cannot (and probably even should not) increased, you must reduce the load you put there, like described in the 2nd point.
At least that's the solution we will probably go for. 🤞
*This is true at least for the main thread, other threads may be different.
**I think code 0 is kinda null pointer exceptionand therefore doesn't apply here. Please correct me if I'm wrong about this.

Why is code relocation done in U-boot proper?

I am trying to understand the boot process of BeagleBone Black by browsing through the source code. Assume that Iam keeping MLO and u-boot.img files in micro-SD card, and making BeagleBone to boot from SD card.
To my understanding, ROM code executes first, and it loads an MLO file from MMC into internal SRAM of the SOC. MLO file contains the code for x-loader, a Second Stage Program Loader(SPL). The SPL then sets up the DRAM and copies Third stage Program Loader(U-boot proper) into the DRAM. U-boot proper directly starts its execution from DRAM.
The architecture dependent portion of U-boot proper is located at arch/arm/ directory of the U-boot sources.
The code pertaining to SPL is located in spl/ directory.(while executing make mrproper followed by make SPL, *.o files are getting created only in spl/ directory)
For U-boot proper, iam guessing this is the execution flow -
arch/arm/cpu/armv7/start.S is located at the reset vector(so it runs first), and after some initialization it calls '_main' procedure located at arch/arm/lib/crt0.S .
In crt0.S, the board_init_f() is called, which sets up the DRAM (and something else), and then returns back to where it left off(in main_). It later calls the function relocate_code which relocates it again to DRAM.
ENTRY(_main)
/*
* Set up initial C runtime environment and call board_init_f(0).
*/
#if defined(CONFIG_SPL_BUILD) && defined(CONFIG_SPL_STACK)
ldr r0, =(CONFIG_SPL_STACK)
#else
ldr r0, =(CONFIG_SYS_INIT_SP_ADDR)
#endif
bic r0, r0, #7 /* 8-byte alignment for ABI compliance */
mov sp, r0
bl board_init_f_alloc_reserve
mov sp, r0
/* set up gd here, outside any C code */
mov r9, r0
bl board_init_f_init_reserve
mov r0, #0
bl board_init_f
#if ! defined(CONFIG_SPL_BUILD)
/*
* Set up intermediate environment (new sp and gd) and call
* relocate_code(addr_moni). Trick here is that we'll return
* 'here' but relocated.
*/
ldr r0, [r9, #GD_START_ADDR_SP] /* sp = gd->start_addr_sp */
bic r0, r0, #7 /* 8-byte alignment for ABI compliance */
mov sp, r0
ldr r9, [r9, #GD_BD] /* r9 = gd->bd */
sub r9, r9, #GD_SIZE /* new GD is below bd */
adr lr, here
ldr r0, [r9, #GD_RELOC_OFF] /* r0 = gd->reloc_off */
add lr, lr, r0
#if defined(CONFIG_CPU_V7M)
orr lr, #1 /* As required by Thumb-only */
#endif
ldr r0, [r9, #GD_RELOCADDR] /* r0 = gd->relocaddr */
b relocate_code
here:
Why U-boot proper needs to setup DRAM again and relocate itself again if this was already done by SPL? Am i missing something here?
Why U-boot proper needs to setup DRAM again and relocate itself again if this was already done by SPL?
The relocation to the top of memory is to maximize contiguous free space.
You could build/link for and load U-Boot at this high-memory address (to save on the copy). But whenever you rebuild U-Boot with added features and the image gets larger, then that load address has to be recalculated.
Which also means that the SPL has to be changed.
BTW "setup DRAM" would typically be thought of as initializing a DRAM controller, something that is not done at this stage.
Besides relocating its code, U-Boot also has to move the stack and heap, aka the C runtime environment.
Am i missing something here?
It's more convenient to load U-Boot in the middle of main memory, and then let U-Boot relocate itself to high memory.

Clarifications on iOS Assembly Language

I'm investigating a little bit on how Objective-C language is mapped into Assembl. I've started from a tutorial found at iOS Assembly Tutorial.
The code snippet under analysis is the following.
void fooFunction() {
int add = addFunction(12, 34);
printf("add = %i", add);
}
It is translated into
_fooFunction:
# 1:
push {r7, lr}
# 2:
movs r0, #12
movs r1, #34
# 3:
mov r7, sp
# 4:
bl _addFunction
# 5:
mov r1, r0
# 6:
movw r0, :lower16:(L_.str-(LPC1_0+4))
movt r0, :upper16:(L_.str-(LPC1_0+4))
LPC1_0:
add r0, pc
# 7:
blx _printf
# 8:
pop {r7, pc}
About the assembly code, I cannot understand the following two points
-> Comment #1
The author says that push decrements the stack by 8 byte since r7 and lr are of 4byte each. Ok. But he also says that the two values are stored with the one instruction. What does it mean?
-> Comment #6
movw r0, :lower16:(L_.str-(LPC1_0+4))
movt r0, :upper16:(L_.str-(LPC1_0+4))
The author says the that r0 will hold the address of the "add = %i" (that can be find in the data segment) but I don't really get how the memory layout looks like. Why does he represent the difference L_.str-(LPC1_0+4) with the dotted black line and not with red one (drawn by me).
Any clarifications will be appreciated.
Edit
I'm missing the concept of pushing r7 onto the stack. What does mean to push that value and what does it contain?
But he also says that the two values are stored with the one
instruction. What does it mean?
That the single push instruction will put both values onto the stack.
Why does he represent the difference L_.str-(LPC1_0+4)
Because the add r0, pc implicitly adds 4 bytes more. To quote the instruction set reference:
Add an immediate constant to the value from sp or pc, and place the result into a low register.
Syntax: ADD Rd, Rp, #expr
where:
Rd is the destination register. Rd mustbe in the range r0-r7.
Rp is either sp or pc.
expr is an expression that evaluates (at assembly time) to a multiple of 4 in the range 0-1020.
If Rp is the pc, the value used is: (the address of the current instruction + 4) AND &FFFFFFFC.
For comment 1:
The two values pushed to the stack are the values store in r7 and lr.
Two 4 byte values equals 8 bytes.
For comment 6:
The label LPC1_0 is followed by the instruction
add r0, pc
which adds another 4 bytes to the difference between the two addresses.

Address of interrupt handler in interrupt vector is +1 of actual address

I cross compiled a program for cortex-m3. In start up code all interrupts are given in g_pfnVectors. When I disassemble, at address 0x0, I see "stack pointer value". Than at address 0x4 reset interrupt handler address is given. This keeps going with following system interrupt addresses.
Here is my question : Why in interrupt vector, addresses of interrupt handlers are +1 of the actual address. Address of ResetISR handler is 0x184 but in interrupt table it is 0x185. This is the case for all other interrupt handler addresses. What is the reason of this?
00000000 <g_pfnVectors>:
0: 10008000 andne r8, r0, r0
4: 00000185 andeq r0, r0, r5, lsl #3
8: 00000215 andeq r0, r0, r5, lsl r2
c: 0000021d andeq r0, r0, sp, lsl r2
00000184 <ResetISR>:
184: b580 push {r7, lr}
186: b084 sub sp, #16
188: af00 add r7, sp, #0
.......
00000214 <NMI_Handler>:
214: b480 push {r7}
216: af00 add r7, sp, #0
218: e7fe b.n 218 <NMI_Handler+0x4>
21a: bf00 nop
.......
0000021c <HardFault_Handler>:
21c: b480 push {r7}
21e: af00 add r7, sp, #0
220: e7fe b.n 220 <HardFault_Handler+0x4>
.......
I found the answer in ARMv7-M Architecture Reference Manual, section Vector Table Definition(B1.5.3):
The Vector table must be naturally aligned to a power of two whose
alignment value is greater than or equal to (Number of Exceptions
supported x 4), with a minimum alignment of 128 bytes. On power-on or
reset, the processor uses the entry at offset 0 as the initial value
for SP_main, see The SP registers on page B1-572. All other entries
must have bit[0] set to 1, because this bit defines the EPSR.T bit on
exception entry. See Reset behavior on page B1-586 and Exception
entry behavior on page B1-587 for more information. On exception
entry, if bit[0] of the associated vector table entry is set to 0,
execution of the first instruction causes an INVSTATE UsageFault, see
The special-purpose program status registers, xPSR on page B1-572 and
Fault behavior on page B1-608. If this happens on a reset, this
escalates to a HardFault, because UsageFault is disabled on reset, see
Priority escalation on page B1-585 for more information.

EXC_BAD_ACCESS in Assembly Code in iOS App

I'm trying to debug an EXC_BAD_ACCESS crash in an iOS App I am working on. Basically, my code calls the function new_dyna_start() which corresponds to the a certain assembly method. Here's the relevant assembly code:
.align 4
42430:
.long _translation_cache_iphone
.align 2
.globl _new_dyna_start
// .type new_dyna_start, %function
_new_dyna_start:
ldr r12, .dlptr
mov r0, #0xa4000000
stmia r12, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
sub fp, r12, #28
add r0, r0, #0x40
bl _new_recompile_block
ldr r0, [fp, #64]
ldr r10, [fp, #400+36] /* Count */
str r0, [fp, #72]
sub r10, r10, r0
ldr r0, 42430b
ldr pc, [r0]
From my (limited) understanding, at line 6 of the method, it calls the C function new_recompile_block(). This method works fine, and I know it finishes because at the end of the function I have
printf("End of loop");
which then appears in the debugger. After the method completes, I'm not entirely sure I understand what happens, but it seems that the assembly method obtains a reference to the C variable translation_cache_iphone. However, at the final line the app crashes oddly. This message appears in Xcode: http://imgur.com/dqKo0
However, if I click on the side to the last method called, I see it is this: http://imgur.com/M5h84
This seems to support my idea that it is the translation_cache_iphone variable causing the crash, as the memory address of the EXC_BAD_ACCESS (0x401000) is the same as translation_cache_iphone. translation_cache_iphone is declared as:
unsigned char* translation_cache_iphone = NULL;
and is initialized by:
translation_cache_iphone = (unsigned char *)(((unsigned long) translation_cache_static_iphone + (4096)) & ~(4095));
Am I right in assuming that this is the problem? Is the problem in the assembly code, or in the C code? I've tried modifying both, but to no avail. The assembly code above is the original.
Here is a link to the full source on Github. Simply compile and run on an iDevice with Xcode and you'll see the exact issues I'm facing. It may be easier to debug that way.
The last two instructions form an indirect jump to the translation_cache_iphone which is thus expected to be executable code. Verify that is the case and that memory permissions are appropriate - in many systems data pages are not executable by default.
This seems to support my idea that it is the translation_cache_iphone variable causing the crash
Yes, I believe that this variable is the problem.
In the assembly code you posted I can see one line that could cause an invalid access to the memory, and it is:
ldr r0, 42430b
ldr pc, [r0]
The first line loads the data from the label 42430 to the register r0. Then, the second line points PC (Program Counter) to the content of r0.
In the beginning of the assembly code you have declared what is the label 42430:
42430:
.long _translation_cache_iphone
Then, when it tries to access this value and execute is as code, it crashes.

Resources