Stack overflow: Thread 1: EXC_BAD_ACCESS (code=2, address=0x16d09aa00) - ios

Crash description
Recently I'm facing kinda really strange memory issues in one of my iOS/Swift projects. I'm really not sure what's going on and feel it's also not quite easy to describe, but I'll try my best anyway.
It basically behaves like follows:
On a certain code base, the crash always occurs in the same place (100% reproducible)
Changes to the code base, may resolve the issue, but it may also just pop up somewhere else
Crashes only occur on real devices, never inside simulators
Currently the app crashes with following error (results from 3 different runs):
Thread 1: EXC_BAD_ACCESS (code=2, address=0x16d09aa00)
Thread 1: EXC_BAD_ACCESS (code=2, address=0x16af46a00)
Thread 1: EXC_BAD_ACCESS (code=2, address=0x16d526a00)
Reasoning about memory addresses
WWDC session
I found an interesting session (Understanding Crashes and Crash Logs) from WWDC 2018, where one of the guys points out that it's sometimes possible to derive more information from the specific memory addresses, the crashes occur.
Unfortunately the addresses it crashes in my app are somewhat completely different, but maybe we can get clues from them anyway? At least it's interesting, that they're all quite similar, or isn't it?
Changes due to Diagnostic options enabled
Further investigation shows that the first 2 bytes (16) stay always the same, followed by 4 random bytes followd by 3 bytes (a00). When activating diagnositcs (e.g. ASan or Scribble), the last 3 bytes change (e.g. 3a0 or 9e0). But maybe this is only a kind of shift due to more "debug stuff" being added? I'm really not that "memory guy", but just want to provide anything I noticed.
Trying "Diagnostic options"
I tried different Diagnostic options (from schemes), but none of them really changed the crash in any way, or provided any more information.
1. Scribble
Crashes do not reference 0xAA or 0x55, so it's nothing to be catched using Scribble? (Xcode - scribble, guard edges and guard malloc)
2. Malloc Guard Edges
Didn't notice any difference using this either.
3. Zombies
Using this guide.
malloc_info --type 0x16b15e9c0
error: error: Trying to put the stack in unreadable memory at: 0x16b15e920.
4. ASan
Using ASan just puts following entry on top of the stack trace. Unfortunately I didn't find anything helpful related to that.
#0 0x0000000109efbf60 in __asan_alloca_poison ()
5. TSan
Not available on real devices (crashes only occur there)
Recursion / BOF?
Could it be a recursion that is too long, or another kind of stack/heap buffer overflow?
But it seems like the stack size on real devices as well as simulators is exactly the same with 524288 bytes (from Thread.main.stackSize).
So, as it doesn't crash in simulators, it's not a BOF? Or is the architecture difference too big, to make such conclusions here?
Disassembling
I also tried "disassembling".
disassemble -a 0x16d09aa00
error: Could not find function bounds for address 0x16d09aa00
Or disassemble -frame
But my assembler skills are really lacking behind, so currently there is nothing to get for me from that information.
Need help
As you can see I'm really running out of ideas. Either the crashes are really totally weird, or I just do not have enough knowledge/skills to use above tools, to get me any closer to the cause of those issues.
Either way... Any help, hints, ideas or whatever could point me in the right direction is highly appreciated!
Thanks in advance, guys.
Update May 19, 2020
I totally forgot to mention, that we're using ReSwift heavily in our app, and the crashes seem to be related to how we use the Middlewares there, I guess.
I'm also already in contact with the devs there: github.com/ReSwift/ReSwift/issues/271.
Here's finally some code. Unfortunately I can't share all the apps code (which may be necessary!?) and also don't want to overload you with way to much code.
Current issue
Thread 1: EXC_BAD_ACCESS (code=1, address=0x16ed82da0)
UserAccountMiddleware.swift
Note: Using those DispatchQueue.main.async actually makes the crashes go away. They indeed break the current cycle, so maybe there's some kind of recursion or timing issue happening?
func userAccountMiddleware() -> Middleware<AppState> {
return { dispatch, getState in
return { next in
return { action in
switch action {
case _ as ReSwiftInit:
// DispatchQueue.main.async {
dispatch(UserAccountSetAuthToken(authToken: Defaults.customerAuthToken))
dispatch(UserAccountSetAvatar(index: Defaults.avatarIndex))
// }
if let data = Defaults.customer,
let customer = try? JSONDecoder().decode(Customer.self, from: data) {
// DispatchQueue.main.async {
dispatch(UserAccountSetCustomerLoggedIn(customer: customer))
// }
}
// [...]
default:
break
}
next(action)
}
}
}
}
ReSwift Store.swift
// [...]
open func _defaultDispatch(action: Action) {
guard !isDispatching else {
raiseFatalError(
"ReSwift:ConcurrentMutationError- Action has been dispatched while" +
" a previous action is action is being processed. A reducer" +
" is dispatching an action, or ReSwift is used in a concurrent context" +
" (e.g. from multiple threads)."
)
}
isDispatching = true
let newState = reducer(action, state) // Thread 1: EXC_BAD_ACCESS (code=1, address=0x16ed82da0)
isDispatching = false
state = newState
}
// [...]
Xcode console:
(lldb) po state
error: warning: couldn't get required object pointer (substituting NULL): Couldn't load 'self' because its value couldn't be evaluated
error: Trying to put the stack in unreadable memory at: 0x16d95ad00.
Assembler (very last step of crash):
myapp`type metadata accessor for GlobalState:
0x101f6ac10 <+0>: sub sp, sp, #0x30 ; =0x30
-> 0x101f6ac14 <+4>: stp x29, x30, [sp, #0x20] // Thread 1: EXC_BAD_ACCESS (code=1, address=0x16ed82da0)
0x101f6ac18 <+8>: adrp x8, 3620
0x101f6ac1c <+12>: add x8, x8, #0x148 ; =0x148
0x101f6ac20 <+16>: ldr x8, [x8]
0x101f6ac24 <+20>: mov x9, #0x0
0x101f6ac28 <+24>: mov x1, x8
0x101f6ac2c <+28>: str x0, [sp, #0x18]
0x101f6ac30 <+32>: str x1, [sp, #0x10]
0x101f6ac34 <+36>: str x9, [sp, #0x8]
0x101f6ac38 <+40>: cbnz x8, 0x101f6ac54 ; <+68> at <compiler-generated>
0x101f6ac3c <+44>: adrp x1, 2122
0x101f6ac40 <+48>: add x1, x1, #0x1dc ; =0x1dc
0x101f6ac44 <+52>: ldr x0, [sp, #0x18]
0x101f6ac48 <+56>: bl 0x102775358 ; symbol stub for: swift_getSingletonMetadata
0x101f6ac4c <+60>: str x0, [sp, #0x10]
0x101f6ac50 <+64>: str x1, [sp, #0x8]
0x101f6ac54 <+68>: ldr x0, [sp, #0x8]
0x101f6ac58 <+72>: ldr x1, [sp, #0x10]
0x101f6ac5c <+76>: str x0, [sp]
0x101f6ac60 <+80>: mov x0, x1
0x101f6ac64 <+84>: ldr x1, [sp]
0x101f6ac68 <+88>: ldp x29, x30, [sp, #0x20]
0x101f6ac6c <+92>: add sp, sp, #0x30 ; =0x30
0x101f6ac70 <+96>: ret

TL;DR
Just move huge structs to the heap, by wrapping them inside arrays. Using #propertyWrappers, this can be an at least partly elegant solution.
#propertyWrapper
struct StoredOnHeap<T> {
private var value: [T]
init(wrappedValue: T) {
self.value = [wrappedValue]
}
var wrappedValue: T {
get {
return self.value[0]
}
set {
self.value[0] = newValue
}
}
}
// Usage:
#StoredOnHeap var hugeStruct: HugeStruct
https://gist.github.com/d4rkd3v1l/ab582a7cafd3a8b8c164c8541a3eef96
Long version
I'm almost 100% certain now, that this is a stack overflow, as I (finally) managed to reproduce this in a little demo project: https://github.com/d4rkd3v1l/ReSwift-StackOverflowDemo
Now I will just provide some more details and solutions for anyone else may running into this or similar issues.
The stack size on iOS (as of iOS 13) is 512kb and should apply to both, devices and simulators. Why did I say "should"? Because it almost certainly is somewhat different on simulators, as I did not see those crashes there. So maybe Thread.main.stackSize just tells 512kb but is in fact larger? IDK 🤷‍♂️
Here are some indicators, you may face the same issue:
You get EXC_BAD_ACCESS crashes with code 1 or 2**. And the crashes occur in high memory addresses, or at least completely out of where the rest of your app/stack normally "lives". Something like 0x16d95ad00 in my case.
Reducing the stuff you put on the stack (value types, e.g. very very large structs) or breaking the call stack down into smaller pieces (e.g. dispatch async) to give the stack some "time to breathe" prevents this crash.
And here at the latter we're already in the middle of the solution for that issue. As the stack size cannot (and probably even should not) increased, you must reduce the load you put there, like described in the 2nd point.
At least that's the solution we will probably go for. 🤞
*This is true at least for the main thread, other threads may be different.
**I think code 0 is kinda null pointer exceptionand therefore doesn't apply here. Please correct me if I'm wrong about this.

Related

How to solve LLDB error about N_SO in symbol with UID 1

When I launched lldb to debug an iOS application, I got an error that I never had before.
error: Veriff(0x00000001018cc000) N_SO in symbol with UID 1 has
invalid sibling in debug map, please file a bug and attach the binary
listed in this error
Below is the context of the error.
(lldb) process connect connect://localhost:6666
error: Veriff(0x00000001018cc000) N_SO in symbol with UID 1 has invalid sibling in debug map, please file a bug and attach the binary listed in this error
Process 3270 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
frame #0: 0x0000000187a1f6b0 libxpc.dylib` _xpc_dictionary_apply_node_f + 108
libxpc.dylib`_xpc_dictionary_apply_node_f:
-> 0x187a1f6b0 <+108>: mov x1, x20
0x187a1f6b4 <+112>: blr x21
0x187a1f6b8 <+116>: tbz w0, #0x0, 0x187a1f6f8 ; <+180>
0x187a1f6bc <+120>: mov x0, x26
0x187a1f6c0 <+124>: cbnz x26, 0x187a1f6a0 ; <+92>
0x187a1f6c4 <+128>: add x22, x22, #0x1 ; =0x1
0x187a1f6c8 <+132>: cmp x22, x23
0x187a1f6cc <+136>: b.lo 0x187a1f698 ; <+84>
Target 0: (Test app) stopped.
Has anyone been able to solve this error?
Does this impact any debugging?
I've never seen that error triggered before. If you can make this binary available to us, please file a bug either with http://bugs.llvm.org or http://bugreporter.apple.com and include the error message and the binary.
The error means lldb can't map symbols from some .o file that was included in your binary back to the .o file they came from (which is where the debug information actually resides.) So that code's debug information will not be available.

how to find variables location in memory without source code?

Basically I want to find the address/location of a variable in gdb?
I know normally the variable are store at rbp but don't know how to locate them using gdb.
I want to find the address/location of a variable in gdb?
That is possible, but the approach is different depending on whether the variable is a global or a local.
I know normally the variable are store at rbp
Local variables are stored at some offset of the frame pointer. %RBP is often used as a frame pointer in unoptimized binaries.
To find such variable, you'll need to know how to read machine code, and then you can find it. GDB will not help you with finding it in code that is compiled without debug info (it can't).
without source code
Source code has nothing to do with this -- GDB never looks at the source code, except to display it to you.
On to concrete example. Suppose you have the following source:
int foo(int *ip) { return *ip + 42; }
int main()
{
int j = 1;
return foo(&j);
}
Compiling this without debug info and without optimizations, results in:
(gdb) disas main
Dump of assembler code for function main:
0x000000000000060d <+0>: push %rbp
0x000000000000060e <+1>: mov %rsp,%rbp
0x0000000000000611 <+4>: sub $0x10,%rsp
0x0000000000000615 <+8>: movl $0x1,-0x4(%rbp)
0x000000000000061c <+15>: lea -0x4(%rbp),%rax
0x0000000000000620 <+19>: mov %rax,%rdi
0x0000000000000623 <+22>: callq 0x5fa <foo>
0x0000000000000628 <+27>: leaveq
0x0000000000000629 <+28>: retq
End of assembler dump.
here you can clearly see that j is being stored at negative offset 4 off %rbp.
You can set a breakpoint on foo, and use GDB to examine its value like so:
(gdb) b foo
Breakpoint 1 at 0x5fe
(gdb) run
Breakpoint 1, 0x00005555555545fe in foo ()
(gdb) up
#1 0x0000555555554628 in main ()
(gdb) x/x $rbp-4
0x7fffffffdbcc: 0x00000001 // indeed that is expected value of j

Clarifications on iOS Assembly Language

I'm investigating a little bit on how Objective-C language is mapped into Assembl. I've started from a tutorial found at iOS Assembly Tutorial.
The code snippet under analysis is the following.
void fooFunction() {
int add = addFunction(12, 34);
printf("add = %i", add);
}
It is translated into
_fooFunction:
# 1:
push {r7, lr}
# 2:
movs r0, #12
movs r1, #34
# 3:
mov r7, sp
# 4:
bl _addFunction
# 5:
mov r1, r0
# 6:
movw r0, :lower16:(L_.str-(LPC1_0+4))
movt r0, :upper16:(L_.str-(LPC1_0+4))
LPC1_0:
add r0, pc
# 7:
blx _printf
# 8:
pop {r7, pc}
About the assembly code, I cannot understand the following two points
-> Comment #1
The author says that push decrements the stack by 8 byte since r7 and lr are of 4byte each. Ok. But he also says that the two values are stored with the one instruction. What does it mean?
-> Comment #6
movw r0, :lower16:(L_.str-(LPC1_0+4))
movt r0, :upper16:(L_.str-(LPC1_0+4))
The author says the that r0 will hold the address of the "add = %i" (that can be find in the data segment) but I don't really get how the memory layout looks like. Why does he represent the difference L_.str-(LPC1_0+4) with the dotted black line and not with red one (drawn by me).
Any clarifications will be appreciated.
Edit
I'm missing the concept of pushing r7 onto the stack. What does mean to push that value and what does it contain?
But he also says that the two values are stored with the one
instruction. What does it mean?
That the single push instruction will put both values onto the stack.
Why does he represent the difference L_.str-(LPC1_0+4)
Because the add r0, pc implicitly adds 4 bytes more. To quote the instruction set reference:
Add an immediate constant to the value from sp or pc, and place the result into a low register.
Syntax: ADD Rd, Rp, #expr
where:
Rd is the destination register. Rd mustbe in the range r0-r7.
Rp is either sp or pc.
expr is an expression that evaluates (at assembly time) to a multiple of 4 in the range 0-1020.
If Rp is the pc, the value used is: (the address of the current instruction + 4) AND &FFFFFFFC.
For comment 1:
The two values pushed to the stack are the values store in r7 and lr.
Two 4 byte values equals 8 bytes.
For comment 6:
The label LPC1_0 is followed by the instruction
add r0, pc
which adds another 4 bytes to the difference between the two addresses.

ARM bare-metal with MMU: successive reads yield different values

Context (probably not needed):
As a learning exercise, I'm trying to implement a mini "OS" for the Raspberry Pi.
I'm currently implementing a very dumb memory management system. I already have the MMU enabled, and I'm in the process of getting a usable kmalloc.
It can already allocate chunks of memory from a pre-existing little kernel heap, mapped after the code and data segments. I'm trying to get it to grow as needed by mapping more pages. It must also be able to produce physically contiguous chunks.
The code is hosted at Github, there's a branch dedicated to this question with debug code. Note that it's not an example of well-organized, well-commented nor very clever code. :)
Actual question:
While trying to debug a data abort, I found something very strange.
This is a piece of code from my kmalloc:
next->prev_size = chunk->size;
next->size = -1;
term_printf(term, "A chunk->next_free = 0x%x\n", chunk->next_free);
term_printf(term, "B chunk->next_free = 0x%x\n", chunk->next_free);
*prev_list = next;
next->next_free = chunk->next_free;
term_printf(term, "next_free = 0x%x, chunk 0x%x\n", next->next_free, chunk->next_free);
term_printf(term, "next_free = 0x%x, chunk 0x%x\n", next->next_free, chunk->next_free);
I run it three times. Here are the results:
# 1st
A chunk->next_free = 0x0
B chunk->next_free = 0x0
next_free = 0x0, chunk 0x0
next_free = 0x0, chunk 0x0
# 2nd
A chunk->next_free = 0xffffffff
B chunk->next_free = 0x0
next_free = 0x0, chunk 0xffffffff
next_free = 0x0, chunk 0x0
# 3rd
A chunk->next_free = 0xffffffff
B chunk->next_free = 0xffffffff
next_free = 0xffffffff, chunk 0xffffffff
next_free = 0xffffffff, chunk 0xffffffff
The first and third iterations look normal (though next_free is supposed to have the value 0, the data abort occurs because it's got 0xffffffff). But what is my code doing during the second? O_o What kind of black sorcery can make my printf output two different values for chunk->next_free when read four times in a row? O_o
The data is well aligned, the pages are cacheable and non-bufferable (making them non-cacheable doesn't help), and I get the same result whether compiler optimizations are turned on or off. I tried throwing a data memory barrier in there but indeed it does nothing. I also checked the assembly produced, it looks ok.
I thought it could be caused by corrupted TLBs. I'm issuing "Invalidate unified single entry" (mcr p15, 0, %[addr], c8, c7, 1) after each new page mapping. Is it enough?
I tried debugging with qemu but it gets a data abort earlier when setting a bitmap of used physical pages, though this part works fine on the Pi.
I'm just looking for clues about what can cause this behavior. If you need more context please ask, though my code is for the moment rapidly changing and messy with lots of printf.
ETA:
The disassembly with -O0 for the first two printf:
c00025e4: e51b3018 ldr r3, [fp, #-24]
c00025e8: e5933008 ldr r3, [r3, #8]
c00025ec: e59b0004 ldr r0, [fp, #4]
c00025f0: e59f10a0 ldr r1, [pc, #160] ; c0002698 <kmalloc_wilderness+0x2c0>
c00025f4: e1a02003 mov r2, r3
c00025f8: eb000238 bl c0002ee0 <term_printf>
c00025fc: e51b3018 ldr r3, [fp, #-24]
c0002600: e5933008 ldr r3, [r3, #8]
c0002604: e59b0004 ldr r0, [fp, #4]
c0002608: e59f1088 ldr r1, [pc, #140] ; c000269c <kmalloc_wilderness+0x2c4>
c000260c: e1a02003 mov r2, r3
c0002610: eb000232 bl c0002ee0 <term_printf>
So it puts the address of chunk into r3, then perform a ldr to get next_free. It does it again before the second prinf. There's only one core, the DMA is not running, so there's nothing changing the value in memory between the calls.
With -O2:
c0001c38: e1a00006 mov r0, r6
c0001c3c: e59f10d8 ldr r1, [pc, #216] ; c0001d1c <kmalloc_wilderness+0x1b8>
c0001c40: e5942008 ldr r2, [r4, #8]
c0001c44: eb000278 bl c000262c <term_printf>
c0001c48: e1a00006 mov r0, r6
c0001c4c: e59f10cc ldr r1, [pc, #204] ; c0001d20 <kmalloc_wilderness+0x1bc>
c0001c50: e5942008 ldr r2, [r4, #8]
c0001c54: eb000274 bl c000262c <term_printf>
So it still fetches the value with ldr. That's why I get the same thing with both optimization levels.
New edit: I added more printfs, and it seems that the singularity happens at this point:
next->size = -1;
After this line, chunk->next_free turns into Heisenberg's cat. Before, it reads as 0.
The structure is defined as such:
struct kheap_chunk {
size_t prev_size;
size_t size; // -1 for wilderness chunk, bit 0 high if free
struct kheap_chunk *next_free;
};
chunk and next don't overlap.
If I move the "singularity line" below the next->next_free = chunk->next_free, it stops alternating between two values, but it's still weird: chunk->next_free is 0 before *prev_list = next, 0xffffffff after that. But next->next_free is still set to 0.

EXC_BAD_ACCESS in Assembly Code in iOS App

I'm trying to debug an EXC_BAD_ACCESS crash in an iOS App I am working on. Basically, my code calls the function new_dyna_start() which corresponds to the a certain assembly method. Here's the relevant assembly code:
.align 4
42430:
.long _translation_cache_iphone
.align 2
.globl _new_dyna_start
// .type new_dyna_start, %function
_new_dyna_start:
ldr r12, .dlptr
mov r0, #0xa4000000
stmia r12, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
sub fp, r12, #28
add r0, r0, #0x40
bl _new_recompile_block
ldr r0, [fp, #64]
ldr r10, [fp, #400+36] /* Count */
str r0, [fp, #72]
sub r10, r10, r0
ldr r0, 42430b
ldr pc, [r0]
From my (limited) understanding, at line 6 of the method, it calls the C function new_recompile_block(). This method works fine, and I know it finishes because at the end of the function I have
printf("End of loop");
which then appears in the debugger. After the method completes, I'm not entirely sure I understand what happens, but it seems that the assembly method obtains a reference to the C variable translation_cache_iphone. However, at the final line the app crashes oddly. This message appears in Xcode: http://imgur.com/dqKo0
However, if I click on the side to the last method called, I see it is this: http://imgur.com/M5h84
This seems to support my idea that it is the translation_cache_iphone variable causing the crash, as the memory address of the EXC_BAD_ACCESS (0x401000) is the same as translation_cache_iphone. translation_cache_iphone is declared as:
unsigned char* translation_cache_iphone = NULL;
and is initialized by:
translation_cache_iphone = (unsigned char *)(((unsigned long) translation_cache_static_iphone + (4096)) & ~(4095));
Am I right in assuming that this is the problem? Is the problem in the assembly code, or in the C code? I've tried modifying both, but to no avail. The assembly code above is the original.
Here is a link to the full source on Github. Simply compile and run on an iDevice with Xcode and you'll see the exact issues I'm facing. It may be easier to debug that way.
The last two instructions form an indirect jump to the translation_cache_iphone which is thus expected to be executable code. Verify that is the case and that memory permissions are appropriate - in many systems data pages are not executable by default.
This seems to support my idea that it is the translation_cache_iphone variable causing the crash
Yes, I believe that this variable is the problem.
In the assembly code you posted I can see one line that could cause an invalid access to the memory, and it is:
ldr r0, 42430b
ldr pc, [r0]
The first line loads the data from the label 42430 to the register r0. Then, the second line points PC (Program Counter) to the content of r0.
In the beginning of the assembly code you have declared what is the label 42430:
42430:
.long _translation_cache_iphone
Then, when it tries to access this value and execute is as code, it crashes.

Resources