EXC_BAD_ACCESS but NSZombies never triggered, how to debug? - ios

We have some very random bugs happening and throwing EXC_BAD_ACCESS, or malloc_error_break, or abort. There is no consistency to the exceptions, and enabling NSZombies doesn't cause any zombies to be triggered ever.
In fact, running with zombies enabled causes the crashes never to occur. I believe there is a subtle memory error in this codebase, and after spending many hours cleaning up what could be minor issues, we still have not solved the issue.
It make sense that a bad pointer may be overwriting a piece of memory, which is then later dereferenced and crashes the app. But what are other ways to isolate the underlying issue?
We have used all the diagnostic memory tools which will run with the device attached as well (this application uses a peripheral, so cannot be fully debugged in the simulator).

NSZombie is just a mechanism to poison space used by objects instead of freeing them, similar to Address Sanitizer's Memory Poisoning. By using various instrumentation such as NSZombie or the above mentioned ASAN, your stack and heap allocation will be laid out differently, leading to undefined behavior in a condition where a crash would likely be the best case scenario.
EXC_BAD_ACCESS means you tried to access an invalid address or tried to read or write to a memory region you do not have such permissions for. The inconsistencies you're running into are likely consequences of a nasty stack or heap corruption, like sometimes overwriting a variable that just holds data and sometimes overwriting a pointer being used by the program.
Data layout matters a lot for what happens and heap layout is often randomized in non-debug builds which adds even more room for inconsistent crashes. In addition any changes to program source code or build settings may/will inevitably cause data layout changes.
I would recommend:
Build in debug mode (-g compiler flag) and run with debugger attached. When you get a crash, gdb or lldb (the latter being the default for Xcode tooling) will stop execution and let you do things, from there get a stack trace using bt and that may let you work out the deeper cause of the problem.
Use ASAN, this page explains about its usage within Xcode tooling. It's generally an excellent tool for dealing with memory issues. Beware that using it with shared libraries built without support for it may cause anomalies but it usually tells you of them and generally tries to hold your hand as much as it can.
"printf debugging" can help, something like #define TRACE printf("%s %s:%d\n", __func__, __FILE__, __LINE__); and scattering these across the likely problem point can actually be helpful.
In general, I would suggest using a debugger first, without NSZombie or anything, just do a bunch of runs to a point of a crash and get stack traces, register states etc. Having those samples across multiple crashes can help you narrow down the problem.

Related

Symbolicating crash without crash log file

It's not duplicate. My question is how to symbolicate crash error.
My live app is crashing and I've crash report in xCode and crashlytics but I don't have crashlog as it's happening on live app and it's random.
Is it possible to get some meaning out of crash report without a crash log?
How do we find out file & line number from such reports?
Here is one example of such crash
crash_info_entry_0
abort() called
crash_info_entry_1
myapp(569,0x16df57000) malloc: *** error for object 0x10404ddae: pointer being freed was not allocated
Symbolication is the process of translating addresses into symbols (functions, methods, etc). Without a crash log, which contains those addresses, symbolication doesn't make sense. You cannot translate addresses you do not have. But, where did the output you listed come from? It looks like it could be part of a larger log. You've tagged the issue Crashlytics - did this report come from their service?
There's some helpful information in the logging you've included. The good news is that it is telling you that you've got heap corruption. malloc has called abort because it's detected an inconsistency with its internal structures. Further, it's extremely unlikely that a symbolicated stack trace would help you, because heap corruption is rarely, if ever, caused by functions further up the stack.
Keep in mind that the crash you are seeing here is an effect. To fix this issue, you need a cause, and a stack trace isn't going to get you that in this situation.
There's more bad news. It is hard, and often even impossible, to reason about heap corruption. Replicating the bug can also be impossible, as memory corruption is typically not deterministic. As you've noted, the crash appears random. That's because it probably is.
What I would recommend doing here is using the various tools that Apple provides to help debug this kind of issue.
Look for other crashes that look heap-corruption-related
Try out Zombies in Instruments
Try malloc scribble, or guardmalloc, two other good memory debugging tools
It is extremely common for one heap-corrupting bug to cause lots of different kinds of crashes. This could be an objc over-release, so I'd also keep my eye out for selectorNotRecognized exceptions. Those crashes might give you more of a clue as to what kind of object is being over-released.
Good luck!

Crashed: com.apple.root.utility-qos

Facing this strange issue where my app crashes after certain period of time. Attached is the screenshot from the Crashlytics as well.
This has occurred in an iPhone 6Plus running iOS 11.4.1.
I'd like to see a full crash log to get more information. With anything related to concurrency, I like to take a look at what all the threads are doing. Sometimes there is a clue in a thread that did not crash.
I do not know what's going on. But I can make an educated guess here that you are seeing heap corruption of some kind. The function "os_unfair_lock_corruption_abort" strongly indicates that the OS's primitive locking mechanism has detected a corrupted data structure, and is killing the processing.
Heap corruption is super-common, and can be extremely difficult to debug. One of the reasons is what you are seeing here is a symptom of the corruption, not the source. The source could be completely unrelated to locking/OperationQueue internals.
My suggestions would be to try out the memory debugging tools at your disposal, and attempt to fix all issues you can find. You might never be able to know which, if any, cause this. But, that's pretty much all you can do.
Check out malloc scribble, guardmalloc, and even NSZombies. All could potentially turn up some heap corruption bugs that are in your code.

Using instruments to find memory leaks

I've tried to read almost every decent tutorial in the internet, but still can't understand what is really happening here:
I've "Hide System Libraries" and "Invert the call tree", but I do not understand how to find actual code responsible for for example this leak. Any tips are appreciated. May be I am missing something obvious. I am getting hundreds of leaks, however I am using weak in closures, I do not have classes referencing each other etc. But it looks like I am missing something fundamental.
The problem shown in your screenshot is Instruments can't find your app's debug symbols. Instruments is showing memory addresses instead of function names. You are not going to be able to find the source of your memory leaks in Instruments without function names, even if you invert the call tree and hide system libraries.
Make sure your project is generating debug symbols. Check that the Generate Debug Symbols build setting is set to Yes. If your project is generating debug symbols, Instruments may be unable to find the dSYM file that contains the debug symbols. In Instruments choose Instrument > Call Tree > Locate dSYM. The dSYM is usually in the same directory as the release version's application bundle. The following article has additional information:
Instruments: Locating dSYM Files
Memory leaks can be difficult to track down. This is likely going to be a time consuming process, so be prepared. In the end, there is usually a lot of trial and error with debugging memory leaks. The "Memory Leaks" instrument has actually only detected one leak for me in the past. I've always had to track them down myself using the "Allocations" instrument.
One of the things that has helped me in the past is to start by trying to figure out what objects are actually being leaked. Click on the allocations instrument (the row above "Leak Checks"). Now try sorting by number of objects released or amount of memory used. See if there are any objects that have a count of 0 released when they shouldn't be sticking around. See if there is an object type that is taking an abnormal amount of memory.
Memory leaks are always due to developer mistakes with memory management. There are some minor memory leaks that exist in some of the lower level private APIs in Foundation and UIKit. At those lower levels, they are dealing with a lot more manual memory management, so its much easier to make tiny mistakes. You can't really do anything about those, but they are relatively rare.
If your application is working just fine, you may not need to worry about fixing these. There is some cost benefit analysis you want to do here. If this isn't impacting performance or stability, is the time investment in fixing these right now worth the minor benefits it will provide you and your users?
However it is worth nothing that memory leaks can add up, so if a user has your app open for a long time, the amount of leaked memory will eventually become a problem if you continue to leak more objects over time. At some point the application will crash and the user will have to re-open. But if your memory leaks are small enough that this doesn't become an issue unless the app has been open for HOURS, is it really that much of a problem anyways? That's always a judgment call on your part.

Objective-C - How to force low memory on iOS simulator

I have an error that some users are having of EXC_BAD_ACCESS when their device is low on memory. The stack trace is pointing to the if line below, and I believe it's because of the UTF8String that's being deallocated and still being used:
dispatch_sync(dbQueue, ^{
if (sqlite3_bind_text(sql_stmt, 1, pid.UTF8String, -1, SQLITE_STATIC) != SQLITE_OK) {
...
I'm having a hard time reproducing the issue on my end, how can I force or simulate low-memory on the simulator or a device?
Update:
I've tried adding a breakpoint to the line above, and then using the option Simulator -> Simulate Memory Warning, but I still can't reproduce the EXC_BAD_ACCESS error.
In the simulator there is a menu item: Hardware : Simulate Memory Warning
or
shiftcommandM
In simulator's menu: Hardware-> Simulate Memory Warning.
Update
If you're sure that your app crashed at sqlite3_bind_text, I suppose the most potential problem could be that the pid.UTF8String is NULL sometimes in which case it causes crash. Additionally, it's not likely to be the case that pid or pid.UTF8String is deallocated when used, you can check the crash report (if you have any) and check the address of the memory which caused the EXC_BAD_ACCESS, for example if you have EXC_BAD_ACCESS CODE=2 ADDRESS=0x00000000, it means pid.UTF8String is indeed a NULL pointer, if the address is not 0x0, then, it's another problem (very unlikely in your case).
As a suggestion, please add nil check to your code:
if (pid) {
if (sqlite3_bind_text(sql_stmt, 1, pid.UTF8String, -1, SQLITE_STATIC) != SQLITE_OK){
// do your stuff
}
} else {
sqlite3_bind_null(sql_stmt,1);
}
Many times, these kinds of errors are the result of a "perfect storm" of circumstances (i.e. race conditions, infrequent tasks running at just the "right" time, etc.), and often the kind of circumstances that you just cannot anticipate; if you knew how to reproduce it reliably, you'd probably also know how to fix it. The next best thing you can hope for is to try and increase your statistical odds of reproducing it in an environment (the debugger) where you could hopefully make sense of what's happening.
See this post: iOS Development: How can I induce low memory warnings on device?. By simulating memory warnings programmatically, you can (for example) use a repeating timer to cause a memory warning 1/sec (much faster than that and you may run into other issues, which will have you chasing your tail more than solving your original problem), eliminating the need to do it by hand repeatedly.
Before actually running the test, you can also set breakpoints at the following locations:
Symbol Module
====== ======
objc_exception_throw libobjc.A.dylib
-[NSException raise] CoreFoundation
Additionally, set breakpoints on all Objective-C exceptions. Setting a breakpoint will allow you to inspect the contents of memory before the exception actually gets thrown by the runtime, which will give you a much better chance of understanding the problem when it happens. When (and if) you capture the crash, inspect pid, pid.UTF8String and sql_stmt, as those look like the most likely culprits.
Run your app and start the timer firing. This will not necessarily or directly cause the crash you're looking for, but it will probably make it much more likely to occur over time without having to hand-hold; you can fire off the timer and wait (i.e. do something more productive) until you actually see the crash.
On the left side of your Xcode screen you can see a button to open the Debug Navigator, there you can see the amount of memory your app is currently using, and the amount that is free.
If you analyze it, you'll realize that the available memory for your simulator is the same as from your computer, so I suggest that you run some application that uses a lot of memory simultaneously with the simulator.
If you have an iPad available it might be easier, what I usually do is go on this website and copy as much of the unicode table as possible, so it will be stored on the Pasteboard

How to get the root cause of a memory corruption in a embedded environment?

I have detected a memory corruption in my embedded environment (my program is running on a set top box with a proprietary OS ). but I couldn't get the root cause of it.
the memory corruption , itself, is detected after a stress test of launching and exiting an application multiple times. giving that I couldn't set a memory break point because the corruptued variable is changing it's address every time that the application is launched, is there any idea to catch the root cause of this corruption?
(A memory break point is break point launched when the environment change the value of a giving memory address)
note also that all my software is developed using C language.
Thanks for your help.
These are always difficult problems on embedded systems and there is no easy answer. Some tips:
Look at the value the memory gets corrupted with. This can give a clear hint.
Look at datastructures next to your memory corruption.
See if there is a pattern in the memory corruption. Is it always at a similar address?
See if you can set up the memory breakpoint at run-time.
Does the embedded system allow memory areas to be sandboxed? Set-up sandboxes to safeguard your data memory.
Good luck!
Where is the data stored and how is it accessed by the two processes involved?
If the structure was allocated off the heap, try allocating a much larger block and putting large guard areas before and after the structure. This should give you an idea of whether it is one of the surrounding heap allocations which has overrun into the same allocation as your structure. If you find that the memory surrounding your structure is untouched, and only the structure itself is corrupted then this indicates that the corruption is being caused by something which has some knowledge of your structure's location rather than a random memory stomp.
If the structure is in a data section, check your linker map output to determine what other data exists in the vicinity of your structure. Check whether those have also been corrupted, introduce guard areas, and check whether the problem follows the structure if you force it to move to a different location. Again this indicates whether the corruption is caused by something with knowledge of your structure's location.
You can also test this by switching data from the heap into a data section or visa versa.
If you find that the structure is no longer corrupted after moving it elsewhere or introducing guard areas, you should check the linker map or track the heap to determine what other data is in the vicinity, and check accesses to those areas for buffer overflows.
You may find, though, that the problem does follow the structure wherever it is located. If this is the case then audit all of the code surrounding references to the structure. Check the contents before and after every access.
To check whether the corruption is being caused by another process or interrupt handler, add hooks to each task switch and before and after each ISR is called. The hook should check whether the contents have been corrupted. If they have, you will be able to identify which process or ISR was responsible.
If the structure is ever read onto a local process stack, try increasing the process stack and check that no array overruns etc have occurred. Even if not read onto the stack, it's likely that you will have a pointer to it on the stack at some point. Check all sub-functions called in the vicinity for stack issues or similar that could result in the pointer being used erroneously by unrelated blocks of code.
Also consider whether the compiler or RTOS may be at fault. Try turning off compiler optimisation, and failing that inspect the code generated. Similarly consider whether it could be due to a faulty context switch in your proprietary RTOS.
Finally, if you are sharing the memory with another hardware device or CPU and you have data cache enabled, make sure you take care of this through using uncached accesses or similar strategies.
Yes these problems can be tough to track down with a debugger.
A few ideas:
Do regular code reviews (not fast at tracking down a specific bug, but valuable for catching such problems in general)
Comment-out or #if 0 out sections of code, then run the cut-down application. Try commenting-out different sections to try to narrow down in which section of the code the bug occurs.
If your architecture allows you to easily disable certain processes/tasks from running, by the process of elimination perhaps you can narrow down which process is causing the bug.
If your OS is a cooperative multitasking e.g. round robin (this would be too hard I think for preemptive multitasking): Add code to the end of the task that "owns" the structure, to save a "check" of the structure. That check could be a memcpy (if you have the time and space), or a CRC. Then after every other task runs, add some code to verify the structure compared to the saved check. This will detect any changes.
I'm assuming by your question you mean that you suspect some part of the proprietary code is causing the problem.
I have dealt with a similar issue in the past using what a colleague so tastefully calls a "suicide note". I would allocate a buffer capable of storing a number of copies of the structure that is being corrupted. I would use this buffer like a circular list, storing a copy of the current state of the structure at regular intervals. If corruption was detected, the "suicide note" would be dumped to a file or to serial output. This would give me a good picture of what was changed and how, and by increasing the logging frequency I was able to narrow down the corrupting action.
Depending on your OS, you may be able to react to detected corruption by looking at all running processes and seeing which ones are currently holding a semaphore (you are using some kind of access control mechanism with shared memory, right?). By taking snapshots of this data too, you perhaps can log the culprit grabbing the lock before corrupting your data. Along the same lines, try holding the lock to the shared memory region for an absurd length of time and see if the offending program complains. Sometimes they will give an error message that has important information that can help your investigation (for example, line numbers, function names, or code offsets for the offending program).
If you feel up to doing a little linker kung fu, you can most likely specify the address of any statically-allocated data with respect to the program's starting address. This might give you a consistent-enough memory address to set a memory breakpoint.
Unfortunately, this sort of problem is not easy to debug, especially if you don't have the source for one or more of the programs involved. If you can get enough information to understand just how your data is being corrupted, you may be able to adjust your structure to anticipate and expect the corruption (sometimes needed when working with code that doesn't fully comply with a specification or a standard).
You detect memory corruption. Could you be more specific how? Is it a crash with a core dump, for example?
Normally the OS will completely free all resources and handles your program has when the program exits, gracefully or otherwise. Even proprietary OSes manage to get this right, although its not a given.
So an intermittent problem could seem to be triggered after stress but just be chance, or could be in the initialisation of drivers or other processes the program communicates with, or could be bad error handling around say memory allocations that fail when the OS itself is under stress e.g. lazy tidying up of the closed programs.
Printfs in custom malloc/realloc/free proxy functions, or even an Electric Fence -style custom allocator might help if its as simple as a buffer overflow.
Use memory-allocation debugging tools like ElectricFence, dmalloc, etc - at minimum they can catch simple errors and most moderately-complex ones (overruns, underruns, even in some cases write (or read) after free), etc. My personal favorite is dmalloc.
A proprietary OS might limit your options a bit. One thing you might be able to do is run the problem code on a desktop machine (assuming you can stub out the hardware-specific code), and use the more-sophisticated tools available there (i.e. guardmalloc, electric fence).
The C library that you're using may include some routines for detecting heap corruption (glibc does, for instance). Turn those on, along with whatever tracing facilities you have, so you can see what was happening when the heap was corrupted.
First I am assuming you are on a baremetal chip that isn't running Linux or some other POSIX-capable OS (if you are there are much better techniques such as Valgrind and ASan).
Here's a couple tips for tracking down embedded memory corruption:
Use JTAG or similar to set a memory watchpoint on the area of memory that is being corrupted, you might be able to catch the moment when memory being is accidentally being written there vs a correct write, many JTAG debuggers include plugins for IDEs that allow you to get stack traces as well
In your hard fault handler try to generate a call stack that you can print so you can get a rough idea of where the code is crashing, note that since memory corruption can occur some time before the crash actually occurs the stack traces you get are unlikely to be helpful now but with better techniques mentioned below the stack traces will help, generating a backtrace on baremetal can be a very difficult task though, if you so happen to be using a Cortex-M line processor check this out https://github.com/armink/CmBacktrace or try searching the web for advice on generating a back/stack trace for your particular chip
If your compiler supports it use stack canaries to detect and immediately crash if something writes over the stack, for details search the web for "Stack Protector" for GCC or Clang
If you are running on a chip that has an MPU such as an ARM Cortex-M3 then you can use the MPU to write-protect the region of memory that is being corrupted or a small region of memory right before the region being corrupted, this will cause the chip to crash at the moment of the corruption rather than much later

Resources