why would a malloc/new return a pointer to the middle of a block? - memory

General Background:
I am attempting to analyze a dump where a heap corruption occurs. The heap corruption takes place in std::vector.push_back - when the vector capacity is exceeded and more space is required, the call to the "free" of the old vector fails.
Analysis details:
From the analysis of the dump, I've seen that the pointer which is "freed" is in the middle of an existing "HEAP_ENTRY" block. I've seen this by enumerating all the blocks of the relevant heap using "!heap -h " and finding that the free'd block resides between two existing blocks (the difference between them is significant, certainly not only the 8-16 bytes of metadata, or something of this sort).
Questions:
Can a previous heap corruption cause the heap manager to return an address to the middle of a block, thus necessarily causing a crash when I attempt to free it ?
If 1. is true, that means that using pageheap isn't very useful here, because the corruption seems to take place on data which is always writable, so I don't think pageheap (gflags option) will be able to detect this. Do you have any suggestion how I might catch the point at which this kind of corruption occurs ?
Thanks a lot,
Amit

You might be best to use tools to track down better clues to what is going wrong.
Valgrind is quite good.
Some operating systems have built in malloc() diagnostics that can be enabled on the fly via environment variables with no additional effort. Check the manual page for malloc() on your system.

I'd guess it to be a stale pointer, it sounds like it was probably validly allocated in the past - but presumably also freed in the past and then allocated over by a larger block.
In that scenario the problem is a double-free. If your code is multi-threaded it could be a thread safety problem leading to stale pointers.

Related

Use of Memory Profiling

Whats the meaning Memory Profiling?
Is it give statistics of memory like how much memory utilized?
And are there any different kinds in this?
The problem is, you may be doing way to much new-ing which, even in a language with a garbage-collector, may unnecessarily dominate your execution time.
You may also have a memory leak meaning that the amount of dynamic memory you're not returning to the pool grows steadily over time.
If your app runs for a long time, that's equally bad.
I use the random-pausing method for performance diagnosis, but that is of no value for finding memory leaks.
That's what Memory Profiling should help with.
Here's how I've found memory leaks in the past, using MFC.
In a debug build, when I shut down the app, it prints a list of all the non-collected memory blocks, along with their class type.
Then I look to see where those blocks are created, and try to figure out why they weren't deleted or collected.
It would be more useful if I could capture a stack trace on each block, so I could tell which new statement made it, and the stack could tell me why.
The point is, I could allocate 100 blocks of class Foo, and delete 99 of them.
The one that I don't delete is the problem, so it would be useful to know more about where it came from.
I don't know if memory profilers can do this or not.

What's the malloc checksum logic in iOS?

What's the reasoning of the checksum. How and when is it checked (e.g. before/after allocation, before/after deallocation)?
Why am I interested? Read on.
While porting a large project to arm64, I'm running into some tough to diagnose crashes having to with that ever so popular malloc checksum failure. I've set watchpoints on the offending address and it's always the same offset from the base address. This address is a member variable of a CPP class (and it's just a 32 bit integer). The project has some C and CPP mixed together with ObjC, which makes me lean towards alignment bugs.
The watchpoints seldom ever hit, only at the beginning of the use of the object, then they're left alone, yet still it crashes at this same address.
I understand it's intended to identify writing to invalid address, but knowing how/when it's performed could help shed some light on this bug.
Checksums in malloc functions are generally performed for the control information held for a block (not the data area), for example the sixteen bytes immediately preceding the address you're given, which holds information such as block size, next block, checksum and so on.
And the most logical time for setting it is upon block allocation (or reallocation if it's done in-place) since the information tends not to change otherwise.
It's also generally checked at deallocation time, to catch the situation where an errant write has corrupted the control information.
I would suggest that, if you're writing to a positive offset from the allocated memory (your "member varaible of a CPP class") and that's causing the issue, then you haven't allocated enough memory for it. In other words, you're writing over the control information for the next block (free or allocated, probably doesn't matter to the checking code).
Keep in mind that's based on general knowledge of how memory arenas work, not specific details of the iOS one. But there's a fair amount of commonality in all I've seen. It makes sense to set the checksum on malloc/realloc and check it on free, as much sense as it makes to not bother checking it at any other time.
And, based on the operation you state is corrupting, it's likely it's a buffer overrun rather than underrun.

Is EOutOfMemory exception recoverable?

Does it make sense to continue execution after catching EOutOfMemory exception or now heap or stack is corrupted with a high probability?
I don't mean the case of EOutOfMemory caused by previous memory corruption due to bugs like writing to a wild address, I mean correct code that calls GetMem and catches EOutOfMemory.
In my opinion, there's no point attempting to continue from EOutOfMemory. In my experience, chances are exceedingly high that the heap will be corrupted and future errors can be expected. Usually, the safest course of action is to terminate the process.
In general, I agree that there's no point trying to recover. But it can be useful in specific circumstances. For example, allocating large amounts of memory that depend on user choices, and if it fails you can back out cleanly and let them retry with different settings. I do this for converting point clouds to 3D meshes, which includes some steps where the memory requirements aren't known in advance. It just takes careful coding of the steps you want to be recoverable, with an immediate and clean backout path. For example, some of my data structures are bitmaps or buffers with each line allocated separately to minimize issues with fragmented memory. The constructors have try... except handling and throw an EOutOfMemory exception, and the destructors free any of the lines that were already allocated. I can't guarantee it will always work, but it has worked well enough to be worth doing.

Memory related errors

I mostly work on C language for my work. I have faced many issues and spent lot time in debugging issues related to dynamically allocated memory corruption/overwriting. Like malloc(A) A bytes but use write more than A bytes. Towards that i was trying to read few things when i read about :-
1.) An approach wherein one allocates more memory than what is needed. And write some known value/pattern in that extra locations. Then during program execution that pattern should be untouched, else it indicated memory corruption/overwriting. But how does this approach work. Does it mean for every write to that pointer which is allocated using malloc() i should be doing a memory read of the additional sentinel pattern and read for its sanity? That would make my whole program very slow.
And to say that we can remove these checks from the release version of the code, is also not fruitful as memory related issues can happen more in 'real scenario'. So can we handle this?
2.) I heard that there is something called HEAP WALKER, which enables programs to detect memory related issues? How can one enable this.
thank you.
-AD.
If you're working under Linux or OSX, have a look at Valgrind (free, available on OSX via Macports). For Windows, we're using Rational PurifyPlus (needs a license).
You can also have a look at Dmalloc or even at Paul Nettle's memory manager which helps tracking memory allocation related bugs.
If you're on Mac OS X, there's an awesome library called libgmalloc. libgmalloc places each memory allocation on a separate page. Any memory access/write beyond the page will immediately trigger a bus error. Note however that running your program with libgmalloc will likely result in a significant slowdown.
Memory guards can catch some heap corruption. It is slower (especially deallocations) but it's just for debug purposes and your release build would not include this.
Heap walking is platform specific, but not necessarily too useful. The simplest check is simply to wrap your allocations and log them to a file with the LINE and FILE information for your debug mode, and most any leaks will be apparent very quickly when you exit the program and numbers don't tally up.
Search google for LINE and I am sure lots of results will show up.

How to get the root cause of a memory corruption in a embedded environment?

I have detected a memory corruption in my embedded environment (my program is running on a set top box with a proprietary OS ). but I couldn't get the root cause of it.
the memory corruption , itself, is detected after a stress test of launching and exiting an application multiple times. giving that I couldn't set a memory break point because the corruptued variable is changing it's address every time that the application is launched, is there any idea to catch the root cause of this corruption?
(A memory break point is break point launched when the environment change the value of a giving memory address)
note also that all my software is developed using C language.
Thanks for your help.
These are always difficult problems on embedded systems and there is no easy answer. Some tips:
Look at the value the memory gets corrupted with. This can give a clear hint.
Look at datastructures next to your memory corruption.
See if there is a pattern in the memory corruption. Is it always at a similar address?
See if you can set up the memory breakpoint at run-time.
Does the embedded system allow memory areas to be sandboxed? Set-up sandboxes to safeguard your data memory.
Good luck!
Where is the data stored and how is it accessed by the two processes involved?
If the structure was allocated off the heap, try allocating a much larger block and putting large guard areas before and after the structure. This should give you an idea of whether it is one of the surrounding heap allocations which has overrun into the same allocation as your structure. If you find that the memory surrounding your structure is untouched, and only the structure itself is corrupted then this indicates that the corruption is being caused by something which has some knowledge of your structure's location rather than a random memory stomp.
If the structure is in a data section, check your linker map output to determine what other data exists in the vicinity of your structure. Check whether those have also been corrupted, introduce guard areas, and check whether the problem follows the structure if you force it to move to a different location. Again this indicates whether the corruption is caused by something with knowledge of your structure's location.
You can also test this by switching data from the heap into a data section or visa versa.
If you find that the structure is no longer corrupted after moving it elsewhere or introducing guard areas, you should check the linker map or track the heap to determine what other data is in the vicinity, and check accesses to those areas for buffer overflows.
You may find, though, that the problem does follow the structure wherever it is located. If this is the case then audit all of the code surrounding references to the structure. Check the contents before and after every access.
To check whether the corruption is being caused by another process or interrupt handler, add hooks to each task switch and before and after each ISR is called. The hook should check whether the contents have been corrupted. If they have, you will be able to identify which process or ISR was responsible.
If the structure is ever read onto a local process stack, try increasing the process stack and check that no array overruns etc have occurred. Even if not read onto the stack, it's likely that you will have a pointer to it on the stack at some point. Check all sub-functions called in the vicinity for stack issues or similar that could result in the pointer being used erroneously by unrelated blocks of code.
Also consider whether the compiler or RTOS may be at fault. Try turning off compiler optimisation, and failing that inspect the code generated. Similarly consider whether it could be due to a faulty context switch in your proprietary RTOS.
Finally, if you are sharing the memory with another hardware device or CPU and you have data cache enabled, make sure you take care of this through using uncached accesses or similar strategies.
Yes these problems can be tough to track down with a debugger.
A few ideas:
Do regular code reviews (not fast at tracking down a specific bug, but valuable for catching such problems in general)
Comment-out or #if 0 out sections of code, then run the cut-down application. Try commenting-out different sections to try to narrow down in which section of the code the bug occurs.
If your architecture allows you to easily disable certain processes/tasks from running, by the process of elimination perhaps you can narrow down which process is causing the bug.
If your OS is a cooperative multitasking e.g. round robin (this would be too hard I think for preemptive multitasking): Add code to the end of the task that "owns" the structure, to save a "check" of the structure. That check could be a memcpy (if you have the time and space), or a CRC. Then after every other task runs, add some code to verify the structure compared to the saved check. This will detect any changes.
I'm assuming by your question you mean that you suspect some part of the proprietary code is causing the problem.
I have dealt with a similar issue in the past using what a colleague so tastefully calls a "suicide note". I would allocate a buffer capable of storing a number of copies of the structure that is being corrupted. I would use this buffer like a circular list, storing a copy of the current state of the structure at regular intervals. If corruption was detected, the "suicide note" would be dumped to a file or to serial output. This would give me a good picture of what was changed and how, and by increasing the logging frequency I was able to narrow down the corrupting action.
Depending on your OS, you may be able to react to detected corruption by looking at all running processes and seeing which ones are currently holding a semaphore (you are using some kind of access control mechanism with shared memory, right?). By taking snapshots of this data too, you perhaps can log the culprit grabbing the lock before corrupting your data. Along the same lines, try holding the lock to the shared memory region for an absurd length of time and see if the offending program complains. Sometimes they will give an error message that has important information that can help your investigation (for example, line numbers, function names, or code offsets for the offending program).
If you feel up to doing a little linker kung fu, you can most likely specify the address of any statically-allocated data with respect to the program's starting address. This might give you a consistent-enough memory address to set a memory breakpoint.
Unfortunately, this sort of problem is not easy to debug, especially if you don't have the source for one or more of the programs involved. If you can get enough information to understand just how your data is being corrupted, you may be able to adjust your structure to anticipate and expect the corruption (sometimes needed when working with code that doesn't fully comply with a specification or a standard).
You detect memory corruption. Could you be more specific how? Is it a crash with a core dump, for example?
Normally the OS will completely free all resources and handles your program has when the program exits, gracefully or otherwise. Even proprietary OSes manage to get this right, although its not a given.
So an intermittent problem could seem to be triggered after stress but just be chance, or could be in the initialisation of drivers or other processes the program communicates with, or could be bad error handling around say memory allocations that fail when the OS itself is under stress e.g. lazy tidying up of the closed programs.
Printfs in custom malloc/realloc/free proxy functions, or even an Electric Fence -style custom allocator might help if its as simple as a buffer overflow.
Use memory-allocation debugging tools like ElectricFence, dmalloc, etc - at minimum they can catch simple errors and most moderately-complex ones (overruns, underruns, even in some cases write (or read) after free), etc. My personal favorite is dmalloc.
A proprietary OS might limit your options a bit. One thing you might be able to do is run the problem code on a desktop machine (assuming you can stub out the hardware-specific code), and use the more-sophisticated tools available there (i.e. guardmalloc, electric fence).
The C library that you're using may include some routines for detecting heap corruption (glibc does, for instance). Turn those on, along with whatever tracing facilities you have, so you can see what was happening when the heap was corrupted.
First I am assuming you are on a baremetal chip that isn't running Linux or some other POSIX-capable OS (if you are there are much better techniques such as Valgrind and ASan).
Here's a couple tips for tracking down embedded memory corruption:
Use JTAG or similar to set a memory watchpoint on the area of memory that is being corrupted, you might be able to catch the moment when memory being is accidentally being written there vs a correct write, many JTAG debuggers include plugins for IDEs that allow you to get stack traces as well
In your hard fault handler try to generate a call stack that you can print so you can get a rough idea of where the code is crashing, note that since memory corruption can occur some time before the crash actually occurs the stack traces you get are unlikely to be helpful now but with better techniques mentioned below the stack traces will help, generating a backtrace on baremetal can be a very difficult task though, if you so happen to be using a Cortex-M line processor check this out https://github.com/armink/CmBacktrace or try searching the web for advice on generating a back/stack trace for your particular chip
If your compiler supports it use stack canaries to detect and immediately crash if something writes over the stack, for details search the web for "Stack Protector" for GCC or Clang
If you are running on a chip that has an MPU such as an ARM Cortex-M3 then you can use the MPU to write-protect the region of memory that is being corrupted or a small region of memory right before the region being corrupted, this will cause the chip to crash at the moment of the corruption rather than much later

Resources