Does it make sense to continue execution after catching EOutOfMemory exception or now heap or stack is corrupted with a high probability?
I don't mean the case of EOutOfMemory caused by previous memory corruption due to bugs like writing to a wild address, I mean correct code that calls GetMem and catches EOutOfMemory.
In my opinion, there's no point attempting to continue from EOutOfMemory. In my experience, chances are exceedingly high that the heap will be corrupted and future errors can be expected. Usually, the safest course of action is to terminate the process.
In general, I agree that there's no point trying to recover. But it can be useful in specific circumstances. For example, allocating large amounts of memory that depend on user choices, and if it fails you can back out cleanly and let them retry with different settings. I do this for converting point clouds to 3D meshes, which includes some steps where the memory requirements aren't known in advance. It just takes careful coding of the steps you want to be recoverable, with an immediate and clean backout path. For example, some of my data structures are bitmaps or buffers with each line allocated separately to minimize issues with fragmented memory. The constructors have try... except handling and throw an EOutOfMemory exception, and the destructors free any of the lines that were already allocated. I can't guarantee it will always work, but it has worked well enough to be worth doing.
Related
General Background:
I am attempting to analyze a dump where a heap corruption occurs. The heap corruption takes place in std::vector.push_back - when the vector capacity is exceeded and more space is required, the call to the "free" of the old vector fails.
Analysis details:
From the analysis of the dump, I've seen that the pointer which is "freed" is in the middle of an existing "HEAP_ENTRY" block. I've seen this by enumerating all the blocks of the relevant heap using "!heap -h " and finding that the free'd block resides between two existing blocks (the difference between them is significant, certainly not only the 8-16 bytes of metadata, or something of this sort).
Questions:
Can a previous heap corruption cause the heap manager to return an address to the middle of a block, thus necessarily causing a crash when I attempt to free it ?
If 1. is true, that means that using pageheap isn't very useful here, because the corruption seems to take place on data which is always writable, so I don't think pageheap (gflags option) will be able to detect this. Do you have any suggestion how I might catch the point at which this kind of corruption occurs ?
Thanks a lot,
Amit
You might be best to use tools to track down better clues to what is going wrong.
Valgrind is quite good.
Some operating systems have built in malloc() diagnostics that can be enabled on the fly via environment variables with no additional effort. Check the manual page for malloc() on your system.
I'd guess it to be a stale pointer, it sounds like it was probably validly allocated in the past - but presumably also freed in the past and then allocated over by a larger block.
In that scenario the problem is a double-free. If your code is multi-threaded it could be a thread safety problem leading to stale pointers.
Say for instance I write a program which allocates a bunch of large objects when it is initialized. Then the program runs for awhile, perhaps indefinitely, and when it's time to terminate, each of the large initialized objects are freed.
So my question is, will it take longer to manually deallocate each block of memory separately at the end of the program's life or would it be better to let the system unload the program and deallocate all of the virtual memory given to the program by the system at the same time.
Would it be safe and/or faster? Also, if it is safe, does the compiler do this when set to optimise anyway?
1) Not all systems will free a memory for you when application terminates. Of course most of the modern desktop systems will do this, so if you are going to run your program only on Linux or Mac(or Windows), you can leave the deallocation to the system.
2) Often it is needed to make some operations with the data on termination, not just to free the memory. So if you are going to develop such program design that makes it hard to deallocate objects at the end manually, then it can happen that later you will need to perform some code before exiting and you will face up with hard problem.
2') Sometimes even if you think that your program will need some objects all the way until dead, later you may want to make a library from you program or change a project to load and unload you big objects and the poor design of your program will make this hard or impossible.
3) Moreover, the program deallocation performance depends on the implementation of the allocator you are going to use in your program. The system deallocation depends on the system memory management and even for a single system there can be several implementations. So if you face with allocation/deallocation performance problems - you would like to develop better allocator rather then hope on the system.
4) So my opinion is: When you deallocate memory manually at the end - you are always on a right way. When you don't do this, perhaps you can get some ambiguous benefits in several cases, but likely you will just face with the problems sooner or later.
Well most OS will free the memory at exit if the program, but the bigger question is why would you want it to have to?
Is it faster? Hard to say with memory sometimes. I would guess not really and definitely not worth breaking good coding practices anyway.
Is it safe? Define safe... Will your OS crash? Probably not. Will your code be susceptible to memory leaks or other problems? Absolutely, it will. In fact you are basically telling it you want memory leaks.
Best practice is to always free your memory when you are done with it. With C and C++, every malloced or new block of memory should have a corresponding free or delete.
It is a bad idea to rely on the OS to free your memory because it not only makes your code look bad and makes it less portable, but if the program was ever integrated into another program, then you will likely be tracking down memory leaks for hours.
So, short answer, always do it manually.
Programs with a short maintenance life time are good candidates for memory deallocation by "exit() and let the kernel sort them out." However, if the program will last more than a few months you have to consider the maintenance burden.
For instance, consider that someone may realize that a subsequent stage is required in the program, and some of the data is not needed, or not needed in memory. They now have to go and find out how to deallocate the memory, properly removing stale references, etc.
Sometimes, I want to use lua_pushstring in places after I allocated some resources which I would need to cleanup in case of failure. However, as the documentation seems to imply, lua_push* functions can always end up with an out of memory exception. But that exception instant-quits my C scope and doesn't allow me to cleanup whatever I might have temporarily allocated that might have to be freed in case of error.
Example code to illustrate the situation:
void* blubb = malloc(20);
...some other things happening here...
lua_pushstring(L, "test"); //how to do this call safely so I can still take care of blubb?
...possibly more things going on here...
free(blubb);
Is there a way I can check beforehand if such an exception would happen and then avoid pushing and doing my own error triggering as soon as I safely cleaned up my own resources? Or can I somehow simply deactivate the setjmp, and then check some "magic variable" after doing the push to see if it actually worked or triggered an error?
I considered pcall'ing my own function, but even just pushing the function on the stack I want to call safely through pcall can possibly give me an out of memory, can't it?
To clear things up, I am specifically asking this for combined use with custom memory allocators that will prevent Lua from allocating too much memory, so assume this is not a case where the whole system has run out of memory.
Unless you have registered a user-defined memory handler with Lua when you created your Lua state, getting an out of memory error means that your entire application has run out of memory. Recovery from this state is generally not possible. Or at least, not feasible in a lot of cases. It could be depending on your application, but probably not.
In short, if it ever comes up, you've got bigger things to be concerned about ;)
The only kind of cleanup that should affect you is for things external to your application. If you have some process global memory that you need to free or set some state in. You're doing interprocess communication and you have some memory mapped file you're talking though. Or something like that.
Otherwise, it's probably better to just kill your process.
You could build Lua as a C++ library. When you do that, errors become actual exceptions, which you can either catch or just use RAII objects to handle.
If you're stuck with C... well, there's not much you can do.
I am specifically interested in a custom allocator that will out of memory much earlier to avoid Lua eating too much memory.
Then you should handle it another way. To signal an out-of-memory error is basically to say, "I want Lua to terminate right now."
The way to stop Lua from eating memory is to periodically check the Lua state's memory, and garbage collect it if it's using too much. And if that doesn't free up enough memory, then you should terminate the Lua state manually, but only when it is safe to do so.
lua_atpanic() may be one solution for you, depending on the kind of cleanup you need to do. It will never throw an error.
In your specific example you could also create blubb as a userdata. Then Lua would free it automatically when it left the stack.
I have recently gotten into some more Lua sandboxing again, and now I think the answer I accepted previously is a bad idea. I have given this some more thought:
Why periodic checking is not enough
Periodically checking for large memory consumption and terminating Lua "only when it is safe to do so" seems like a bad idea if you consider that a single huge table can eat up a lot of your memory with one single VM instruction about which you'll only find out after it happened - where your program might already be dying from it, and then you indeed have much bigger problems which you could have avoided entirely if you had stopped that allocation in time in the first place.
Since Lua has a nice out of memory exception already built-in, I would just like to use that one since this allows me to do the minimal required thing (preventing the script from allocating more stuff, while possibly allowing it to recover) without my C code breaking from it.
Therefore my current plan for Lua sandboxing with memory limit is:
Use custom allocator that returns NULL with limit
Design all C functions to be able to handle this without memory leak or other breakage
But how to design the C functions safely?
How to do that, since lua_pushstring and others can always setjmp away with an error without me knowing whether that is gonna happen in advance? (this was originally my question)
I think I found a working approach:
I added a facility to register pointers when I allocate them, and where I unregister them after I am done with them. This means if Lua suddenly setjmp's me out of my C code without me getting a chance to clean up, I have everything in a global list I need to clean up that mess later when I'm back in control.
Is that ugly or what?
Yes, it is quite the hack. But, it will most likely work, and unlike 'periodic checking' it will actually allow me to have a true hard limit and avoid getting the application itself trouble because of an aggressive attack.
I have detected a memory corruption in my embedded environment (my program is running on a set top box with a proprietary OS ). but I couldn't get the root cause of it.
the memory corruption , itself, is detected after a stress test of launching and exiting an application multiple times. giving that I couldn't set a memory break point because the corruptued variable is changing it's address every time that the application is launched, is there any idea to catch the root cause of this corruption?
(A memory break point is break point launched when the environment change the value of a giving memory address)
note also that all my software is developed using C language.
Thanks for your help.
These are always difficult problems on embedded systems and there is no easy answer. Some tips:
Look at the value the memory gets corrupted with. This can give a clear hint.
Look at datastructures next to your memory corruption.
See if there is a pattern in the memory corruption. Is it always at a similar address?
See if you can set up the memory breakpoint at run-time.
Does the embedded system allow memory areas to be sandboxed? Set-up sandboxes to safeguard your data memory.
Good luck!
Where is the data stored and how is it accessed by the two processes involved?
If the structure was allocated off the heap, try allocating a much larger block and putting large guard areas before and after the structure. This should give you an idea of whether it is one of the surrounding heap allocations which has overrun into the same allocation as your structure. If you find that the memory surrounding your structure is untouched, and only the structure itself is corrupted then this indicates that the corruption is being caused by something which has some knowledge of your structure's location rather than a random memory stomp.
If the structure is in a data section, check your linker map output to determine what other data exists in the vicinity of your structure. Check whether those have also been corrupted, introduce guard areas, and check whether the problem follows the structure if you force it to move to a different location. Again this indicates whether the corruption is caused by something with knowledge of your structure's location.
You can also test this by switching data from the heap into a data section or visa versa.
If you find that the structure is no longer corrupted after moving it elsewhere or introducing guard areas, you should check the linker map or track the heap to determine what other data is in the vicinity, and check accesses to those areas for buffer overflows.
You may find, though, that the problem does follow the structure wherever it is located. If this is the case then audit all of the code surrounding references to the structure. Check the contents before and after every access.
To check whether the corruption is being caused by another process or interrupt handler, add hooks to each task switch and before and after each ISR is called. The hook should check whether the contents have been corrupted. If they have, you will be able to identify which process or ISR was responsible.
If the structure is ever read onto a local process stack, try increasing the process stack and check that no array overruns etc have occurred. Even if not read onto the stack, it's likely that you will have a pointer to it on the stack at some point. Check all sub-functions called in the vicinity for stack issues or similar that could result in the pointer being used erroneously by unrelated blocks of code.
Also consider whether the compiler or RTOS may be at fault. Try turning off compiler optimisation, and failing that inspect the code generated. Similarly consider whether it could be due to a faulty context switch in your proprietary RTOS.
Finally, if you are sharing the memory with another hardware device or CPU and you have data cache enabled, make sure you take care of this through using uncached accesses or similar strategies.
Yes these problems can be tough to track down with a debugger.
A few ideas:
Do regular code reviews (not fast at tracking down a specific bug, but valuable for catching such problems in general)
Comment-out or #if 0 out sections of code, then run the cut-down application. Try commenting-out different sections to try to narrow down in which section of the code the bug occurs.
If your architecture allows you to easily disable certain processes/tasks from running, by the process of elimination perhaps you can narrow down which process is causing the bug.
If your OS is a cooperative multitasking e.g. round robin (this would be too hard I think for preemptive multitasking): Add code to the end of the task that "owns" the structure, to save a "check" of the structure. That check could be a memcpy (if you have the time and space), or a CRC. Then after every other task runs, add some code to verify the structure compared to the saved check. This will detect any changes.
I'm assuming by your question you mean that you suspect some part of the proprietary code is causing the problem.
I have dealt with a similar issue in the past using what a colleague so tastefully calls a "suicide note". I would allocate a buffer capable of storing a number of copies of the structure that is being corrupted. I would use this buffer like a circular list, storing a copy of the current state of the structure at regular intervals. If corruption was detected, the "suicide note" would be dumped to a file or to serial output. This would give me a good picture of what was changed and how, and by increasing the logging frequency I was able to narrow down the corrupting action.
Depending on your OS, you may be able to react to detected corruption by looking at all running processes and seeing which ones are currently holding a semaphore (you are using some kind of access control mechanism with shared memory, right?). By taking snapshots of this data too, you perhaps can log the culprit grabbing the lock before corrupting your data. Along the same lines, try holding the lock to the shared memory region for an absurd length of time and see if the offending program complains. Sometimes they will give an error message that has important information that can help your investigation (for example, line numbers, function names, or code offsets for the offending program).
If you feel up to doing a little linker kung fu, you can most likely specify the address of any statically-allocated data with respect to the program's starting address. This might give you a consistent-enough memory address to set a memory breakpoint.
Unfortunately, this sort of problem is not easy to debug, especially if you don't have the source for one or more of the programs involved. If you can get enough information to understand just how your data is being corrupted, you may be able to adjust your structure to anticipate and expect the corruption (sometimes needed when working with code that doesn't fully comply with a specification or a standard).
You detect memory corruption. Could you be more specific how? Is it a crash with a core dump, for example?
Normally the OS will completely free all resources and handles your program has when the program exits, gracefully or otherwise. Even proprietary OSes manage to get this right, although its not a given.
So an intermittent problem could seem to be triggered after stress but just be chance, or could be in the initialisation of drivers or other processes the program communicates with, or could be bad error handling around say memory allocations that fail when the OS itself is under stress e.g. lazy tidying up of the closed programs.
Printfs in custom malloc/realloc/free proxy functions, or even an Electric Fence -style custom allocator might help if its as simple as a buffer overflow.
Use memory-allocation debugging tools like ElectricFence, dmalloc, etc - at minimum they can catch simple errors and most moderately-complex ones (overruns, underruns, even in some cases write (or read) after free), etc. My personal favorite is dmalloc.
A proprietary OS might limit your options a bit. One thing you might be able to do is run the problem code on a desktop machine (assuming you can stub out the hardware-specific code), and use the more-sophisticated tools available there (i.e. guardmalloc, electric fence).
The C library that you're using may include some routines for detecting heap corruption (glibc does, for instance). Turn those on, along with whatever tracing facilities you have, so you can see what was happening when the heap was corrupted.
First I am assuming you are on a baremetal chip that isn't running Linux or some other POSIX-capable OS (if you are there are much better techniques such as Valgrind and ASan).
Here's a couple tips for tracking down embedded memory corruption:
Use JTAG or similar to set a memory watchpoint on the area of memory that is being corrupted, you might be able to catch the moment when memory being is accidentally being written there vs a correct write, many JTAG debuggers include plugins for IDEs that allow you to get stack traces as well
In your hard fault handler try to generate a call stack that you can print so you can get a rough idea of where the code is crashing, note that since memory corruption can occur some time before the crash actually occurs the stack traces you get are unlikely to be helpful now but with better techniques mentioned below the stack traces will help, generating a backtrace on baremetal can be a very difficult task though, if you so happen to be using a Cortex-M line processor check this out https://github.com/armink/CmBacktrace or try searching the web for advice on generating a back/stack trace for your particular chip
If your compiler supports it use stack canaries to detect and immediately crash if something writes over the stack, for details search the web for "Stack Protector" for GCC or Clang
If you are running on a chip that has an MPU such as an ARM Cortex-M3 then you can use the MPU to write-protect the region of memory that is being corrupted or a small region of memory right before the region being corrupted, this will cause the chip to crash at the moment of the corruption rather than much later
I have a VPS with not very much memory (256Mb) which I am trying to use for Common Lisp development with SBCL+Hunchentoot to write some simple web-apps. A large amount of memory appears to be getting used without doing anything particularly complex, and after a while of serving pages it runs out of memory and either goes crazy using all swap or (if there is no swap) just dies.
So I need help to:
Find out what is using all the memory (if it's libraries or me, especially)
Limit the amount of memory which SBCL is allowed to use, to avoid massive quantities of swapping
Handle things cleanly when memory runs out, rather than crashing (since it's a web-app I want it to carry on and try to clean up).
I assume the first two are reasonably straightforward, but is the third even possible?
How do people handle out-of-memory or constrained memory conditions in Lisp?
(Also, I note that a 64-bit SBCL appears to use literally twice as much memory as 32-bit. Is this expected? I can run a 32-bit version if it will save a lot of memory)
To limit the memory usage of SBCL, use --dynamic-space-size option (e.g.,sbcl --dynamic-space-size 128 will limit memory usage to 128M).
To find out who is using memory, you may call (room) (the function that tells how much memory is being used) at different times: at startup, after all libraries are loaded and then during work (of cource, call (sb-ext:gc :full t) before room not to measure the garbage that has not yet been collected).
Also, it is possible to use SBCL Profiler to measure memory allocation.
Find out what is using all the memory
(if it's libraries or me, especially)
Attila Lendvai has some SBCL-specific code to find out where an allocated objects comes from. Refer to http://article.gmane.org/gmane.lisp.steel-bank.devel/12903 and write him a private mail if needed.
Be sure to try another implementation, preferably with a precise GC (like Clozure CL) to ensure it's not an implementation-specific leak.
Limit the amount of memory which SBCL
is allowed to use, to avoid massive
quantities of swapping
Already answered by others.
Handle things cleanly when memory runs
out, rather than crashing (since it's
a web-app I want it to carry on and
try to clean up).
256MB is tight, but anyway: schedule a recurring (maybe 1s) timed thread that checks the remaining free space. If the free space is less than X then use exec() to replace the current SBCL process image with a new one.
If you don't have any type declarations, I would expect 64-bit Lisp to take twice the space of a 32-bit one. Even a plain (small) int will use a 64-bit chunk of memory. I don't think it'll use less than a machine word, unless you declare it.
I can't help with #2 and #3, but if you figure out #1, I suspect it won't be a problem. I've seen SBCL/Hunchentoot instances running for ages. If I'm using an outrageous amount of memory, it's usually my own fault. :-)
I would not be surprised by a 64-bit SBCL using twice the meory, as it will probably use a 64-bit cell rather than a 32-bit one, but couldn't say for sure without actually checking.
Typical things that keep memory hanging around for longer than expected are no-longer-useful references that still have a path to the root allocation set (hash tables are, I find, a good way of letting these things linger). You could try interspersing explicit calls to GC in your code and make sure to (as far as possible) not store things in global variables.