Trouble reading memory - memory

When I run my code through the debugger, after a series of steps it eventually gets lost and executes commands out of order. I'm not sure if the stack is overflowing or what.
This is the error I usually get:
MSP430: Trouble Reading Memory Block at 0xffe2e on Page 0 of Length 0x1d2: Invalid parameter(s)
Any suggestions on what it could be? I read briefly about possible issues with not handling some interrupts.
Also, I'm trying to fill my RAM with a specific value so that I can tell if the stack is overflowing, any suggestions on how to fill the entire RAM with, say a value of 0x1234?
Thanks!

What debugger and compiler are you using? I've found that msp430-gcc and msp430-gdb/gdbproxy can get very confused with GCC optimizations turned on. However, broken code is sometimes is emitted without them turned on (its a quality product, really).
The easiest way to fill memory is to modify you crt0.s startup file and link it yourself. When memory is set to 0, you can change the pattern there.
Which device are you using? On 16-bit devices, 0xffe2e is outside of the address space of the processor, likely an array index or similar which has gone negative.

I have seen this error as well when using code composer studio and TI's USBFET programmer although I have not been able to nail down a single, definite cause.
Assuming you are using CCS, here are some tips:
1) Catch ACCV (UNMI) and VMA (SYSNMI) interrupts and set a break point within the handlers. If one of these trips, examine the stack for clues as to what triggered the interrupt.
2) If you have any interrupt handlers which re-enable interrupts (GIE bit), make sure they are not being retriggered repeatedly.
3) I have seen this error (inexplicably) when stepping through optimized code; so it may help to turn off optimizations.
If you are using code composer studio, as an alternative to initializing your RAM, you can set a breakpoint on stack overflow. Also, with a paused debug session, CCS gives you the option to fill a portion of memory with any value you choose via the "Memory" sub-window.

Related

How to narrow down a stack overflow exception?

I have a deadline. I am googling, I am code reading, I neeed help ...
My Application is throwing an EStackOverFlow. It requires an overnight test to hit the error, so I need some good ideas or it will take a long time to track down.
I tried last night with MAD Except, but that didn't catch it, presumably because there was no stack for it to do so. I was running from the IDE so I broke execution and looked at the Call Stack, but it was filled with MAD except details (I have contacted the author, but there is a big time difference between us).
There are no (deliberately) recursive recursive routines. No OnChange handlers (which might accidentaly change the component which they monitor, thus calling themselves recursively). No large data structures (which might be passed on the stack as parameters).
My first thought is to turn off MAD Except, but I can't wait another 12 or 16 hours for a crash.
Unattended, the program is doing some database access when timers expire every 30 seconds or every hour, so I have set those to 1 second, hoping to hasten the crash. Hmm, can I reduce the stack size in order to hasten the crash? If so, how?
What else can I do? I have wrapped my applications main file, where forms are created and the application is run, in Try ... Except.
Is there some point, such as the message handling loop, where I can check the stack size and see if it is growing "too large"? (if so, can you give details?)
Any other suggestion? Thanks in advance
(p.s the code is far too large to post)
AS. Profiler approach i noted above might give you result by sheer luck (if there is some stray branch leading you to rare but instant-fast leak. Say, among 100 case variants there is the only one leading to infinite recursion). However if there is steady cumulative leak, that takes all night to accumulate itself, then profiler results would not be distinguishable from normal work.
I thought a bit. Current hypothesis is that stack tracers fail for there is no stack left. Let's hold it. Then we shall throw exception before stack is over, yes ? So i'd try this sequence:
1) i'd set stack tracer to log full stack trace including line numbers. It can be done in JCL tracers used by Delphi IDE, and i think madExcept can too.
2) i'd like to know the current stack size. For example Determining Stack Space with Visual Studio
3) i'd keep regularly checking used stack space. For example What is a safe Maximum Stack Size or How to measure use of stack?
Note: since we hardly know which thread has caused this - if i had few threads, i'd try to use all of those. I just don't know how would application react if some auxiliary thread fail with SO, can it be intercepted and logged or the whole app would be blown up just for one thread
4) ~ every 5 minutes i'd log current stack usage - just to see the pattern, if it is steadily slowly goring, or is it some rare but severe codepath.
If several thread - then several log files one per thread
It would also allow to estimate the "normal" stack usage.
5) if stack usage raises above 80% (i think your "normal" usage todl above would not exceed it, won't it?) i'd raise manual exception, some special class that would not be caught until thread/application top level, and at top level maybe even do something complex, like halting the app into debugger and awaking you so you can remotely connect and check the internal status of sick yet not deceased yet app.
The most the common stackoverflow problem is function calling itself, try to avoid it, or at least limit it.
You say you use a timer, are you calling Application.ProcessMessages inside your timer event? if you have a lots messages in queue, incase if its another WM_TIMER, this could cause stackoverflow.

What are possible causes of IDirect3DVertexBuffer9::Lock failing?

In error reports from some end users of our game I have quite often seen following behaviour: IDirect3DVertexBuffer9::Lock fails, returned error code is D3DERR_NOTAVAILABLE.
Once this happens, quite frequently (but not always) it is followed by the CreateTexture or CreateVertexBuffer call failing with error D3DERR_OUTOFVIDEOMEMORY.
What are possible reasons for a vertex buffer lock failure? Could the virtual memory address space be exhausted, or what?
Based on the DIRECTXDEV response by Chuck Walbourn from Microsoft, besides of "out of address space" another cause could be "out of page pool".
Alternatively, on Windows XP this could indicate you have hit the limits of paged pool kernel memory. Typically this happens when you create a lot of Direct3D resources (textures, etc.)
We DO create a lot of Direct3D resources.
This is what I posted to DirectXDev: ;)
Have you checked how much memory your
application is using? (Be sure to
select the Virtual Memory column in
Task Manager!). My guess would be
memory fragmentation based issues
causing you to, as you suggest, run
out of address space.
It could, however, be a driver bug ...
Does the debug runtime provide any useful information?
Edit: The only other thing I can think of is that the aperture memory has run out. I don't know how this works with PCIExpress but on AGP you can set the aperture size. I've no idea how to check if it is full however. I suspect the error you are seeing is reporting that its full. Are you doing lots of locks with the Discard flag? If so its possible that these are creating tonnes of new allocations in the aperture and is causing you to run out of memory there. This is pure guess work however.
I'd guess that if this is happening with only some of your users it is those on the lower end machines. If things run slowly then you can end up with a lot of data buffered in the command buffer. This will make control laggy and "could", at a guess, lead to the problem you are seeing. You may want to try making sure the command buffer never gets too long. If you make sure the first lock of every frame is done without the discard flag (ie flag set to 0) then this will cause the pipeline to stall until the vertex buffer has been rendered and bring the command buffer back in sync with you. This will cause a slow down as the command buffering will not be able to smooth out frame rate spikes as easily ...
Anyway ... thats just a guess!
The raised issue about out of memory is valid. We need some details on the Lock() call to be sure, but for example if it is in the DEFAULT pool and if it's dynamic (D3DLOCK_DISCARD flag passed), it's very well possible that your driver tries to find an unused piece of memory to return (because it double or triple buffers internally) and fails because, as you discover yourself soon after, video memory is exhausted.

Hunting down EOutOfResources

Question:
Is there an easy way to get a list of types of resources that leak in a running application? IOW by connecting to an application ?
I know memproof can do it, but it slows down so much that the application won't even last a minute. Most taskmanager likes can show the number, but not the type.
It is not a problem that the check itself is catastrophic (halts the app process), since I can check with a taskmgr if I'm getting close (or at least I hope)
Any other insights on resource leak hunting (so not memory) is also welcomed.
Background:
I've an Delphi 7/2006/2009 app (compiles with all three) and after about a few week it starts acting funny. However only on one of the places it runs, on several other systems it runs till the power goes out.
I've tried to put in some debug code to narrow the problem down. and found out that the exception is EOutofResources on a save of a file. (the file save can happen thousands of times a day).
I have tried to reason out memory leaks (with fastmm), but since the dataflow is quite high (60MByte/s from gigabit industrial camera's), I can only rule out "creeping" memory leaks with fastmm, not quick flashes of memoryleaks that exhaust memory around the time it happens. If something goes wrong, the app fills memory in under half a minute.,
Main suspects are filehandles that are somehow left on some error and TMetafiles (which are streamed to these files). Minor suspects are VST, popupmenu and tframes
Updates:
Another possible tip: It ran fine for two years with D7, and now the problems are with Turbo Explorer (which I use for stable projects not converted to D2009 ).
Paul-Jan: Since it only happens once a week (and that can happen at night), information acquisition is slow. Which is why I ask this question, need to combine stuff for when I'm there thursday. In short: no I don't know 100% sure. I intend to bring the entire Systemtools collection to see if I can find something (because then it will be running for days). There is also a chance that I see open files. (maybe should try to find some mingw lsof and schedule it)
But the app sees very little GUI action (it is an machine vision inspection app), except screen refresh +/- 15/s which is tbitmap stretchdraw + tmetafile, but I get this error when saving to disk (TFileStream) handles are probably really exhausted. However in the same stream, TMetafile is also savetostreamed, something which later apps don't have anymore, and they can run from months.
------------------- UPDATE
I've searched and searched and searched, and managed to reproduce the problems in-vitro two or three times. The problems happened when memusage was +/- 256MB (the systems have 2GB), user objects 200, gdi objects 500, not one file more open than expected ).
This is not really exceptional. I do notice that I leak small amounts of handles, probably due to reparenting frames (something in the VCL seems to leak HPalette's), but I suspect the core cause is a different problem. I reuse TMetafile, and .clear it inbetween. I think clearing the metafile doesn't really (always?) resize the resource, eventually each metafile in the entire pool of tmetafile at maximum size, and with 20-40+ tmetafiles (which can be several 100ks each) this will hit the desktop heap limit.
That's theory, but I'll try to verify this by setting the desktop limit to 10MB at the customers, but it will be several weeks before I have confirmation if this changes anything. This theory also confirms why this machine is special (it's possible that this machine naturally has slightly larger metafiles on average). Occasionally freeing and recreating a tmetafile in the pool might also help.
Luckily all these problems (both tmetafile and reparenting) have already been designed out in newer generations of the apps.
Due to the special circumstances (and the fact that I have very limited test windows), this is going to be a while, but I decided to accept the desktop heap as an example for now (though the GDILeaks stuff was also somewhat useful).
Another thing that the audit revealed GDI-types usage in a thread (though only saving tmetafiles (that weren't used or connected otherwise) to streams.
------------- Update 2.
Increasing the desktop limit only seemed to minorly increase the time till the problem occurred.
Unfortunately, I won't be able to follow up on this further, since the machines were updated to a newer version of the framework that doesn't have the problem.
In summary I can only state what the three core modifications were going from the old to the new framework:
I no longer change screens by reparenting frames. I now work with forms that I hide and show. I changed this since I also had very rare crashes or exceptions (that could be clicked away) due to this. The crashes were all while operating the GUI though, not spontaneously like the main problem
The routine where the crash happened dealt with TMetafile. TMetafile has been designed out, and replace by a simpler own made format. (basically arrays with Opengl vertices)
Drawing no longer happened with tbitmap with a tmetafile overlay strechdrawn over it, but using OpenGL.
Of course it could be something else too, that got changed in the rewrite of the above parts, fixing some very nasty detail bug. It would have to be an extremely bad one, since I analysed the above system as much as I could.
Updated nov 2012 after some private mail discussion: In retrospect, the next step would have been adding a counter to the metafiles objects, and simply reinstantiate them every x * 1000 uses or so, and see if that changes anything. If you have similar problems, try to see if you can somewhat regularly destroy and reinitialize long living resources that are dynamically allocated.
There is a slim chance that the error is misleading. The VCL naively reports EOutOfResources if it is unable to obtain a DC for a window (see TWinControl.GetDeviceContext in Controls.pas).
I say "naively" because there are other reasons why GetDC() might return a NULL handle and the VCL should report the OS error, not assume an out of resources condition (there is a Windows version check required for this to be reliably possible, but the VCL could and should take of that too).
I had a situation where I was getting the EOutOfResources error as the result of a window handle becoming invalid. Once I'd discovered the true problem, finding the cause and fixing it was simple, but I wasted many, many hours trying to find a non-existent resource leak.
If possible I would examine the stack trace leading to this exception - if it is coming from TWinControl.GetDeviceContext then the problem may not be what you think (it's impossible to say what it might be of course, but eliminating the impossible is always the first step toward discovering the solution, no matter how improbable).
If they are GDI handle leaks you can have a look at MSDN Magazine January 2003 which uses the tool GDILeaks. Other tools are GDIObj or GDIView. Also see here.
Another source of EOutOfResources could be that the Desktop Heap is full. I've had that issue on busy terminal servers with large screens.
If there are lots of file handles you are leaking you could check out Process Explorer and have a look at the open file handles of your process and see any out of the ordinary. Or use WinDbg with the !htrace command.
I've run into this problem before. From what I've been able to tell, Delphi may throw an EOutOfResources any time the Windows API returns ERROR_NOT_ENOUGH_MEMORY, and (as the other answers here discuss) Windows may return ERROR_NOT_ENOUGH_MEMORY for a variety of conditions.
In my case, EOutOfResources was being caused by a TBitmap - in particular, TBitmap's call to CreateCompatibleBitmap, which it uses with its default PixelFormat of pfDevice. Apparently Windows may enforce fairly strict systemwide limits on the memory available for device-dependent bitmaps (see, e.g, this discussion), even if your system otherwise has plenty of memory and plenty of GDI resources. (These systemwide limits are apparently because Windows may allocate device-dependent bitmaps in the video card's memory.)
The solution is simply to use device-independent bitmaps (DIBs) instead (although these may not offer quite as good of a performance). To do this in Delphi, set TBitmap.PixelFormat to anything other than pfDevice. This KB article describes how to pick the optimal DIB format for a device, although I generally just use pf32Bit instead of trying to determine the optimal format for each of the monitors the application is displayed on.
Most of the times I saw EOutOfResources, it was some sort of handle leak.
Did you try something like MadExcept?
--jeroen
"I've tried to put in some debug code to narrow the problem down. and found out that the exception is EOutofResources on a save of a file. (the file save can happen thousands of times a day)."
I'm shooting in the dark here, but could it be that you're using the Windows API to (GetTempFileName) create a temp file and you're blowing out some file system indexes or forgetting to close a file handle?
Either way, I do agree that with your supposition about it being a file handle problem. That seems to be the most likely thing given your symptoms and diagnosis.
Also try to check handle count for the application with Process Explorer from SysInternals. Handle leaks can be very dangerous and they build slowly through time.
I am currently having this problem, in software that is clearly not leaking any handles in my own code, so if there are leaks they could be happening in a component's source code or the VCL sourcecode itself.
The handle count and GDI and user object counts are not increasing, nor is anything being created. Deltic's answer shows corner cases where the message is kind of a red-herring, and Allen suggests that even a file write can cause this error.
So far, The best strategy I have found for hunting them down is to use either JCL JCLDEBUG stack tracebacks, or the exception report save features in MadExcept to generate the context information to find out what is actually failing.
Secondly, AQTime contains many tools to help you, including a resource profiler that can keep the links between where the code that created the resources is, and how it was called, along with counts of the total numbers of handles. It can grab results MID RUN and so it is not limited to detecting unfreed resources after you exit. So, run AQTime, do a results capture in mid run, wait several hours, and capture again, and you should have two points in time to compare handle counts. Just in case it is the obvious thing. But as Deltics wisely points out, this exception class is raised in cases where it probably shouldn't have been.
I spent all of today chasing this issue down. I found plenty of helpful resources pointing me in the direction of GDI, with the fact that I'm using GDI+ to produce high-speed animations directly onto the main form via timer/invalidate/onpaint (animation performed in separate thread). I also have a panel in this form with some dynamically created controls for the user to make changes to the animation.
It was extremely random and spontaneous. It wouldn't break anywhere in my code, and when the error dialog appeared, the animation on the main form would continue to work. At one point, two of these errors popped up at the same time (as opposed to sequential).
I carefully observed my code and made sure I wasn't leaking any handles related to GDI. In fact, my entire application tends to keep less than 300 handles, according to Task Manager. Regardless, this error would randomly pop up. And it would always correspond with the simplest UI related action, such as just moving the mouse over a standard VCL control.
Solution
I believe I have solved it by changing the logic to performing the drawing within a custom control, rather than directly to the main form as I had been doing before. I think the fact that I was rapidly drawing on the same form canvas which shared other controls, somehow they interfered. Now that it has its own dedicated canvas to draw on, it seems to be perfectly fixed.
That is with about 1 hour of vigorous testing at least.
[Fingers crossed]

How to get the root cause of a memory corruption in a embedded environment?

I have detected a memory corruption in my embedded environment (my program is running on a set top box with a proprietary OS ). but I couldn't get the root cause of it.
the memory corruption , itself, is detected after a stress test of launching and exiting an application multiple times. giving that I couldn't set a memory break point because the corruptued variable is changing it's address every time that the application is launched, is there any idea to catch the root cause of this corruption?
(A memory break point is break point launched when the environment change the value of a giving memory address)
note also that all my software is developed using C language.
Thanks for your help.
These are always difficult problems on embedded systems and there is no easy answer. Some tips:
Look at the value the memory gets corrupted with. This can give a clear hint.
Look at datastructures next to your memory corruption.
See if there is a pattern in the memory corruption. Is it always at a similar address?
See if you can set up the memory breakpoint at run-time.
Does the embedded system allow memory areas to be sandboxed? Set-up sandboxes to safeguard your data memory.
Good luck!
Where is the data stored and how is it accessed by the two processes involved?
If the structure was allocated off the heap, try allocating a much larger block and putting large guard areas before and after the structure. This should give you an idea of whether it is one of the surrounding heap allocations which has overrun into the same allocation as your structure. If you find that the memory surrounding your structure is untouched, and only the structure itself is corrupted then this indicates that the corruption is being caused by something which has some knowledge of your structure's location rather than a random memory stomp.
If the structure is in a data section, check your linker map output to determine what other data exists in the vicinity of your structure. Check whether those have also been corrupted, introduce guard areas, and check whether the problem follows the structure if you force it to move to a different location. Again this indicates whether the corruption is caused by something with knowledge of your structure's location.
You can also test this by switching data from the heap into a data section or visa versa.
If you find that the structure is no longer corrupted after moving it elsewhere or introducing guard areas, you should check the linker map or track the heap to determine what other data is in the vicinity, and check accesses to those areas for buffer overflows.
You may find, though, that the problem does follow the structure wherever it is located. If this is the case then audit all of the code surrounding references to the structure. Check the contents before and after every access.
To check whether the corruption is being caused by another process or interrupt handler, add hooks to each task switch and before and after each ISR is called. The hook should check whether the contents have been corrupted. If they have, you will be able to identify which process or ISR was responsible.
If the structure is ever read onto a local process stack, try increasing the process stack and check that no array overruns etc have occurred. Even if not read onto the stack, it's likely that you will have a pointer to it on the stack at some point. Check all sub-functions called in the vicinity for stack issues or similar that could result in the pointer being used erroneously by unrelated blocks of code.
Also consider whether the compiler or RTOS may be at fault. Try turning off compiler optimisation, and failing that inspect the code generated. Similarly consider whether it could be due to a faulty context switch in your proprietary RTOS.
Finally, if you are sharing the memory with another hardware device or CPU and you have data cache enabled, make sure you take care of this through using uncached accesses or similar strategies.
Yes these problems can be tough to track down with a debugger.
A few ideas:
Do regular code reviews (not fast at tracking down a specific bug, but valuable for catching such problems in general)
Comment-out or #if 0 out sections of code, then run the cut-down application. Try commenting-out different sections to try to narrow down in which section of the code the bug occurs.
If your architecture allows you to easily disable certain processes/tasks from running, by the process of elimination perhaps you can narrow down which process is causing the bug.
If your OS is a cooperative multitasking e.g. round robin (this would be too hard I think for preemptive multitasking): Add code to the end of the task that "owns" the structure, to save a "check" of the structure. That check could be a memcpy (if you have the time and space), or a CRC. Then after every other task runs, add some code to verify the structure compared to the saved check. This will detect any changes.
I'm assuming by your question you mean that you suspect some part of the proprietary code is causing the problem.
I have dealt with a similar issue in the past using what a colleague so tastefully calls a "suicide note". I would allocate a buffer capable of storing a number of copies of the structure that is being corrupted. I would use this buffer like a circular list, storing a copy of the current state of the structure at regular intervals. If corruption was detected, the "suicide note" would be dumped to a file or to serial output. This would give me a good picture of what was changed and how, and by increasing the logging frequency I was able to narrow down the corrupting action.
Depending on your OS, you may be able to react to detected corruption by looking at all running processes and seeing which ones are currently holding a semaphore (you are using some kind of access control mechanism with shared memory, right?). By taking snapshots of this data too, you perhaps can log the culprit grabbing the lock before corrupting your data. Along the same lines, try holding the lock to the shared memory region for an absurd length of time and see if the offending program complains. Sometimes they will give an error message that has important information that can help your investigation (for example, line numbers, function names, or code offsets for the offending program).
If you feel up to doing a little linker kung fu, you can most likely specify the address of any statically-allocated data with respect to the program's starting address. This might give you a consistent-enough memory address to set a memory breakpoint.
Unfortunately, this sort of problem is not easy to debug, especially if you don't have the source for one or more of the programs involved. If you can get enough information to understand just how your data is being corrupted, you may be able to adjust your structure to anticipate and expect the corruption (sometimes needed when working with code that doesn't fully comply with a specification or a standard).
You detect memory corruption. Could you be more specific how? Is it a crash with a core dump, for example?
Normally the OS will completely free all resources and handles your program has when the program exits, gracefully or otherwise. Even proprietary OSes manage to get this right, although its not a given.
So an intermittent problem could seem to be triggered after stress but just be chance, or could be in the initialisation of drivers or other processes the program communicates with, or could be bad error handling around say memory allocations that fail when the OS itself is under stress e.g. lazy tidying up of the closed programs.
Printfs in custom malloc/realloc/free proxy functions, or even an Electric Fence -style custom allocator might help if its as simple as a buffer overflow.
Use memory-allocation debugging tools like ElectricFence, dmalloc, etc - at minimum they can catch simple errors and most moderately-complex ones (overruns, underruns, even in some cases write (or read) after free), etc. My personal favorite is dmalloc.
A proprietary OS might limit your options a bit. One thing you might be able to do is run the problem code on a desktop machine (assuming you can stub out the hardware-specific code), and use the more-sophisticated tools available there (i.e. guardmalloc, electric fence).
The C library that you're using may include some routines for detecting heap corruption (glibc does, for instance). Turn those on, along with whatever tracing facilities you have, so you can see what was happening when the heap was corrupted.
First I am assuming you are on a baremetal chip that isn't running Linux or some other POSIX-capable OS (if you are there are much better techniques such as Valgrind and ASan).
Here's a couple tips for tracking down embedded memory corruption:
Use JTAG or similar to set a memory watchpoint on the area of memory that is being corrupted, you might be able to catch the moment when memory being is accidentally being written there vs a correct write, many JTAG debuggers include plugins for IDEs that allow you to get stack traces as well
In your hard fault handler try to generate a call stack that you can print so you can get a rough idea of where the code is crashing, note that since memory corruption can occur some time before the crash actually occurs the stack traces you get are unlikely to be helpful now but with better techniques mentioned below the stack traces will help, generating a backtrace on baremetal can be a very difficult task though, if you so happen to be using a Cortex-M line processor check this out https://github.com/armink/CmBacktrace or try searching the web for advice on generating a back/stack trace for your particular chip
If your compiler supports it use stack canaries to detect and immediately crash if something writes over the stack, for details search the web for "Stack Protector" for GCC or Clang
If you are running on a chip that has an MPU such as an ARM Cortex-M3 then you can use the MPU to write-protect the region of memory that is being corrupted or a small region of memory right before the region being corrupted, this will cause the chip to crash at the moment of the corruption rather than much later

How do I go about diagnosing memory corruption errors occurring in a COM-DLL after porting it from Delphi 2007 to Delphi 2009?

I have just ported several of our home-made Outlook COM-addins from Delphi 2007 to Delphi 2009 and am now experiencing some really weird errors (before you ask: none of which appear to have any obvious relationship to string-handling), for example modal dialogs that hang Outlook when one tries to invoke them a second time (the first time around everything appears to be fine) but only when they're invoked from one specific event handler and not when doing the same thing somewhere else. When I trace the error to a specific line of code and comment out that line or replace it with different code to the same effect (e.g. by copying code that would otherwise be called via a function directly to the calling site), the error will appear to go away - typically only to reoccur a couple of (equally inconspicuous looking) statements later.
When running this inside the Delphi debugger I can see that the freezes are often preceded by Access Violations in GetMem.inc . At least all of these issues are 100% reproducible...
Needless to say we had none of these issues when compiling these addins in Delphi 2007.
Now, I'm quite at a loss. I know I have just been lucky but even though I consider myself a fairly experienced programmer (though mostly in niche areas) I never really had to deal with this class of error before. As the title of this question says, I don't even really know where to start. I can step through the code as much as I like but the endless assembler statements mean nothing to me and neither am I proficient in effectively using the CPU view.
Furthermore, I don't even know for sure yet whether this is an issue with my own code to begin with (I actually tend to doubt it in this case). We are makign massive use of a number of third-party libraries (e.g. JCL, ADX, Redemption). ADX in particular still labels its Delphi 2009 support "beta".
I also tried using FastMM's FullDebugMode and indeed I did uncover a number of errors in ADX that way (e.g. blocks that were modified after having been freed) but all of these also occur when I compile with Delphi 2007 so it doesn't yet seem imperative that these are ultimately the cause for the observed regression.
So, how do I deal with this? - or better yet: Where can I find some good resources on learning how to deal with this? e.g. tutorials on using the CPU view or effectively interpreting and acting upon the reports put out by FastMM? Are these the correct tools at all? Where else should I look?
Addendum:
What types of code should I be suspicious of in this context? What kind of code even has the potential to wreak such havoc in memory? The only places I can think of where my code performs anything remotely approaching explicit memory manipulation is when reserving some buffer space in preparation of a WinAPI call. Also keep in mind that all of my code is identical between the Delphi 2007 and Delphi 2009 versions and the Delphi 2007 version exhibits no such problems.
Update:
With some probability the issue that prompted me to post this question has now been solved. See my own answer below.
The best tool for getting to a solution is probably memory breakpoints.
Debugging memory corruption is painful, so try to make your life as simple as possible first: find an exact, guaranteed-reproducible set of steps that work every time. If necessary, mock up the Outlook host so that you don't need to rely on Outlook timing issues or address space layout issues etc.
It's imperative that you get a reliably reproducible set of steps that results in an AV or other error at a predictable address.
What you then do is restart the process, create a memory breakpoint set for whatever referred to that address, and get familiar with the lifecycle of that chunk of memory. Minmizing and rationalizing your reproduction steps helps here. It may help to add other breakpoints and only enable the memory breakpoint later in the application; or use the logging features of D2009 breakpoints to log memory values / call stacks etc., rather than actually breaking into the debuggee.
Not exactly an answer to the question which was more general, but very probably the solution to the specific problem that prompted it:
I am 95% sure to have identified the problem now! :)
Here's what I did:
I enabled RangeChecking and OverflowChecking in the compiler
I tracked down and fixed all problems that caused ERangeError or EIntOverflow exceptions
(there was one of each)
I ran the program again with FastMM and FullDebugMode enabled
I was finally able to identify the cause of the problem in all cases to be a call to the JCL function GetWindowCaption
It seems that GetWindowCaption has obviously not yet been checked for Unicode-compatibility: It was using the value returned from the API function GetWindowTextLength (which returns the number of characters) as input for ReallocMem (which expects the number of bytes) to allocate the buffer for GetWindowText (which in Delphi 2009 returns a buffer of WideChars). Boom! The function was allocating too little memory for the buffer but GetWindowText simply overwrote the following memory thus corrupting the block footer.
I have now filed this in the JCL bug tracker as item #4648
The bottom line I took out of this is: Always be sure to fix all reported errors! Including (seemingly) non-critical ones like range and overflow errors. If nothing else, it will make debugging that much more predictable.
The fact that you catch double-free bugs in D2007, even though it does appear to work fine in this version, means that you NEED to fix those because you are merely lucky that the D2007 version does not need to recycle the memory as aggressively as the D2009 version and the bugs do not show up due to "shadow persistence" in memory.
I would use FastMM fulldebugmode to find the bad code and fix it as much as possible, then follow Barry's advice to trouble shoot memory usage.
For how to use the features of the Integrated Debugger, and how to log info from non breaking breakpoints, you may want to look at this CodeRage 3 session: Delphi Debugging for Dummies
I'd look in direction of full pageheap support built into the system.
Look in this post for how to configure it. Provided your memory usage is not too extensive, this is the easiest to do to find the problem.
It gets tricky when memory consumption is heavier - but like I said, try full peagheap first.

Resources