How to hunt a Heisenbug - delphi

Recently, we received a bug report from one of our users: something on the screen was displayed incorrectly in our software. Somehow, we could not reproduce this in our development environment (Delphi 2007).
After some further study, it appears that this bug only manifests itself when "Code optimization" is turned on.
Are there any people here with experience in hunting down such a Heisenbug? Any specific constructs or coding bugs that commonly cause such an issue in Delphi software? Any places you would start looking?
I'll also just start debugging the whole thing in the usual way, but any tips specific to Optimization-related bugs (*) would be more than welcome!
(*) Note: I don't mean to say that the bug is caused by the optimizer; I think it's much more likely some wonky construct in the code is somehow pushed "over the edge" by the optimizer.
Update
It seems the bug boils down to a record being fully initialized with zeros when there's no code optimization, and the same record containing some random data when there is optimization. In this case, the random data seems to cause an enum type to contain invalid data (to my great surprise!).
Solution
The solution turned out to involve an unitialized local record variable somewhere deep in the code. Apparently, without optimization the record was reset (heap?), and with optimization turned on, the record was filled with the usual garbage. Thanks to you all for your contributions --- I learned a lot along the way!

Typically bugs of this form are caused by invalid memory access (reading uninitialised data, reading off the end of a buffer...) or thread race conditions.
The former will be affected by optimisations causing data layout to be rearranged in memory, and/or possibly by debug code that initialises newly allocated memory to some value; causing the incorrect code to "accidentally work".
The latter will be affected due to timings changing between optimisation levels. The former is generally much more likely.
If you have some automated way of making freshly allocated memory be filled with some constant value before it is passed to the program, and this makes the crash go away or become reproducible in the debug build, that'll provide a good point to start chasing things.

Could very well be a memory vs register issue: you programm running fine relying on memory persistence after a free.
I would recommend running your application with FastMM4 in full debug mode to be sure of your memory management.
Another (not free) tool which can be very useful in a case like this is Eurekalog.
Another thing that I've seen: a crash with the FPU registers being botched when calling some outside code (DLL, COM...) while with the debugger everything was OK.

A record that contains different data according to different compiler settings tells me one thing: That the record is not being explicitly initialised.
You may find that the setting of the compiler optimization flag is only one factor that might affect the content of that record - with any uninitialised data structures the one thing that you can rely on is that you can't rely on the initial content of the structure.
In simple terms:
class member data is initialised (to zero's) for new instances of the class
local variables (in functions and procedures) and unit variables are NOT initialised except in a few specific cases: interface references, dynamic arrays and strings and I think (but would need to check) records if they contain one or more fields of those types that would be initialised (strings, interface references etc).
The question as stated is now a little misleading because it seems you found your "Heisenberg" fairly easily enough. Now the issue is how to deal with it, and the answer is simply to explicitly initialise your record so that you aren't reliant on whatever behaviour or side-effect of the compiler is sometimes taking care of that for you and sometimes not.

Especially in purely native languages, like Delphi, you should be more than careful not to abuse the freedom to be able to cast anything to anything.
IOW: One thing, I have seen is that someone copies the definition of a class (e.g. from the implementation section in RTL or VCL) into his own code and then cast instances of the original class to his copy.
Now, after upgrading the library where the original class came from, you might experience all kinds of weird stuff. Like jumping into the wrong methods or bufferoverflows.
There's also the habit of using signed integer as pointers and vice-versa. (Instead of cardinal)
this works perfectly fine as long as your process has only 2GB of address space. But boot with the /3GB switch and you will see a lot of apps that start acting crazy. Those made the assumption of "pointer=signed integer" at least somewhere.
Your customer uses a 64Bit Windows? Chances are, he might have a larger address space for 32Bit apps. Pretty tough to debug w/o having such a test system available.
Then, there's race conditions.
Like having 2 threads, where one is very, very slow. So that you instinctively assume it will always be the last one and so there's no code that handles the scenario where "Captn slow" finishes first.
Changes in the underlying technologies can make these assumptions very wrong, very fast indeed.
Take a look at the upcoming breed of Flash-based super-mega-fast server storage.
Systems that can read and write Gigabytes per second. Applications that assume the IO stuff to be significantly slower than some calculations on in-memory values will easily fail on this kind of fast storage.
I could go on and on, but I gotta run right now...
Cheers

Code optimization does not mean necessarily that debug symbols have to be left out. Do a debug build with code optimization, then you can still debug the program and maybe the error occurs now.

One easy thing to do is Turn on compiler warning and hint, rebuild project and then fix all warnings/hints
Cheers

If it Delphi businesscode, with dataaware components etc, the follow might not apply.
I'm however writing machine vision code which is a bit computational. Most of the unittests are console based. I also am involved with FPC, and over the years have tested a lot with FPC. Partially out of hobby, partially in desperate situations where I wanted any hunch.
Some standard tricks that I tried (decreasing usefulness)
use -gv and valgrind the code (practically this means applications are required to run on Linux/FreeBSD. But for computational code and unittests that can be doable)
compile using fpc param -gt (=trash local vars, randomize local vars on procedure init)
modify heapmanager to randomize data of blocks it puts out (also applyable to Delphi code)
Try FPC's range/overflow checking and compiler hints.
run on a Mac Mini (powerpc) or win64. Due to totally different rules and memory layouts it can catch pretty funky things.
The 2 and 3 together nearly allow you to find most, if not all initialization problems.
Try to find any clues, and then go back to Delphi and search more focussed, debug etc.
I do realize this is not easy. I have a lot of FPC experience, and didn't have to find everything out from scratch for these cases. Still it might be worth a try, and might be a motivation to start setting up non-visual systems and unittests FPC compatible and platform independant. Most of this work will be needed anyway, seeing the Delphi roadmap.

In such problems i always advice to use logfiles.
Question: Can you somehow determine the incorrect display in the sourcecode?
If not, my answer wont help you.
If yes, check for the incorrectness, and as soon as you find it, dump the stack to a logfile. (see post mortem debugging for details about dumping and resymbolizing the stack).
If you see that some data has been corrupted, but you dont know how and then this happend, extract a function that does such a test for validity (with logging if failed), and call this function from more and more places over program execution (i.e. after each menu call). If you reiterate such a approach a few times you have good chances to find the problem.

Is this a local variable inside a procedure or function?
If so, then it lives on the stack, and will contain garbage. Depending on the execution path and compiler settings the garbage will change, potentially pushing your logic 'over the edge'.
--jeroen

Given your description of the problem I think you had uninitialized data that you got away with without the optimizer but which blew up with the optimization on.

Related

Suppress DLL messages or callback function

I plan to use https://github.com/risoflora/brookframework as an embedded static server in a Delphi application. It requires https://github.com/risoflora/libsagui and seems very fast. But I can't find an option to suppress messages coming from the libsagui DLL.
A common scenario is when the desired port can't be bound, I prefer to get an exception or some error handler callback, rather than get a MessageBox which can't be controlled by my application.
Any info or suggestion?
I'm going to take a flying leap here and just toss out what occurs to me at the moment. First, DLLs are not supposed to have ANY sort of UI activity. If someone does that deliberately and not because they just don't know, then it usually signals there's a serious problem afoot in the run-time activity.
Second, assuming that this signals a serious problem, then there's likely to be some mechanism in place to avoid it. This is for an embedded situation, so one might ask "why can't a desired port be bound" if the person doing the embedding is also responsible for configuring the overall embedded environment? One does not leave things undefined or unconfigured in embedded situations or the application just blows up at some point. So the obvious question is, why are YOU (as the person who configured the embedded environment) writing code that's trying to connect to unavailable ports? And why are YOU also complaining that the underlying library isn't throwing an exception rather than throwing up its virtual hands saying it doesn't know how to handle a request YOUR CODE should have known better than to make?
In other words, if you feel the need to add a layer to your code between your app and the platform you configured to support it, in order to validate your own requests, then you're free to do so. Otherwise, I'm guessing the embedded library designer expects you to know the limits you configured into it. (Or there may be some he did not anticipate.)
I've done a lot of embedded programming in my past, and if I was using a configurable kernal that I configured to only support 10 tasks (because I knew my application didn't need more than that), then in my client code I'd do a check to ensure that said limit wasn't reached BEFORE creating a new task and expecting to get an exception back if I was too stupid to follow my own design guidelines. In embedded situations, "exceptions" can lead to auto-shutdown or even self-destruct sequences. There is no effective way to handle "exceptions" in real-time situations, and often even just embedded situations.
Imagine pushing on a brake pedal in a car and an exception is thrown -- the user is saying "slow down" and the library says, "Sorry, I can't do that right now". It's shades of "2001 A Space Odyssy" with HAL telling Dave "No". That's what "exceptions" in embedded systems can result in.
What's different between that and your app saying, "Connect to this URL via this port" and the underlying library saying, "Sorry I can't do that right now". What is your exception handling code going to do? You need to detect that stuff FIRST rather than wait for an exception to be thrown by the embedded library.
Finally, this is more of a rant than anything else ... but one thing I've found sorely lacking in the vast majority of open-source projects on github I've looked at is any clear explanation of one or more USE-CASE MODELS. In this situation, WHY did the author choose to design this embedded DLL so it has situations where it simply throws up its hands by displaying an error message in a low-level MessageBox rather than providing some mechanism for handling the issue more elegantly? Obviously, he had SOMETHING in mind when he did that. Unfortunately for us, he did not explain it anywhere. Here we are trying to figure out how to deal with something that, on its face, makes no sense -- a DLL throwing up its hands when faced with an exceptional situation.
So my answer is, in the "spirit" of open-source code repos, you're free to dig into the code and try to figure out the USE-CASE MODEL that the author had in mind when he created this thing, and adapt your code to fit that model. Maybe start by looking at examples of code he posted that use this library, although if they're your typical trival examples people often post, then they won't offer a lot of insight into the overall USE-CASE MODEL that is leading to your situation.
Said another way, assume the library is behaving "as designed". Then WHAT IS THIS LIBRARY EXPECTING YOU TO DO IN YOUR CODE IN ORDER TO AVOID THAT SITUATION FROM ARISING?
Hopefully someone will see this who has a much shorter definitive answer to that.

Which of the three mutually exclusive Clang sanitizers should I default to?

Clang has a number of sanitizers that enable runtime checks for questionable behavior. Unfortunately, they can't all be enabled at once.
It is not possible to combine more than one of the -fsanitize=address, -fsanitize=thread, and -fsanitize=memory checkers in the same program.
To make things worse, each of those three seems too useful to leave out. AddressSanitizer checks for memory errors, ThreadSanitizer checks for race conditions and MemorySanitizer checks for uninitialized reads. I'm worried about all of those things!
Obviously, if I have a hunch about where a bug lies, I can choose a sanitizer according to that. But what if I don't? Going further, what if I want to use the sanitizers as a preventative tool rather than a diagnostic one, to point out bugs that I didn't even know about?
In other words, given that I'm not looking for anything in particular, which sanitizer should I compile with by default? Am I just expected to compile and test the whole program three times, once for each sanitizer?
As you pointed out, sanitizers are typically mutually exclusive (you can combine only Asan+UBsan+Lsan, via -fsanitize=address,undefined,leak, maybe also add Isan via -fsanitize=...,integer if your program does not contain intentional unsigned overflows) so the only way to ensure complete coverage is to do separate QA runs with each of them (which implies rebuilding SW for every run). BTW doing yet another run with Valgrind is also recommended.
Using Asan in production has two aspects. On one hand common experience is that some bugs can only be detected in production so you do want to occasionally run sanitized builds there, to increase test coverage [*]. On the other hand Asan has been reported to increase attack surface in some cases (see e.g. this oss-security report) so using it as hardening solution (to prevent bugs rather than detect them) has been discouraged.
[*] As a side note, Asan developers also strongly suggest using fuzzing to increase coverage (see e.g. Cppcon15 and CppCon17 talks).
[**] See Asan FAQ for ways to make AddressSanitizer more strict (look for "aggressive diagnostics")

How to patch TControlCanvas.CreateHandle (FreeDeviceContexts issue)?

The question is related to my previous question:
access violation at address in module ntdll.dll - RtlEnterCriticalSection with TCanvas.Lock
Apparently there is a bug in Delphi's code (see QC 64898: Access violation in FreeDeviceContexts). This bug goes all the way until D2010, AFAIK.
The suggested workaround worked fine so far. Now I have a dilemma.
I don't like the idea of using a private copy of Controls.pas in my project - I'm not sure it is safe. The Controls unit is a very low level unit, and I really feel it's a drastic move, considering that my huge application works fine, except for the mentioned problem. I'm also not sure if/how to rebuild all components/units that rely on the Controls unit in my project.
Is it possible to patch TControlCanvas.CreateHandle(), which uses an internal CanvasList and private members?
NOTE: I will be using the patch for this project only (Delphi 5). I don't mind hard-coding the offsets. AFAIK, patching privates always uses hard-coded offsets, based on the compiler version. I might be able to deal with privates myself (without class helpers), but I have no clue how to handle CanvasList and FreeDeviceContext(), which are declared in the implementation section of the Controls unit.
As discussed in the comments, it is possible to access the private and protected members of classes, even in older versions of Delphi without "class helpers".
However, the problem in this case revolves around the details of a particular method implementation, not just being able to access or modify private member variables. Further, the implementation of a particular method which makes use of an implementation variable in the unit involved. Specifically the CanvasList variable that you have noted.
Even with the benefit of class helpers, there is no simple way to access that implementation variable.
Your current solution is the simplest and safest approach: Using a copy of the entire unit with a modification applied to the specific method required to solve the issue.
Rest assured, this is not an uncommon practise. :)
Your only problem with this approach is to be sure to manage the fact that you are relying on this "privatised" copy of the unit when standing up new development environments or upgrading to new versions of the IDE.
In the case of new development environments, careful project configuration should take care of things (and of course, your modified Controls.pas unit is part of your version controlled project).
In the case of upgrading to newer Delphi versions, you simply have to remember to revisit the modified Controls unit in each new version, updating the private copy in your project and re-applying the modifications you have made as appropriate. In most if not all cases this should be straightforward.
But I Really Want to Access the CanvasList Variable
As I say above, there is no simple way to access the implementation variable used in that unit (which will be necessary if you were to somehow contrive to "patch" the code at runtime, rather than replacing it with a modified copy at compile time).
But that implies that there is a **non-**simple way. And there is.
Like any data in your application, that variable resides at some memory address in your process. It's only the compiler scoping rules which prevent you from addressing it directly in source. There is nothing stopping you figuring out how to find that location at runtime and addressing that memory location via a pointer as you would any other "raw" memory address to which you have access.
I don't have a worked up demonstration of how to do that and strongly recommend that trying to implement such a solution is a waste of time and effort, given that an easier solution exists (copying and modifying the unit).
Apart from anything else, depending upon how reliable the method is for determining the memory location involved, direct access to that memory location could prove potentially vulnerable not only to differences between compiler versions but even to changes arising from compiler settings.
In terms of the end result, it is no better than copying the unit but is certainly far harder and far less reliable.

Why does the IDE error parsers represent the Size property of NativeInt as error?

I have recently started using the built-in helper classes for basic data types, makes them look like C#. The IDE has a really unusual behavior associated only with NativeInt and NativeUInt helpers, as such it interprets the Size property to be undefined.
Its a nuisance to see a line of errors which are actually not there, and then sniffing through them for the real errors. Other mistakes made by the IDE error parsers can almost always be mitigated with a successful compile but this one never goes away.
Does somebody know a solution to this aside from not using the property and switching back to SizeOf ()? A hack solution is also welcome.
Disable "Error Insight" in the IDE settings. Seriously. It never works right, reports false errors that are not real errors, etc. It gets its info from a separate source then the actual code, and easily gets out of sync. Best to just not use it at all.

Low level programming: How to find data in a memory of another running process?

I am trying to write a statistics tool for a game by extracting values from game's process memory (as there is no other way). The biggest challenge is to find out required addresses that store data I am interested. What makes it even more harder is dynamic memory allocation - I need to find not only addresses that store data but also pointers to those memory blocks, because addresses are changing every time game restarts.
For now I am just manually searching game memory using memory editor (ArtMoney), and looking for addresses that change their values as data changes (or don't change). After address is found I am looking for a pointer that points to this memory block in a similar way.
I wonder what techniques/tools exist for such tasks? Maybe there are some articles I can read? Is mastering disassembler the only way to go? For example game trainers are solving similar tasks, but they make them in days and I am struggling already for weeks.
Thanks.
PS. It's all under windows.
Is mastering disassembler the only way to go?
Yes; go download WinDbg from http://www.microsoft.com/whdc/devtools/debugging/default.mspx, or if you've got some money to blow, IDA Pro is probably the best tool for doing this
If you know how to code in C, it is easy to search for memory values. If you don't know C, this page might point you to your solution if you can code in C#. It will not be hard to port the C# they have to Java.
You might take a look at DynInst (Dynamic Instrumentation). In particular, look at the Dynamic Probe Class Library (DPCL). These tools will let you attach to running processes via the debugger interface and insert your own instrumentation (via special probe classes) into them while they're running. You could probably use this to instrument the routines that access your data structures and trace when the values you're interested in are created or modified.
You might have an easier time doing it this way than doing everything manually. There are a bunch of papers on those pages you can look at to see how other people built similar tools, too.
I believe the Windows support is maintained, but I have not used it myself.

Resources