How to improve accuracy of profiling - delphi

I want to improve the running time of some code.
In order to that I first time the running time of all relevant code, using code like this:
before:= rdtsc;
myobject.run;
after:= rdtsc;
Then I zoom in and time a relevant part, like so:
procedure myobject.part;
begin
StartTime:= rdtsc;
...
EndTime:= rdtsc;
inc(TotalTime, (EndTime- StartTime));
end;
I have some code to copy paste the timings into Excel, a typical outcome would look like:
(the 89.8% and 10.2% adding up to 100% is a coincidence and has nothing to do with the data or the question)
(when the data shows 1 it means 0 to avoid divide by zero errors)
Note the difference between run A and run B.
I have not changed anything yet so run A and B should give the same running time.
Further note that I know that on both runs procedure part was invoked exactly the same number of times (the data is the same and the algorithm is deterministic).
The running time of procedure part is very short (it is just called many times).
If there was some way to block out other processes during these short bursts of runtime (less than 700 CPU cycles) my timings would be much more accurate.
How do I get these timings to be more reliable?
Is there a way to monopolize the CPU to only run my task when timing and nothing else?
Note that I'm not looking for obvious answers like:
- Close other running programs
- Disable the virusscanner etc...
I've tagged the question Delphi because I'm using Delphi right now (and there may be some Delphi specific option to achieve this result).
I've also tagged it language-agnostic because there may be some more general way.
Update
Because I'm using the CPU instruction RDTSC I'm not affected by CPU throttling. If the CPU slows down, the number of cycles stays the same.
Update2
I have 2 answers, but neither answers the question...
The question is how do I prevent these changes in running time?
Do I have to run the code 20x and always compare the lowest running time out of the 20 runs?
Or to I set my program priority to realtime?
Or is there some other trick to use so my code sample does not get interrupted?

To want to improve the running time of some code.
In order to that I first time the running time of all relevant code, ...
OK, I'm a bit of a stuck record on this subject, but lots of people think that to improve running time requires first measuring it accurately.
Not So.
Improving running time requires finding out what's taking a large fraction of time (the exact fraction does not matter) and doing it differently or maybe not at all.
What it's doing is often not revealed by timing individual routines.
Here's the method I use,
and here's a very amateur video of it.

The problem with profiling your code like that, by sticking special statements into it, is that those special statements themselves take time to run. And since the things taking the most time are likely to be things happening in tight loops, the more they run, the more they distort your timings. What you need for good information is something that will observe your program from outside, without modifying the executing code.
In other words, you need a sampling profiler. And there just happens to be a very good one for Delphi available for free, by the rather descriptive name of Sampling Profiler. It runs your program and watches what it's doing, then correlates that against the map file (make sure to set up your project options to generate a Detailed map file) to give you an intelligible readout on what your program is spending its time on.
And if you want to narrow things down, you can use OutputDebugString to output profiling commands to make it only pay attention to specific parts of your code. It's got instructions in the help file.
I've used a lot of different methods, and this is the most useful way I've found to figure out what Delphi programs are spending their time on. And it's free. Give it a try.

Related

Alternatives to D3D11_CREATE_DEVICE_PREVENT_INTERNAL_THREADING_OPTIMIZATIONS?

This is a followon to this question about using the DX11VideoRenderer sample (a replacement for EVR that uses DirectX11 instead of EVR's DirectX9).
I've been trying to track down why it uses so much more CPU than the EVR. Task Manager shows me that most of that time is kernel mode.
Using profiling tools, I see that a LOT of time is being spent in numerous calls to NtDelayExecution (aka Sleep). How many calls? ~100,000 over the course of ~12 seconds. Ok, yeah, I'm sending a lot of frames in those 12 seconds, but that's still a lot of calls, every one of which requires a kernel mode transition.
The callstack shows the last call in "my" code is to IDXGISwapChain1::Present(0, 0). The actual call seems be Sleep(0) and comes from nvwgf2umx.dll (which is why this question is tagged NVidia: hopefully someone there can call up the code and see what the logic is behind such frequent calls).
I couldn't quite figure out why it would need to do /any/ Sleeping during Present. It's not like we wait for vertical retrace anymore, is it? But the other reason to use Sleep has to do with yielding to other threads. Which led me to a serious clue:
If I use D3D11_CREATE_DEVICE_PREVENT_INTERNAL_THREADING_OPTIMIZATIONS, the CPU utilization drops. Along with some other fixes, the DX11 version is now faster and uses less CPU time than the DX9 version (which is what I would hope/expect). Profiling shows that Sleep has dropped from >30% to <1%.
Unfortunately, this page tells me:
This flag is not recommended for general use.
Oh.
So, any ideas on how to get decent performance without using debug flags?

Is "Running Time", "CPU Usage" a useful metric under Instruments to draw any conclusions?

Have profiled an app on an iPhone 4 using "Time Profiler" and "CPU Monitor" and trying to make sense of it.
Given that execution time is 8 minutes, CPU "Running Time" is around 2 minutes.
About 67% of that is on the main thread, out of which 52% is coming from "own code".
Now, I can see the majority of time being spent in enumerating over arrays (and associated work), UIKit operations, etc.
The problem is, how do I draw any meaningful conclusions out of this data? i.e. there is something wrong going on here that needs fixing.
I can see a lot of CPU load over that running time (median at 70%) that isn't "justifiable" given the nature of the app.
Having said that, there are some things that do stand out. Parsing HTTP responses on the main thread, creating objects eagerly (backed up by memory profiling as well).
However, what I am looking for here is offending code along with useful conclusions solely based on CPU running time. i.e. spending "too much" time here.
Update
Let me try and elaborate in order to give a better picture.
Based on the functional requirements of this app, I can't see why it shouldn't be able to run on an iPhone 3G. A median CPU usage of around 70%, with a peak of 97% only looks like a red flag on an iPhone 4.
The most obvious response to this is to investigate the code and draw conclusions from that.
What I am hoping for is a categorical answer of the following form
if you spend anywhere between 25% - 50% of your time on CA, there is something wrong with your animations
if you spend 1000ms on anything related to UIKit, better check your processing
Then again, maybe there aren't any answers only indications of things being off when it comes to running time and CPU usage.
Answer for question "is there something wrong going on here that needs fixing" is simple: do you see the problem while using application? If yes (you see glitches in animation, or app hang for a while), you probably want to fix it. If not, you may be looking for premature optimization.
Nonetheless, parsing http responses in main thread, may be a bad idea.
In dev presentations Apple have pointed out that whilst CPU usage is not an accurate indicator in the simulator it is something to hold stock of when profiling on device. Personally I would consider any thread that takes significant CPU time without good reason a problem that needs to be resolved.
Find the time sinks, prioritise by percentage, and start working through them. These may not be visible problems now but they will begin to, if they have not already, degrade the user's experience of the app and potentially the device too.
Check out their documentation on how to effectively use CPU profiling for some handy hints.
If enumeration of arrays is taking a lot of time then I would suggest that dictionaries or other more effective caches could be appropriate, assuming you can spare some memory to ease CPU.
An effective approach may be to remove all business logic from the main thread (a given) and make a good boundary layer between the app and the parsing / business logic. From here you can better hook in some test suites that could better tell you if the code is at fault or if it's simply the significant requirements of the app UI itself...
Eight minutes?
Without beating around the bush, you want to make your application faster, right?
Forget looking at CPU load and wondering if it's the right amount.
Forget guessing if it's HTTP parsing. Maybe it is, but guessing won't tell you.
Forget rummaging around in the code timing things in hopes that you will find the problem(s).
You can find out directly why it is spending so much time.
Here's the method I use,
and here's an (amateurish) video of it.
Here's what will happen if you do that.
First you will find something you would never have guessed, and when you fix it you will lop a big chunk off that 8 minutes, like maybe down to 6 minutes.
Then you do it again, and lop off another big chunk.
You repeat until you can't find anything to fix, and then it will be much faster than your 8 minutes.
OK, now the ball is in your court.

Sampling performance profiler user interface to work with my custom domain specific language (DSL)

I have a custom Domain Specific Language (DSL) and a sampling performance profiler that generates a data feed similar to other profilers (current stack trace, inclusive samples, exclusive samples, etc). I am looking to leverage an existing GUI to better visualize the profiler results but haven't been able to find one that is pluggable/extendable in this way.
Also, the DSL runs in a computing cluster so any a GUI that can slice and dice data for a single node or multiple nodes at the same time is ideal. Eventually I would like to integrate it in visual studio but I'd be happy for anything at this point. Thanks for your suggestions!
Rob
I doubt if you're going to find such a UI that you can leverage for your own purposes.
It shouldn't be too hard to write one though, not unlike the one in Zoom,
or like this one that I did years ago, typically called a butterfly view, but for lines of code, not functions:
It's easy to program. You have some number of stack traces, and each stack trace consists of a sequence of lines of code, where each line of code represents a call to a function, except for the "bottom" line. Let's assume you have 100 stack traces.
At any point in time, you have a particular line of code which is the "focus", in this case foo.cpp: 326. The 33% is the fraction of stack traces containing that line (even if the line appears more than once on some of the traces).
That percent is roughly how much overall time could be saved if that line could be removed, so that is what it is responsible for.
Of those 33 stack traces, it says line gzorn.cpp: 99 appears above the focus line on 25 of them, and foo.cpp:105 on 8 of them. (That's a case of recursion. Notice if there's recursion the ancestors or descendents don't have to add up to the same percent as the focus line.)
It also says of those 33 stack traces going through that line, on 16 of them bar.cpp:45 is the next line down, 10 of them have bar.cpp:10, and 7 of them have bar.cpp:17.
Alternatively, you could show on each line the fraction of traces going through that line.
Then the user can hunt for costly lines of code by clicking on any one of those "neighbors" to change the focus.
Notice how easy this is to program. You don't need any data structure other than the stack traces themselves, and you don't have to calculate any statistics.
What's more you don't have to have a large number of stack traces, because all that would do is make the percents more precise, which is not particularly important.
P.S. I almost forgot to mention. Your stack traces should be taken not just during CPU time but also during I/O or other blocked time. That is critical, because often programs can be slow due to I/O, sleeps, resource waits, etc., and you need to know that. If you're worried about slowness simply due to competition with other processes, that's not really a problem because it does not throw off the percents very much.

memory not freed in matlab?

I am running a script that animates a plot (simulation of a water flow). After a while, I kill the loop by doing ctrl-c.
After doing this several times I get the error:
??? Error: Out of memory.
And after I start receiving that error, every call to my script will generate it.
Now, it happens before anything inside the function that I am calling is executed, i.e even if I add the line a=1 as the first line of the function I am calling, I still get the error and no printout, so the code inside the function doesn't even get executed.
What could be causing this?
There are several possible reasons.
Most likely your script creates some variables that are filling up the memory. Run
clear all
before restarting the script, so that all the variables are cleared, or change your script to a function (which will automatically erase all temporary variables after the function returns). Note that this also clears all loaded functions, so your next execution of the script has to load them again which will slow down the next execution by a (usually tiny) bit. It may be sufficient to call clear only.
Maybe you're animating by plotting several plots over one another (without clearing the axes first). Thus you might run out of Java heap space. You can close the open figures individually, or run
close all
You can also increase the amount of Java Memory Matlab uses on your system (see instructions here) - note that the limit is generally rather low, annoyingly so if you want to tons of figures.
Especially if you're running an older version of Windows, you may get your memory fragmented. Matlab needs contiguous blocks of free space to assign variables. To check for memory fragmentation, run
memory
and look at the number for the maximum possible variable size. If this is much smaller than the size available for all arrays, it's time to restart Matlab (I guess if you use a Windows version that would require a reboot to fix the problem, you may want to look into getting a new computer with Win7).
You can also try the pack command, eg:
close all;
clear all;
pack;
to clear memory. Although after a recent mathworks seminar I asked one of the mathworks guru's and he also conformed #Andrew Janke's comment regarding memory fragmentation. Usually quitting and restarting matlab sorts this out for me (on XP).
clear all close all are straight-forward ways to free memory, which are known by all non-beginners.
The main issue is that when you have done some data large data processing, and cleared/closed everything off - there is still significant memory used by matlab.
This is a currently major problem with matlab, and to my knowledge there is no solution rather than restarting matlab, which is a pity.
It sounds like you are not clearing any of your variables. You should either provide a way to stop the loop without hitting ctrl-c (write a simple GUI with a "Stop" button and your display) and then clean up your workspace in the script or clear your variables at the start of the script.
Are you intentionally storing all the data (or some large component) on each iteration of your loop?

Hunting down EOutOfResources

Question:
Is there an easy way to get a list of types of resources that leak in a running application? IOW by connecting to an application ?
I know memproof can do it, but it slows down so much that the application won't even last a minute. Most taskmanager likes can show the number, but not the type.
It is not a problem that the check itself is catastrophic (halts the app process), since I can check with a taskmgr if I'm getting close (or at least I hope)
Any other insights on resource leak hunting (so not memory) is also welcomed.
Background:
I've an Delphi 7/2006/2009 app (compiles with all three) and after about a few week it starts acting funny. However only on one of the places it runs, on several other systems it runs till the power goes out.
I've tried to put in some debug code to narrow the problem down. and found out that the exception is EOutofResources on a save of a file. (the file save can happen thousands of times a day).
I have tried to reason out memory leaks (with fastmm), but since the dataflow is quite high (60MByte/s from gigabit industrial camera's), I can only rule out "creeping" memory leaks with fastmm, not quick flashes of memoryleaks that exhaust memory around the time it happens. If something goes wrong, the app fills memory in under half a minute.,
Main suspects are filehandles that are somehow left on some error and TMetafiles (which are streamed to these files). Minor suspects are VST, popupmenu and tframes
Updates:
Another possible tip: It ran fine for two years with D7, and now the problems are with Turbo Explorer (which I use for stable projects not converted to D2009 ).
Paul-Jan: Since it only happens once a week (and that can happen at night), information acquisition is slow. Which is why I ask this question, need to combine stuff for when I'm there thursday. In short: no I don't know 100% sure. I intend to bring the entire Systemtools collection to see if I can find something (because then it will be running for days). There is also a chance that I see open files. (maybe should try to find some mingw lsof and schedule it)
But the app sees very little GUI action (it is an machine vision inspection app), except screen refresh +/- 15/s which is tbitmap stretchdraw + tmetafile, but I get this error when saving to disk (TFileStream) handles are probably really exhausted. However in the same stream, TMetafile is also savetostreamed, something which later apps don't have anymore, and they can run from months.
------------------- UPDATE
I've searched and searched and searched, and managed to reproduce the problems in-vitro two or three times. The problems happened when memusage was +/- 256MB (the systems have 2GB), user objects 200, gdi objects 500, not one file more open than expected ).
This is not really exceptional. I do notice that I leak small amounts of handles, probably due to reparenting frames (something in the VCL seems to leak HPalette's), but I suspect the core cause is a different problem. I reuse TMetafile, and .clear it inbetween. I think clearing the metafile doesn't really (always?) resize the resource, eventually each metafile in the entire pool of tmetafile at maximum size, and with 20-40+ tmetafiles (which can be several 100ks each) this will hit the desktop heap limit.
That's theory, but I'll try to verify this by setting the desktop limit to 10MB at the customers, but it will be several weeks before I have confirmation if this changes anything. This theory also confirms why this machine is special (it's possible that this machine naturally has slightly larger metafiles on average). Occasionally freeing and recreating a tmetafile in the pool might also help.
Luckily all these problems (both tmetafile and reparenting) have already been designed out in newer generations of the apps.
Due to the special circumstances (and the fact that I have very limited test windows), this is going to be a while, but I decided to accept the desktop heap as an example for now (though the GDILeaks stuff was also somewhat useful).
Another thing that the audit revealed GDI-types usage in a thread (though only saving tmetafiles (that weren't used or connected otherwise) to streams.
------------- Update 2.
Increasing the desktop limit only seemed to minorly increase the time till the problem occurred.
Unfortunately, I won't be able to follow up on this further, since the machines were updated to a newer version of the framework that doesn't have the problem.
In summary I can only state what the three core modifications were going from the old to the new framework:
I no longer change screens by reparenting frames. I now work with forms that I hide and show. I changed this since I also had very rare crashes or exceptions (that could be clicked away) due to this. The crashes were all while operating the GUI though, not spontaneously like the main problem
The routine where the crash happened dealt with TMetafile. TMetafile has been designed out, and replace by a simpler own made format. (basically arrays with Opengl vertices)
Drawing no longer happened with tbitmap with a tmetafile overlay strechdrawn over it, but using OpenGL.
Of course it could be something else too, that got changed in the rewrite of the above parts, fixing some very nasty detail bug. It would have to be an extremely bad one, since I analysed the above system as much as I could.
Updated nov 2012 after some private mail discussion: In retrospect, the next step would have been adding a counter to the metafiles objects, and simply reinstantiate them every x * 1000 uses or so, and see if that changes anything. If you have similar problems, try to see if you can somewhat regularly destroy and reinitialize long living resources that are dynamically allocated.
There is a slim chance that the error is misleading. The VCL naively reports EOutOfResources if it is unable to obtain a DC for a window (see TWinControl.GetDeviceContext in Controls.pas).
I say "naively" because there are other reasons why GetDC() might return a NULL handle and the VCL should report the OS error, not assume an out of resources condition (there is a Windows version check required for this to be reliably possible, but the VCL could and should take of that too).
I had a situation where I was getting the EOutOfResources error as the result of a window handle becoming invalid. Once I'd discovered the true problem, finding the cause and fixing it was simple, but I wasted many, many hours trying to find a non-existent resource leak.
If possible I would examine the stack trace leading to this exception - if it is coming from TWinControl.GetDeviceContext then the problem may not be what you think (it's impossible to say what it might be of course, but eliminating the impossible is always the first step toward discovering the solution, no matter how improbable).
If they are GDI handle leaks you can have a look at MSDN Magazine January 2003 which uses the tool GDILeaks. Other tools are GDIObj or GDIView. Also see here.
Another source of EOutOfResources could be that the Desktop Heap is full. I've had that issue on busy terminal servers with large screens.
If there are lots of file handles you are leaking you could check out Process Explorer and have a look at the open file handles of your process and see any out of the ordinary. Or use WinDbg with the !htrace command.
I've run into this problem before. From what I've been able to tell, Delphi may throw an EOutOfResources any time the Windows API returns ERROR_NOT_ENOUGH_MEMORY, and (as the other answers here discuss) Windows may return ERROR_NOT_ENOUGH_MEMORY for a variety of conditions.
In my case, EOutOfResources was being caused by a TBitmap - in particular, TBitmap's call to CreateCompatibleBitmap, which it uses with its default PixelFormat of pfDevice. Apparently Windows may enforce fairly strict systemwide limits on the memory available for device-dependent bitmaps (see, e.g, this discussion), even if your system otherwise has plenty of memory and plenty of GDI resources. (These systemwide limits are apparently because Windows may allocate device-dependent bitmaps in the video card's memory.)
The solution is simply to use device-independent bitmaps (DIBs) instead (although these may not offer quite as good of a performance). To do this in Delphi, set TBitmap.PixelFormat to anything other than pfDevice. This KB article describes how to pick the optimal DIB format for a device, although I generally just use pf32Bit instead of trying to determine the optimal format for each of the monitors the application is displayed on.
Most of the times I saw EOutOfResources, it was some sort of handle leak.
Did you try something like MadExcept?
--jeroen
"I've tried to put in some debug code to narrow the problem down. and found out that the exception is EOutofResources on a save of a file. (the file save can happen thousands of times a day)."
I'm shooting in the dark here, but could it be that you're using the Windows API to (GetTempFileName) create a temp file and you're blowing out some file system indexes or forgetting to close a file handle?
Either way, I do agree that with your supposition about it being a file handle problem. That seems to be the most likely thing given your symptoms and diagnosis.
Also try to check handle count for the application with Process Explorer from SysInternals. Handle leaks can be very dangerous and they build slowly through time.
I am currently having this problem, in software that is clearly not leaking any handles in my own code, so if there are leaks they could be happening in a component's source code or the VCL sourcecode itself.
The handle count and GDI and user object counts are not increasing, nor is anything being created. Deltic's answer shows corner cases where the message is kind of a red-herring, and Allen suggests that even a file write can cause this error.
So far, The best strategy I have found for hunting them down is to use either JCL JCLDEBUG stack tracebacks, or the exception report save features in MadExcept to generate the context information to find out what is actually failing.
Secondly, AQTime contains many tools to help you, including a resource profiler that can keep the links between where the code that created the resources is, and how it was called, along with counts of the total numbers of handles. It can grab results MID RUN and so it is not limited to detecting unfreed resources after you exit. So, run AQTime, do a results capture in mid run, wait several hours, and capture again, and you should have two points in time to compare handle counts. Just in case it is the obvious thing. But as Deltics wisely points out, this exception class is raised in cases where it probably shouldn't have been.
I spent all of today chasing this issue down. I found plenty of helpful resources pointing me in the direction of GDI, with the fact that I'm using GDI+ to produce high-speed animations directly onto the main form via timer/invalidate/onpaint (animation performed in separate thread). I also have a panel in this form with some dynamically created controls for the user to make changes to the animation.
It was extremely random and spontaneous. It wouldn't break anywhere in my code, and when the error dialog appeared, the animation on the main form would continue to work. At one point, two of these errors popped up at the same time (as opposed to sequential).
I carefully observed my code and made sure I wasn't leaking any handles related to GDI. In fact, my entire application tends to keep less than 300 handles, according to Task Manager. Regardless, this error would randomly pop up. And it would always correspond with the simplest UI related action, such as just moving the mouse over a standard VCL control.
Solution
I believe I have solved it by changing the logic to performing the drawing within a custom control, rather than directly to the main form as I had been doing before. I think the fact that I was rapidly drawing on the same form canvas which shared other controls, somehow they interfered. Now that it has its own dedicated canvas to draw on, it seems to be perfectly fixed.
That is with about 1 hour of vigorous testing at least.
[Fingers crossed]

Resources