Delphi Profiling tools [closed]

Delphi Profiling tools [closed] - delphi

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I am having some performance problems with my Delphi 2006 app.
Can you Suggest any profiling tools that will help me find the bottle neck
i.e. A tool like turbo Profiler

I asked the same question not too long ago
I've downloaded and tried AQtime. It does seem comprehensive, but it is not an easy to use tool and is VERY expensive for an individual programmer (i.e. $600 US). I loved the fact that it was non-invasive (did not change your code), and that it could do line-by-line profiling, until I found that because it is an instrumenting profiler, it can lead to improper optimizations as in: Why is CharInSet faster than Case statement?
I tried a demo of ProDelphi, much less expensive (about $80 I think), but it was much too clunky for me - I didn't like the user interface at all, and it is invasive - changing your code to add the instrumenting, which you have to be careful about.
I used GpProfile with Delphi 4 for many years. I loved it. It also was invasive, but it worked so well I learned to trust it and it never gave me a problem in 10 years. But when I upgraded to Delphi 2009, I didn't think it best to try using it, since it hasn't been upgraded and by GP's admission, won't work without modifications. I expect you won't be able to use it either with Delphi 2006.
ProDelphi and GpProfile will only profile at the procedure level. If you want to do individual lines (which I sometimes had to), you have to call PROC1, PROC2, PROC3 for each line and put the one line in each PROC. It was a bit of an annoyance to have to do that, but it gave me good results (at least I was happy with the results of GpProfile doing that).
The answer I accepted in my CharInSet question said that "Sampling profilers, which periodically check the location of the CPU, are usually better for measuring code time." and a later answer gave Eric Grange's free sampling profiler for Delphi that now supports Delphi 2009. I haven't tried it yet, but I've heard good things about it, and it is the next one I'm going to try.
By the way, you might be best off by saving your $600 by NOT buying AQtime, and instead using that to upgrade your Delphi 2006 to Delphi 2009. The stability, speed and extra features (expecially Unicode), will be worth your while. See: What are major incentives to upgrade to D2009 (Unicode excluded)?
Also AQtime does not integrate into Delphi 2009 yet.
One other free one, with source that I found out about, but haven't tried yet is TProfiler. If anyone has tried that one, I'd like to know what they think.
Note: The Addenum I added afterwards to question 291631 seems like it may be the answer. See Andre's open source program: asmprofiler
Feb 2010 followup. I bit the bullet and purchased AQTime. A few months ago they finally integrated it into Delphi 2009 which is what I use (but they still have to do Delphi 2010). The viewing of source lines and their individual times and counts is invaluable to me, and AQTime does a superb job of this.

I have just found a very nice free sampling profiler and it supports Delphi 2009

I've used ProDelphi, mostly to determine which routines are eating the most time. It's an Instrumenting Profiler, meaning it adds a bit of code to the beginning and end of each routine. You control which routines it profiles by directives inside comments. You can also profile sections of a routine. But the sections must start and stop at the same block level, with no entry into or exit out of the section. Optimization must be off where ProDelphi inserts it's code (where you put the directives), but you can turn it on anywhere else.
The interface is kinda klunky, but very fast once you get the hang of it. You can do useful work with the free version (limited to 10 routines or sections). ProDelphi can quickly tell you which routines you should examine. But not why, or which lines.
Recently, I've started using Intel's VTune Performance Analyzer. 'WOW' doesn't begin to sum it up. I am impressed. I simply had no idea all this was built into modern Intel processors. Did you know it can tell you exactly how often a single instruction needed to wait for the L1 Data Cache to look sideways at another core before reloading a word from a higher cache? If I keep writing, I'll just sound like a breathless advert for the product.
Go to Intel and download the full-working timed demo. Dig around the net and find a couple of videos on how to get started. (Otherwise, you run the risk of being stymied by all the options.) It works with any compiler. Just point it to a .exe. It'll show you source lines if your .exe includes debug info & you point it to the source code.
I was stuck trying to optimize an inner loop that called a function I wrote. There were no external calls except length(str). This inner loop ran billions of times per run, and ate up about half the cpu time -- a perfect candidate for optimization. I tried all sorts of standard optimizations, with little to no effect. VTune shows hot-spots. I just drilled down till it showed me the ASM my code generated, and how much time each instruction took.
Here's what VTune told me:
line nnnn [line of delphi code] ...
addr hhhh cmp byte ptr [edx+ecx],0x14h - - - - - - - - 3 cycles
addr hhhh ja label_x - - - - - - - - - - - - - - - - - - -10302 cycles
The absolute values mean nothing. (I think I was measuring cycles per instruction retired.) The relative values make it kinda clear where all the time went. The great thing was the Advice Window. It told me the code stalled waiting for data to load into the L1 data cache, and actually gave me good advice on how to avoid stalls.
My mistake was in thinking of the Core2 Quad as just a really fast 8086 CPU. No^3. The code was spending 99% of its time waiting for data to load from memory because I was jumping around too much. My algorithm assumed that memory was RAM (Random Access). That's not how modern CPUs work. Data in L1 cache might be accessed in 1 or 2 cycles, but accessing the L2 or L3 cache costs tens to hundreds of cycles, and going to RAM costs thousands. However, all that latency is avoided when you access your data sequentially -- because the processor will pre-load the cache with the data following the first byte you ask for.
Net result is that I rewrote the algorithm to access the data more sequentially, and got a 10x speedup, which was good enough. When I have the time, I'm certain I can get another 10x out of it. But that's just the Geek in me. Good Enough is good enough.
I already knew that you get the most bang by optimizing your algorithm, not your code. I thought I only needed the profiler to tell me what needed optimizing. But I also needed it to find the reason for the bottleneck so I could design a faster algorithm.
The new algorithm isn't radically different from the old. It just stores the data such that it can be accessed sequentially. For example, in one place I moved a field from an array of records into it's own array of integers -- because the inner loop didn't need the rest of the data in each record. I also had a rectangular matrix stored as a dynamic array of dynamic arrays. The code used this to randomly access megabytes of data (and the poor L1 data cache is only 64Kb). I figured out how to store it in a linear array as diagonals of the matrix, which is the order I use the data. (OK, maybe that part is radical.)
Anyway, I'm sold on VTune.

I have used http://www.prodelphi.de with success on Delphi 7 project in the past. Cheap and works. Don't let the bush league web site scare you off.

www.AutomatedQA.com has the best choice for Delphi profiling (AQTime)

I use and recomend Sampling Profiler, I think you can get it from embarcadeiro.public,attachments newsgroup.

Here's another choice, I haven't used this one before: http://www.prodelphi.de

Final choice that I know of for Delphi, http://gp.17slon.com/gpprofile/index.htm

Final note, www.torry.net is a great place for Delphi component/tool search

Related

llvm based code mutation for genetic programming?

for a study on genetic programming, I would like to implement an evolutionary system on basis of llvm and apply code-mutations (possibly on IR level).
I found llvm-mutate which is quite useful executing point mutations.
As far as I have understood, the instructions get count/numbered, one can then e.g. delete a numbered instruction.
However, introduction of new instructions seems to be possible as one of the availeable statements in the code.
Real mutation however would allow to insert any of the allowed IR instructions, irrespective of it beeing used in the code to be mutated.
In addition, it should be possible to insert library function calls of linked libraries (not used in the current code, but possibly available, because the lib has been linked in clang).
Did I overlook this in the llvm-mutate or is it really not possible so far?
Are there any projects trying to /already have implement(ed) such mutations for llvm?
llvm has lots of code analysis tools which should allow the implementation of the afore mentioned approach. llvm is huge, so I'm a bit disoriented. Any hints which tools could be helpful (e.g. getting a list of available library functions etc.)?
Thanks
Alex

Very interesting question. I have been intrigued by the possibility of doing binary-level genetic programming for a while. With respect to what you ask:
It is apparent from their documentation that LLVM-mutate can't do what you are asking. However, I think it is wise for it not to. My reasoning is that any machine-language genetic program would inevitably face the "Halting Problem", e.g. it would be impossible to know if a randomly generated instruction would completely crash the whole computer (for example, by assigning a value to a OS-reserved pointer), or it might run forever and take all of your CPU cycles. Turing's theorem tells us that it is impossible to know in advance if a given program would do that. Mind you, LLVM-mutate can cause for a perfectly harmless program to still crash or run forever, but I think their approach makes it less likely by only taking existing instructions.
However, such a thing as "impossibility" only deters scientists, not engineers :-)...
What I have been thinking is this: In nature, real mutations work a lot more like LLVM-mutate that like what we do in normal Genetic Programming. In other words, they simply swap letters out of a very limited set (A,T,C,G) and every possible variation comes out of this. We could have a program or set of programs with an initial set of instructions, plus a set of "possible functions" either linked or defined in the program. Most of these functions would not be actually used, but they will be there to provide "raw DNA" for mutations, just like in our DNA. This set of functions would have the complete (or semi-complete) set of possible functions for a problem space. Then, we simply use basic operations like the ones in LLVM-mutate.
Some possible problems though:
Given the amount of possible variability, the only way to have
acceptable execution times would be to have massive amounts of
computing power. Possibly achievable in the Cloud or with GPUs.
You would still have to contend with Mr. Turing's Halting Problem.
However I think this could be resolved by running the solutions in a
"Sandbox" that doesn't take you down if the solution blows up:
Something like a single-use virtual machine or a Docker-like
container, with a time limitation (to get out of infinite loops). A
solution that crashes or times out would get the worst possible
fitness, so that the programs would tend to diverge away from those
paths.
As to why do this at all, I can see a number of interesting applications: Self-healing programs, programs that self-optimize for an specific environment, program "vaccination" against vulnerabilities, mutating viruses, quality assurance, etc.
I think there's a potential open source project here. It would be insane, dangerous and a time-sucking vortex: Just my kind of project. Count me in if someone doing it.

Delphi decompiling [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
Why decompiling a delphi exe, is so easy, compared to others executables built with other programming languages/compilers?

There are a few things that help with reversing delphi programs:
You get the full form data including the name of event handler methods
All members with published visibility have metadata used with RTTI
The compiler is pretty bad at optimizing. It does no whole program optimization and the assembly is usually a straight forward translation of the original source with only minor optimizations. (At least it was in the versions I used, might have improved since then)
All classes, even those compiled with RTTI off have some level of metadata available. In particular it's possible to get the name and inheritance structure of classes. And for any instance of a class you happen to see in the debugger you can get its VMT and thus its class name.
Delphi uses textfiles describing the content of your form and hooks up event handlers by name. This approach obviously needs enough metadata to deserialize that textual representation of a from and hook up the eventhandlers by name.
An alternative some other GUI toolkits use is auto-generating code that initializes the form and hooks up the event handler with code. Since this code directly uses pointers to the eventhandlers and directly assigns to properties/calls setters it doesn't need any metadata. Which has the side-effect that reversing becomes a bit harder.
It shouldn't be too hard to create a program that transforms a dfm file into a series of hardcoded instructions that creates the form instead. So a tool like DeDe won't work that well anymore. But that doesn't gain you much in practice.
But figuring out which evenhandler corresponds to which control/event is still rather easy. Especially since stuff like FLIRT identifies most library functions. So you just need to breakpoint the one you're interested in and then step into the user code.

The statement you make is false. Delphi is not particularly more easy to decompile than code produced by other mainstream compilers.
For .net languages there is Reflector.
C++ is covered in this Stack Overflow question.
Python/Perl/Ruby etc. are interpreted.
If you were able to prove that the results of decompiling a Delphi executable were of significantly higher quality than in other widely used languages then your question would carry more weight.

Story from the trenches: Decompiling a tiny Delphi DLL
I've been through a Delphi decompiling session myself. It was one of those fake-sounding "I lost my sources" thing, I really did lose the sources for a tiny Firebird UDF library. Now I do no better, I didn't jump right into decompiling because the library was so small and I knew a rewrite would be much faster.
This DLL exports a function that looks like this:
function udf_do_some_math(Number1, Number2:Currency): Currency;
After doing the sane thing and rewriting the function and doing some regression tests I discovered some obscure corner-cases where the new function's result wasn't the same as the old function's result! The trouble was, the new function's result was the correct result, the old DLL contained a BUG and I had to reproduce the BUG - with this function consistency is more important then accuracy.
Again, did the sane thing and tried to "guess" at the BUG. I knew it was a rounding issue but simply couldn't figure out what it was. Finally I decided to give decompilers I try. After all this was a small library, the entry-point was straight-forward and I didn't really need re-compilable code, nor 100% decompilation: I only needed enough to figure out the old BUG so I can reproduce it!
Decompiling failed! I tried lots of different decompilers, including a couple of "commercial" ones. Most produced what on the surface looked like good data, but not enough to figure out the old bug. The most promising one, the one with version specific knowledge of the VCL and RTL gave the worst failure: sure, it figured out the RTL calls, gave them names, but failed to locate the exported function! The one function I was interested in wasn't shown int the list of entry points, and it should have been straight forward since it's an exported function.
This decompiling attempt should have been easy because:
The code was fairly simple and not a lot of it.
It was a DLL with an exported function, none of the complexity you'd expect from an event-driven exe.
I wasn't interested in re-compilable code, I simply wanted to find an old bug so I can reproduce it.
I didn't ask for Pascal code, assembler would've been good enough.
I knew precisely what the code was doing and how it was doing it. It wasn't cryptic 3rd party code.
My solution
After decompilers failed me I turned to my own trusty Delphi IDE for debugging. I wrote a small Delphi program that directly imports the function from the DLL, created a fake Firbird memory manager DLL so my DLL can load, called my old function with the parameters I knew would give bad results, steped into the code using the debugger and closely watched the FPU registers. After a few failed attempts I finally noticed a value was popped from the FPU stack as integer where it shouldn't have been Integer so I had my BUG: I mistakenly defined an Integer local variable where I should have used Currency. Armed with that knowledge I was able to reproduce the bug.

Only thing that is easier in Delphi is retrieving VCLs.
After using decompilers like DeDe you will get application user interface but without any logic.
So if you want to retrieve only forms and buttons - Delphi is easier than other compilers, but if you want to know what is going on after clicking on the button you'll need to use ollydbg or other (debugger/disassembler) as for other languages that creates executables.

There are pros and cons. I am not sure what angle your referring to as being easier. There is also a huge difference in a 1 form simple application, versus a very in-depth application that has many forms and tons of classes and functions. It's like Notepad versus Office 2013 (given they were coded in delphi, just an example comparing complexity not language).
In a small app, having the extra information that Delphi apps "usually" contain can make it a breeze. However, in a large application it may "help", but you have a million calls to dig through. They may help you get in the near vicinity, but calls inside of calls inside of calls, then multiple returns used as jumps... makes you dizzy. Then if the app "was" packed or protected, some things can still be a garbled mess. While it may work programming wise, reading it can be a lot harder. I was in one the other day, where all of the strings were encrypted, so "referenced text strings" were no help, and the encryption was not a simple md5 or base64, it was some custom algorithm. Maybe an MD5 with a salt, then base64 encoded? I never could get to the exact method on the strings. I knew what some of them were supposed to be, but couldn't reproduce the method, even though it looked like it was base64, it was the base64 of the string already encrypted some how... I dont rely on text strings, but in a large large app, every little bit helps.
Of course, my interpretation of this question, was looking at a Delphi exe in OllyDbg. I could be off base on where you guys were going with this topic, but I feel in regards to Olly and reversing, I am on point (if that was what you were talking about) lol.

Benchmark Tools For Delphi

I'm looking for some tools to improve my Delphi development.
And a tool that I could not found any free project is a benchmark tool.
Some one have some hit about some project to use ?
Today to check where I must focus my optimizations I use sample profiling, but it's not enough
I must file the function that spent more time overage, not just the top called functions.
Tks

I think the acknowledged leader in this field is AQtime.
If you have no money then you can try Sampling Profiler.
I'm sure others will be along in due course to offer yet more suggestions!

Check out my question on Profiler and Memory Analysis Tools for Delphi. In my Addenum 4, I mention André's Open Source Profiler for Delphi called AsmProfiler that he made. See his answer to that question that led me to it.
I had downloaded it and tried it and it is quite good. It is an instrumenting profiler like AQTime, so it may be better than a Sampling Profiler for certain optimizations. It does procedure-based timings, so the one thing it can't do that AQTime is line-by-line based timings. But for a free program that works well, most often procedure-based timings are good enough. I had used GpProfile very productively for many years which was very similar but it is no longer available for current versions of Delphi.

What's the quickest way to parallelize code?

I have an image processing routine that I believe could be made very parallel very quickly. Each pixel needs to have roughly 2k operations done on it in a way that doesn't depend on the operations done on neighbors, so splitting the work up into different units is fairly straightforward.
My question is, what's the best way to approach this change such that I get the quickest speedup bang-for-the-buck?
Ideally, the library/approach I'm looking for should meet these criteria:
Still be around in 5 years. Something like CUDA or ATI's variant may get replaced with a less hardware-specific solution in the not-too-distant future, so I'd like something a bit more robust to time. If my impression of CUDA is wrong, I welcome the correction.
Be fast to implement. I've already written this code and it works in a serial mode, albeit very slowly. Ideally, I'd just take my code and recompile it to be parallel, but I think that that might be a fantasy. If I just rewrite it using a different paradigm (ie, as shaders or something), then that would be fine too.
Not require too much knowledge of the hardware. I'd like to be able to not have to specify the number of threads or operational units, but rather to have something automatically figure all of that out for me based on the machine being used.
Be runnable on cheap hardware. That may mean a $150 graphics card, or whatever.
Be runnable on Windows. Something like GCD might be the right call, but the customer base I'm targeting won't switch to Mac or Linux any time soon. Note that this does make the response to the question a bit different than to this other question.
What libraries/approaches/languages should I be looking at? I've looked at things like OpenMP, CUDA, GCD, and so forth, but I'm wondering if there are other things I'm missing.
I'm leaning right now to something like shaders and opengl 2.0, but that may not be the right call, since I'm not sure how many memory accesses I can get that way-- those 2k operations require accessing all the neighboring pixels in a lot of ways.

Easiest way is probably to divide your picture into the number of parts that you can process in parallel (4, 8, 16, depending on cores). Then just run a different process for each part.
In terms of doing this specifically, take a look at OpenCL. It will hopefully be around for longer since it's not vendor specific and both NVidia and ATI want to support it.
In general, since you don't need to share too much data, the process if really pretty straightforward.

I would also recommend Threading Building Blocks. We use this with the Intel® Integrated Performance Primitives for the image analysis at the company I work for.
Threading Building Blocks(TBB) is similar to both OpenMP and Cilk. And it uses OpenMP to do the multithreading, it is just wrapped in a simpler interface. With it you don't have to worry about how many threads to make, you just define tasks. It will split the tasks, if it can, to keep everything busy and it does the load balancing for you.
Intel Integrated Performance Primitives(Ipp) has optimized libraries for vision. Most of which are multithreaded. For the functions we need that aren't in the IPP we thread them using TBB.
Using these, we obtain the best result when we use the IPP method for creating the images. What it does is it pads each row so that any given cache line is entirely contained in one row. Then we don't divvy up a row in the image across threads. That way we don't have false sharing from two threads trying to write to the same cache line.

Have you seen Intel's (Open Source) Threading Building Blocks?

I haven't used it, but take a look at Cilk. One of the big wigs on their team is Charles E. Leiserson; he is the "L" in CLRS, the most widely/respected used Algorithms book on the planet.
I think it caters well to your requirements.
From my brief readings, all you have to do is "tag" your existing code and then run it thru their compiler which will automatically/seamlessly parallelize the code. This is their big selling point, so you dont need to start from scratch with parallelism in mind, unlike other options (like OpenMP).

If you already have a working serial code in one of C, C++ or Fortran, you should give serious consideration to OpenMP. One of its big advantages over a lot of other parallelisation libraries / languages / systems / whatever, is that you can parallelise a loop at a time which means that you can get useful speed-up without having to re-write or, worse, re-design, your program.
In terms of your requirements:
OpenMP is much used in high-performance computing, there's a lot of 'weight' behind it and an active development community -- www.openmp.org.
Fast enough to implement if you're lucky enough to have chosen C, C++ or Fortran.
OpenMP implements a shared-memory approach to parallel computing, so a big plus in the 'don't need to understand hardware' argument. You can leave the program to figure out how many processors it has at run time, then distribute the computation across whatever is available, another plus.
Runs on the hardware you already have, no need for expensive, or cheap, additional graphics cards.
Yep, there are implementations for Windows systems.
Of course, if you were unwise enough to have not chosen C, C++ or Fortran in the beginning a lot of this advice will only apply after you have re-written it into one of those languages !
Regards
Mark

Profiler and Memory Analysis Tools for Delphi [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I recently upgraded from Delphi 4 to Delphi 2009. With Delphi 4 I had been using GpProfile by Primoz Gabrijelcic as a profiler and Memory Sleuth by Turbo Power for memory analysis and leak debugging. Both worked well for me. But I now need new tools that will work with Delphi 2009.
The leader in Profiling/Analysis tools for Delphi by a wide margin is obviously AQTime by AutomatedQA. They recently even gobbled up Memproof by Atanas Soyanov, which I understood was an excellent and free memory analysis tool, and incorporated its functionality into AQTime. But AQTime is very expensive for an individual programmer. It actually costs more than the upgrade to Delphi 2009 cost!
So my question is: Are there other less expensive options to do profiling and memory analysis in current versions of Delphi that you are happy with and recommend, or should I bite the bullet and pay the big bucks for AQTime?
Addenum: It seems the early answerers are indicating that the FastMM manager already included in Delphi is very good for finding memory leaks.
So then, are there any good alternatives for source code profiling?
One I'm curious about is ProDelphi by Michael Adolph which is less than one sixth the cost of AQTime. Do you use it? Is AQTime worth paying six times as much?
Addenum 2: I downloaded trial versions of both AQTime and ProDelphi.
AQTime was a bit overwhelming and a little confusing at first. It took a few hours to find some of the tricks needed to hook it up.
ProDelphi was very much like the GpProfile that I was used to. But its windows are cluttered and confusing and it's not quite as nice as GpProfile.
To me the big differences seem to be:
ProDelphi changes your code. AQTime does not. Changing code may corrupt your data if something goes wrong, but my experience with GpProfile was that it never happened to me. Plus one for AQTime.
ProDelphi requires you turn optimization off. But what you want to profile is your program with optimization on, the way it will be run. Plus one for AQTime.
ProDelphi only can profile down to the function or procedure. AQTime can go down to individual lines. Plus 2 for AQTime.
ProDelphi has a free version that will profile 20 routines, and its pro version costs less than $100 USD. AQTime is $600 USD. Plus 4 for ProDelphi.
The score is now 4-4. What do you think?
Addenum 3: Primoz Gabrijelcic is planning to get GpProfile working again. See his comments on some of the responses below. He on StackOverflow as Gabr.
Addenum 4: It seems like there may be a profiler solution after all. See Andre's open source asmprofiler, described below.

For the price, you cannot beat FastMM4 as a memory tracker. It's simple to use yet powerful and well integrated with Delphi.
I guess that you know that, without downloading, installing or changing anything else, just putting this line
ReportMemoryLeaksOnShutDown := True;
anywhere in your code, will enable basic reporting of memory leaks.
If you need more like crash information, EurekaLog is a very good product that we use. MadExcept also has a good reputation...
For profiling specifically, we have AQTime.
As for gpProfile, you can try and bug gabr on SO for an update... or go and update gpProfile yourself as it is open source. ;-)

I've made an open source profiler for Delphi:
http://code.google.com/p/asmprofiler/
It's not perfect, but it's free and open source :-).
The main reason I made it was because I missed an exact call tree.
For example, ProDelphi only stores a summary and total counts of all calls,
you cannot see what calls a specific procedure at a specific time did (or time
duration).
And it has a time chart, so you can see how the call duration changed over time.

Also take a Look at Eric Grange's Sampling Profiler

I've been very happy with AQtime for profiling.

Having used both GpProfile and AQTime I have found both to be effective at finding what method call is causing a bottle neck.
However AQTime can also tell me what line of code is causing this, without making any changes to my source code (although it works best with TD32 debugging and debug dcus).
I recently used it to speed up a routine by about 30x (due to bad use of a internal library function)
However I didn't have to pay for it myself!

We use AQTime Pro and are happy with it. Smartbear have recently released a completely free AQTime standard edition. Most of the features are still there but they have of course removed a bit

I agree with you about the interface of ProDelphi, but it does a good enough job that we're happy to stay with it. We only need to profile very occasionally when we have a significant performance issue, and it's always helped us find the problem pretty quickly. Very good value for money, and Michael seems pretty good about keeping it updated for new versions.
One thing I would suggest is that because it does require code to be inserted, having all the relevant code in some kind of VCS is invaluable. When we need to profile, we:
Check all relevant files in
Check them all out
Do the profiling we need, then
Cancel all checkouts, effectively rolling back to where we were.

Has anyone tried the Profiler component at Delphi Area? It is freeware with source and it's writeup says:
If you are looking for an easy and
accurate way to measure execution time
of your code for free, TProfiler is
what you need. TProfiler is a
non-visual and debugging component
that enables you to create named
timers in your code.
Each timer of TProfiler provides the
following information:
The number of times that the timer was
activated (Hit Count) The total
execution time The average execution
time on each hit Execution time on
the first hit Execution time on the
last hit The hit with minimum
execution time The hit with maximum
execution time

It's true, for profiling I miss Primoz' GpProfile, and haven't found a good replacement. I once tried AQTime, but wasn't too happy with it for the price.
For tracking of memory leaks and dodgy memory accesses however I couldn't be happier than I am with FastMM4.

I've been using ProDelphi for a long time & find it meets my needs.
I've been able to achieve stunning results in system performance improvements by using the data it provides.
For small projects the free version is fine.
For larger projects, you'll need the (Paid) pro version.

For a profiler you might try SmartInspect from Gurock Software. I never used GpProfile, but quickly glancing at its feature set reminded me of SmartInspect. Interestingly it doesn't claim to be a profiler, but it seems to be as much of one as GpProfile (unless I am missing something). It supports Delphi 2009 and has a free Trial and is a little cheaper then AQTime.
Note: SmartInspect is a logger rather than a profiler.

The FastMM4 memory manager mentioned in this older answer ("How to monitor or visualize memory fragmentation of a delphi application") keeps a list of all allocations which can be queried at run time (and displayed in a grid using the included demo application). It does not exactly show which object leaks, as the statistics are per block size. But it can be useful for long-time monitoring of applications in production, for example servers or services. I am currently integrating it in a (commercial) web application server framework as the 'VisualMM' add-on.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart