Reduce CPU time spent in VCL

Reduce CPU time spent in VCL - delphi

It is important for a MIDI player to playback the notes as precisely as possible. I never quite succeeded in that, always blaming the timer (see a previous thread: How to prevent hints interrupting a timer). Recently I acquired ProDelphi and started measuring what exactly consumed that much time. The result was quite surprising, see the example code below.
procedure TClip_View.doMove (Sender: TObject; note, time, dure, max_time: Int32);
var x: Int32;
begin
{$IFDEF PROFILE}Profint.ProfStop; Try; Profint.ProfEnter(#self,1572 or $58B20000); {$ENDIF}
Image.Picture.Bitmap.Canvas.Pen.Mode := pmNot;
Image.Picture.Bitmap.Canvas.MoveTo (FPPos, 0);
Image.Picture.Bitmap.Canvas.LineTo (FPPos, Image.Height);
x := time * GPF.PpM div MIDI_Resolution;
Image.Picture.Bitmap.Canvas.Pen.Mode := pmNot;
Image.Picture.Bitmap.Canvas.MoveTo (x, 0);
Image.Picture.Bitmap.Canvas.LineTo (x, Image.Height);
FPPos := x;
// Bevel.Left := time * GPF.PpM div MIDI_Resolution;
{$IFDEF PROFILE}finally; Profint.ProfExit(1572); end;{$ENDIF}
end; // doMove //
The measurements are (without debug code on an Intel i7-920, 2,7Ghz):
95 microseconds for the code as shown
5.609 milliseconds when all is commented out except for the now commented out statement (Bevel.Left :=)
0.056 microseconds when all code is replaced by x := time * GPF.PpM div MIDI_Resolution;
Just moving around a Bevel costs 60 times as much CPU as just drawing on a Canvas. That surprised me. The results of measurement 1 are very audible (there is more going on than just this), but 2 and 3 not. I need some form of feedback to the user as what the player now is processing, some sort of line over a piano roll is the accepted way. In my never ending quest of reducing CPU cycles in the timed-event loop I have some questions:
Why does moving around a bevel cost that much time?
Is there a way to reduce more CPU cycles than in drawing on a bitmap?
Is there a way to reduce the flicker when drawing?

You won't be able to change the world, nor VCL nor Windows. I suspect you are asking to much to those...
IMHO you should better change a little bit your architecture:
Sound processing shall be in one (or more) separated thread(s), and should not be at all linked to the UI (e.g. do not send GDI messages from it);
UI refresh shall be made using a timer with a 500 ms resolution (half a second refresh sounds reactive enough), not every time there is a change.
That is, the sequencer won't refresh the UI, but the UI will ask periodically the sequencer what is its current status. This will be IMHO much smoother.
To answer your exact questions:
"Moving a bevel" is in fact sending several GDI messages, and the rendering will be made by the GDI stack (gdi32.dll) using a temporary bitmap;
Try to use a smaller bitmap, or try using a Direct X buffer mapping;
Try DoubleBuffered := true on your TForm.OnCreate event, or use a dedicated component (TPaintBox) with a global bitmap for the whole component content, with something like that for the message WM_ERASEBKGND.
Some code:
procedure TMyPaintBox.WMEraseBkgnd(var Message: TWmEraseBkgnd);
begin
Message.Result := 1; // no erasing is necessary after this method call
end;

I have the feeling that your bitmap buffering is wrong. When you move your clip it shouldn't have to be redrawn at all. you could try with This clip component structure:
TMidiClip = Class(TControl)
Private
FBuffer: TBitmap;
FGridPos: TPoint;
FHasToRepaint: Boolean;
Public
Procedure Paint; Override; // you only draw the bitmap on the control canvas
Procedure Refresh; // you recompute the FBuffer.canvas
End;
When you change some properties such as "clip tick length" you set "FHasToRepaint" to true but not when changing "FGridPos" (position on the grid). So most of the time, in your Paint event, you only have a copy of your FBuffer.
Actually, this is very dependent on the design of your grid and its children (clips).
I might be wrong but it seems that your design is not enough decomposed in Controls: the master grid should be a TControl, a clip should be a TControl, even the events on a clip should be some TControls...You can only define a strongly optimized bitmap-buffering system by this way (aka "double-buffering").
About the Timer: you should use a musical clock which processes per audio sample otherwise you can't have a good enough resolution. Such a clock can be implemented using the "Windows Audio driver" (mmsystem.pas) or an "Asio" driver (you have an interface in BASS, The Delphi Asio Vst project, for example).

By far the best way to tackle this is to stop the GUI message queue interfering with your MIDI player. Put the MIDI player on a background thread so that it can do its work without interruption from the main thread. Naturally this relies on you running on a machine with more than a single processor but it's not unreasonable to take that for granted nowadays.
Judging from your comments it looks like your audio processing is being blocked by the UI thread. Don't let that happen and your audio problems will disappear. If you are using something like TThread.Synchronize to fire VCL events from the audio thread then that will block on the UI thread. Use an asynchronous communication method instead.
Your proposed alternative of speeding up the VCL is not really viable. You can't modify the VCL and even if you could, the bottleneck could easily be the underlying Windows code.

Related

Finding a string from a text file, from bottom

I need to find a certain string, in a text file, from bottom (the end of the line).
Once the string has been found, the function exit.
Here is my code, which is working fine. But, it is kind of slow.
I meant, I run this code every 5 seconds. And it consumes about 0.5% to 1% CPU time.
The text file is about 10 MB.
How to speed this up? Like, really fast and it doesn't consume much CPU time.
function TMainForm.GetVMem: string;
var
TS: TStrings;
sm: string;
i: integer;
begin
TS := TStringList.Create;
TS.LoadFromFile(LogFileName);
for i := TS.Count-1 downto 0 do
begin
Application.ProcessMessages;
sm := Trim(TS[i]);
if Pos('Virtual Memory Total =', sm) > 0 then
begin
Result := sm;
TS.Free;
exit;
end;
end;
Result := '';
TS.Free;
end;

You can use a TMemoryStream and use LoadFromFile to load the complete file content.
The you can cast the property Memory to either PChar or PAnsiChar or other character type depending on file content.
When you have the pointer, you can use it to check for content. I would avoid using string handling because it is much slower than pointer operation.
You can move the pointer from the end of memory (use Stream.Size) and search backward for the CR/LF pair (or whatever is used as line delimiter). Then from that point check for the searched string. If found, you are done, if not, loop searching the previous CR/LF.
That is more complex than the method you used but - if done correctly - will be faster.
If the file is to big to fit in memory, specially in a x32 application, you'll have to resort to read the file line by line, keeping only one line, going to the end of file. Each time you find the searched string, then save his position. At the end of file, this saved position - if any - will be the last searched file.
If the file is really very large, and the probability that the searched string is near the end, you may read the file backward (Setting TStream.position to have direct access). block by block. Then is each block, use the previous algorithm. Pay attention that the searched string may be split in two blocks depending on the block size.
Again, depending on the file size, you may split the search in several parallel searches using multi threading. Do not create to much thread neither. Pay attention that the searched string may be split in two blocks assigned to different threads depending on the block size.

Pascal: How to randomize two images on Turbo Delphi

I have a task of having two images (Two TImage), one a head and the other a tail (Coins), on my screen with a TButton to randomize both of them.
It is to be that when you press the button, the two images go random choosing Heads or Tails.
I know its kind of a easy question, but I am just learning. I just don't know what to use!

You need to sample from a discrete uniform distribution with two possible values. So like this:
function IsHead: Boolean;
begin
Result := Random()<0.5;
end;
Or like this:
function IsHead: Boolean;
begin
Result := Random(2)=0;
end;
You'll want to call Randomize somewhere in the startup of your program to make sure that you don't get the same sequence of pseudo-random numbers each time you run the program.
I'm assuming that you already know how to write button OnClick event handlers, and switch visibility of TImage controls.

VirtualStringTree updating data using cache system

Well, I'm using VirtualStringTree to create kind of a process manager...
I run into trouble because of updating the tree with a timer set to 1000ms (cpu usage is too high for my application retrieving a lot of data (filling about 20 columns).
So I wonder how would one build kind of a cache system so I can update the tree only when something changed which I guess seems to be the key decrementing the cpu usage for my application a lot?
Snip:
type
TProcessNodeType = (ntParent, ntDummy);
PProcessData = ^TProcessData;
TProcessData = record
pProcessName : String;
pProcessID,
pPrivMemory,
pWorkingSet,
pPeakWorkingSet,
pVirtualSize,
pPeakVirtualSize,
pPageFileUsage,
pPeakPageFileUsage,
pPageFaults : Cardinal;
pCpuUsageStr: string;
pIOTotal: Cardinal;
...
end;
If my application starts I fill the tree with all running processes.
Remember this is called only once, later when the application runs I got notified of new processes or processes which are terminated via wmi so I dont need to call the following procedure in the timer later to update the tree...
procedure FillTree;
begin
var
NodeData: PProcessData;
Node: PVirtualNode;
ParentNode: PVirtualNode;
ChildNode: PVirtualNode;
Process: TProcessItem;
I : Integer;
begin
ProcessTree.BeginUpdate;
for I := 0 to FRunningProcesses.Count - 1 do
begin
Process := FRunningProcesses[i];
NodeData^.pProcessID := ProcessItem.ProcessID;
NodeData^.pProcessName := ProcessItem.ProcessName;
...
I have a Class which will retrieve all the data I want and store it into the tree like:
var
FRunningProcesses: TProcessRunningProcesses;
So if I want to enumerate all running processes I just give it a call like:
// clears all data inside the class and refills everything with the new data...
FRunningProcesses.UpdateProcesses;
The problem starts here while I enumerate everything and not only data which had changed which is quite cpu intensive:
procedure TMainForm.UpdateTimerTimer(Sender: TObject);
var
NodeData: PProcessData;
Node : PVirtualNode;
Process: TProcessItem;
I: Integer;
begin
for I := 0 to FRunningProcesses.Count - 1 do
begin
Application.ProcessMessages;
Process := FRunningProcesses[I];
// returns PVirtualNode if the node is found inside the tree
Node := FindNodeByPID(Process.ProcessID);
if not(assigned(Node)) then
exit;
NodeData := ProcessVst.GetNodeData(Node);
if not(assigned(NodeData)) then
exit;
// now starting updating the tree
// NodeData^.pWorkingsSet := Process.WorkingsSet;
....
Basically the timer is only needed for cpu usage and all memory informations I can retrieve from a process like:
Priv.Memory
Working Set
Peak Working Set
Virtual Size
PageFile Usage
Peak PageFile Usage
Page Faults
Cpu Usage
Thread Count
Handle Count
GDI Handle Count
User Handle Count
Total Cpu Time
User Cpu Time
Kernel Cpu Time
So I think the above data must be cached and compared somehow if its changed or not just wonder how and what will be most efficient?

You need only update the data in nodes which are currently are visible.
you can use vst.getfirstvisible vst.getnextvisible to iterate thru these nodes.
the second way is also easy.
use objects instead of the record. sample code of object usage
use getters for the different values.
those getters query the processes for the values.
maybe you need here a limit. refresh data only every second.
now you only need to set the vst into an invalidated status every second.
vst.invalidate
this forced the vst to repaint the visible area.
but all this works only if your data is not sorted by any changing values.
if this necessary you need to update all record and this is your bottle neck - i think.
remember COM and WMI are much slower than pure API.
avoid (slow) loops and use a profiler to find the slow parts.

I'd recommend you to have your VT's node data point directly to TProcessItem.
Pro's:
Get rid of FindNodeByPID. Just update all the items from
FRunningProcesses and then call VT.Refresh. When the process is
terminated, delete corresponding item from FRunningProcesses.
Currently you have quite expensive search in FindNodeByPID where
you loop through all VT nodes, retrieve their data and check for
PID.
Get rid of Process := FRunningProcesses[I] where you have
unnecessary data copy of the whole TProcessData record (btw, that
should be done anyway, use pointers instead).
Get rid of the whole // now starting updating the tree block.
In general, by this change you decrease excess entities what is very good for application updating and debugging.
Con's:
You'll have to keep VT and FRunningProcesses in sync. But that's quite trivial.

Need multi-threading memory manager

I will have to create a multi-threading project soon I have seen experiments ( delphitools.info/2011/10/13/memory-manager-investigations ) showing that the default Delphi memory manager has problems with multi-threading.
So, I have found this SynScaleMM. Anybody can give some feedback on it or on a similar memory manager?
Thanks

Our SynScaleMM is still experimental.
EDIT: Take a look at the more stable ScaleMM2 and the brand new SAPMM. But my remarks below are still worth following: the less allocation you do, the better you scale!
But it worked as expected in a multi-threaded server environment. Scaling is much better than FastMM4, for some critical tests.
But the Memory Manager is perhaps not the bigger bottleneck in Multi-Threaded applications. FastMM4 could work well, if you don't stress it.
Here are some (not dogmatic, just from experiment and knowledge of low-level Delphi RTL) advice if you want to write FAST multi-threaded application in Delphi:
Always use const for string or dynamic array parameters like in MyFunc(const aString: String) to avoid allocating a temporary string per each call;
Avoid using string concatenation (s := s+'Blabla'+IntToStr(i)) , but rely on a buffered writing such as TStringBuilder available in latest versions of Delphi;
TStringBuilder is not perfect either: for instance, it will create a lot of temporary strings for appending some numerical data, and will use the awfully slow SysUtils.IntToStr() function when you add some integer value - I had to rewrite a lot of low-level functions to avoid most string allocation in our TTextWriter class as defined in SynCommons.pas;
Don't abuse on critical sections, let them be as small as possible, but rely on some atomic modifiers if you need some concurrent access - see e.g. InterlockedIncrement / InterlockedExchangeAdd;
InterlockedExchange (from SysUtils.pas) is a good way of updating a buffer or a shared object. You create an updated version of of some content in your thread, then you exchange a shared pointer to the data (e.g. a TObject instance) in one low-level CPU operation. It will notify the change to the other threads, with very good multi-thread scaling. You'll have to take care of the data integrity, but it works very well in practice.
Don't share data between threads, but rather make your own private copy or rely on some read-only buffers (the RCU pattern is the better for scaling);
Don't use indexed access to string characters, but rely on some optimized functions like PosEx() for instance;
Don't mix AnsiString/UnicodeString kind of variables/functions, and check the generated asm code via Alt-F2 to track any hidden unwanted conversion (e.g. call UStrFromPCharLen);
Rather use var parameters in a procedure instead of function returning a string (a function returning a string will add an UStrAsg/LStrAsg call which has a LOCK which will flush all CPU cores);
If you can, for your data or text parsing, use pointers and some static stack-allocated buffers instead of temporary strings or dynamic arrays;
Don't create a TMemoryStream each time you need one, but rely on a private instance in your class, already sized in enough memory, in which you will write data using Position to retrieve the end of data and not changing its Size (which will be the memory block allocated by the MM);
Limit the number of class instances you create: try to reuse the same instance, and if you can, use some record/object pointers on already allocated memory buffers, mapping the data without copying it into temporary memory;
Always use test-driven development, with dedicated multi-threaded test, trying to reach the worse-case limit (increase number of threads, data content, add some incoherent data, pause at random, try to stress network or disk access, benchmark with timing on real data...);
Never trust your instinct, but use accurate timing on real data and process.
I tried to follow those rules in our Open Source framework, and if you take a look at our code, you'll find out a lot of real-world sample code.

If your app can accommodate GPL licensed code, then I'd recommend Hoard. You'll have to write your own wrapper to it but that is very easy. In my tests, I found nothing that matched this code. If your code cannot accommodate the GPL then you can obtain a commercial licence of Hoard, for a significant fee.
Even if you can't use Hoard in an external release of your code you could compare its performance with that of FastMM to determine whether or not your app has problems with heap allocation scalability.
I have also found that the memory allocators in the versions of msvcrt.dll distributed with Windows Vista and later scale quite well under thread contention, certainly much better than FastMM does. I use these routines via the following Delphi MM.
unit msvcrtMM;
interface
implementation
type
size_t = Cardinal;
const
msvcrtDLL = 'msvcrt.dll';
function malloc(Size: size_t): Pointer; cdecl; external msvcrtDLL;
function realloc(P: Pointer; Size: size_t): Pointer; cdecl; external msvcrtDLL;
procedure free(P: Pointer); cdecl; external msvcrtDLL;
function GetMem(Size: Integer): Pointer;
begin
Result := malloc(size);
end;
function FreeMem(P: Pointer): Integer;
begin
free(P);
Result := 0;
end;
function ReallocMem(P: Pointer; Size: Integer): Pointer;
begin
Result := realloc(P, Size);
end;
function AllocMem(Size: Cardinal): Pointer;
begin
Result := GetMem(Size);
if Assigned(Result) then begin
FillChar(Result^, Size, 0);
end;
end;
function RegisterUnregisterExpectedMemoryLeak(P: Pointer): Boolean;
begin
Result := False;
end;
const
MemoryManager: TMemoryManagerEx = (
GetMem: GetMem;
FreeMem: FreeMem;
ReallocMem: ReallocMem;
AllocMem: AllocMem;
RegisterExpectedMemoryLeak: RegisterUnregisterExpectedMemoryLeak;
UnregisterExpectedMemoryLeak: RegisterUnregisterExpectedMemoryLeak
);
initialization
SetMemoryManager(MemoryManager);
end.
It is worth pointing out that your app has to be hammering the heap allocator quite hard before thread contention in FastMM becomes a hindrance to performance. Typically in my experience this happens when your app does a lot of string processing.
My main piece of advice for anyone suffering from thread contention on heap allocation is to re-work the code to avoid hitting the heap. Not only do you avoid the contention, but you also avoid the expense of heap allocation – a classic twofer!

It is locking that makes the difference!
There are two issues to be aware of:
Use of the LOCK prefix by the Delphi itself (System.dcu);
How does FastMM4 handles thread contention and what it does after it failed to acquire a lock.
Use of the LOCK prefix by the Delphi itself
Borland Delphi 5, released in 1999, was the one that introduced the lock prefix in string operations. As you know, when you assign one string to another, it does not copy the whole string but merely increases the reference counter inside the string. If you modify the string, it is de-references, decreasing the reference counter and allocating separate space for the modified string.
In Delphi 4 and earlier, the operations to increase and decrease the reference counter were normal memory operations. The programmers that have used Delphi knew about and, and, if they were using strings across threads, i.e. pass a string from one thread to another, have used their own locking mechanism only for the relevant strings. Programmers did also use read-only string copy that did not modify in any way the source string and did not require locking, for example:
function AssignStringThreadSafe(const Src: string): string;
var
L: Integer;
begin
L := Length(Src);
if L <= 0 then Result := '' else
begin
SetString(Result, nil, L);
Move(PChar(Src)^, PChar(Result)^, L*SizeOf(Src[1]));
end;
end;
But in Delphi 5, Borland have added the LOCK prefix to the string operations and they became very slow, compared to Delphi 4, even for single-threaded applications.
To overcome this slowness, programmers became to use "single threaded" SYSTEM.PAS patch files with lock's commented.
Please see https://synopse.info/forum/viewtopic.php?id=57&p=1 for more information.
FastMM4 Thread Contention
You can modify FastMM4 source code for a better locking mechanism, or use any existing FastMM4 fork, for example https://github.com/maximmasiutin/FastMM4
FastMM4 is not the fastest one for multicore operation, especially when the number of threads is more than the number of physical sockets is because it, by default, on thread contention (i.e. when one thread cannot acquire access to data, locked by another thread) calls Windows API function Sleep(0), and then, if the lock is still not available enters a loop by calling Sleep(1) after each check of the lock.
Each call to Sleep(0) experiences the expensive cost of a context switch, which can be 10000+ cycles; it also suffers the cost of ring 3 to ring 0 transitions, which can be 1000+ cycles. As about Sleep(1) – besides the costs associated with Sleep(0) – it also delays execution by at least 1 millisecond, ceding control to other threads, and, if there are no threads waiting to be executed by a physical CPU core, puts the core into sleep, effectively reducing CPU usage and power consumption.
That’s why, on multithreded wotk with FastMM, CPU use never reached 100% - because of the Sleep(1) issued by FastMM4. This way of acquiring locks is not optimal. A better way would have been a spin-lock of about 5000 pause instructions, and, if the lock was still busy, calling SwitchToThread() API call. If pause is not available (on very old processors with no SSE2 support) or SwitchToThread() API call was not available (on very old Windows versions, prior to Windows 2000), the best solution would be to utilize EnterCriticalSection/LeaveCriticalSection, that don’t have latency associated by Sleep(1), and which also very effectively cedes control of the CPU core to other threads.
The fork that I've mentioned uses a new approach to waiting for a lock, recommended by Intel in its Optimization Manual for developers - a spinloop of pause + SwitchToThread(), and, if any of these are not available: CriticalSections instead of Sleep(). With these options, the Sleep() will never be used but EnterCriticalSection/LeaveCriticalSection will be used instead. Testing has shown that the approach of using CriticalSections instead of Sleep (which was used by default before in FastMM4) provides significant gain in situations when the number of threads working with the memory manager is the same or higher than the number of physical cores. The gain is even more evident on computers with multiple physical CPUs and Non-Uniform Memory Access (NUMA). I have implemented compile-time options to take away the original FastMM4 approach of using Sleep(InitialSleepTime) and then Sleep(AdditionalSleepTime) (or Sleep(0) and Sleep(1)) and replace them with EnterCriticalSection/LeaveCriticalSection to save valuable CPU cycles wasted by Sleep(0) and to improve speed (reduce latency) that was affected each time by at least 1 millisecond by Sleep(1), because the Critical Sections are much more CPU-friendly and have definitely lower latency than Sleep(1).
When these options are enabled, FastMM4-AVX it checks: (1) whether the CPU supports SSE2 and thus the "pause" instruction, and (2) whether the operating system has the SwitchToThread() API call, and, if both conditions are met, uses "pause" spin-loop for 5000 iterations and then SwitchToThread() instead of critical sections; If a CPU doesn't have the "pause" instrcution or Windows doesn't have the SwitchToThread() API function, it will use EnterCriticalSection/LeaveCriticalSection.
You can see the test results, including made on a computer with multiple physical CPUs (sockets) in that fork.
See also the Long Duration Spin-wait Loops on Hyper-Threading Technology Enabled Intel Processors article. Here is what Intel writes about this issue - and it applies to FastMM4 very well:
The long duration spin-wait loop in this threading model seldom causes a performance problem on conventional multiprocessor systems. But it may introduce a severe penalty on a system with Hyper-Threading Technology because processor resources can be consumed by the master thread while it is waiting on the worker threads. Sleep(0) in the loop may suspend the execution of the master thread, but only when all available processors have been taken by worker threads during the entire waiting period. This condition requires all worker threads to complete their work at the same time. In other words, the workloads assigned to worker threads must be balanced. If one of the worker threads completes its work sooner than others and releases the processor, the master thread can still run on one processor.
On a conventional multiprocessor system this doesn't cause performance problems because no other thread uses the processor. But on a system with Hyper-Threading Technology the processor the master thread runs on is a logical one that shares processor resources with one of the other worker threads.
The nature of many applications makes it difficult to guarantee that workloads assigned to worker threads are balanced. A multithreaded 3D application, for example, may assign the tasks for transformation of a block of vertices from world coordinates to viewing coordinates to a team of worker threads. The amount of work for a worker thread is determined not only by the number of vertices but also by the clipped status of the vertex, which is not predictable when the master thread divides the workload for working threads.
A non-zero argument in the Sleep function forces the waiting thread to sleep N milliseconds, regardless of the processor availability. It may effectively block the waiting thread from consuming processor resources if the waiting period is set properly. But if the waiting period is unpredictable from workload to workload, then a large value of N may make the waiting thread sleep too long, and a smaller value of N may cause it to wake up too quickly.
Therefore the preferred solution to avoid wasting processor resources in a long duration spin-wait loop is to replace the loop with an operating system thread-blocking API, such as the Microsoft Windows* threading API,
WaitForMultipleObjects. This call causes the operating system to block the waiting thread from consuming processor resources.
It refers to Using Spin-Loops on Intel Pentium 4 Processor and Intel Xeon Processor application note.
You can also find a very good spin-loop implementation here at stackoverflow.
It also loads normal loads just to check before issuing a lock-ed store, just to not flood the CPU with locked operations in a loop, that would lock the bus.
FastMM4 per se is very good. Just improve the locking and you will get an excelling multi-threaded memory manager.
Please also be aware that each small block type is locked separately in FastMM4.
You can put padding between the small block control areas, to make each area have own cache line, not shared with other block sizes, and to make sure it begins at a cache line size boundary. You can use CPUID to determine the size of the CPU cache line.
So, with locking correctly implemented to suit your needs (i.e. whether you need NUMA or not, whether to use lock-ing releases, etc., you may obtain the results that the memory allocation routines would be several times faster and would not suffer so severely from thread contention.

FastMM deals with multi-threading just fine. It is the default memory manager for Delphi 2006 and up.
If you are using an older version of Delphi (Delphi 5 and up), you can still use FastMM. It's available on SourceForge.

You could use TopMM:
http://www.topsoftwaresite.nl/
You could also try ScaleMM2 (SynScaleMM is based on ScaleMM1) but I have to fix a bug regarding to interthread memory, so not production ready yet :-(
http://code.google.com/p/scalemm/

Deplhi 6 memory manager is outdated and outright bad. We were using RecyclerMM both on a high-load production server and on a multi-threaded desktop application, and we had no issues with it: it's fast, reliable and doesn't cause excess fragmentation. (Fragmentation was Delphi's stock memory manager worst issue).
The only drawback of RecyclerMM is that it isn't compatible with MemCheck out of the box. However, a small source alteration was enough to render it compatible.

How to listen to microphone and detect sound loudness in Delphi 7

I need a program to catch an event when microphone input gets louder than certain threshold value. So probably I need to constantly listen to mic, and somehow measure sound amplitude? Is it possible to do that in Delphi 7?

I recommend you to use the BASS Audio Library http://www.un4seen.com/bass.html
BASS is an audio library .. to provide developers with powerful stream (MP3.. OGG.. ) functions. All in a tiny DLL, under 100KB in size.
it's very easy to use, as this simple minimalistic program illustrates. It is based on the BASS Record Test for Delphi, included in the samples that come with BASS. See it for a complete save and playback of the recorded audio.
Just compile it and run it.
program rec;
uses Windows, Bass;
(* This function called while recording audio *)
function RecordingCallback(h:HRECORD; b:Pointer; l,u: DWord): boolean; stdcall;
var level:dword;
begin
level:=BASS_ChannelGetLevel(h);
write(''#13,LoWord(level),'-',HiWord(level),' ');
Result := True;
end;
begin
BASS_RecordInit(-1);
BASS_RecordStart(44100, 2, 0, #RecordingCallback, nil);
Readln;
BASS_RecordFree;
end.

Yes of course. Wave sound is just about that, the amplitude of the sound wave at each moment. Volume is afaik the RMS (root mean square) of the samples.
Just get whatever audio library you use, obtain the wave data and calculate this value. Maybe even simply having a moving average is already enough (sparing you the RMS thing).
Delphi 7 would do fine for this, and comes with mmsystem headers. More advanced components are available (I used the lakeofsoft lib for a while), but that might be overkill, if this is your only audio operation.

I recommend you to look AudioLab

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart