vhdl signal generated faster before it is needed

vhdl signal generated faster before it is needed - buffer

I have a question about VHDL.
I"m driving an RGB LED matrix using an FPGA.
I have two main entities. The DRIVER and the COLLECTOR.
The DRIVER is used to just send the signals to the LED matrix.
The COLLECTOR is used to collect the incoming signals (from an Android device) and split them accordingly before sending these signals to the DIRVER.
The problem is, the signals generated in the COLLECTOR to drive the LED matrix, is generated faster than the Driver can accept it.
How do you solve a problem at which a signal is generated before it is needed?

Your question does not mention if you are running in a clocked process or not. If you are using combinatorial logic (not clocked), then you're at the mercy of routing delays and you're stuck.
I'm going to assume that you are using clocked processes. In that case, you should try using a shift-register to delay one of your signals by 'n' clock cycles.
-- <...usual entity/architecture preamble...>
constant CLOCK_DELAYS : natural := 10;
signal sreg : std_logic_vector(CLOCK_DELAYS - 1 downto 0);
begin -- start of RTL
process(clk)
begin
if rising_edge(clk) then
sreg <= sreg(CLOCK_DELAYS - 2 downto 0) & i_signal;
end if;
end process;
o_signal <= sreg(CLOCK_DELAYS - 1);

Related

What instruction set would be easiest to implement on a homemade ALU?

I'm designing a basic 8 or 16 bit computer (haven't really decided yet) using eeprom chips, sram, and an ALU made (mostly) out of individual transistors on a PCB using cmos logic that I already have partially designed and tested. And I thought it would be cool to use an already existing instruction set so I can compile C++ code for it instead of writing everything in machine code.
I looked at the AVR gcc compiler on Compiler Explorer and the machine code it produces, it looks very simple and I think it is only 8-bits. Or should I go for 32-bits and try to use x86? That would make the ALU a lot bigger. Are there compilers that let you use limited instructions so I don't have to make every single one? Or would it even be easier to just write an interpreter for a custom instruction set? Any advice is welcome, thank you.

After a bit of research it has become apparent that trying to recreate modern ALUs and instructions would be very complicated and time consuming, and I should definitely make my own simplistic architecture and if I really want to compile C code for it I could probably just interpret x86 or AVR assembly from gcc.
I would also love some feedback on my design, I came up with a really weird ISA last night that is focused mainly on being easy to engineer the hardware.
There are two registers in the ALU, all other registers perform functions based off those two numbers all at the same time. For instance, there is a register that holds the added result of A and B, one that holds the result of A shifted right B times, a "jump if A > B" branch, and so on.
And so to add a number, it would take 3 clock cycles, you would move two values from ram into A and B, then copy the data back to ram afterwards. It would look like this:
setA addressInRam1 (6-bit opcode, 18-bit address/value)
setB addressInRam2
copyAddedResult addressInRam1
And program code is executed directly from EEPROM memory. I don't know if I should think of it as having two general purpose registers or it having 2^18 registers. Either way, it makes it much easier and simpler to build when you're executing instructions one at a time like that. Again any advice is welcome, I am somewhat of a noob in this field, thank you!
Oh and then an additional C register to hold a value to be stored in RAM the next clock cycle specified in the set register. This is what the Fibonacci sequence would look like:
1: setC 1; // setting C reg to 1
2: set 0; // setting address 0 in ram to the C register
3: setA 0; // copying value in address 0 of ram into A reg
// repeat for B reg
4: set 1; // setting this to the same as the other
5: setB 1;
6: jumpIf> 9; // jump to line 9 if A > B
7: getSum 0; // put sum of A and B into address 0 of ram
8: setA 0; // set the A register to address 0 of ram
9: getSum 1; // "else" put the sum into the second variable
10: setB 1;
11: jump 6; // loop back to line 6 forever
I made a C++ equivalent and put it through compiler explorer and despite the many drawbacks of this architecture it uses the same amount of clock cycles as x64 in the loop and two more in total. But I think this function in particular works pretty well with it as I don't have to reassign A and B often.

Ada. Building "main" file takes forever when tasking

I have this simple tasking (threads) program that I'd like to run but building it takes forever (30 seconds or more). It makes it exhausting to have to wait for the build to be built before running the program every-time, especially when I all want to do is change something insignificant, like adding a Put sentence here or there.
This is the program I have been running for reference. I am using GPS 2016. I am a beginner in Ada.
with Ada.Text_IO, Ada.Integer_Text_IO;
use Ada.Text_IO, Ada.Integer_Text_IO;
procedure Main is
task First_Task;
task body First_Task is
begin
for Index in 1..4 loop
delay 2.0;
Put("This is in First_Task, pass number ");
Put(Index, 3);
New_Line;
end loop;
end First_Task;
task Second_Task;
task body Second_Task is
begin
for Index in 1..7 loop
delay 1.0;
Put("This is in Second_Task, pass number");
Put(Index, 3);
New_Line;
end loop;
end Second_Task;
task Third_Task;
task body Third_Task is
begin
for Index in 1..5 loop
delay 0.1;
Put("This is in Third_Task, pass number ");
Put(Index, 3);
New_Line;
end loop;
end Third_Task;
begin
for Index in 1..5 loop
delay 0.7;
Put_Line("This is in the main program.");
end loop;
end Main;

Posting an answer to help searchability for future users. If you find the full solution, exactly why your AV software does this and a clean solution, don't hesitate to post and accept your own answer.
First, the MCVE enabled a quick test, revealing nothing wrong with either the code or at least one Gnat compiler (Linux x86-64, Debian Jessie, gcc4.9.3) pointing at an installation-specific problem.
The installation in question is Gnat GPL-2016 (32 bit) on Windows-10, with GPS as the IDE, and AVAST anti-virus software.
Previous problem reports and rumour pointed at two possible candidates,
unusual Python installations - GPS depends on Python, and finding an unexpected Python version is rumoured to cause some troubles
Anti-virus software interacting with the IDE in unexpected ways.
Of these, the latter is confirmed to be the problem, and disabling AV during program build restores acceptable build times. (This isn't specific to Ada or Gnat, I've seen it on FPGA development tools too)
So we have a temporary workaround.
The next step might be to identify why AVAST is allergic to the build process, and disable its reaction to false positives, to maintain AV protection during programming sessions.
Possible candidates may be the intermediate .o and .ali files (Object and Ada Linker), or intermediate "binding" files b~whatever.ads/b which stitch the Ada code to runtime system and OS.
Most likely, the b~whatever.o object files spark an allergic reaction when they link to unusual OS primitives for process manipulation, to implement Ada tasking. Possibly this resembles virus behaviour closely enough to attract attention.
One answer may be to teach Avast not to scan your Ada project's build folder, or to filter what it scans by file type. But I can be no further help, and I encourage a better answer from anyone who finds one.

Interrupt during network I/O == crash?

It seems that when an I/O pin interrupt occurs while network I/O is being performed, the system resets -- even if the interrupt function only declares a local variable and assigns it (essentially a do-nothing routine.) So I'm fairly certain it isn't to do with spending too much time in the interrupt function. (My actual working interrupt functions are pretty spartan, strictly increment and assign, not even any conditional logic.)
Is this a known constraint? My workaround is to disconnect the interrupt while using the network, but of course this introduces potential for data loss.
function fnCbUp(level)
lastTrig = rtctime.get()
gpio.trig(pin, "down", fnCbDown)
end
function fnCbDown(level)
local spin = rtcmem.read32(20)
spin = spin + 1
rtcmem.write32(20, spin)
lastTrig = rtctime.get()
gpio.trig(pin, "up", fnCbUp)
end
gpio.trig(pin, "down", fnCbDown)
gpio.mode(pin, gpio.INT, gpio.FLOAT)
branch: master
build built on: 2016-03-15 10:39
powered by Lua 5.1.4 on SDK 1.4.0
modules: adc,bit,file,gpio,i2c,net,node,pwm,rtcfifo,rtcmem,rtctime,sntp,tmr,uart,wifi

Not sure if this should be an answer or a comment. May be a bit long for a comment though.
So, the question is "Is this a known constraint?" and the short but unsatisfactory answer is "no". Can't leave it like that...
Is the code excerpt enough for you to conclude the reset must occur due to something within those few lines? I doubt it.
What you seem to be doing is a simple "global" increment of each GPIO 'down' with some debounce logic. However, I don't see any debounce, what am I missing? You get the time into the global lastTrig but you don't do anything with it. Just for debouncing you won't need rtctime IMO but I doubt it's got anything to do with the problem.
I have a gist of a tmr.delay-based debounce as well as one with tmr.now that is more like a throttle. You could use the first like so:
GPIO14 = 5
spin
function down()
spin = spin + 1
tmr.delay(50) -- time delay for switch debounce
gpio.trig(GPIO14, "up", up) -- change trigger on falling edge
end
function up()
tmr.delay(50)
gpio.trig(GPIO14, "down", down) -- trigger on rising edge
end
gpio.mode(GPIO14, gpio.INT) -- gpio.FLOAT by default
gpio.trig(GPIO14, "down", down)
I also suggest running this against the dev branch because you said it be related to network I/O during interrupts.

I have nearly the same problem.
Running ESP8266Webserver, using GPIO14 Interrupt, with too fast Impulses as input ,
the system stopps recording the interrupts.
Please see here for more details.
http://www.esp8266.com/viewtopic.php?f=28&t=9702
I'm using ARDUINO IDE 1.69 but the Problem seems to be the same.
I used an ESP8266-07 as generator & counter (without Webserver)
to generate the Pulses, wired to my ESP8266-Watersystem.
The generator works very well, with much more than 240 puls / sec,
generating and counting on the same ESP.
But the ESP-Watersystem, stops recording interrupts here at impuls > 50/ second:
/*************************************************/
/* ISR Water pulse counter */
/*************************************************/
/**
* Invoked by interrupt14 once per rotation of the hall-effect sensor. Interrupt
* handlers should be kept as small as possible so they return quickly.
*/
void ICACHE_RAM_ATTR pulseCounter()
{
// Increment the pulse Counter
cli();
G_pulseCount++;
Serial.println ( "!" );
sei();
}
The serial output is here only for showing whats happening.
It shows the correct counted Impuls, until the webserver interacts with the network.
Than is seams the Interrupt is blocked.(no serial output from here)
By stressing the System, when I several times refresh the Website in an short time,
the interrupt counting starts for an short time, but it stops short time again.
The problem is anywhere along Interrupt handling and Webservices.
I hope I could help to find this issues.
Interessted in getting some solutions.
Who can help?
Thanks from Mickbaer
Berlin Germany
Email: michael.lorenz#web.de

How to parallel code a for-down-to loop in delphi, active over a list, deleting items as you go?

Take for example the following code:
for i := (myStringList.Count - 1) DownTo 0 do begin
dataList := SplitString(myStringList[i], #9);
x := StrToFloat(dataList[0]);
y := StrToFloat(dataList[1]);
z := StrToFloat(dataList[2]);
//Do something with these variables
myOutputRecordArray[i] := {SomeFunctionOf}(x,y,z)
//Free Used List Item
myStringList.Delete(i);
end;
//Free Memory
myStringList.Free;
How would you parallelise this using, for example, the OmniThreadLibrary? Is it possible? Or does it need to be restructured?
I'm calling myStringList.Delete(i); at each iteration as the StringList is large and freeing items after use at each iteration is important to minimise memory usage.

Simple answer: You wouldn't.
More involved answer: The last thing you want to do in a parallelized operation is modify shared state, such as this delete call. Since it's not guaranteed that each individual task will finish "in order"--and in fact it's highly likely that they won't at least once, with that probability approaching 100% very quickly the more tasks you add to the total workload--trying to do something like that is playing with fire.
You can either destroy the items as you go and do it serialized, or do it in parallel, finish faster, and destroy the whole list. But I don't think there's any way to have it both ways.

You can cheat. Setting the string value to an empty string will free most of the memory and will be thread safe. At the end of the processing you can then clear the list.
Parallel.ForEach(0, myStringList.Count - 1).Execute(
procedure (const index: integer)
var
dataList: TStringDynArray;
x, y, z: Single;
begin
dataList := SplitString(myStringList[index], #9);
x := StrToFloat(dataList[0]);
y := StrToFloat(dataList[1]);
z := StrToFloat(dataList[2]);
//Do something with these variables
myOutputRecordArray[index] := {SomeFunctionOf}(x,y,z);
//Free Used List Item
myStringList[index] := '';
end);
myStringList.Clear;
This code is safe because we are never writing to a shared object from multiple threads. You need to make sure that all of the variables you use that would normally be local are declared in the threaded block.

I'm not going to attempt to show how to do what you originally asked because it is a bad idea that will not lead to improved performance. Not even assuming that you deal with the many and various data races in your proposed parallel implementation.
The bottleneck here is the disk I/O. Reading the entire file into memory, and then processing the contents is the design choice that is leading to your memory problems. The correct way to solve this problem is to use a pipeline.
Step 1 of the pipeline takes as input the file on disk. The code here reads chunks of the file and then breaks those chunks into lines. These lines are the output of this step. The entire file is never in memory at one time. You'll have to tune the size of the chunks that you read.
Step 2 takes as input the strings the step 1 produced. Step 2 consumes those strings and produces vectors. Those vectors are added to your vector list.
Step 2 will be faster than step 1 because I/0 is so expensive. Therefore there's nothing to be gained by trying to optimise either of the steps with parallel algorithms. Even on a uniprocessor machine this pipelined implementation could be faster than non-pipelined.

Need multi-threading memory manager

I will have to create a multi-threading project soon I have seen experiments ( delphitools.info/2011/10/13/memory-manager-investigations ) showing that the default Delphi memory manager has problems with multi-threading.
So, I have found this SynScaleMM. Anybody can give some feedback on it or on a similar memory manager?
Thanks

Our SynScaleMM is still experimental.
EDIT: Take a look at the more stable ScaleMM2 and the brand new SAPMM. But my remarks below are still worth following: the less allocation you do, the better you scale!
But it worked as expected in a multi-threaded server environment. Scaling is much better than FastMM4, for some critical tests.
But the Memory Manager is perhaps not the bigger bottleneck in Multi-Threaded applications. FastMM4 could work well, if you don't stress it.
Here are some (not dogmatic, just from experiment and knowledge of low-level Delphi RTL) advice if you want to write FAST multi-threaded application in Delphi:
Always use const for string or dynamic array parameters like in MyFunc(const aString: String) to avoid allocating a temporary string per each call;
Avoid using string concatenation (s := s+'Blabla'+IntToStr(i)) , but rely on a buffered writing such as TStringBuilder available in latest versions of Delphi;
TStringBuilder is not perfect either: for instance, it will create a lot of temporary strings for appending some numerical data, and will use the awfully slow SysUtils.IntToStr() function when you add some integer value - I had to rewrite a lot of low-level functions to avoid most string allocation in our TTextWriter class as defined in SynCommons.pas;
Don't abuse on critical sections, let them be as small as possible, but rely on some atomic modifiers if you need some concurrent access - see e.g. InterlockedIncrement / InterlockedExchangeAdd;
InterlockedExchange (from SysUtils.pas) is a good way of updating a buffer or a shared object. You create an updated version of of some content in your thread, then you exchange a shared pointer to the data (e.g. a TObject instance) in one low-level CPU operation. It will notify the change to the other threads, with very good multi-thread scaling. You'll have to take care of the data integrity, but it works very well in practice.
Don't share data between threads, but rather make your own private copy or rely on some read-only buffers (the RCU pattern is the better for scaling);
Don't use indexed access to string characters, but rely on some optimized functions like PosEx() for instance;
Don't mix AnsiString/UnicodeString kind of variables/functions, and check the generated asm code via Alt-F2 to track any hidden unwanted conversion (e.g. call UStrFromPCharLen);
Rather use var parameters in a procedure instead of function returning a string (a function returning a string will add an UStrAsg/LStrAsg call which has a LOCK which will flush all CPU cores);
If you can, for your data or text parsing, use pointers and some static stack-allocated buffers instead of temporary strings or dynamic arrays;
Don't create a TMemoryStream each time you need one, but rely on a private instance in your class, already sized in enough memory, in which you will write data using Position to retrieve the end of data and not changing its Size (which will be the memory block allocated by the MM);
Limit the number of class instances you create: try to reuse the same instance, and if you can, use some record/object pointers on already allocated memory buffers, mapping the data without copying it into temporary memory;
Always use test-driven development, with dedicated multi-threaded test, trying to reach the worse-case limit (increase number of threads, data content, add some incoherent data, pause at random, try to stress network or disk access, benchmark with timing on real data...);
Never trust your instinct, but use accurate timing on real data and process.
I tried to follow those rules in our Open Source framework, and if you take a look at our code, you'll find out a lot of real-world sample code.

If your app can accommodate GPL licensed code, then I'd recommend Hoard. You'll have to write your own wrapper to it but that is very easy. In my tests, I found nothing that matched this code. If your code cannot accommodate the GPL then you can obtain a commercial licence of Hoard, for a significant fee.
Even if you can't use Hoard in an external release of your code you could compare its performance with that of FastMM to determine whether or not your app has problems with heap allocation scalability.
I have also found that the memory allocators in the versions of msvcrt.dll distributed with Windows Vista and later scale quite well under thread contention, certainly much better than FastMM does. I use these routines via the following Delphi MM.
unit msvcrtMM;
interface
implementation
type
size_t = Cardinal;
const
msvcrtDLL = 'msvcrt.dll';
function malloc(Size: size_t): Pointer; cdecl; external msvcrtDLL;
function realloc(P: Pointer; Size: size_t): Pointer; cdecl; external msvcrtDLL;
procedure free(P: Pointer); cdecl; external msvcrtDLL;
function GetMem(Size: Integer): Pointer;
begin
Result := malloc(size);
end;
function FreeMem(P: Pointer): Integer;
begin
free(P);
Result := 0;
end;
function ReallocMem(P: Pointer; Size: Integer): Pointer;
begin
Result := realloc(P, Size);
end;
function AllocMem(Size: Cardinal): Pointer;
begin
Result := GetMem(Size);
if Assigned(Result) then begin
FillChar(Result^, Size, 0);
end;
end;
function RegisterUnregisterExpectedMemoryLeak(P: Pointer): Boolean;
begin
Result := False;
end;
const
MemoryManager: TMemoryManagerEx = (
GetMem: GetMem;
FreeMem: FreeMem;
ReallocMem: ReallocMem;
AllocMem: AllocMem;
RegisterExpectedMemoryLeak: RegisterUnregisterExpectedMemoryLeak;
UnregisterExpectedMemoryLeak: RegisterUnregisterExpectedMemoryLeak
);
initialization
SetMemoryManager(MemoryManager);
end.
It is worth pointing out that your app has to be hammering the heap allocator quite hard before thread contention in FastMM becomes a hindrance to performance. Typically in my experience this happens when your app does a lot of string processing.
My main piece of advice for anyone suffering from thread contention on heap allocation is to re-work the code to avoid hitting the heap. Not only do you avoid the contention, but you also avoid the expense of heap allocation – a classic twofer!

It is locking that makes the difference!
There are two issues to be aware of:
Use of the LOCK prefix by the Delphi itself (System.dcu);
How does FastMM4 handles thread contention and what it does after it failed to acquire a lock.
Use of the LOCK prefix by the Delphi itself
Borland Delphi 5, released in 1999, was the one that introduced the lock prefix in string operations. As you know, when you assign one string to another, it does not copy the whole string but merely increases the reference counter inside the string. If you modify the string, it is de-references, decreasing the reference counter and allocating separate space for the modified string.
In Delphi 4 and earlier, the operations to increase and decrease the reference counter were normal memory operations. The programmers that have used Delphi knew about and, and, if they were using strings across threads, i.e. pass a string from one thread to another, have used their own locking mechanism only for the relevant strings. Programmers did also use read-only string copy that did not modify in any way the source string and did not require locking, for example:
function AssignStringThreadSafe(const Src: string): string;
var
L: Integer;
begin
L := Length(Src);
if L <= 0 then Result := '' else
begin
SetString(Result, nil, L);
Move(PChar(Src)^, PChar(Result)^, L*SizeOf(Src[1]));
end;
end;
But in Delphi 5, Borland have added the LOCK prefix to the string operations and they became very slow, compared to Delphi 4, even for single-threaded applications.
To overcome this slowness, programmers became to use "single threaded" SYSTEM.PAS patch files with lock's commented.
Please see https://synopse.info/forum/viewtopic.php?id=57&p=1 for more information.
FastMM4 Thread Contention
You can modify FastMM4 source code for a better locking mechanism, or use any existing FastMM4 fork, for example https://github.com/maximmasiutin/FastMM4
FastMM4 is not the fastest one for multicore operation, especially when the number of threads is more than the number of physical sockets is because it, by default, on thread contention (i.e. when one thread cannot acquire access to data, locked by another thread) calls Windows API function Sleep(0), and then, if the lock is still not available enters a loop by calling Sleep(1) after each check of the lock.
Each call to Sleep(0) experiences the expensive cost of a context switch, which can be 10000+ cycles; it also suffers the cost of ring 3 to ring 0 transitions, which can be 1000+ cycles. As about Sleep(1) – besides the costs associated with Sleep(0) – it also delays execution by at least 1 millisecond, ceding control to other threads, and, if there are no threads waiting to be executed by a physical CPU core, puts the core into sleep, effectively reducing CPU usage and power consumption.
That’s why, on multithreded wotk with FastMM, CPU use never reached 100% - because of the Sleep(1) issued by FastMM4. This way of acquiring locks is not optimal. A better way would have been a spin-lock of about 5000 pause instructions, and, if the lock was still busy, calling SwitchToThread() API call. If pause is not available (on very old processors with no SSE2 support) or SwitchToThread() API call was not available (on very old Windows versions, prior to Windows 2000), the best solution would be to utilize EnterCriticalSection/LeaveCriticalSection, that don’t have latency associated by Sleep(1), and which also very effectively cedes control of the CPU core to other threads.
The fork that I've mentioned uses a new approach to waiting for a lock, recommended by Intel in its Optimization Manual for developers - a spinloop of pause + SwitchToThread(), and, if any of these are not available: CriticalSections instead of Sleep(). With these options, the Sleep() will never be used but EnterCriticalSection/LeaveCriticalSection will be used instead. Testing has shown that the approach of using CriticalSections instead of Sleep (which was used by default before in FastMM4) provides significant gain in situations when the number of threads working with the memory manager is the same or higher than the number of physical cores. The gain is even more evident on computers with multiple physical CPUs and Non-Uniform Memory Access (NUMA). I have implemented compile-time options to take away the original FastMM4 approach of using Sleep(InitialSleepTime) and then Sleep(AdditionalSleepTime) (or Sleep(0) and Sleep(1)) and replace them with EnterCriticalSection/LeaveCriticalSection to save valuable CPU cycles wasted by Sleep(0) and to improve speed (reduce latency) that was affected each time by at least 1 millisecond by Sleep(1), because the Critical Sections are much more CPU-friendly and have definitely lower latency than Sleep(1).
When these options are enabled, FastMM4-AVX it checks: (1) whether the CPU supports SSE2 and thus the "pause" instruction, and (2) whether the operating system has the SwitchToThread() API call, and, if both conditions are met, uses "pause" spin-loop for 5000 iterations and then SwitchToThread() instead of critical sections; If a CPU doesn't have the "pause" instrcution or Windows doesn't have the SwitchToThread() API function, it will use EnterCriticalSection/LeaveCriticalSection.
You can see the test results, including made on a computer with multiple physical CPUs (sockets) in that fork.
See also the Long Duration Spin-wait Loops on Hyper-Threading Technology Enabled Intel Processors article. Here is what Intel writes about this issue - and it applies to FastMM4 very well:
The long duration spin-wait loop in this threading model seldom causes a performance problem on conventional multiprocessor systems. But it may introduce a severe penalty on a system with Hyper-Threading Technology because processor resources can be consumed by the master thread while it is waiting on the worker threads. Sleep(0) in the loop may suspend the execution of the master thread, but only when all available processors have been taken by worker threads during the entire waiting period. This condition requires all worker threads to complete their work at the same time. In other words, the workloads assigned to worker threads must be balanced. If one of the worker threads completes its work sooner than others and releases the processor, the master thread can still run on one processor.
On a conventional multiprocessor system this doesn't cause performance problems because no other thread uses the processor. But on a system with Hyper-Threading Technology the processor the master thread runs on is a logical one that shares processor resources with one of the other worker threads.
The nature of many applications makes it difficult to guarantee that workloads assigned to worker threads are balanced. A multithreaded 3D application, for example, may assign the tasks for transformation of a block of vertices from world coordinates to viewing coordinates to a team of worker threads. The amount of work for a worker thread is determined not only by the number of vertices but also by the clipped status of the vertex, which is not predictable when the master thread divides the workload for working threads.
A non-zero argument in the Sleep function forces the waiting thread to sleep N milliseconds, regardless of the processor availability. It may effectively block the waiting thread from consuming processor resources if the waiting period is set properly. But if the waiting period is unpredictable from workload to workload, then a large value of N may make the waiting thread sleep too long, and a smaller value of N may cause it to wake up too quickly.
Therefore the preferred solution to avoid wasting processor resources in a long duration spin-wait loop is to replace the loop with an operating system thread-blocking API, such as the Microsoft Windows* threading API,
WaitForMultipleObjects. This call causes the operating system to block the waiting thread from consuming processor resources.
It refers to Using Spin-Loops on Intel Pentium 4 Processor and Intel Xeon Processor application note.
You can also find a very good spin-loop implementation here at stackoverflow.
It also loads normal loads just to check before issuing a lock-ed store, just to not flood the CPU with locked operations in a loop, that would lock the bus.
FastMM4 per se is very good. Just improve the locking and you will get an excelling multi-threaded memory manager.
Please also be aware that each small block type is locked separately in FastMM4.
You can put padding between the small block control areas, to make each area have own cache line, not shared with other block sizes, and to make sure it begins at a cache line size boundary. You can use CPUID to determine the size of the CPU cache line.
So, with locking correctly implemented to suit your needs (i.e. whether you need NUMA or not, whether to use lock-ing releases, etc., you may obtain the results that the memory allocation routines would be several times faster and would not suffer so severely from thread contention.

FastMM deals with multi-threading just fine. It is the default memory manager for Delphi 2006 and up.
If you are using an older version of Delphi (Delphi 5 and up), you can still use FastMM. It's available on SourceForge.

You could use TopMM:
http://www.topsoftwaresite.nl/
You could also try ScaleMM2 (SynScaleMM is based on ScaleMM1) but I have to fix a bug regarding to interthread memory, so not production ready yet :-(
http://code.google.com/p/scalemm/

Deplhi 6 memory manager is outdated and outright bad. We were using RecyclerMM both on a high-load production server and on a multi-threaded desktop application, and we had no issues with it: it's fast, reliable and doesn't cause excess fragmentation. (Fragmentation was Delphi's stock memory manager worst issue).
The only drawback of RecyclerMM is that it isn't compatible with MemCheck out of the box. However, a small source alteration was enough to render it compatible.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart