Using SSE to round in Delphi

Using SSE to round in Delphi - delphi

I wrote this function to round singles to integers:
function Round(const Val: Single): Integer;
begin
asm
cvtss2si eax,Val
mov Result,eax
end;
end;
It works, but I need to change the rounding mode. Apparently, per this, I need to set the MXCSR register.
How do I do this in Delphi?
The reason I am doing this in the first place is I need "away-from-zero" rounding (like in C#), which is not possible even via SetRoundingMode.

On modern Delphi, to set MXCSR you can call SetMXCSR from the System unit. To read the current value use GetMXCSR.
Do beware that SetMXCSR, just like Set8087CW is not thread-safe. Despite my efforts to persuade Embarcadero to change this, it seems that this particular design flaw will remain with us forever.
On older versions of Delphi you use the LDMXCSR and STMXCSR opcodes. You might write your own versions like this:
function GetMXCSR: LongWord;
asm
PUSH EAX
STMXCSR [ESP].DWord
POP EAX
end;
procedure SetMXCSR(NewMXCSR: LongWord);
//thread-safe version that does not abuse the global variable DefaultMXCSR
var
MXCSR: LongWord;
asm
AND EAX, $FFC0 // Remove flag bits
MOV MXCSR, EAX
LDMXCSR MXCSR
end;
These versions are thread-safe and I hope will compile and work on older Delphi versions.
Do note that using the name Round for your function is likely to cause a lot of confusion. I would advise that you do not do that.
Finally, I checked in the Intel documentation and both of the Intel floating point units (x87, SSE) offer just the rounding modes specified by the IEEE754 standard. They are:
Round to nearest (even)
Round down (toward −∞)
Round up (toward +∞)
Round toward zero (Truncate)
So, your desired rounding mode is not available.

Related

How to use 16-bits Assembly inline on Delphi on Windows98?

Today I was playing with my old computer and trying to use 16-bits Assembly inside Delphi. It's works fine with 32-bits but I always had problem when I used interrupts. Blue Screen or Freezing, that was making me believe that's not possible to do it. I'm on Windows 98 and using Delphi 7, using this simple code.
program Project1;
{$APPTYPE CONSOLE}
uses
SysUtils, Windows;
begin
asm
mov ax,$0301
mov bx,$0200
mov cx,$0001
xor dx,dx
int $13
int $20
end;
MessageBox(0,'Okay','Okay',MB_OK);
end.
To "format" a diskkete on the Floppy drive. There's a way to use it on Delphi 7 without freezing and blues screens? Or Delphi only allows to use 32-bits Assembly? Am I doing something wrong?

As long as your application is built as "32-bit Windows" application, the interrupts cannot work since these interrupts are simply not mapped.
You could try to compile your application as a "16-bit Console" application. I don't know if Delphi supports this, but that's my best guess for getting the emulation of int 0x13 and int 0x10.
By the way, shouldn't your assembly code use hexadecimal numbers, like this:?
mov ax, $0301
mov bx, $0200
mov cx, $0001
xor dx, dx
int $13
int $20
As it is now, you are probably calling interrupt $0d, which according to Ralf Brown's Interrupt List means:
INT 0D C - IRQ5 - FIXED DISK (PC,XT), LPT2 (AT), reserved (PS/2)

Delphi 7 produces 32 bit executables. Your 16 bit assembly code is therefore not compatible with the compiler you use.
You might have some luck with a 16 bit compiler, e.g. Turbo Pascal or Delphi 1. But it would make more sense, I suspect, to use the Win32 API to achieve your goals.

What does CLD do here?

I have seen code
procedure FillDWord(var Dest; Count, What: dword); assembler ;
asm
PUSH EDI
MOV EDI, Dest
MOV EAX, What
MOV ECX, Count
CLD
REP STOSD
POP EDI
end;
I googled CLD and it says it clears the direction flag... so is it important here? after I removed it, the function seems working fine.

The direction flag controls if - during the execution of REP STOSD the EDI register will be incremented or decremented.
In case of a cleared direction flag (e.g. after execution of CLD) the pointer will be incremented, so the function does a memory fill.
The CLD is in this code because the programmer probably was not able to guarantee that the direction flag was cleared. Therefore he made sure that it is cleared before executing REP STOSD.
If the code works when CLD is removed, then the direction flag was clear at the entry of the function. Since the direction flag is not part of the calling conventions that was just by luck. It could be the other way next time, and in this case your program will very likely crash.
Clearing/setting the flag is a very fast operation, so it's good practice to add them to the assembler code. This also makes it easy for other programmers to understand your function because the state of the direction flag is explicitly defined.

The stosd command can either work down the memory, incrementing EDI, or up the memory, decrementing it. This depends on the value of the direction ("D") flag. If the flag is set to 1 upon function entrance and never explicitly cleared, it'll misbehave wildly. There's no convention on the default value of that flag; so the function plays it safe.
EDIT: Egor says Delphi has a convention :) Still, better safe than sorry.

Declaring block level variables for branches in delphi

In Delphi prism we can declare variables that is only needed in special occasions.
eg: In prism
If acondition then
begin
var a :Integer;
end;
a := 3; //this line will produce error. because a will be created only when the condition is true
Here 'a' cannot be assigned with 3 because it is nested inside a branch.
How can we declare a variable which can be used only inside a branch in delphi win32. So i can reduce memory usage as it is only created if a certain condition is true;
If reduced memory usage is not a problem what are the draw backs we have (or we don't have)

The premise of your question is faulty. You're assuming that in languages where block-level variables are allowed, the program allocates and releases memory for those variable when control enters or leaves those variables' scopes. So, for example, you think that when acondition is true, the program adjusts the stack to make room for the a variable as it enters that block. But you're wrong.
Compilers calculate the maximum space required for all declared variables and temporary variables, and then they reserve that much space upon entry to the function. Allocating that space is as simple as adjusting the stack pointer; the time required usually has nothing to do with the amount of space being reserved. The bottom line is that your idea won't actually save any space.
The real advantage to having block-level variables is that their scopes are limited.
If you really need certain variables to be valid in only one branch of code, then factor that branch out to a separate function and put your variables there.

The concept of Local Variable Declaration Statements like in Java is not supported in Delphi, but you could declare a sub-procedure:
procedure foo(const acondition: boolean);
procedure subFoo;
var
a: integer;
begin
a := 3;
end;
begin
If acondition then
begin
subFoo;
end;
end;

There is no way in Delphi to limit scope of an variable to less than entire routine. And in case of a single integer variable it doesn't make sense to worry about it... But in case of large data structure you should allocate it dynamically, not statically, ie instead of
var integers: array[1..10000]of Integer;
use
type TIntArray: array of Integer;
var integers: TIntArray;
If acondition then
begin
SetLength(integers, 10000);
...
end;

Beware that it could only be "syntactic sugar". The compiler may ensure you don't use the variable outside the inner scope, but that doesn't mean it could save memory. The variable may be allocated on the stack in the procedure entry code anyway, regardless if it is actually used or not. AFAIK most ABI initialize the stack on entry and clean it on exit. Manipulating the stack in a much more complex way while the function is executing including taking care of different execution paths may be even less performant - instead of a single instruction to reserve stack space you need several instruction scattered along code, and ensure the stack is restored correctly adding more, epecially stack unwinding due to an exception may become far more complex.
If the aim is to write "better" code because of better scope handling to ensure the wrong variable is not used in the wrong place it could be useful, but if you need it as a way to save memory it could not be the right way.

You can emulate block-level variables with the (dreaded) with statement plus a function returning a record. Here's a bit of sample code, written in the browser:
type TIntegerA = record
A: Integer;
end;
function varAInteger: TIntegerA;
begin
Result.A := 0;
end;
// Code using this pseudo-local-variable
if Condition then
with varAInteger do
begin
A := 7; // Works.
end
else
begin
A := 3; // Error, the compiler doesn't know who A is
end;
Edit to clarify this proposition
Please note this kind of wizardry is no actual replacement for true block-level variables: Even those they're likely allocated on stack, just like most other local variables, the compiler is not geared to treat them as such. It's not going to do the same optimizations: a returned record will always be stored in an actual memory location, while a true local variable might be associated with a CPU register. The compiler will also not let you use such variables for "for" statements, and that's a big problem.

Having commented all that - there is a party trick that Delphi has that has far more uses than a simple local variable and may achieve your aim:
function Something: Integer;
begin
// don't want any too long local variables...
If acondition then
asm
// now I have lots of 'local' variables available in the registers
mov EAX, #AnotherVariable //you can use pascal local variables too!
// do something with the number 3
Add EAX, 3
mov #Result, EAX
jmp #next
#AnotherVariable: dd 10
#next:
end;
end;
end;
:)) bit of a pointless example...

How to ensure 16byte code alignment of Delphi routines?

Background:
I have a unit of optimised Delphi/BASM routines, mostly for heavy computations. Some of these routines contain inner loops for which I can achieve a significant speed-up if the loop start is aligned to a DQWORD (16-byte) boundary. I can ensure that the loops in question are aligned as desired IF I know the alignment at the routine entry point.
As far as I can see, the Delphi compiler aligns procedures/functions to DWORD boundaries, and e.g. adding functions to the unit may change the alignment of subsequent ones. However, as long as I pad the end of routines to multiples of 16, I can ensure that subsequent routines are likewise aligned -- or misaligned, depending on the alignment of the first routine. I therefore tried to place the critical routines at the beginning of the unit's implementation section, and put a bit of padding code before them so that the first procedure would be DQWORD aligned.
This looks something like below:
interface
procedure FirstProcInUnit;
implementation
procedure __PadFirstProcTo16;
asm
// variable number of NOP instructions here to get the desired code length
end;
procedure FirstProcInUnit;
asm //should start at DQWORD boundary
//do something
//padding to align the following label to DQWORD boundary
#Some16BAlignedLabel:
//code, looping back to #Some16BAlignedLabel
//do something else
ret #params
//padding to get code length to multiple of 16
end;
initialization
__PadFirstProcTo16; //call this here so that it isn't optimised out
ASSERT ((NativeUInt(Pointer(#FirstProcInUnit)) AND $0F) = 0, 'FirstProcInUnit not DQWORD aligned');
end.
This is a bit of a pain in the neck, but I can get this sort of thing to work when necessary. The problem is that when I use such a unit in different projects, or make some changes to other units in the same project, this may still break the alignment of __PadFirstProcTo16 itself. Likewise, recompiling the same project with different compiler versions (e.g. D2009 vs. D2010) typically also breaks the alignment. So, the only way of doing this sort of thing I found was by hand as the pretty much last thing to be done when all the rest of the project is in its final form.
Question 1:
Is there any other way to achieve the desired effect of ensuring that (at least some specific) routines are DQWORD-aligned?
Question 2:
Which are the exact factors that affect the compiler's alignment of code and (how) could I use such specific knowledge to overcome the problem outlined here?
Assume that for the sake of this question "don't worry about code alignment/the associated presumably small speed benefits" is not a permissible answer.

As of Delphi XE, the problem of code alignment is now easily solved using the $CODEALIGN compiler directive (see this Delphi documentation page):
{$CODEALIGN 16}
procedure MyAlignedProc;
begin
..
end;

One thing that you could do, is to add a 'magic' signature at the end of each routine, after an explicit ret instruction:
asm
...
ret
db <magic signature bytes>
end;
Now you could create an array containing pointers to each routine, scan the routines at run-time once for the magic signature to find the end of each routine and therefore its length. Then, you can copy them to a new block of memory that you allocate with VirtualAlloc using PAGE_EXECUTE_READWRITE, ensuring this time that each routine starts on a 16-byte boundary.

Most Efficient Unicode Hash Function for Delphi 2009

I am in need of the fastest hash function possible in Delphi 2009 that will create hashed values from a Unicode string that will distribute fairly randomly into buckets.
I originally started with Gabr's HashOf function from GpStringHash:
function HashOf(const key: string): cardinal;
asm
xor edx,edx { result := 0 }
and eax,eax { test if 0 }
jz #End { skip if nil }
mov ecx,[eax-4] { ecx := string length }
jecxz #End { skip if length = 0 }
#loop: { repeat }
rol edx,2 { edx := (edx shl 2) or (edx shr 30)... }
xor dl,[eax] { ... xor Ord(key[eax]) }
inc eax { inc(eax) }
loop #loop { until ecx = 0 }
#End:
mov eax,edx { result := eax }
end; { HashOf }
But I found that this did not produce good numbers from Unicode strings. I noted that Gabr's routines have not been updated to Delphi 2009.
Then I discovered HashNameMBCS in SysUtils of Delphi 2009 and translated it to this simple function (where "string" is a Delphi 2009 Unicode string):
function HashOf(const key: string): cardinal;
var
I: integer;
begin
Result := 0;
for I := 1 to length(key) do
begin
Result := (Result shl 5) or (Result shr 27);
Result := Result xor Cardinal(key[I]);
end;
end; { HashOf }
I thought this was pretty good until I looked at the CPU window and saw the assembler code it generated:
Process.pas.1649: Result := 0;
0048DEA8 33DB xor ebx,ebx
Process.pas.1650: for I := 1 to length(key) do begin
0048DEAA 8BC6 mov eax,esi
0048DEAC E89734F7FF call $00401348
0048DEB1 85C0 test eax,eax
0048DEB3 7E1C jle $0048ded1
0048DEB5 BA01000000 mov edx,$00000001
Process.pas.1651: Result := (Result shl 5) or (Result shr 27);
0048DEBA 8BCB mov ecx,ebx
0048DEBC C1E105 shl ecx,$05
0048DEBF C1EB1B shr ebx,$1b
0048DEC2 0BCB or ecx,ebx
0048DEC4 8BD9 mov ebx,ecx
Process.pas.1652: Result := Result xor Cardinal(key[I]);
0048DEC6 0FB74C56FE movzx ecx,[esi+edx*2-$02]
0048DECB 33D9 xor ebx,ecx
Process.pas.1653: end;
0048DECD 42 inc edx
Process.pas.1650: for I := 1 to length(key) do begin
0048DECE 48 dec eax
0048DECF 75E9 jnz $0048deba
Process.pas.1654: end; { HashOf }
0048DED1 8BC3 mov eax,ebx
This seems to contain quite a bit more assembler code than Gabr's code.
Speed is of the essence. Is there anything I can do to improve either the pascal code I wrote or the assembler that my code generated?
Followup.
I finally went with the HashOf function based on SysUtils.HashNameMBCS. It seems to give a good hash distribution for Unicode strings, and appears to be quite fast.
Yes, there is a lot of assembler code generated, but the Delphi code that generates it is so simple and uses only bit-shift operations, so it's hard to believe it wouldn't be fast.

ASM output is not a good indication of algorithm speed. Also, from what I can see, the two pieces of code are doing almost the identical work. The biggest difference seem to be the memory access strategy and the first is using roll-left instead of the equivalent set of instructions (shl | shr -- most higher-level programming languages leave out the "roll" operators). The latter may pipeline better than the former.
ASM optimization is black magic and sometimes more instructions execute faster than fewer.
To be sure, benchmark both and pick the winner. If you like the output of the second but the first is faster, plug the second's values into the first.
rol edx,5 { edx := (edx shl 5) or (edx shr 27)... }
Note that different machines will run the code in different ways, so if speed is REALLY of the essence then benchmark it on the hardware that you plan to run the final application on. I'm willing to bet that over megabytes of data the difference will be a matter of milliseconds -- which is far less than the operating system is taking away from you.
PS. I'm not convinced this algorithm creates even distribution, something you explicitly called out (have you run the histograms?). You may look at porting this hash function to Delphi. It may not be as fast as the above algorithm but it appears to be quite fast and also gives good distribution. Again, we're probably talking on the order of milliseconds of difference over megabytes of data.

We held a nice little contest a while back, improving on a hash called "MurmurHash"; Quoting Wikipedia :
It is noted for being exceptionally
fast, often two to four times faster
than comparable algorithms such as
FNV, Jenkins' lookup3 and Hsieh's
SuperFastHash, with excellent
distribution, avalanche behavior and
overall collision resistance.
You can download the submissions for that contest here.
One thing we learned was, that sometimes optimizations don't improve results on every CPU. My contribution was tweaked to run good on AMD, but performed not-so-good on Intel. The other way around happened too (Intel optimizations running sub-optimal on AMD).
So, as Talljoe said : measure your optimizations, as they might actually be detrimental to your performance!
As a side-note: I don't agree with Lee; Delphi is a nice compiler and all, but sometimes I see it generating code that just isn't optimal (even when compiling with all optimizations turned on). For example, I regularly see it clearing registers that had already been cleared just two or three statements before. Or EAX is put into EBX, only to have it shifted and put back into EAX. That sort of thing. I'm just guessing here, but hand-optimizing that sort of code will surely help in tight spots.
Above all though; First analyze your bottleneck, then see if a better algorithm or datastructure can be used, then try to optimize the pascal code (like: reduce memory-allocations, avoid reference counting, finalization, try/finally, try/except blocks, etc), and then, only as a final resort, optimize the assembly code.

I've written two assembly "optimized" functions in Delphi, or more implemented known fast hash algorithms in both fine-tuned Pascal and Borland Assembler. The first was a implementation of SuperFastHash, and the second was a MurmurHash2 implementation triggered by a request from Tommi Prami on my blog to translate my c# version to a Pascal implementation. This spawned a discussion continued on the Embarcadero Discussion BASM Forums, that in the end resulted in about 20 implementations (check the latest benchmark suite) which ultimately showed that it would be difficult to select the best implementation due to the big differences in cycle times per instruction between Intel and AMD.
So, try one of those, but remember, getting the fastest every time would probably mean changing the algorithm to a simpler one which would hurt your distribution. Fine-tuning an implementation takes lots of time and better create a good validation and benchmarking suite to make check your implementations.

There has been a bit of discussion in the Delphi/BASM forum that may be of interest to you. Have a look at the following:
http://forums.embarcadero.com/thread.jspa?threadID=13902&tstart=0

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart