Does Delphi support all MMX/SSE instructions? - delphi

I have this snippet of code:
#combinerows:
mov esi,eax
and edi,Row1Mask
and ebx,Row2Mask
or ebx,edi
//NewQ:= (Row1 and Row1Mask) or (Row2 and Row2Mask);
//Result:= NewQ xor q;
PUNPCKDQ mm4,mm5 <-- I get an error here
//mov eax,[eax].q
movd eax,mm4
//q:= NewQ;
mov [esi].q,ebx
xor eax,ebx //Return difference.
I get this error:
[Pascal Error] SDIMAIN.pas(718): E2003 Undeclared identifier: 'PUNPCKDQ'
Am I doing something wrong, or does Delphi 2007 not support a full set of MMX/SSE instructions?

Delphi 2007 supports the MMX and SSE instruction sets. Certainly, Delphi 2010 and XE support up to the SSE4.2 instruction sets (but so far no support for AVX).
However, Delphi is correct to complain about your "PUNPCKDQ" instruction: If you search the Intel® 64 and IA-32 Architectures Software Developer’s Manual (especially Volumes 2A and 2B would be relevant), you will NOT find an instruction by that name. I.e., it is your mistake, not Delphi's lack of support for this instruction.

A quick Google gives information on a PUNPCKLDQ rather than PUNPCKDQ.
D2007 accepts PUNPCKLDQ
and even better it also supports PUNPCKHDQ, which lets you transfer a high order dword to a low dword enabling you to load it into a general purpose register.

Related

How to access full 128 bits in NEON instructions?

I recently wrote a program that does some floating point calculations in Arm64 Assembly.
Since the numbers I'm dealing with can become really tiny, I now want to optimise the code so that it uses as much precision as possible.
I found out the NEON engine has 128-bit floating point registers instead of the 64 bits I'm currently working with, so I searched a way to use these for calculations. Every website I looked at tells me this should be possible, but when I try to do something like
fmul v0, v1, v2
I just get "error: invalid operand for instruction".
I'm using the M1 chip that should be capable of working with NEON instructions, and when I change it to
fmul v0.2d, v1.2d, v2.2d
there's no problem at all.
Does anyone have an idea what I'm doing wrong? Or is it just impossible to use all the 128 bits of these registers at once?
You can't.
True, the NEON registers are 128bit wide, but the maximum data type width is 64.
No consumer architecture known to me is capable of handling any 128bit data type.
PS : Is there a quad data type to begin with? I'm curious.

How to use 16-bits Assembly inline on Delphi on Windows98?

Today I was playing with my old computer and trying to use 16-bits Assembly inside Delphi. It's works fine with 32-bits but I always had problem when I used interrupts. Blue Screen or Freezing, that was making me believe that's not possible to do it. I'm on Windows 98 and using Delphi 7, using this simple code.
program Project1;
{$APPTYPE CONSOLE}
uses
SysUtils, Windows;
begin
asm
mov ax,$0301
mov bx,$0200
mov cx,$0001
xor dx,dx
int $13
int $20
end;
MessageBox(0,'Okay','Okay',MB_OK);
end.
To "format" a diskkete on the Floppy drive. There's a way to use it on Delphi 7 without freezing and blues screens? Or Delphi only allows to use 32-bits Assembly? Am I doing something wrong?
As long as your application is built as "32-bit Windows" application, the interrupts cannot work since these interrupts are simply not mapped.
You could try to compile your application as a "16-bit Console" application. I don't know if Delphi supports this, but that's my best guess for getting the emulation of int 0x13 and int 0x10.
By the way, shouldn't your assembly code use hexadecimal numbers, like this:?
mov ax, $0301
mov bx, $0200
mov cx, $0001
xor dx, dx
int $13
int $20
As it is now, you are probably calling interrupt $0d, which according to Ralf Brown's Interrupt List means:
INT 0D C - IRQ5 - FIXED DISK (PC,XT), LPT2 (AT), reserved (PS/2)
Delphi 7 produces 32 bit executables. Your 16 bit assembly code is therefore not compatible with the compiler you use.
You might have some luck with a 16 bit compiler, e.g. Turbo Pascal or Delphi 1. But it would make more sense, I suspect, to use the Win32 API to achieve your goals.

Using SSE to round in Delphi

I wrote this function to round singles to integers:
function Round(const Val: Single): Integer;
begin
asm
cvtss2si eax,Val
mov Result,eax
end;
end;
It works, but I need to change the rounding mode. Apparently, per this, I need to set the MXCSR register.
How do I do this in Delphi?
The reason I am doing this in the first place is I need "away-from-zero" rounding (like in C#), which is not possible even via SetRoundingMode.
On modern Delphi, to set MXCSR you can call SetMXCSR from the System unit. To read the current value use GetMXCSR.
Do beware that SetMXCSR, just like Set8087CW is not thread-safe. Despite my efforts to persuade Embarcadero to change this, it seems that this particular design flaw will remain with us forever.
On older versions of Delphi you use the LDMXCSR and STMXCSR opcodes. You might write your own versions like this:
function GetMXCSR: LongWord;
asm
PUSH EAX
STMXCSR [ESP].DWord
POP EAX
end;
procedure SetMXCSR(NewMXCSR: LongWord);
//thread-safe version that does not abuse the global variable DefaultMXCSR
var
MXCSR: LongWord;
asm
AND EAX, $FFC0 // Remove flag bits
MOV MXCSR, EAX
LDMXCSR MXCSR
end;
These versions are thread-safe and I hope will compile and work on older Delphi versions.
Do note that using the name Round for your function is likely to cause a lot of confusion. I would advise that you do not do that.
Finally, I checked in the Intel documentation and both of the Intel floating point units (x87, SSE) offer just the rounding modes specified by the IEEE754 standard. They are:
Round to nearest (even)
Round down (toward −∞)
Round up (toward +∞)
Round toward zero (Truncate)
So, your desired rounding mode is not available.

Why does nasm on windows copy invert my word size value when copying-it?

When I copy like this :
mov word[esi+edi],0x7FFF
In the file I write it to it is copyed like FF 7F
Why does it do this, or how can I invert it?
NASM didn't do this. The processor did, because x86 is Little Endian (see endianness).
You could write mov word[esi+edi],0xFF7F if you wanted, but I suspect that the code was correct to begin with, only you didn't take the endianness into account.
The byte order of an intel machine is least significant byte first, that's why it is FF and 7F.
See http://en.wikipedia.org/wiki/Endianness
I do not think you want to invert this.

Most Efficient Unicode Hash Function for Delphi 2009

I am in need of the fastest hash function possible in Delphi 2009 that will create hashed values from a Unicode string that will distribute fairly randomly into buckets.
I originally started with Gabr's HashOf function from GpStringHash:
function HashOf(const key: string): cardinal;
asm
xor edx,edx { result := 0 }
and eax,eax { test if 0 }
jz #End { skip if nil }
mov ecx,[eax-4] { ecx := string length }
jecxz #End { skip if length = 0 }
#loop: { repeat }
rol edx,2 { edx := (edx shl 2) or (edx shr 30)... }
xor dl,[eax] { ... xor Ord(key[eax]) }
inc eax { inc(eax) }
loop #loop { until ecx = 0 }
#End:
mov eax,edx { result := eax }
end; { HashOf }
But I found that this did not produce good numbers from Unicode strings. I noted that Gabr's routines have not been updated to Delphi 2009.
Then I discovered HashNameMBCS in SysUtils of Delphi 2009 and translated it to this simple function (where "string" is a Delphi 2009 Unicode string):
function HashOf(const key: string): cardinal;
var
I: integer;
begin
Result := 0;
for I := 1 to length(key) do
begin
Result := (Result shl 5) or (Result shr 27);
Result := Result xor Cardinal(key[I]);
end;
end; { HashOf }
I thought this was pretty good until I looked at the CPU window and saw the assembler code it generated:
Process.pas.1649: Result := 0;
0048DEA8 33DB xor ebx,ebx
Process.pas.1650: for I := 1 to length(key) do begin
0048DEAA 8BC6 mov eax,esi
0048DEAC E89734F7FF call $00401348
0048DEB1 85C0 test eax,eax
0048DEB3 7E1C jle $0048ded1
0048DEB5 BA01000000 mov edx,$00000001
Process.pas.1651: Result := (Result shl 5) or (Result shr 27);
0048DEBA 8BCB mov ecx,ebx
0048DEBC C1E105 shl ecx,$05
0048DEBF C1EB1B shr ebx,$1b
0048DEC2 0BCB or ecx,ebx
0048DEC4 8BD9 mov ebx,ecx
Process.pas.1652: Result := Result xor Cardinal(key[I]);
0048DEC6 0FB74C56FE movzx ecx,[esi+edx*2-$02]
0048DECB 33D9 xor ebx,ecx
Process.pas.1653: end;
0048DECD 42 inc edx
Process.pas.1650: for I := 1 to length(key) do begin
0048DECE 48 dec eax
0048DECF 75E9 jnz $0048deba
Process.pas.1654: end; { HashOf }
0048DED1 8BC3 mov eax,ebx
This seems to contain quite a bit more assembler code than Gabr's code.
Speed is of the essence. Is there anything I can do to improve either the pascal code I wrote or the assembler that my code generated?
Followup.
I finally went with the HashOf function based on SysUtils.HashNameMBCS. It seems to give a good hash distribution for Unicode strings, and appears to be quite fast.
Yes, there is a lot of assembler code generated, but the Delphi code that generates it is so simple and uses only bit-shift operations, so it's hard to believe it wouldn't be fast.
ASM output is not a good indication of algorithm speed. Also, from what I can see, the two pieces of code are doing almost the identical work. The biggest difference seem to be the memory access strategy and the first is using roll-left instead of the equivalent set of instructions (shl | shr -- most higher-level programming languages leave out the "roll" operators). The latter may pipeline better than the former.
ASM optimization is black magic and sometimes more instructions execute faster than fewer.
To be sure, benchmark both and pick the winner. If you like the output of the second but the first is faster, plug the second's values into the first.
rol edx,5 { edx := (edx shl 5) or (edx shr 27)... }
Note that different machines will run the code in different ways, so if speed is REALLY of the essence then benchmark it on the hardware that you plan to run the final application on. I'm willing to bet that over megabytes of data the difference will be a matter of milliseconds -- which is far less than the operating system is taking away from you.
PS. I'm not convinced this algorithm creates even distribution, something you explicitly called out (have you run the histograms?). You may look at porting this hash function to Delphi. It may not be as fast as the above algorithm but it appears to be quite fast and also gives good distribution. Again, we're probably talking on the order of milliseconds of difference over megabytes of data.
We held a nice little contest a while back, improving on a hash called "MurmurHash"; Quoting Wikipedia :
It is noted for being exceptionally
fast, often two to four times faster
than comparable algorithms such as
FNV, Jenkins' lookup3 and Hsieh's
SuperFastHash, with excellent
distribution, avalanche behavior and
overall collision resistance.
You can download the submissions for that contest here.
One thing we learned was, that sometimes optimizations don't improve results on every CPU. My contribution was tweaked to run good on AMD, but performed not-so-good on Intel. The other way around happened too (Intel optimizations running sub-optimal on AMD).
So, as Talljoe said : measure your optimizations, as they might actually be detrimental to your performance!
As a side-note: I don't agree with Lee; Delphi is a nice compiler and all, but sometimes I see it generating code that just isn't optimal (even when compiling with all optimizations turned on). For example, I regularly see it clearing registers that had already been cleared just two or three statements before. Or EAX is put into EBX, only to have it shifted and put back into EAX. That sort of thing. I'm just guessing here, but hand-optimizing that sort of code will surely help in tight spots.
Above all though; First analyze your bottleneck, then see if a better algorithm or datastructure can be used, then try to optimize the pascal code (like: reduce memory-allocations, avoid reference counting, finalization, try/finally, try/except blocks, etc), and then, only as a final resort, optimize the assembly code.
I've written two assembly "optimized" functions in Delphi, or more implemented known fast hash algorithms in both fine-tuned Pascal and Borland Assembler. The first was a implementation of SuperFastHash, and the second was a MurmurHash2 implementation triggered by a request from Tommi Prami on my blog to translate my c# version to a Pascal implementation. This spawned a discussion continued on the Embarcadero Discussion BASM Forums, that in the end resulted in about 20 implementations (check the latest benchmark suite) which ultimately showed that it would be difficult to select the best implementation due to the big differences in cycle times per instruction between Intel and AMD.
So, try one of those, but remember, getting the fastest every time would probably mean changing the algorithm to a simpler one which would hurt your distribution. Fine-tuning an implementation takes lots of time and better create a good validation and benchmarking suite to make check your implementations.
There has been a bit of discussion in the Delphi/BASM forum that may be of interest to you. Have a look at the following:
http://forums.embarcadero.com/thread.jspa?threadID=13902&tstart=0

Resources