Efficient conversion of an array of singles to an array of doubles in Delphi 2010 - delphi

I need to implement a wrapper layer between a high level application and a low level sub-system using slightly different typing:
The application produces an array of single vectors:
unit unApplication
type
TVector = record
x, y, z : single;
end;
TvectorArray = array of Tvector;
procedure someFunc(): tvectorArray;
[...]
while the subsystem expects an array of double vectors. I also implemented typecasting from tvector to Tvectord:
unit unSubSystem
type
TVectorD = record
x, y, z : double;
class operator Implicit(value : t3dVector):t3dvectorD;inline;
end;
TvectorDArray = array of TvectorD;
procedure otherFunc(points: tvectorDArray);
implementation
class operator T3dVecTorD.Implicit(value : t3dVector):t3dvectorD;
begin
result.x := value.x;
result.y := value.y;
result.z := value.z;
end;
What I am currently doing is like this:
uses unApplication, unsubsystem,...
procedure ConvertValues
var
singleVecArr : TvectorArray;
doubleveArr : TvectorDArray;
begin
singleVecArr := somefunc;
setlength(doubleVecArray, lenght(singlevecArr));
for i := 0 to length(singlevecArr) -1 do
doubleVecArray[i] := singleVecArr[i];
end;
Is there a more efficient way to perform these kinds of conversion?

First of all I would say that you should not attempt any optimisation without first timing. In this case I don't mean timing alternative algorithms, I mean timing the code in question and assessing what proportion of the total time is spent there.
My instincts tell me that the code you show will run for a tiny proportion of the overall time and so optimising it will yield no discernible benefits. I think if you do anything meaningful with each element of this array then that must be true since the cost of converting from single to double will be small compared to floating point operations.
Finally, if perchance this code is a bottleneck, you should consider not converting it at all. My assumption is that you are using standard Delphi floating point operations which map to the 8087 FPU. All such floating point operations happen inside the 8087 floating point stack. Values are converted on entry to either 64 or more normally 80 bit precision. I don't think it would be any slower to load a single than to load a double – in fact it may even be faster due to memory read performance.

Assuming that the conversion indeed is the bottleneck, then one way of speeding up the conversion may be to use SSE# instead of the FPU, provided the necessary instruction sets can be assumed to be present on the computers on which this code will run.
For instance, the following would convert one single Vector into one double Vector:
procedure SingleToDoubleVector (var S: TVector; var D: TVectorD);
// #S in EAX
// #D in EDX
asm
movups xmm0, [eax] ;// Load S in xmm0
movhlps xmm1, xmm0 ;// Copy High 2 singles of xmm0 into xmm1
cvtps2pd xmm2, xmm0 ;// Convert Low two singles of xmm0 into doubles in xmm2
cvtss2sd xmm3, xmm1 ;// Convert Lowes single in xmm1 into double in xmm1
movupd [edx], xmm2 ;// Move two doubles in xmm2 into D (.X and .Y)
movsd [edx+16],xmm3 ;// Move one double from xmm3 into D.Z
end;
I am not saying that this bit of code is the most efficient way to do it and there are many caveats with using assembly code in general and this code in particular. Note that this code makes assumptions about the alignment of the fields in your records. (It does not make assumptions regarding the alignment of the record as a whole.)
Also, for best results, you would control the alignment of your array/record elements in memory and write the entire conversion loop in assembly, to reduce overheads. Whether this is what you want/can do is another question.

If modifying the source to produce doubles rather than singles is not possible you can try threading out the process. Try dividing the TArray into two or four equal sized chunks (depending on processor count) and have each thread do the conversion. Doing this will realize almost double or quadruple speed.
Also, is the 'length' call calculated each loop? Maybe place that into a variable to avoid the calculation.

Related

SIMD zero vector test

Does there exist a quick way to check whether a SIMD vector is a zero vector (all components equal +-zero). I am currently using an algorithm, using shifts, that runs in log2(N) time, where N is the dimension of the vector. Does there exist anything faster? Note that my question is broader (tags), than the proposed answer and it refers to vectors of all types (integer, float, double, ...).
How about this straightforward avx code? I think it's O(N) and don't know how you could possibly do better without making assumptions about the input data - you have to actually read every value to know if its 0 so it's about doing as much of that as possible per cycle.
You should be able to massage the code to your needs. Should treat both +0 and -0 as zero. Will work for unaligned memory addresses but aligning to 32 byte addresses will make the loads faster. You may need to add something to deal with remaining bytes if size isn't a multiple of 8.
uint64_t num_non_zero_floats(float *mem_address, int size) {
uint64_t num_non_zero = 0;
__m256 zeros _mm256_setzero_ps ();
for(i = 0; i != size; i+=8) {
__m256 vec _mm256_loadu_ps (mem_addr + i);
__m256 comparison_out _mm256_cmp_ps (zeros, vec, _CMP_EQ_OQ); //3 cycles latency, throughput 1
uint64_t bits_non_zero = _mm256_movemask_ps(comparison_out); //2-3 cycles latency
num_non_zero += __builtin_popcountll(bits_non_zero);
}
return num_non_zero;
}
If you want to test floats for +/- 0.0, then you can check for all the bits being zero, except the sign bit. Any set-bits anywhere except the sign bit mean the float is non-zero. (http://www.h-schmidt.net/FloatConverter/IEEE754.html)
Agner Fog's asm optimization guide points out that you can test a float or double for zero using integer instructions:
; Example 17.4b
mov eax, [rsi]
add eax, eax ; shift out the sign bit
jz IsZero
For vectors, though, using ptest with a sign-bit mask is better than using paddd to get rid of the sign bit. Actually, test [rsi], $0x7fffffff may be more efficient than Agner Fog's load/add sequence, but a 32bit immediate probably stops the load from micro-fusing on Intel, and maybe have a larger code-size.
x86 PTEST (SSE4.1) does a bitwise AND and sets flags based on the result.
movdqa xmm0, [mask]
.loop:
ptest xmm0, [rsi+rcx]
jnz nonzero
add rcx, 16 # count up towards zero
jl .loop # with rsi pointing to past the end of the array
...
nonzero:
Or cmov could be useful to consume the flags set by ptest.
IDK if it'd be possible to use a loop-counter instruction that didn't set the zero flag, so you could do both tests with one jump instruction or something. Probably not. And the extra uop to merge the flags (or the partial-flags stall on earlier CPUs) would cancel out the benefit.
#Iwillnotexist Idonotexist: re one of your comments on the OP: you can't just movemask without doing a pcmpeq first, or a cmpps. The non-zero bit might not be in the high bit! You probably knew that, but one of your comments seemed to leave it out.
I do like the idea of ORing together multiple values before actually testing. You're right that sign-bits would OR with other sign-bits, and then you ignore them the same way you would if you were testing one at a time. A loop that PORs 4 or 8 vectors before each PTEST would probably be faster. (PTEST is 2 uops, and can't macro-fuse with a jcc.)

creating a fast case statement (assembly)

I have a project that has intensive use of the case statement with many procedures coming off it. I know you can place case statements in a two tear arrangement divide in blocks of 10 and a second case statement to separate individual procedures. But I have a better idea if I can pull it off.
I want to call it assembly case
Prolist: array [1..500] of Pointer =
(#Procedure1, #Procedure2, #Procedure3, #Procedure4, #Procedure5);
Procedure ASMCase(Prolist: array of Pointer; No: Word; Var InRange: Boolean);
var Count : DWord;
PTR: Pointer;
Pro : Procedure;
begin
Count := No * 4;
InRange := boolean(Count <= SizeOf(Prolist));
If not InRange then Exit;
PTR := Pointer(DWord(#Prolist[1]) + Count);
If PTR <> nil then Pro := #PTR else Exit;
Pro; /run procedure
end;​
The point is I'm creating a direct jump to the procedure.
In my case procedures can have an identical header and global data can be accessed for any odd information. Writing it in assembly would be faster I think but what I'm not sure on is running the procedures. Please do not ask why am I doing this as I have 500 procedures with many calls on the case statement and time is of essence with a fast processor.
It's expensive to pass that array by value. Pass it by const.
I can't see the point of the InRange flag and test. Don't pass out of range indices. And if you have to test, do it right. Don't use SizeOf which measures byte size. Use high or perhaps Length, if you have to. Which I doubt.
The pointer assignment test (PTR <> nil) is bogus. That condition always evaluates true. And the array indexing is very weird. What's wrong with []?
On top of that, your array is 1-based (usually a bad choice) but open arrays are always 0-based. Likely that's going to trip you up.
In short, I'd throw away all of that code. It's both wrong and needless. I'd just write it like this:
ProList[No]();
In order for this to compile your array would need to be defined as an array of procedural type rather than array of Pointer. Adding some type safety would be a good move.
It's pretty hard to see asm making much difference here. The compiler is going to emit optimal code.
If you are concerned with out of bound access, enable range checking in debug mode. Disable it for release if performance is paramount.
Bear in mind that global data structures don't tend to scale well as you add complexity. Most experienced programmers go to some length to avoid global state. Are you sure that global state is the right choice for you?
If you do need to improve performance, first identify opportunity for improvement. Reading from an array and calling a function are not likely candidates. Look at the procedures that you call. The bottlenecks are surely there.
One final point. Try to forget that you ever learn to use # with function pointers. Doing so yields an untyped pointer, of type Pointer that can be assigned to any pointer type. And thus you completely abandon type checking. Your procedure could have the wrong signature altogether and the compiler is not able to tell you. Declare your array of procedures with a type safe procedure type.

How to ensure 16byte code alignment of Delphi routines?

Background:
I have a unit of optimised Delphi/BASM routines, mostly for heavy computations. Some of these routines contain inner loops for which I can achieve a significant speed-up if the loop start is aligned to a DQWORD (16-byte) boundary. I can ensure that the loops in question are aligned as desired IF I know the alignment at the routine entry point.
As far as I can see, the Delphi compiler aligns procedures/functions to DWORD boundaries, and e.g. adding functions to the unit may change the alignment of subsequent ones. However, as long as I pad the end of routines to multiples of 16, I can ensure that subsequent routines are likewise aligned -- or misaligned, depending on the alignment of the first routine. I therefore tried to place the critical routines at the beginning of the unit's implementation section, and put a bit of padding code before them so that the first procedure would be DQWORD aligned.
This looks something like below:
interface
procedure FirstProcInUnit;
implementation
procedure __PadFirstProcTo16;
asm
// variable number of NOP instructions here to get the desired code length
end;
procedure FirstProcInUnit;
asm //should start at DQWORD boundary
//do something
//padding to align the following label to DQWORD boundary
#Some16BAlignedLabel:
//code, looping back to #Some16BAlignedLabel
//do something else
ret #params
//padding to get code length to multiple of 16
end;
initialization
__PadFirstProcTo16; //call this here so that it isn't optimised out
ASSERT ((NativeUInt(Pointer(#FirstProcInUnit)) AND $0F) = 0, 'FirstProcInUnit not DQWORD aligned');
end.
This is a bit of a pain in the neck, but I can get this sort of thing to work when necessary. The problem is that when I use such a unit in different projects, or make some changes to other units in the same project, this may still break the alignment of __PadFirstProcTo16 itself. Likewise, recompiling the same project with different compiler versions (e.g. D2009 vs. D2010) typically also breaks the alignment. So, the only way of doing this sort of thing I found was by hand as the pretty much last thing to be done when all the rest of the project is in its final form.
Question 1:
Is there any other way to achieve the desired effect of ensuring that (at least some specific) routines are DQWORD-aligned?
Question 2:
Which are the exact factors that affect the compiler's alignment of code and (how) could I use such specific knowledge to overcome the problem outlined here?
Assume that for the sake of this question "don't worry about code alignment/the associated presumably small speed benefits" is not a permissible answer.
As of Delphi XE, the problem of code alignment is now easily solved using the $CODEALIGN compiler directive (see this Delphi documentation page):
{$CODEALIGN 16}
procedure MyAlignedProc;
begin
..
end;
One thing that you could do, is to add a 'magic' signature at the end of each routine, after an explicit ret instruction:
asm
...
ret
db <magic signature bytes>
end;
Now you could create an array containing pointers to each routine, scan the routines at run-time once for the magic signature to find the end of each routine and therefore its length. Then, you can copy them to a new block of memory that you allocate with VirtualAlloc using PAGE_EXECUTE_READWRITE, ensuring this time that each routine starts on a 16-byte boundary.

Most Efficient Unicode Hash Function for Delphi 2009

I am in need of the fastest hash function possible in Delphi 2009 that will create hashed values from a Unicode string that will distribute fairly randomly into buckets.
I originally started with Gabr's HashOf function from GpStringHash:
function HashOf(const key: string): cardinal;
asm
xor edx,edx { result := 0 }
and eax,eax { test if 0 }
jz #End { skip if nil }
mov ecx,[eax-4] { ecx := string length }
jecxz #End { skip if length = 0 }
#loop: { repeat }
rol edx,2 { edx := (edx shl 2) or (edx shr 30)... }
xor dl,[eax] { ... xor Ord(key[eax]) }
inc eax { inc(eax) }
loop #loop { until ecx = 0 }
#End:
mov eax,edx { result := eax }
end; { HashOf }
But I found that this did not produce good numbers from Unicode strings. I noted that Gabr's routines have not been updated to Delphi 2009.
Then I discovered HashNameMBCS in SysUtils of Delphi 2009 and translated it to this simple function (where "string" is a Delphi 2009 Unicode string):
function HashOf(const key: string): cardinal;
var
I: integer;
begin
Result := 0;
for I := 1 to length(key) do
begin
Result := (Result shl 5) or (Result shr 27);
Result := Result xor Cardinal(key[I]);
end;
end; { HashOf }
I thought this was pretty good until I looked at the CPU window and saw the assembler code it generated:
Process.pas.1649: Result := 0;
0048DEA8 33DB xor ebx,ebx
Process.pas.1650: for I := 1 to length(key) do begin
0048DEAA 8BC6 mov eax,esi
0048DEAC E89734F7FF call $00401348
0048DEB1 85C0 test eax,eax
0048DEB3 7E1C jle $0048ded1
0048DEB5 BA01000000 mov edx,$00000001
Process.pas.1651: Result := (Result shl 5) or (Result shr 27);
0048DEBA 8BCB mov ecx,ebx
0048DEBC C1E105 shl ecx,$05
0048DEBF C1EB1B shr ebx,$1b
0048DEC2 0BCB or ecx,ebx
0048DEC4 8BD9 mov ebx,ecx
Process.pas.1652: Result := Result xor Cardinal(key[I]);
0048DEC6 0FB74C56FE movzx ecx,[esi+edx*2-$02]
0048DECB 33D9 xor ebx,ecx
Process.pas.1653: end;
0048DECD 42 inc edx
Process.pas.1650: for I := 1 to length(key) do begin
0048DECE 48 dec eax
0048DECF 75E9 jnz $0048deba
Process.pas.1654: end; { HashOf }
0048DED1 8BC3 mov eax,ebx
This seems to contain quite a bit more assembler code than Gabr's code.
Speed is of the essence. Is there anything I can do to improve either the pascal code I wrote or the assembler that my code generated?
Followup.
I finally went with the HashOf function based on SysUtils.HashNameMBCS. It seems to give a good hash distribution for Unicode strings, and appears to be quite fast.
Yes, there is a lot of assembler code generated, but the Delphi code that generates it is so simple and uses only bit-shift operations, so it's hard to believe it wouldn't be fast.
ASM output is not a good indication of algorithm speed. Also, from what I can see, the two pieces of code are doing almost the identical work. The biggest difference seem to be the memory access strategy and the first is using roll-left instead of the equivalent set of instructions (shl | shr -- most higher-level programming languages leave out the "roll" operators). The latter may pipeline better than the former.
ASM optimization is black magic and sometimes more instructions execute faster than fewer.
To be sure, benchmark both and pick the winner. If you like the output of the second but the first is faster, plug the second's values into the first.
rol edx,5 { edx := (edx shl 5) or (edx shr 27)... }
Note that different machines will run the code in different ways, so if speed is REALLY of the essence then benchmark it on the hardware that you plan to run the final application on. I'm willing to bet that over megabytes of data the difference will be a matter of milliseconds -- which is far less than the operating system is taking away from you.
PS. I'm not convinced this algorithm creates even distribution, something you explicitly called out (have you run the histograms?). You may look at porting this hash function to Delphi. It may not be as fast as the above algorithm but it appears to be quite fast and also gives good distribution. Again, we're probably talking on the order of milliseconds of difference over megabytes of data.
We held a nice little contest a while back, improving on a hash called "MurmurHash"; Quoting Wikipedia :
It is noted for being exceptionally
fast, often two to four times faster
than comparable algorithms such as
FNV, Jenkins' lookup3 and Hsieh's
SuperFastHash, with excellent
distribution, avalanche behavior and
overall collision resistance.
You can download the submissions for that contest here.
One thing we learned was, that sometimes optimizations don't improve results on every CPU. My contribution was tweaked to run good on AMD, but performed not-so-good on Intel. The other way around happened too (Intel optimizations running sub-optimal on AMD).
So, as Talljoe said : measure your optimizations, as they might actually be detrimental to your performance!
As a side-note: I don't agree with Lee; Delphi is a nice compiler and all, but sometimes I see it generating code that just isn't optimal (even when compiling with all optimizations turned on). For example, I regularly see it clearing registers that had already been cleared just two or three statements before. Or EAX is put into EBX, only to have it shifted and put back into EAX. That sort of thing. I'm just guessing here, but hand-optimizing that sort of code will surely help in tight spots.
Above all though; First analyze your bottleneck, then see if a better algorithm or datastructure can be used, then try to optimize the pascal code (like: reduce memory-allocations, avoid reference counting, finalization, try/finally, try/except blocks, etc), and then, only as a final resort, optimize the assembly code.
I've written two assembly "optimized" functions in Delphi, or more implemented known fast hash algorithms in both fine-tuned Pascal and Borland Assembler. The first was a implementation of SuperFastHash, and the second was a MurmurHash2 implementation triggered by a request from Tommi Prami on my blog to translate my c# version to a Pascal implementation. This spawned a discussion continued on the Embarcadero Discussion BASM Forums, that in the end resulted in about 20 implementations (check the latest benchmark suite) which ultimately showed that it would be difficult to select the best implementation due to the big differences in cycle times per instruction between Intel and AMD.
So, try one of those, but remember, getting the fastest every time would probably mean changing the algorithm to a simpler one which would hurt your distribution. Fine-tuning an implementation takes lots of time and better create a good validation and benchmarking suite to make check your implementations.
There has been a bit of discussion in the Delphi/BASM forum that may be of interest to you. Have a look at the following:
http://forums.embarcadero.com/thread.jspa?threadID=13902&tstart=0

How to avoid rounding problems when comparing currency values in Delphi?

AFAIK, Currency type in Delphi Win32 depends on the processor floating point precision. Because of this I'm having rounding problems when comparing two Currency values, returning different results depending on the machine.
For now I'm using the SameValue function passing a Epsilon parameter = 0.009, because I only need 2 decimal digits precision.
Is there any better way to avoid this problem?
The Currency type in Delphi is a 64-bit integer scaled by 1/10,000; in other words, its smallest increment is equivalent to 0.0001. It is not susceptible to precision issues in the same way that floating point code is.
However, if you are multiplying your Currency numbers by floating-point types, or dividing your Currency values, the rounding does need to be worked out one way or the other. The FPU controls this mechanism (it's called the "control word"). The Math unit contains some procedures which control this mechanism: SetRoundMode in particular. You can see the effects in this program:
{$APPTYPE CONSOLE}
uses Math;
var
x: Currency;
y: Currency;
begin
SetRoundMode(rmTruncate);
x := 1;
x := x / 6;
SetRoundMode(rmNearest);
y := 1;
y := y / 6;
Writeln(x = y); // false
Writeln(x - y); // 0.0001; i.e. 0.1666 vs 0.1667
end.
It is possible that a third-party library you are using is setting the control word to a different value. You may want to set the control word (i.e. rounding mode) explicitly at the starting point of your important calculations.
Also, if your calculations ever transfer into plain floating point and then back into Currency, all bets are off - too hard to audit. Make sure all your calculations are in Currency.
No, Currency is not a floating point type. It is a fixed-precision decimal, implemented with integer storage. It can be compared exactly, and does not have the rounding issues of, say, Double. Therefore, if you are seeing inexact values in your Currency variables, the problem is not the Currency type itself, but what you are putting into it. Most likely, you have a floating-point calculation somewhere else in your code. Since you do not show that code, it's hard to be of more help on this question. But the solution, generally speaking, will be to round your floating point numbers to the correct precision before storing in the Currency variable, rather than doing an inexact comparison on the Currency variables.
Faster and safer way of comparing two currency values is certainly to map the variables to their internal Int64 representation:
function CompCurrency(var A,B: currency): Int64;
var A64: Int64 absolute A;
B64: Int64 absolute B;
begin
result := A64-B64;
end;
This will avoid any rounding error during comparison (working with *10000 integer values), and will be faster than the default FPU-based implementation (especially under 64 bit XE2 compiler).
See this article for additional information.
If your situation is like mine, you might find this approach helpful. I work mostly in payroll. If a business has say 3 departments and wants to charge the cost of an employee evenly among those three departments, there are a lot of times when there will be rounding issues.
What I have been doing is loop through the departments charging each one a third of the total cost and adding the cost charged to a subtotal (currency) variable. But when the loop variable equals the limit, rather than multiplying by the fraction, I subtract the subtotal variable from the total cost and put that in the last department. Since the journal entries that result from this process always have to balance, I believe that it has always worked.
See thread:
D7 / DUnit: all CheckEquals(Currency, Currency) tests suddenly fail ...
https://forums.codegear.com/thread.jspa?threadID=16288
It looks like a change on our development workstations caused Currency comparision to fail. We have not found the root cause, but on two computers running Windows 2000 SP4, and independent of the version of gds32.dll (InterBase 7.5.1 or 2007) and Delphi (7 and 2009), this line
TIBDataBase.Create(nil);
changes the value of to 8087 control word from $1372 to $1272 now.
And all Currency comparisions in unit tests will fail with funny messages like
Expected: <12.34> - Found: <12.34>
The gds32.dll has not been modified, so I guess that there is a dependency in this library to a third party dll which modifies the control word.
To avoid possible issues with currency rounding in Delphi use 4 decimal places.
This will ensure that you never having rounding issues when doing calcualtions with very small amounts.
"Been there. Done That. Written the unit tests."

Resources