Optimizing SIMD histogram calculation

Optimizing SIMD histogram calculation - opencv

I worked on a code that implements an histogram calculation given an opencv struct IplImage * and a buffer unsigned int * to the histogram. I'm still new to SIMD so I might not be taking advantage of the full potential the instruction set provides.
histogramASM:
xor rdx, rdx
xor rax, rax
mov eax, dword [imgPtr + imgWidthOffset]
mov edx, dword [imgPtr + imgHeightOffset]
mul rdx
mov rdx, rax ; rdx = Image Size
mov r10, qword [imgPtr + imgDataOffset] ; r10 = ImgData
NextPacket:
mov rax, rdx
movdqu xmm0, [r10 + rax - 16]
mov rcx,16 ; 16 pixels/paq
PacketLoop:
pextrb rbx, xmm0, 0 ; saving the pixel value on rbx
shl rbx,2
inc dword [rbx + Hist]
psrldq xmm0,1
loop PacketLoop
sub rdx,16
cmp rdx,0
jnz NextPacket
ret
On C, I'd be running these piece of code to obtain the same result.
imgSize = (img->width)*(img->height);
pixelData = (unsigned char *) img->imageData;
for(i = 0; i < imgSize; i++)
{
pixel = *pixelData;
hist[pixel]++;
pixelData++;
}
But the time it takes for both, measured in my computer with rdtsc(), is only 1.5 times better SIMD's assembler. Is there a way to optimize the code above and quickly fill the histogram vector with SIMD?
Thanks in advance

Like Jester I'm surprised that your SIMD code had any significant improvement. Did you compile the C code with optimization turned on?
The one additional suggestion I can make is to unroll your Packetloop loop. This is a fairly simple optimization and reduces the number of instructions per "iteration" to just two:
pextrb ebx, xmm0, 0
inc dword [ebx * 4 + Hist]
pextrb ebx, xmm0, 1
inc dword [ebx * 4 + Hist]
pextrb ebx, xmm0, 2
inc dword [ebx * 4 + Hist]
...
pextrb ebx, xmm0, 15
inc dword [ebx * 4 + Hist]
If you're using NASM you can use the %rep directive to save some typing:
%assign pixel 0
%rep 16
pextrb rbx, xmm0, pixel
inc dword [rbx * 4 + Hist]
%assign pixel pixel + 1
%endrep

Related

Using FPC .o files in Delphi 2010

I've written quite a big library for matrix operations for Delphi and FPC.
There exists now an extension for this library for the Intel AVX extension but
I could only manage to get that compiled in FPC. My idea was to create
.o files in FPC which contains the AVX assembler codes and include these
files in Delphi. I tried to follow this question here:
Linking FPC .o files into Delphi
but without success. I was able to dump the function names and tried to import
these in the Delphi unit. The problem is that I always get an error saying that
the .o files is in the wrong format.
I use CodeTyphoon for compilation which internally uses FPC 3.1.1 and
Delphi2010 as a first try.
The code is once compiled in FPC and one time in Delphi using the approriate
ifdefs.
My base code looks like this (just an excerpt):
// ###################################################################
// #### This file is part of the mathematics library project, and is
// #### offered under the licence agreement described on
// #### http://www.mrsoft.org/
// ####
// #### Copyright:(c) 2011, Michael R. . All rights reserved.
// ####
// #### Unless required by applicable law or agreed to in writing, software
// #### distributed under the License is distributed on an "AS IS" BASIS,
// #### WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// #### See the License for the specific language governing permissions and
// #### limitations under the License.
// ###################################################################
unit AVXMatrixMultOperations;
interface
{$IFDEF CPUX64}
{$DEFINE x64}
{$ENDIF}
{$IFDEF cpux86_64}
{$DEFINE x64}
{$ENDIF}
{$IFNDEF x64}
uses MatrixConst;
{$IFNDEF FPC}
// this fails -> wrong object format
{$L '.\AVXPrecompiled\win32\AVXMatrixMultOperations.o'}
{$ENDIF}
// full matrix operations
procedure AVXMatrixMultAligned(dest : PDouble; const destLineWidth : TASMNativeInt; mt1, mt2 : PDouble; width1, height1, width2, height2 : TASMNativeInt; const LineWidth1, LineWidth2 : TASMNativeInt);
{$IFNDEF FPC} external '' name 'AVXMATRIXMULTOPERATIONS_$$_AVXMATRIXMULTALIGNED$crc2A67AB04'; {$ENDIF}
{$ENDIF}
implementation
{$IFDEF FPC} {$ASMMODE intel} {$ENDIF}
{$IFNDEF x64}
{$IFDEF FPC}
procedure AVXMatrixMultAligned(dest : PDouble; const destLineWidth : TASMNativeInt; mt1, mt2 : PDouble; width1, height1, width2, height2 : TASMNativeInt; const LineWidth1, LineWidth2 : TASMNativeInt);
var bytesWidth2, destOffset : TASMNativeInt;
iter : TASMNativeInt;
{$IFDEF FPC}
begin
{$ENDIF}
asm
// prolog - simulate stack
push ebx;
push edi;
push esi;
mov ecx, dest;
mov edi, width1;
imul edi, -8;
mov iter, edi;
sub mt1, edi;
//destOffset := destLineWidth - Width2*sizeof(double);
mov ebx, Width2;
shl ebx, 3;
mov eax, destLineWidth;
sub eax, ebx;
mov destOffset, eax;
//bytesWidth2 := width2*sizeof(double);
mov bytesWidth2, ebx;
// for y := 0 to height1 - 1 do
##foryloop:
// r12 -> counter to width2
mov esi, width2;
sub esi, 2;
jl #LastXColumn;
##forxloop:
// for x := 0 to width2 div 2 - 1
// esi: mt1 - width1*sizeof(double)
// mt2: mt2
mov edx, mt1;
mov ebx, mt2;
mov eax, iter;
mov edi, LineWidth2;
vxorpd ymm0, ymm0, ymm0;
vxorpd ymm1, ymm1, ymm1;
cmp eax, -32;
jg ##Innerloop2Begin;
// for z := 0 to width1 - 1do
// AVX part:
##InnerLoop1:
// 4x4 block
vmovapd xmm2, [ebx];
add ebx, edi;
vmovapd xmm4, xmm2;
vmovapd xmm3, [ebx];
add ebx, edi;
// shuffle so we can multiply
// swap such that we can immediately multiply
vmovlhps xmm2, xmm2, xmm3;
vmovhlps xmm3, xmm3, xmm4;
// next 4 elements
vmovapd xmm4, [ebx];
add ebx, edi;
vmovapd xmm6, xmm4;
vmovapd xmm5, [ebx];
add ebx, edi;
vmovapd ymm7, [edx + eax]
vmovlhps xmm4, xmm4, xmm5;
vmovhlps xmm5, xmm5, xmm6;
vinsertf128 ymm2, ymm2, xmm4, 1;
vinsertf128 ymm3, ymm3, xmm5, 1;
// now multiply and add
vmulpd ymm2, ymm2, ymm7;
vmulpd ymm3, ymm3, ymm7;
vaddpd ymm0, ymm0, ymm2;
vaddpd ymm1, ymm1, ymm3;
add eax, 32;
jl ##InnerLoop1;
vextractf128 xmm2, ymm0, 1;
vextractf128 xmm3, ymm1, 1;
vhaddpd xmm0, xmm0, xmm2;
vhaddpd xmm1, xmm1, xmm3;
test eax, eax;
jz ##InnerLoopEnd2;
##Innerloop2Begin:
// rest in single elements
##InnerLoop2:
vmovapd xmm2, [ebx];
add ebx, edi;
vmovddup xmm3, [edx + eax];
vmulpd xmm2, xmm2, xmm3;
vmovhlps xmm4, xmm4, xmm2;
vaddsd xmm0, xmm0, xmm2;
vaddsd xmm1, xmm1, xmm4;
add eax, 8;
jnz ##InnerLoop2;
##InnerLoopEnd2:
// finall horizontal addition
vhaddpd xmm0, xmm0, xmm1;
vmovapd [ecx], xmm0;
// increment the pointers
// inc(mt2), inc(dest);
//add dword ptr [mt2], 8;
add mt2, 16;
add ecx, 16;
// end for x := 0 to width2 div 2 - 1
sub esi, 2;
jge ##forxloop;
#LastXColumn:
cmp esi, -1;
jne #NextLine;
// last column of mt2
mov eax, iter;
mov ebx, mt2;
vxorpd xmm0, xmm0, xmm0;
#InnerLoop2:
vmovsd xmm1, [edx + eax];
vmovsd xmm2, [ebx];
vmulsd xmm1, xmm1, xmm2;
vaddsd xmm0, xmm0, xmm1;
add ebx, edi;
add eax, 8;
jnz #InnerLoop2;
vmovsd [ecx], xmm0;
add ecx, 8;
add mt2, 8;
#NextLine:
// dec(mt2, Width2);
// inc(PByte(mt1), LineWidth1);
// inc(PByte(dest), destOffset);
//mov ebx, bytesWidth2;
//sub dword ptr [mt2], ebx;
mov eax, bytesWidth2;
sub mt2, eax;
mov eax, LineWidth1;
add mt1, eax;
add ecx, destOffset;
// end for y := 0 to height1 - 1
//dec eax;
dec height1;
jnz ##foryloop;
// epilog
vzeroupper;
pop esi;
pop edi;
pop ebx;
end;
{$IFDEF FPC}
end;
{$ENDIF}
{$ENDIF}
{$ENDIF}
end.

Since there is a single function involved here, the easiest is IMHO to convert the FPC AVXMatrixMultOperations.o file directly.
Use the great Object file converter tool.
You may try to convert from one binary format to another, accepted by Delphi.
But I guess that the cleanest way is to convert it to asm:
objconv -fasm AVXMatrixMultOperations.o
It will create a AVXMatrixMultOperations.asm file, which could be used to replace the unknown AVX instructions by simple db ..,..,..,.. bytes. Typically, the generated .asm file has the assembler on the left side, and the raw hexadecimal bytes on the right side.
This is how I dealt with old Delphi compilers in my libraries, for instance:
function crc32csse42(crc: cardinal; buf: PAnsiChar; len: cardinal): cardinal;
asm // eax=crc, edx=buf, ecx=len
not eax
test ecx, ecx
jz #0
test edx, edx
jz #0
#3: test edx, 3
jz #8 // align to 4 bytes boundary
{$ifdef ISDELPHI2010}
crc32 eax, byte ptr[edx]
{$else}
db $F2, $0F, $38, $F0, $02
{$endif}
inc edx
....
So in your case, something like
{$ifdef FPC}
vinsertf128 ymm2, ymm2, xmm4, 1;
vinsertf128 ymm3, ymm3, xmm5, 1;
{$else}
db $xx,$yy,$zz
db $xx,$yy,$zz
{$endif}

Combining ASM with non-asm code (or SwapInt64 ASM function needed)

I need to process a file that is coming from the old Mac era (old Motorola CPU). The bytes are big endian so I have a function that swaps and Int64 to Intel little endian. The function is ASM and works on 32 bit CPU but not on 64. For 64 bit I have a different function that is not ASM. I want to combine the functions using IFDEF. Can I do this? Will it be a problem?
interface
function SwapInt64(Value: Int64): Int64; assembler;
implementation
{$IFDEF CPUx86}
function SwapInt64(Value: Int64): Int64; assembler; { Does not work on 64 bit } {
asm
MOV EDX,[DWORD PTR EBP + 12]
MOV EAX,[DWORD PTR EBP + 8]
BSWAP EAX
XCHG EAX,EDX
BSWAP EAX
end;
{$else}
function SwapInt64 (Value: Int64): Int64;
var P: PInteger;
begin
Result: = (Value shl 32) or (Value shr 32);
P: = #Result;
P ^: = (Swap (P ^) shl 16) or (Swap (P ^ shr 16));
Inc (P);
P ^: = (Swap (P ^) shl 16) or (Swap (P ^ shr 16));
end;
{$ENDIF}
I think the compiler will correctly compile/call the appropriate function no matter if one is ASM and the other is Pascal.

What you are proposing is perfectly fine. It is a quite reasonable approach.
If you want a 64 bit swap in asm, for x64, that's quite simple:
function SwapInt64(Value: Int64): Int64;
asm
MOV RAX,RCX
BSWAP RAX
end;
Combine this with the 32 bit version using conditional, as you have done in the question.
function SwapInt64(Value: Int64): Int64;
{$IF Defined(CPUX86)}
asm
MOV EDX,[DWORD PTR EBP + 12]
MOV EAX,[DWORD PTR EBP + 8]
BSWAP EAX
XCHG EAX,EDX
BSWAP EAX
end;
{$ELSEIF Defined(CPUX64)}
asm
MOV RAX,RCX
BSWAP RAX
end;
{$ELSE}
{$Message Fatal 'Unsupported architecture'}
{$ENDIF}
Or include a Pascal implementation in the {$ELSE} block.

The approach of swapping the bytes in a separate routine that cannot be inlined is a bit silly if performance is what you're after.
A better way to a assume you've got a block of data and all dword/qwords in it need to have their endianness changed.
This would look something like this.
For dwords
function SwapDWords(var Data; size: cardinal): boolean;
{ifdef CPUX64}
asm
//Data in RCX, Size in EDX
xor EAX,EAX //failure
test EDX,3
jz #MultipleOf4
#error:
ret
#MultipleOf4
neg EDX //Count up instead of down
jz #done
ADD RCX,RDX
#loop
mov R8d, [RCX+RDX]
bswap R8d
mov [RCX+RDX],R8d
add RDX,4 //add is faster than inc on modern processors
jnz #loop
#done:
inc EAX //success
ret
end;
For qwords
function SwapQWords(var Data; size: cardinal): boolean;
{ifdef CPUX64}
asm
//Data in RCX, Size in EDX
xor EAX,EAX //failure
test EDX,7
jz #MultipleOf8
#error:
ret
#MultipleOf8
neg EDX //Count up instead of down
jz #done
ADD RCX,RDX
#loop
mov R8, [RCX+RDX]
bswap R8
mov [RCX+RDX],R8
add RDX,8 //add is faster than inc on modern processors
jnz #loop
#done:
inc EAX //success
ret
end;
If you're already on 64 bit, then you have SSE2, and can use the 128-bit SSE registers.
Now you can process 4 dwords at a time, effectively unrolling the loop 4 times.
See: http://www.asmcommunity.net/forums/topic/?id=29743
movntpd xmm5,[RCX+RDX] //non-temporal move to avoid polluting the cache
movdqu xmm0, xmm5
movdqu xmm1, xmm5
pxor xmm5, xmm5
punpckhbw xmm0, xmm5 ; interleave '0' with bytes of original
punpcklbw xmm1, xmm5 ; so they become words
pshuflw xmm0, xmm0, 27 ; swap the words by shuffling
pshufhw xmm0, xmm0, 27 ;//27 = B00_01_10_11
pshuflw xmm1, xmm1, 27
pshufhw xmm1, xmm1, 27
packuswb xmm1, xmm0 ; make the words back into bytes.
movntpd [RCX+RDX], xmm1 //non-temporal move to keep the cache clean.

Simply use either LEToN() or BEtoN()
Use the LE variant if the data is little endian (e.g. 32 or 64-bits x86 mac, modern arm), use the BE if the source data (e.g. file from disk) is in big endian format.
Depending on the used architecture a swap or "nothing" will be inlined, usually a fairly optimal one for single conversions. For block oriented solutions see the posted SSE code (or Agner Fog's)

faster alternative to InttoStr/StrToInt?

I wonder if there are faster alternative than System.IntToStr / System.StrToInt. There is a fast version but only UTF8. Which is Int32ToUTF8 from SynCommons.pas and due to slow string conversions it is bound to be slow. The purepascal RTL versions are really slow for 64 bit.

This routine is approximately 40% faster than the routine in the RTL. It could be much faster if you worked with WideChar[] buffers because the string allocation is taking up 75% of the time used by the conversion routine:
IntS32ToWide: 5,50 ns/item (PWideChar)
IntToStr: 34,51 ns/item (RTL)
IntS32ToStr: 24,77 ns/item (RTL replacement)
Please note that the routine below uses SSE2 and only x86 and x64 versions are fully implemented and tested.
In the initialization:
function IntU32ToWide( X: Longword; P: PWideChar ): PWideChar; register;
function IntS32ToWide( X: Integer; P: PWideChar ): PWideChar; register;
function IntS32ToStr ( X: Longword ): UnicodeString; register; inline;
In the implementation:
{$CODEALIGN 16}
{$ALIGN 16}
const
DigitsClippedW: array [ 0..99 ] of LongWord = (
$000030, $000031, $000032, $000033, $000034, $000035, $000036, $000037, $000038, $000039,
$300031, $310031, $320031, $330031, $340031, $350031, $360031, $370031, $380031, $390031,
$300032, $310032, $320032, $330032, $340032, $350032, $360032, $370032, $380032, $390032,
$300033, $310033, $320033, $330033, $340033, $350033, $360033, $370033, $380033, $390033,
$300034, $310034, $320034, $330034, $340034, $350034, $360034, $370034, $380034, $390034,
$300035, $310035, $320035, $330035, $340035, $350035, $360035, $370035, $380035, $390035,
$300036, $310036, $320036, $330036, $340036, $350036, $360036, $370036, $380036, $390036,
$300037, $310037, $320037, $330037, $340037, $350037, $360037, $370037, $380037, $390037,
$300038, $310038, $320038, $330038, $340038, $350038, $360038, $370038, $380038, $390038,
$300039, $310039, $320039, $330039, $340039, $350039, $360039, $370039, $380039, $390039 );
// Delphi XE3 has no working alignment for 16 bytes for data but it has alignment for 16 bytes for code!
// So we encode our constants as a procedure and use constant offsets to the data.
const
Div10000_Shl45d = $00;
Shl16_minus_10000d = $10;
Div_1000_100_10_1w = $20;
Shl_1000_100_10_1w = $30;
Mul_10w = $40;
To_Asciiw = $50;
Mul_10000d = $60;
Div100_Shl19w = $70;
Mul100w = $80;
Div10_shl16w = $90;
To_Asciib = $A0;
procedure IntUToStrConsts();
asm
{$if defined( CPUX64 )}.NOFRAME{$ifend}
dd $d1b71759, $d1b71759, $d1b71759, $d1b71759; // RoundUp( 2^45 / 10000 )
dd $10000 - 10000, $10000 - 10000, $10000 - 10000, $10000 - 10000; // 1 shl 16 - 1e4
dw 8389, 5243, 13108, $8000, 8389, 5243, 13108, $8000; // 1000 100 10 1 div
dw 1 shl 7, 1 shl 11, 1 shl 13, 1 shl 15, 1 shl 7, 1 shl 11, 1 shl 13, 1 shl 15; // 1000 100 10 1 shr
dw 10, 10, 10, 10, 10, 10, 10, 10; // 10
dw $30, $30, $30, $30, $30, $30, $30, $30; // To Unicode / ASCII
dd 10000, 10000, 10000, 10000; // 10000
dw $147b, $147b, $147b, $147b, $147b, $147b, $147b, $147b // RoundUp( 2^19 / 100 )
dw 100, 100, 100, 100, 100, 100, 100, 100 // 100
dw $199a, $199a, $199a, $199a, $199a, $199a, $199a, $199a // RoundUp( 2^16 / 10 )
dd $30303030, $30303030, $30303030, $30303030 // To bytewise / ASCII
end;
function IntS32ToStr( X: Longword ): UnicodeString; register;
var
P, Q: PWideChar;
begin
SetLength( Result, 11 );
P := PWideChar( Pointer( Result ) );
// Full string buffer and set the length of the string with no resizing!
PLongword( ( NativeInt( Result ) - sizeof( Longword ) ) )^ := IntS32ToWide( X, P ) - P;
end;
function IntS32ToWide( X: Integer; P: PWideChar ): PWideChar;
{$if defined( CPUX86 )}
asm // eax = X, edx = P
cmp eax, 0
jge IntU32ToWide
mov word ptr [ edx ], Word( '-' )
neg eax
lea edx, [ edx + 2 ]
jmp IntU32ToWide
end;
{$else if defined( CPUX64 )}
asm // ecx = X, rdx = P
.NOFRAME
cmp ecx, 0
jge IntU32ToWide
mov word ptr [ rdx ], Word( '-' )
neg ecx
lea rdx, [ rdx + 2 ]
jmp IntU32ToWide
end;
{$else}
begin
if X >= 0 then begin
Result := IntU32ToWide( Longword( X ), P );
end else begin
P^ := '-';
Result := IntU32ToWide( Longword( -X ), P + 1 );
end;
end;
{$ifend}
function IntU32ToWide( X: Longword; P: PWideChar ): PWideChar; register;
{$if defined( CPUX86 )}
asm
cmp eax, 100000000
jb #Medium
#Large:
push edx
xor edx, edx
mov ecx, 100000000
div ecx
pop ecx
// eax = high one or two digit value, edx = 8 digit value, ecx = pointer
// Emit the first 2 digits
mov eax, dword ptr [ DigitsClippedW + eax * 4 ]
mov [ ecx ], eax
cmp eax, $10000
setae al
movzx eax, al
lea eax, [ eax * 2 + ecx + 18 ]
// edx = 8 digit value, ecx = pointer
// Emit 8 follow digits
movd xmm1, edx // xmm1 = Value
movdqa xmm0, dqword ptr [ IntUToStrConsts + Div10000_Shl45d ]
pmuludq xmm0, xmm1
psrlq xmm0, 45 // xmm0 = xmm1 div 10000
pmuludq xmm0, dqword ptr [ IntUToStrConsts + Shl16_minus_10000d ]
paddd xmm0, xmm1 // xmm0 = word( lo digits ), word( hi digit ), 0 (6x)
psllq xmm0, 2
punpcklwd xmm0, xmm0
punpckldq xmm0, xmm0 // xmm0 *= 4 (lo, lo, lo, lo, hi, hi, hi, hi)W (LSW, MSW)
pmulhuw xmm0, dqword ptr [ IntUToStrConsts + Div_1000_100_10_1w ]
pmulhuw xmm0, dqword ptr [ IntUToStrConsts + Shl_1000_100_10_1w ] // xmm0 = ( lo, lo div 10, lo div 100, lo div 100, (same with hi) )W
movdqa xmm2, dqword ptr [ IntUToStrConsts + Mul_10w ] // xmm2 := xmm0 * 10; shift to left one word.
pmullw xmm2, xmm0
psllq xmm2, 16
psubw xmm0, xmm2 // Extract digits
por xmm0, dqword ptr [ IntUToStrConsts + To_ASCIIw ] // Digits to ASCII
shufps xmm0, xmm0, $4E
movdqu [ eax - 16 ], xmm0 // And save 8 digits at once
ret
#Medium:
cmp eax, 100
jb #Small
// eax 2..8 digits, edx = pointer
// Emit 2..8 digits
movd xmm1, eax // xmm1 = Value
movdqa xmm0, dqword ptr [ IntUToStrConsts + Div10000_Shl45d ]
pmuludq xmm0, xmm1
psrlq xmm0, 45 // xmm0 = xmm1 div 10000
pmuludq xmm0, dqword ptr [ IntUToStrConsts + Shl16_minus_10000d ]
paddd xmm0, xmm1 // xmm0 = word( lo digits ), word( hi digit ), 0 (6x)
psllq xmm0, 2
punpcklwd xmm0, xmm0
punpckldq xmm0, xmm0 // xmm0 *= 4 (lo, lo, lo, lo, hi, hi, hi, hi)W (LSW, MSW)
pmulhuw xmm0, dqword ptr [ IntUToStrConsts + Div_1000_100_10_1w ]
pmulhuw xmm0, dqword ptr [ IntUToStrConsts + Shl_1000_100_10_1w ] // xmm0 = ( lo, lo div 10, lo div 100, lo div 100, (same with hi) )W
movdqa xmm2, dqword ptr [ IntUToStrConsts + Mul_10w ] // xmm2 := xmm0 * 10; shift to left one word.
pmullw xmm2, xmm0
psllq xmm2, 16
psubw xmm0, xmm2 // Extract digits
movdqa xmm1, dqword ptr [ IntUToStrConsts + To_ASCIIw ] // Digits to ASCII
por xmm0, xmm1
shufps xmm0, xmm0, $4E
// Now we have 8 Unicode characters in the xmm0 register in the correct order.
pcmpeqw xmm1, xmm0 // scan for zeroes.
pmovmskb eax, xmm1
packuswb xmm0, xmm0 // convert to bytes
xor eax, $FFFF // change polarity
bsf eax, eax // amount to shift in bytes.
lea ecx, [ eax * 4 ]
movd xmm1, ecx
psrlq xmm0, xmm1 // bytes shifted.
pxor xmm2, xmm2
punpcklbw xmm0, xmm2
neg eax
movdqu dqword ptr [ edx ], xmm0
lea eax, [ edx + 16 + eax ]
ret
#Small:
// eax 1..2 digits, edx = pointer
// Emit one or two digits
mov eax, dword ptr [ DigitsClippedW + eax * 4 ]
mov [ edx ], eax
cmp eax, $10000
setae al
movzx eax, al
lea eax, [ edx + eax * 2 + 2 ]
end;
{$else if defined( CPUX64 )}
asm
cmp ecx, 100000000
jb #Medium
#Large:
mov r8, rdx // r8 = pointer
// Split up low 8 digits from high 1 or 2 digits..
mov eax, ecx
mov r9, 12379400392853802749 // RoundUp( 2^64+26 / 1e8 )
mul rax, r9
shr rdx, 26
mov r10, rdx // r10 = eax div 1e8
mov rax, rdx
mov r9, 100000000
mul rax, r9
sub ecx, eax // ecx = eax mod 1e8
// Emit the first 2 digits
lea r9, [ DigitsClippedW ]
mov eax, dword ptr [ r9 + r10 * 4 ]
mov dword ptr [ r8 ], eax
// advance pointer ( also for the next 8 bytes)
cmp eax, $10000
setae al
movzx rax, al
lea rax, [ rax * 2 + r8 + 2 + 16 ]
// ecx = 8 digit value, r8 = pointer + 8
movd xmm1, ecx // xmm1 = Value
movdqa xmm0, dqword ptr [ IntUToStrConsts + Div10000_Shl45d ]
pmuludq xmm0, xmm1
psrlq xmm0, 45 // xmm0 = xmm1 div 10000
pmuludq xmm0, dqword ptr [ IntUToStrConsts + Shl16_minus_10000d ]
paddd xmm0, xmm1 // xmm0 = word( lo digits ), word( hi digit ), 0 (6x)
psllq xmm0, 2
punpcklwd xmm0, xmm0
punpckldq xmm0, xmm0 // xmm0 *= 4 (lo, lo, lo, lo, hi, hi, hi, hi)W (LSW, MSW)
pmulhuw xmm0, dqword ptr [ IntUToStrConsts + Div_1000_100_10_1w ]
pmulhuw xmm0, dqword ptr [ IntUToStrConsts + Shl_1000_100_10_1w ] // xmm0 = ( lo, lo div 10, lo div 100, lo div 100, (same with hi) )W
movdqa xmm2, dqword ptr [ IntUToStrConsts + Mul_10w ] // xmm2 := xmm0 * 10; shift to left one word.
pmullw xmm2, xmm0
psllq xmm2, 16
psubw xmm0, xmm2 // Extract digits
por xmm0, dqword ptr [ IntUToStrConsts + To_ASCIIw ] // Digits to ASCII
shufps xmm0, xmm0, $4E
movdqu [ rax - 16 ], xmm0 // And save 8 digits at once
ret
#Medium:
cmp ecx, 100
jb #Small
// eax 2..8 digits, rdx = pointer
// Emit 2..8 digits
movd xmm1, ecx // xmm1 = Value
movdqa xmm0, dqword ptr [ IntUToStrConsts + Div10000_Shl45d ]
pmuludq xmm0, xmm1
psrlq xmm0, 45 // xmm0 = xmm1 div 10000
pmuludq xmm0, dqword ptr [ IntUToStrConsts + Shl16_minus_10000d ]
paddd xmm0, xmm1 // xmm0 = word( lo digits ), word( hi digit ), 0 (6x)
psllq xmm0, 2
punpcklwd xmm0, xmm0
punpckldq xmm0, xmm0 // xmm0 *= 4 (lo, lo, lo, lo, hi, hi, hi, hi)W (LSW, MSW)
pmulhuw xmm0, dqword ptr [ IntUToStrConsts + Div_1000_100_10_1w ]
pmulhuw xmm0, dqword ptr [ IntUToStrConsts + Shl_1000_100_10_1w ] // xmm0 = ( lo, lo div 10, lo div 100, lo div 100, (same with hi) )W
movdqa xmm2, dqword ptr [ IntUToStrConsts + Mul_10w ] // xmm2 := xmm0 * 10; shift to left one word.
pmullw xmm2, xmm0
psllq xmm2, 16
psubw xmm0, xmm2 // Extract digits
movdqa xmm1, dqword ptr [ IntUToStrConsts + To_ASCIIw ] // Digits to ASCII
por xmm0, xmm1
shufps xmm0, xmm0, $4E
// Now we have 8 Unicode characters in the xmm0 register in the correct order.
pcmpeqw xmm1, xmm0 // scan for zeroes.
pmovmskb eax, xmm1
packuswb xmm0, xmm0 // convert to bytes
xor eax, $FFFF // change polarity
bsf eax, eax // amount to shift in bytes.
lea ecx, [ eax * 4 ]
movd xmm1, ecx
psrlq xmm0, xmm1 // bytes shifted.
pxor xmm2, xmm2
punpcklbw xmm0, xmm2
neg rax
movdqu dqword ptr [ rdx ], xmm0
lea rax, [ rdx + 16 + rax ]
ret
#Small:
// ecx 1..2 digits, rdx = pointer
// Emit one or two digits
lea r9, [ DigitsClippedW ]
mov eax, dword ptr [ r9 + rcx * 4 ]
mov [ rdx ], eax
cmp eax, $10000
setae al
movzx rax, al
lea rax, [ rdx + rax * 2 + 2 ]
end;
{$else}
begin
Assert( False, 'Not implemented.' );
end;
{$ifend}

In SynCommons.pas, you have also the following function:
function IntToString(Value: integer): string;
var tmp: array[0..15] of AnsiChar;
P: PAnsiChar;
begin
P := StrInt32(#tmp[15],Value);
Ansi7ToString(PWinAnsiChar(P),#tmp[15]-P,result);
end;
I suspect it will be also fast, even on Win64 platform. Slower than asm, but fast enough for small numbers (which tends to be most of the integer in the wild).
There will be only one memory allocation in this function, which is pretty fast even on Win64, thanks to the updated version of FastMM4, which has its own optimized x64 asm.

In my opinion, the key way to improve performance is to avoid heap allocations. The time spent by IntToStr doing the allocations is greater than the time spent doing the decimal conversion. And if you are wanting to use multiple threads then this is even more important because the default Delphi memory manager does not scale well under thread contention.
It's true that the decimal conversion can also be optimised, but I always try to optimise by picking off the low-hanging fruit first.
So, for the sake of completeness, in case these functions prove useful to others, here are my routines for heap allocation free integer to string conversion:
procedure DivMod(Dividend, Divisor: Cardinal; out Quotient, Remainder: Cardinal);
{$IFDEF CPUX86}
asm
PUSH EBX
MOV EBX,EDX
XOR EDX,EDX
DIV EBX
MOV [ECX],EAX
MOV EBX,Remainder
MOV [EBX],EDX
POP EBX
end;
{$ELSE IF Defined(CPUX64)}
asm
.NOFRAME
MOV EAX,ECX
MOV ECX,EDX
XOR EDX,EDX
DIV ECX
MOV [R8],EAX
MOV [R9],EDX
end;
{$ELSE}
{$Message Error 'Unrecognised platform.'}
{$ENDIF}
{$IFOPT R+}
{$DEFINE RANGECHECKSON}
{$R-}
{$ENDIF}
{$IFOPT Q+}
{$DEFINE OVERFLOWCHECKSON}
{$Q-}
{$ENDIF}
// disable range checks and overflow checks so that abs() functions in case Value = low(Value)
function CopyIntegerToAnsiBuffer(const Value: Integer; var Buffer: array of AnsiChar): Integer;
var
i, j: Integer;
val, remainder: Cardinal;
negative: Boolean;
tmp: array [0..15] of AnsiChar;
begin
negative := Value<0;
val := abs(Value);
Result := 0;
repeat
DivMod(val, 10, val, remainder);
tmp[Result] := AnsiChar(remainder + ord('0'));
inc(Result);
until val=0;
if negative then begin
tmp[Result] := '-';
inc(Result);
end;
Assert(Result<=Length(Buffer));
i := 0;
j := Result-1;
while i<Result do begin
Buffer[i] := tmp[j];
inc(i);
dec(j);
end;
end;
function CopyInt64ToAnsiBuffer(const Value: Int64; var Buffer: array of AnsiChar): Integer;
var
i, j: Integer;
val, remainder: UInt64;
negative: Boolean;
tmp: array [0..23] of AnsiChar;
begin
negative := Value<0;
val := abs(Value);
Result := 0;
repeat
DivMod(val, 10, val, remainder);
tmp[Result] := AnsiChar(remainder + ord('0'));
inc(Result);
until val=0;
if negative then begin
tmp[Result] := '-';
inc(Result);
end;
Assert(Result<=Length(Buffer));
i := 0;
j := Result-1;
while i<Result do begin
Buffer[i] := tmp[j];
inc(i);
dec(j);
end;
end;
{$IFDEF RANGECHECKSON}
{$R+}
{$UNDEF RANGECHECKSON}
{$ENDIF}
{$IFDEF OVERFLOWCHECKSON}
{$Q+}
{$UNDEF OVERFLOWCHECKSON}
{$ENDIF}
My use case requires an array of AnsiChar, but it is of course simple to amend these functions to populate WideChar arrays.

Delphi - Detect Int64 Overflow Error

In Delphi how can I detect overflow errors for Int64?
For Integers we could do:
type
MyInt = Integer; //Int64
function TryMaxTimes10(out Res: MyInt): boolean;
var
a, b: MyInt;
begin
{$Q+}
try
a := High(MyInt);
b := 10;
Res := a * b; //REF1
Result := True;
except
Result := False;
end;
{$Q-}
end;
For MyInt = Integer, line REF1 gives an exception and so TryMaxTimes10 returns false.
But if we change MyInt to MyInt = Int64, then REF1 does not give an exception and TryMaxTimes10 returns true!
I understand that the help for {$Q+} does not specifically mention Int64: ... {$Q+} state, certain integer arithmetic operations ... are checked for overflow.
QUESTION: So my question is, how can we detect overflow errors for Int64?
(I'm using Delphi 7. Does the same thing happen in newer versions of Delphi?)

This is a known issue. See http://qc.embarcadero.com/wc/qcmain.aspx?d=10185, and the comments Andy wrote at the bottom.
My suggestion would be to create a function (I did not compile nor test this - just an example):
function Foo(A, B : Int64) : Int64;
var bNeg : boolean;
begin
// Do we expect a negative result?
bNeg := ((a < 0) xor (b < 0));
// Get the real result
Result := a * b;
// If the result is wrong, raise an error
if ((Result < 0) xor bNeg) then begin
// Raise EOverFlow
end;
end;

This bug has been fixed in RAD Studio 10.2 Tokyo.
The issue can be found here (but one have to log in with embarcadero account to see it).
Here is correct version of __llmulo by John O'Harrow (licensed under MPL 1.1) shipped with Delphi versions 10.2 and above:
// Param 1(edx:eax), Param 2([esp+8]:[esp+4])
// Result is stored in edx:eax
// O-flag set on exit => result is invalid
// O-flag clear on exit => result is valid
procedure __llmulo();
asm
test edx, edx {Param1-Hi = 0?}
jne ##Large {No, More than one multiply may be needed}
cmp edx, [esp+8] {Param2-Hi = 0?}
jne ##Large {No, More than one multiply may be needed}
mul dword ptr [esp+4] {Only one multiply needed, Set Result}
and eax, eax {Clear Overflow Flag}
ret 8
##Large:
sub esp, 28 {allocate local storage}
mov [esp], ebx {save used registers}
mov [esp+4], esi
mov [esp+8], edi
mov [esp+12], ebp
mov ebx, [esp+32] {Param2-Lo}
mov ecx, [esp+36] {Param2-Hi}
mov esi, edx
mov edi, ecx
sar esi, 31
sar edi, 31
xor eax, esi
xor edx, esi
sub eax, esi
sbb edx, esi {edx:eax (a1:a0) = abs(Param1)}
xor ebx, edi
xor ecx, edi
sub ebx, edi
sbb ecx, edi {ecx:ebx (b1:b0) = abs(Param2)}
xor esi, edi {Sign Flag, 0 if Params have same sign else -1}
mov [esp+16], eax {a0}
mov [esp+20], edx {a1}
mov [esp+24], ecx {b1}
mul ebx {edx:eax (c1:c0) = a0*b0}
xchg ebx, edx {ebx = c1, edx = b0}
mov edi, eax {abs(Result-Lo) = c0}
xor ecx, ecx {Upper 32 bits of 128 bit result}
xor ebp, ebp {Second 32 bits of 128 bit result}
mov eax, [esp+20] {a1}
mul edx {edx:eax (d1:d0) = a1*b0}
add ebx, eax {c1 + d0}
adc ebp, edx {d1 + carry}
adc ecx, 0 {Possible carry into Upper 32 bits}
mov eax, [esp+16] {a0}
mov edx, [esp+24] {b1}
mul edx {edx:eax (e1:e0) = a0*b1}
add ebx, eax {abs(Result-Hi) = c1 + d0 + e0}
adc ebp, edx {d1 + e1 + carry}
adc ecx, 0 {Possible carry into Upper 32 bits}
mov eax, [esp+20] {a1}
mov edx, [esp+24] {b1}
mul edx {edx:eax (f1:f0) = a1*b1}
add ebp, eax {d1 + e1 + f0 + carry}
adc ecx, edx {f1 + carry}
or ecx, ebp {Overflow if ecx <> 0 or ebp <> 0}
jnz ##Overflow
mov edx, ebx {Set abs(Result-Hi)}
mov eax, edi {Set abs(Result-Lo)}
cmp edx, $80000000
jae ##CheckRange {Possible Overflow if edx>=$80000000}
##SetSign:
xor eax, esi {Correct Sign of Result}
xor edx, esi
sub eax, esi
sbb edx, esi
mov ebx, [esp] {restore used registers}
mov esi, [esp+4]
mov edi, [esp+8]
mov ebp, [esp+12]
add esp, 28 {Clears Overflow flag}
ret 8
##CheckRange:
jne ##Overflow {Overflow if edx>$80000000}
test esi, esi {edx=$80000000, Is Sign Flag=0?}
jnz ##SetSign {No, Result is Ok (-MaxInt64)}
##Overflow:
mov ebx, [esp] {restore used registers}
mov esi, [esp+4]
mov edi, [esp+8]
mov ebp, [esp+12]
add esp, 28
mov ecx, $80000000
dec ecx {Set Overflow Flag}
ret 8
end;

Intel x86 assembly optimization techniques for expanding 8 bits to 8 boolean bytes of 0 or 1

I am learning assembler quite a while and I am trying to rewrite some simple procedures \ functions to it to see performance benefits (if any). My main development tool is Delphi 2007 and first examples will be in that language but they can be easily translated to other languages as well.
The problem states as:
We have given an unsigned byte value in which each of the eight bits represents a pixel in one row of a screen. Each single pixel can be solid (1) or transparent (0). So in other words, we have 8 pixels packed in one byte value.
I want to unpack those pixels into an eight byte array in the way that youngest pixel(bit) will land under the lowest index of the array and so on. Here is an example:
One byte value -----------> eight byte array
10011011 -----------------> [1][1][0][1][1][0][0][1]
Array index number -------> 0 1 2 3 4 5 6 7
Below I present five methods which are solving the problem. Next I will show their time comparison and how I did measure those times.
My questions consist of two parts:
1.
I am asking you for detailed answer concerning methods DecodePixels4a and DecodePixels4b. Why method 4b is somewhat slower than the 4a?
If for example it is slower because my code is not aligned correctly then show me which instructions in a given method could be better aligned and how to do this to not break the method.
I would like to see real examples behind the theory. Please bear in mind that I am learning assembly and I want to gain knowledge from your answers which allows me in the future to writing better optimized code.
2.
Can you write faster routine than DecodePixels4a? If so, please present it and describe optimization steps that you have taken.
By faster routine I mean routine that runs in the shortest period of time in your test environment among all the routines presented here.
All Intel family processors are allowed and those which are compatible with them.
Below you will find routines written by me:
procedure DecodePixels1(EncPixels: Byte; var DecPixels: TDecodedPixels);
var
i3: Integer;
begin
DecPixels[0] := EncPixels and $01;
for i3 := 1 to 7 do
begin
EncPixels := EncPixels shr 1;
DecPixels[i3] := EncPixels and $01;
//DecPixels[i3] := (EncPixels shr i3) and $01; //this is even slower if you replace above 2 lines with it
end;
end;
//Lets unroll the loop and see if it will be faster.
procedure DecodePixels2(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
DecPixels[0] := EncPixels and $01;
EncPixels := EncPixels shr 1;
DecPixels[1] := EncPixels and $01;
EncPixels := EncPixels shr 1;
DecPixels[2] := EncPixels and $01;
EncPixels := EncPixels shr 1;
DecPixels[3] := EncPixels and $01;
EncPixels := EncPixels shr 1;
DecPixels[4] := EncPixels and $01;
EncPixels := EncPixels shr 1;
DecPixels[5] := EncPixels and $01;
EncPixels := EncPixels shr 1;
DecPixels[6] := EncPixels and $01;
EncPixels := EncPixels shr 1;
DecPixels[7] := EncPixels and $01;
end;
procedure DecodePixels3(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
asm
push eax;
push ebx;
push ecx;
mov bl, al;
and bl, $01;
mov [edx], bl;
mov ecx, $00;
##Decode:
inc ecx;
shr al, $01;
mov bl, al;
and bl, $01;
mov [edx + ecx], bl;
cmp ecx, $07;
jnz ##Decode;
pop ecx;
pop ebx;
pop eax;
end;
end;
//Unrolled assembly loop
procedure DecodePixels4a(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
asm
push eax;
push ebx;
mov bl, al;
and bl, $01;
mov [edx], bl;
shr al, $01;
mov bl, al;
and bl, $01;
mov [edx + $01], bl;
shr al, $01;
mov bl, al;
and bl, $01;
mov [edx + $02], bl;
shr al, $01;
mov bl, al;
and bl, $01;
mov [edx + $03], bl;
shr al, $01;
mov bl, al;
and bl, $01;
mov [edx + $04], bl;
shr al, $01;
mov bl, al;
and bl, $01;
mov [edx + $05], bl;
shr al, $01;
mov bl, al;
and bl, $01;
mov [edx + $06], bl;
shr al, $01;
mov bl, al;
and bl, $01;
mov [edx + $07], bl;
pop ebx;
pop eax;
end;
end;
// it differs compared to 4a only in switching two instructions (but seven times)
procedure DecodePixels4b(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
asm
push eax;
push ebx;
mov bl, al;
and bl, $01;
shr al, $01; //
mov [edx], bl; //
mov bl, al;
and bl, $01;
shr al, $01; //
mov [edx + $01], bl; //
mov bl, al;
and bl, $01;
shr al, $01; //
mov [edx + $02], bl; //
mov bl, al;
and bl, $01;
shr al, $01; //
mov [edx + $03], bl; //
mov bl, al;
and bl, $01;
shr al, $01; //
mov [edx + $04], bl; //
mov bl, al;
and bl, $01;
shr al, $01; //
mov [edx + $05], bl; //
mov bl, al;
and bl, $01;
shr al, $01; //
mov [edx + $06], bl; //
mov bl, al;
and bl, $01;
mov [edx + $07], bl;
pop ebx;
pop eax;
end;
end;
And here is how do I test them:
program Test;
{$APPTYPE CONSOLE}
uses
SysUtils, Windows;
type
TDecodedPixels = array[0..7] of Byte;
var
Pixels: TDecodedPixels;
Freq, TimeStart, TimeEnd :Int64;
Time1, Time2, Time3, Time4a, Time4b: Extended;
i, i2: Integer;
begin
if QueryPerformanceFrequency(Freq) then
begin
for i2 := 1 to 100 do
begin
QueryPerformanceCounter(TimeStart);
for i := 1 to 100000 do
DecodePixels1(155, Pixels);
QueryPerformanceCounter(TimeEnd);
Time1 := Time1 + ((TimeEnd - TimeStart) / Freq * 1000);
QueryPerformanceCounter(TimeStart);
for i := 1 to 100000 do
DecodePixels2(155, Pixels);
QueryPerformanceCounter(TimeEnd);
Time2 := Time2 + ((TimeEnd - TimeStart) / Freq * 1000);
QueryPerformanceCounter(TimeStart);
for i := 1 to 100000 do
DecodePixels3(155, Pixels);
QueryPerformanceCounter(TimeEnd);
Time3 := Time3 + ((TimeEnd - TimeStart) / Freq * 1000);
QueryPerformanceCounter(TimeStart);
for i := 1 to 100000 do
DecodePixels4a(155, Pixels);
QueryPerformanceCounter(TimeEnd);
Time4a := Time4a + ((TimeEnd - TimeStart) / Freq * 1000);
QueryPerformanceCounter(TimeStart);
for i := 1 to 100000 do
DecodePixels4b(155, Pixels);
QueryPerformanceCounter(TimeEnd);
Time4b := Time4b + ((TimeEnd - TimeStart) / Freq * 1000);
end;
Writeln('Time1 : ' + FloatToStr(Time1 / 100) + ' ms. <- Delphi loop.');
Writeln('Time2 : ' + FloatToStr(Time2 / 100) + ' ms. <- Delphi unrolled loop.');
Writeln('Time3 : ' + FloatToStr(Time3/ 100) + ' ms. <- BASM loop.');
Writeln('Time4a : ' + FloatToStr(Time4a / 100) + ' ms. <- BASM unrolled loop.');
Writeln('Time4b : ' + FloatToStr(Time4b / 100) + ' ms. <- BASM unrolled loop instruction switch.');
end;
Readln;
end.
Here are the results from my machine ( Intel® Pentium® E2180 on Win32 XP) :
Time1 : 1,68443549919493 ms. <- Delphi loop.
Time2 : 1,33773024572211 ms. <- Delphi unrolled loop.
Time3 : 1,37015271374424 ms. <- BASM loop.
Time4a : 0,822916962526627 ms. <- BASM unrolled loop.
Time4b : 0,862914462301607 ms. <- BASM unrolled loop instruction switch.
The results are pretty stable - times vary only by few percent between each test I've made. And that was always true: Time1 > Time3 > Time 2 > Time4b > Time4a
So I think that de difference between Time4a and Time4b depends of that instructions switch in the method DecodePixels4b. Sometimes it is 4% sometimes it is up to 10% but 4b is always slower than 4a.
I was thinking about another method with usage of MMX instructions to write into memory eight bytes at one time, but I can't figure out fast way to unpack byte into the 64 bit register.
Thank you for your time.
Thank you guys for your valuable input. Whish I could answer all of you at the same time, unfortunately compared to the modern CPU's I have only one "pipe" and can execute only one instruction "reply" at the time ;-)
So, I will try sum up some things over here and write additional comments under your answers.
First of all, I wanted to say that before posting my question I came up with the solution presented by Wouter van Nifterick and it was actually way slower then my assembly code.
So I've decided not to post that routine here, but you may see that I took the same approach also in my loop Delphi version of the routine. It is commented there because it was giving me worser results.
This is a mystery for me. I've run my code once again with Wouter's and PhilS's routines and here are the results:
Time1 : 1,66535493194387 ms. <- Delphi loop.
Time2 : 1,29115785420688 ms. <- Delphi unrolled loop.
Time3 : 1,33716934524107 ms. <- BASM loop.
Time4a : 0,795041753757838 ms. <- BASM unrolled loop.
Time4b : 0,843520166815013 ms. <- BASM unrolled loop instruction switch.
Time5 : 1,49457681191307 ms. <- Wouter van Nifterick, Delphi unrolled
Time6 : 0,400587402866258 ms. <- PhiS, table lookup Delphi
Time7 : 0,325472442519827 ms. <- PhiS, table lookup Delphi inline
Time8 : 0,37350491544239 ms. <- PhiS, table lookup BASM
Look at the Time5 result, quite strange isn't it?
I guess I have different Delphi version, since my generated assembly code differs from that provided by Wouter.
Second major edit:
I know why routine 5 was slower on my machnie. I had checked "Range checking" and "Overflow checking" in my compiler options. I've added assembler directive to routine 9 to see if it helps. It seems that with this directive assembly procedure is as good as Delphi inline variant or even slightly better.
Here are the final results:
Time1 : 1,22508325749317 ms. <- Delphi loop.
Time2 : 1,33004145373084 ms. <- Delphi unrolled loop.
Time3 : 1,1473583622526 ms. <- BASM loop.
Time4a : 0,77322594033463 ms. <- BASM unrolled loop.
Time4b : 0,846033593023372 ms. <- BASM unrolled loop instruction switch.
Time5 : 0,688689382044384 ms. <- Wouter van Nifterick, Delphi unrolled
Time6 : 0,503233741036693 ms. <- PhiS, table lookup Delphi
Time7 : 0,385254722925063 ms. <- PhiS, table lookup Delphi inline
Time8 : 0,432993919452751 ms. <- PhiS, table lookup BASM
Time9 : 0,362680491244212 ms. <- PhiS, table lookup BASM with assembler directive
Third major edit:
In opinion #Pascal Cuoq and #j_random_hacker the difference in execution times between routines 4a, 4b and 5 is caused by the data dependency. However I have to disagree with that opinion basing on the further tests that I've made.
I've also invented new routine 4c based on 4a. Here it is:
procedure DecodePixels4c(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
asm
push ebx;
mov bl, al;
and bl, 1;
mov [edx], bl;
mov bl, al;
shr bl, 1;
and bl, 1;
mov [edx + $01], bl;
mov bl, al;
shr bl, 2;
and bl, 1;
mov [edx + $02], bl;
mov bl, al;
shr bl, 3;
and bl, 1;
mov [edx + $03], bl;
mov bl, al;
shr bl, 4;
and bl, 1;
mov [edx + $04], bl;
mov bl, al;
shr bl, 5;
and bl, 1;
mov [edx + $05], bl;
mov bl, al;
shr bl, 6;
and bl, 1;
mov [edx + $06], bl;
shr al, 7;
and al, 1;
mov [edx + $07], al;
pop ebx;
end;
end;
I would say it is pretty data dependent.
And here are the tests and results. I've made four tests to make sure there is no accident.
I've also added new times for the routines proposed by GJ (Time10a, Time10b).
Test1 Test2 Test3 Test4
Time1 : 1,211 1,210 1,220 1,213
Time2 : 1,280 1,258 1,253 1,332
Time3 : 1,129 1,138 1,130 1,160
Time4a : 0,690 0,682 0,617 0,635
Time4b : 0,707 0,698 0,706 0,659
Time4c : 0,679 0,685 0,626 0,625
Time5 : 0,715 0,682 0,686 0,679
Time6 : 0,490 0,485 0,522 0,514
Time7 : 0,323 0,333 0,336 0,318
Time8 : 0,407 0,403 0,373 0,354
Time9 : 0,352 0,378 0,355 0,355
Time10a : 1,823 1,812 1,807 1,813
Time10b : 1,113 1,120 1,115 1,118
Time10c : 0,652 0,630 0,653 0,633
Time10d : 0,156 0,155 0,172 0,160 <-- current winner!
As you may see the results of 4a, 4b, 4c and 5 are very close to each other.
Why is that? Because I've removed from 4a, 4b (4c already doesn't have it) two instructions: push eax and pop eax. Since I know I wont use anywhere else in my code the value under eax I do not have to prereserve it.
Now my code has only one pair of push/pop so as the routine 5.
Routine 5 prereserves value of eax beacause it firstly make copy of it under ecx but it deson't prereserve ecx.
So my conclusion is that: the difference in time execution of 5 and 4a and 4b (before the third edit) didn't concern data dependecny but was caused by additional pair of push / pop instructions.
I am very interested in your comments.
After a few days GJ invented even faster routine (Time 10d) than PhiS's. Nice work GJ!

In general, I'd personally stay away from trying to optimize code by using tricks on assembler level, unless you really need that extra 2 or 3% of speed, and you're willing to pay the price of code that is harder to read, maintain and port.
To squeeze that last 1%, you might even have to maintain several versions optimized per processor, and if newer processors and an improved pascal compiler comes along, you're not going to benefit from it.
This Delphi code is faster than your fastest assembler code:
procedure DecodePixels5(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
DecPixels[0] := (EncPixels shr 0) and $01;
DecPixels[1] := (EncPixels shr 1) and $01;
DecPixels[2] := (EncPixels shr 2) and $01;
DecPixels[3] := (EncPixels shr 3) and $01;
DecPixels[4] := (EncPixels shr 4) and $01;
DecPixels[5] := (EncPixels shr 5) and $01;
DecPixels[6] := (EncPixels shr 6) and $01;
DecPixels[7] := (EncPixels shr 7) and $01;
end;
Results:
Time1 : 1,03096806151283 ms. <- Delphi loop.
Time2 : 0,740308641141395 ms. <- Delphi unrolled loop.
Time3 : 0,996602425688886 ms. <- BASM loop.
Time4a : 0,608267951561275 ms. <- BASM unrolled loop.
Time4b : 0,574162510648039 ms. <- BASM unrolled loop instruction switch.
Time5 : 0,499628206138524 ms. !!! <- Delphi unrolled loop 5.
It's fast because the operations can be done with registers only, instead of needing to store and fetch memory. Modern processors execute this partly in parallel (a new operation can be started before the previous finished), because the results of the consecutive instructions are independent of each other.
The machine code looks like this:
push ebx;
// DecPixels[0] := (EncPixels shr 0) and 1;
movzx ecx,al
mov ebx,ecx
// shr ebx,$00
and bl,$01
mov [edx],bl
// DecPixels[1] := (EncPixels shr 1) and 1;
mov ebx,ecx
shr ebx,1
and bl,$01
mov [edx+$01],bl
// DecPixels[2] := (EncPixels shr 2) and 1;
mov ebx,ecx
shr ebx,$02
and bl,$01
mov [edx+$02],bl
// DecPixels[3] := (EncPixels shr 3) and 1;
mov ebx,ecx
shr ebx,$03
and bl,$01
mov [edx+$03],bl
// DecPixels[4] := (EncPixels shr 4) and 1;
mov ebx,ecx
shr ebx,$04
and bl,$01
mov [edx+$04],bl
// DecPixels[5] := (EncPixels shr 5) and 1;
mov ebx,ecx
shr ebx,$05
and bl,$01
mov [edx+$05],bl
// DecPixels[6] := (EncPixels shr 6) and 1;
mov ebx,ecx
shr ebx,$06
and bl,$01
mov [edx+$06],bl
// DecPixels[7] := (EncPixels shr 7) and 1;
shr ecx,$07
and cl,$01
mov [edx+$07],cl
pop ebx;
Edit: As suggested, a table lookup is indeed faster.
var
PixelLookup:Array[byte] of TDecodedPixels;
// You could precalculate, but the performance gain would hardly be worth it because you call this once only.
for I := 0 to 255 do
DecodePixels5b(I, PixelLookup[I]);
procedure DecodePixels7(EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
DecPixels := PixelLookup[EncPixels];
end;
Results:
Time1 : 1,03096806151283 ms. <- Delphi loop.
Time2 : 0,740308641141395 ms. <- Delphi unrolled loop.
Time3 : 0,996602425688886 ms. <- BASM loop.
Time4a : 0,608267951561275 ms. <- BASM unrolled loop.
Time4b : 0,574162510648039 ms. <- BASM unrolled loop instruction switch.
Time5 : 0,499628206138524 ms. !!! <- Delphi unrolled loop 5.
Time7 : 0,251533475182096 ms. <- simple table lookup

Your asm code is relativity slow because use stack end write 8 times to memory.
Check this one...
procedure DecodePixels(EncPixels: Byte; var DecPixels: TDecodedPixels);
asm
xor ecx, ecx
add al, al
rcl ecx, 8
add al, al
rcl ecx, 8
add al, al
rcl ecx, 8
add al, al
rcl ecx, 1
mov [DecPixels + 4], ecx
xor ecx, ecx
add al, al
rcl ecx, 8
add al, al
rcl ecx, 8
add al, al
rcl ecx, 8
add al, al
rcl ecx, 1
mov [DecPixels], ecx
end;
Maybe is even faster than code with lookup table!
Improved version:
procedure DecodePixelsI(EncPixels: Byte; var DecPixels: TDecodedPixels);
asm
mov ecx, 0 //Faster than: xor ecx, ecx
add al, al
rcl ch, 1
add al, al
rcl cl, 1
ror ecx, 16
add al, al
rcl ch, 1
add al, al
rcl cl, 1
mov [DecPixels + 4], ecx
mov ecx, 0 //Faster than: xor ecx, ecx
add al, al
rcl ch, 1
add al, al
rcl cl, 1
ror ecx, 16
add al, al
rcl ch, 1
add al, al
rcl cl, 1
mov [DecPixels], ecx
end;
Version 3:
procedure DecodePixelsX(EncPixels: Byte; var DecPixels: TDecodedPixels);
asm
add al, al
setc byte ptr[DecPixels + 7]
add al, al
setc byte ptr[DecPixels + 6]
add al, al
setc byte ptr[DecPixels + 5]
add al, al
setc byte ptr[DecPixels + 4]
add al, al
setc byte ptr[DecPixels + 3]
add al, al
setc byte ptr[DecPixels + 2]
add al, al
setc byte ptr[DecPixels + 1]
setnz byte ptr[DecPixels]
end;
Version 4:
const Uint32DecPix : array [0..15] of cardinal = (
$00000000, $00000001, $00000100, $00000101,
$00010000, $00010001, $00010100, $00010101,
$01000000, $01000001, $01000100, $01000101,
$01010000, $01010001, $01010100, $01010101
);
procedure DecodePixelsY(EncPixels: byte; var DecPixels: TDecodedPixels); inline;
begin
pcardinal(#DecPixels)^ := Uint32DecPix[EncPixels and $0F];
pcardinal(cardinal(#DecPixels) + 4)^ := Uint32DecPix[(EncPixels and $F0) shr 4];
end;

Expanding on Nick D's answer, I tried the following table-lookup based versions, all of which are faster than the implementations you give (and faster than Wouter van Nifterick's code).
Given the following packed array:
const Uint64DecPix : PACKED ARRAY [0..255] OF UINT64 =
( $0000000000000000, $0000000000000001, $0000000000000100, $0000000000000101, $0000000000010000, $0000000000010001, $0000000000010100, $0000000000010101, $0000000001000000, $0000000001000001, $0000000001000100, $0000000001000101, $0000000001010000, $0000000001010001, $0000000001010100, $0000000001010101,
$0000000100000000, $0000000100000001, $0000000100000100, $0000000100000101, $0000000100010000, $0000000100010001, $0000000100010100, $0000000100010101, $0000000101000000, $0000000101000001, $0000000101000100, $0000000101000101, $0000000101010000, $0000000101010001, $0000000101010100, $0000000101010101,
$0000010000000000, $0000010000000001, $0000010000000100, $0000010000000101, $0000010000010000, $0000010000010001, $0000010000010100, $0000010000010101, $0000010001000000, $0000010001000001, $0000010001000100, $0000010001000101, $0000010001010000, $0000010001010001, $0000010001010100, $0000010001010101,
$0000010100000000, $0000010100000001, $0000010100000100, $0000010100000101, $0000010100010000, $0000010100010001, $0000010100010100, $0000010100010101, $0000010101000000, $0000010101000001, $0000010101000100, $0000010101000101, $0000010101010000, $0000010101010001, $0000010101010100, $0000010101010101,
$0001000000000000, $0001000000000001, $0001000000000100, $0001000000000101, $0001000000010000, $0001000000010001, $0001000000010100, $0001000000010101, $0001000001000000, $0001000001000001, $0001000001000100, $0001000001000101, $0001000001010000, $0001000001010001, $0001000001010100, $0001000001010101,
$0001000100000000, $0001000100000001, $0001000100000100, $0001000100000101, $0001000100010000, $0001000100010001, $0001000100010100, $0001000100010101, $0001000101000000, $0001000101000001, $0001000101000100, $0001000101000101, $0001000101010000, $0001000101010001, $0001000101010100, $0001000101010101,
$0001010000000000, $0001010000000001, $0001010000000100, $0001010000000101, $0001010000010000, $0001010000010001, $0001010000010100, $0001010000010101, $0001010001000000, $0001010001000001, $0001010001000100, $0001010001000101, $0001010001010000, $0001010001010001, $0001010001010100, $0001010001010101,
$0001010100000000, $0001010100000001, $0001010100000100, $0001010100000101, $0001010100010000, $0001010100010001, $0001010100010100, $0001010100010101, $0001010101000000, $0001010101000001, $0001010101000100, $0001010101000101, $0001010101010000, $0001010101010001, $0001010101010100, $0001010101010101,
$0100000000000000, $0100000000000001, $0100000000000100, $0100000000000101, $0100000000010000, $0100000000010001, $0100000000010100, $0100000000010101, $0100000001000000, $0100000001000001, $0100000001000100, $0100000001000101, $0100000001010000, $0100000001010001, $0100000001010100, $0100000001010101,
$0100000100000000, $0100000100000001, $0100000100000100, $0100000100000101, $0100000100010000, $0100000100010001, $0100000100010100, $0100000100010101, $0100000101000000, $0100000101000001, $0100000101000100, $0100000101000101, $0100000101010000, $0100000101010001, $0100000101010100, $0100000101010101,
$0100010000000000, $0100010000000001, $0100010000000100, $0100010000000101, $0100010000010000, $0100010000010001, $0100010000010100, $0100010000010101, $0100010001000000, $0100010001000001, $0100010001000100, $0100010001000101, $0100010001010000, $0100010001010001, $0100010001010100, $0100010001010101,
$0100010100000000, $0100010100000001, $0100010100000100, $0100010100000101, $0100010100010000, $0100010100010001, $0100010100010100, $0100010100010101, $0100010101000000, $0100010101000001, $0100010101000100, $0100010101000101, $0100010101010000, $0100010101010001, $0100010101010100, $0100010101010101,
$0101000000000000, $0101000000000001, $0101000000000100, $0101000000000101, $0101000000010000, $0101000000010001, $0101000000010100, $0101000000010101, $0101000001000000, $0101000001000001, $0101000001000100, $0101000001000101, $0101000001010000, $0101000001010001, $0101000001010100, $0101000001010101,
$0101000100000000, $0101000100000001, $0101000100000100, $0101000100000101, $0101000100010000, $0101000100010001, $0101000100010100, $0101000100010101, $0101000101000000, $0101000101000001, $0101000101000100, $0101000101000101, $0101000101010000, $0101000101010001, $0101000101010100, $0101000101010101,
$0101010000000000, $0101010000000001, $0101010000000100, $0101010000000101, $0101010000010000, $0101010000010001, $0101010000010100, $0101010000010101, $0101010001000000, $0101010001000001, $0101010001000100, $0101010001000101, $0101010001010000, $0101010001010001, $0101010001010100, $0101010001010101,
$0101010100000000, $0101010100000001, $0101010100000100, $0101010100000101, $0101010100010000, $0101010100010001, $0101010100010100, $0101010100010101, $0101010101000000, $0101010101000001, $0101010101000100, $0101010101000101, $0101010101010000, $0101010101010001, $0101010101010100, $0101010101010101);
PUint64DecPix : pointer = #Uint64DecPix;
you can write the following:
procedure DecodePixelsPS1Pas (EncPixels: Byte; var DecPixels: TDecodedPixels);
begin
DecPixels := TDecodedPixels(Uint64DecPix[EncPixels]);
end;
procedure DecodePixelsPS1PasInline (EncPixels: Byte; var DecPixels: TDecodedPixels);
inline;
begin
DecPixels := TDecodedPixels(Uint64DecPix[EncPixels]);
end;
procedure DecodePixelsPS1Asm (EncPixels: Byte; var DecPixels: TDecodedPixels);
asm
lea ecx, Uint64DecPix //[<-Added in EDIT 3]
//mov ecx, dword ptr PUint64DecPix - alternative to the above line (slower for me)
movzx eax, al
movq xmm0, [8*eax+ecx] //Using XMM rather than MMX so we don't have to issue emms at the end
movq [edx], xmm0 //use MOVQ because it doesn't need mem alignment
end;
The standard PAS and ASM implementations are fairly similar speed-wise, but the PAS implementation marked with "INLINE" is the fastest because it gets rid of all the call/ret involved in calling the routine.
--EDIT--: I forgot to say: since you are implicitly assuming something about the memory layout of your TDecodedPixels structure, it would be better if you declare it as
PACKED ARRAY [0..7] of byte
--EDIT2--:
Here are my results for comparison:
Time1 : 2.51638266874701 ms. <- Delphi loop.
Time2 : 2.11277620479698 ms. <- Delphi unrolled loop.
Time3 : 2.21972066282167 ms. <- BASM loop.
Time4a : 1.34093090043567 ms. <- BASM unrolled loop.
Time4b : 1.52222070123437 ms. <- BASM unrolled loop instruction switch.
Time5 : 1.17106364076999 ms. <- Wouter van Nifterick
TimePS1 : 0.633099318488802 ms. <- PS.Pas
TimePS2 : 0.551617593856202 ms. <- PS.Pas Inline
TimePS3 : 0.70921094720139 ms. <- PS.Asm (speed for version before 3rd EDIT)

Compilers do very good job at optimizing small routines.
I would optimize your code by using a lookup table.
Since you decode a single byte - 256 different states - you can precalculate 256 arrays with the unpacked values.
Edit: Note that Pentium processors can execute specific instructions in parallel (Superscalar architecture), it is called pairing.

Pure software solution
Using the beautiful technique from this question, which was again inspired by this question we'll have a great solution like this with only one line of code (excluding declarations)
type TPackedDecodedPixels = record
case integer of
0: (a: TDecodedPixels);
1: (v: Int64);
end;
procedure DecodePixels(EncPixels: byte; var DecPixels: TDecodedPixels); inline;
const
magic = $8040201008040201;
mask = $8080808080808080;
begin
TPackedDecodedPixels(DecPixels).v := SwapEndian(((EncPixels*magic) and mask) shr 7);
end;
Of course you need to make sure that DecPixels is properly 8-byte aligned or you may suffer from some slow down (or even segfaults on other architectures). You can also easily vectorize the function to make it faster
Explanation
Assume we have the following bit pattern as abcdefgh. We'll want the output array to contain
0000000a 0000000b 0000000c 0000000d 0000000e 0000000f 0000000g 0000000h (1)
Reading that in little endian as a 64-bit integer we'll get %0000000h0000000g0000000f0000000e0000000d0000000c0000000b0000000a. We have to find a magic number that shifts the original bits to the positions that we can extract the necessary bits
Let's multiply the value with the magic number
| b7 || b6 || b4 || b4 || b3 || b2 || b1 || b0 |
abcdefgh (1-byte value)
x 1000000001000000001000000001000000001000000001000000001000000001
────────────────────────────────────────────────────────────────
= h0abcdefgh0abcdefgh0abcdefgh0abcdefgh0abcdefgh0abcdefgh0abcdefgh
At this point all the pixels' bits have been moved to the most significant bits of the corresponding bytes. As they already lied in the right place, we just need to strip out the remaining bits with and
| b7 || b6 || b4 || b4 || b3 || b2 || b1 || b0 |
h0abcdefgh0abcdefgh0abcdefgh0abcdefgh0abcdefgh0abcdefgh0abcdefgh
& 1000000010000000100000001000000010000000100000001000000010000000
────────────────────────────────────────────────────────────────
= h0000000g0000000f0000000e0000000d0000000c0000000b0000000a0000000 (8-byte array)
Now the pixels' bits are in the most significant bits of the corresponding bytes, we need to do a logical right shift by 7 to move them to the least significant position. Because the OP wants the value in reversed order, we need SwapEndian() to convert the bytes to big endian. If you just want little endian you can stop at this step
So the magic number is %1000000001000000001000000001000000001000000001000000001000000001 = $8040201008040201 and the mask is %1000000010000000100000001000000010000000100000001000000010000000 = $8080808080808080. Of course in reality to solve the problem and get those values we need to do backwards from the final result → multiplied result → magic number
But why did I put the bytes in little endian at (1) and then have to convert back to big endian? Why don't just arrange the bytes in big endian order and find the magic number for that? In case you're wondering about that then it's because that way it'll only work for at most 7 bits at a time. I did that way in my old answer and have to split a bit off then combine it back later
0abcdefg
x 0000000000000010000001000000100000010000001000000100000010000001
────────────────────────────────────────────────────────────────
= 00000000abcdefgabcdefgabcdefgabcdefgabcdefgabcdefgabcdefgabcdefg
& 0000000000000001000000010000000100000001000000010000000100000001
────────────────────────────────────────────────────────────────
= 000000000000000a0000000b0000000c0000000d0000000e0000000f0000000g
Hardware support
This is actually a special case of bit expand with a constant mask. In AVX2 Intel introduced the pdep instruction in the BMI2 instruction set for that purpose, so you just need a single instruction to get the result. In other languages you can use this with the intrinsic function _pext_u64. Unfortunately AFAIK Free Pascal doesn't support it and you have to use assembly directly. However the expression will look like this
TPackedDecodedPixels(DecPixels).v := _pext_u64(EncPixels, $0101010101010101);
Correctness check
I tried comparing the OP's version with both my versions and didn't find any problem until now. The compiler output is like this
mov al, dil
mov rbx, rsi
movzx edi, al
movabs rax, 0x8040201008040201
imul rdi, rax
movabs rax, 0x8080808080808080
and rdi, rax
shr rdi, 0x7
call 4016a0 <SYSTEM_$$_SWAPENDIAN$INT64$$INT64>
mov QWORD PTR [rbx], rax
The FPC output is still pretty much sub-optimal because the compiler doesn't know to replace the call to SwapEndian with BSWAP, and it copies data unnecessarily. Why mov al, dil; movzx edi, al instead of just movzx edi, dil? As you can see, outputs from C and C++ compilers are a lot better
See How to create a byte out of 8 bool values (and vice versa)?

I was about to give the same algorithm as Wouter van Nifterick.
In addition, I would explain the better performance in terms of dependency chains.
In each of the versions that you proposed, when you unrolled your basic loop, you kept a dependency between two successive iterations: each of your shr al, $01; requires the previous value of al to have been computed.
If you organize your unrolled iterations such that they can be executed in parallel, they will actually be on a modern processor. Don't be fooled by false dependencies that can be suppressed by register renaming.
Someone pointed out that the Pentium can execute two instructions at once. That's true, but modern processors (since the Pentium Pro, PII,..., Core, Core 2) are executing much more than two instructions at the same time, when they have the chance -- that is, when there is no dependency between the instructions being executed. Note how in Wouter van Nifterick's version each line can be executed independently from the others.
http://www.agner.org/optimize/ has all the information you could ever need to understand the architecture of modern processors and how to take advantage of them.

if you only support 80386 and above you can use BTcc and SETcc set of instructions in this manner:
BT ax,1
SETC [dx]
inc dx
BT ax,2
SETC [dx]
inc dx
etc

How about something like:
/* input byte in eax, address to store result in edx */
and eax, 0xff /* may not be needed */
mov ebx, eax
shl ebx, 7
or eax, ebx
mov ebx, eax
shl ebx, 14
or eax, ebx
mov ebx, eax
and eax, 0x01010101
mov [edx], eax
shr ebx, 4
and ebx, 0x01010101
mov [edx+4], ebx

The likely reason that 4b is faster than 4a is that it parallelizes better. From 4a:
mov bl, al;
and bl, $01; // data dep (bl)
mov [edx], bl; // data dep (bl)
shr al, $01;
mov bl, al; // data dep (al)
and bl, $01; // data dep (bl)
mov [edx + $01], bl; // data dep (bl)
Instructions marked "data dep" cannot begin executing until the previous instruction has finished, and I've written the registers that cause this data dependency. Modern CPUs are capable of starting an instruction before the last one has completed, if there is no dependency. But the way you've ordered these operations prevents this.
In 4b, you have fewer data dependencies:
mov bl, al;
and bl, $01; // data dep (bl)
shr al, $01;
mov [edx], bl;
mov bl, al;
and bl, $01; // data dep (bl)
shr al, $01;
mov [edx + $01], bl;
With this instruction ordering, fewer of the instructions depend on the previous instruction, so there is more opportunity for parallelism.
I can't guarantee that this is the reason for the speed difference, but it is a likely candidate. Unfortunately it is hard to come across answers as absolute as the ones you are looking for; modern processors have branch predictors, multi-level caches, hardware pre-fetchers, and all sorts of other complexities that can make it difficult to isolate the reasons for performance differences. The best you can do is read a lot, perform experiments, and get familiar with the tools for taking good measurements.

I guess it's that writing to memory (actually, cache memory) is slower than working with registers.
So,
mov [edx+...], bl
shr al, $01;
mov bl, al;
gives the processor some time to write bl to memory before the bl register is needed again, while
shr al, $01;
mov [edx], bl;
mov bl, al;
needs bl immediately so the processor has to stop and wait for the memory write to complete.
This is surprising to me. Modern Intel processors do crazy pipelining and register renaming so in my opinion, if anything, DecodePixels4b should be faster, since the dependencies of each instruction are further back. The above is all the explanation I can offer, apart from this:
x86 is a terrible instruction set, and Intel does amazing and very advanced hocus-pocus to make it efficient. If I were you, I would look into something else. There's very little demand for megaMcOptimised software for PCs today. My friendly suggestion is to look into processors for mobile devices (mainly ARM), because in mobile devices, processor speed, power consumption and battery life concerns mean that micro-optimised software is more important. And ARM has a superior instruction set to x86.

SIMD
If you extend the algorithm to processing arrays, then SIMD becomes an optimisation option. Here's a SIMD version that's 1/3 the time of an optimised C equivalent:
int main ()
{
const int
size = 0x100000;
unsigned char
*source = new unsigned char [size],
*dest,
*dest1 = new unsigned char [size * 32],
*dest2 = new unsigned char [size * 32];
for (int i = 0 ; i < size ; ++i)
{
source [i] = rand () & 0xff;
}
LARGE_INTEGER
start,
middle,
end;
QueryPerformanceCounter (&start);
dest = dest1;
for (int i = 0 ; i < size ; ++i)
{
unsigned char
v = source [i];
for (int b = 0 ; b < 8 ; ++b)
{
*(dest++) = (v >> b) & 1;
}
}
unsigned char
bits [] = {1,2,4,8,16,32,64,128,1,2,4,8,16,32,64,128},
zero [] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0},
ones [] = {1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1};
QueryPerformanceCounter (&middle);
__asm
{
movdqu xmm1,bits
movdqu xmm2,zero
movdqu xmm3,ones
mov ecx,0x100000/4
mov esi,source
mov edi,dest2
l1:
lodsd
movd xmm0,eax
movd xmm4,eax
punpcklbw xmm0,xmm0
punpcklbw xmm4,xmm4
punpcklwd xmm0,xmm0
punpcklwd xmm4,xmm4
punpckldq xmm0,xmm0
punpckhdq xmm4,xmm4
pand xmm0,xmm1
pand xmm4,xmm1
pcmpeqb xmm0,xmm2
pcmpeqb xmm4,xmm2
paddb xmm0,xmm3
paddb xmm4,xmm3
movdqu [edi],xmm0
movdqu [edi+16],xmm4
add edi,32
dec ecx
jnz l1
}
QueryPerformanceCounter (&end);
cout << "Time taken = " << (middle.QuadPart - start.QuadPart) << endl;
cout << "Time taken = " << (end.QuadPart - middle.QuadPart) << endl;
cout << "memcmp = " << memcmp (dest1, dest2, size * 32) << endl;
return 0;
}

Incredible smart solution Chris, what would you do with the inverse problem: make a byte from an array of 8 bytes?
Non optimized solution for the inverse problem:
BtBld PROC Array:DWORD, Pixels:DWORD
mov eax, [Array]
add eax, 7
mov edx, [Pixels]
mov bx, 0
mov ecx, 8
rpt: or bx, [eax]
dec eax
shl bx, 1
loop rpt
shr bx, 1
mov [edx], bl
ret
BtBld ENDP

As you notice, the difference of speed in 4a and 4b implementation is because of CPU optimization (by execute multiple instructions in parallel / pipelining instruction). But the factor is not in the operands, but because of the nature of operator itself.
4a Instruction Sequence:
AND - MOV - SHR
4b Instruction Sequence:
AND - SHR - MOV
Both AND and SHR use Flags register, so these two instructions has wait state in their pipeline.
Read them as follow:
4a: AND (piped) MOV (piped) SHR
4b: AND (WAIT) SHR (piped) MOV
Conclusion: 4b has 7 more wait-state in it's pipeline than 4a, thus it's slower.
Josh mentioned that there's data dependencies, i.e.:
mov bl, al;
and bl, $01; // data dep (bl)
but it's not entirely true since those two instruction can partially be executed in paralel in CPU level:
mov bl, al -> (A:) read al (B:) write bl => (2 clocks in i386)
and bl, 01 -> (C:) read 01 (D:) write bl => idem
Sequentially they take 4 clocks, but pipelined they take only 3 "clocks" (actually the term "clock" is not adequate in pipeline perspective but I used it in context of simplicity)
[--A--][--B--]
[--C--]<wait>[---D--]

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Optimizing SIMD histogram calculation - opencv

Related

Using FPC .o files in Delphi 2010

Combining ASM with non-asm code (or SwapInt64 ASM function needed)

faster alternative to InttoStr/StrToInt?

Delphi - Detect Int64 Overflow Error

Intel x86 assembly optimization techniques for expanding 8 bits to 8 boolean bytes of 0 or 1

Categories

Resources