Broadcast a byte value to all 16 XMM slots in Delphi ASM - delphi

This is easy in AVX with the VBROADCASTS command, or in SSE if the value were doubles or floats.
How do I broadcast a single 8-bit value to every slot in an XMM register in Delphi ASM?

Michael's answer will work. As an alternative, if you can assume the SSSE3 instruction set, then using Packed Shuffle Bytes pshufb would also work.
Assuming (1) an 8-bit value in AL (for example) and (2) the desired broadcast destination to be XMM1, and (3) that another register, say XMM0, is available, this will do the trick:
movd xmm1, eax ;// move value in AL (part of EAX) into XMM1
pxor xmm0, xmm0 ;// clear xmm0 to create the appropriate mask for pshufb
pshufb xmm1, xmm0 ;// broadcast lowest value into all slots of xmm1
And yes, Delphi's BASM understands SSSE3.

You mean you have a byte in the LSB of an XMM register and want to duplicate it across all lanes of that register? I don't know Delphi's inline assembly syntax, but in Intel/MASM syntax it could be done something like this:
punpcklbw xmm0,xmm0 ; xxxxxxxxABCDEFGH -> xxxxxxxxEEFFGGHH
punpcklwd xmm0,xmm0 ; xxxxxxxxEEFFGGHH -> xxxxxxxxGGGGHHHH
punpckldq xmm0,xmm0 ; xxxxxxxxGGGGHHHH -> xxxxxxxxHHHHHHHH
punpcklqdq xmm0,xmm0 ; xxxxxxxxHHHHHHHH -> HHHHHHHHHHHHHHHH

The fastest option is SSSE3 for pshufb if it's available.
; SSSE3
pshufb xmm0, xmm1 ; where xmm1 is zeroed, e.g. with pxor xmm1,xmm1
Otherwise you should usually use this:
; SSE2 only
punpcklbw xmm0, xmm0 ; xxxxxxxxABCDEFGH -> xxxxxxxxEEFFGGHH
pshuflw xmm0, xmm0, 0 ; xxxxxxxxEEFFGGHH -> xxxxxxxxHHHHHHHH
punpcklqdq xmm0, xmm0 ; xxxxxxxxHHHHHHHH -> HHHHHHHHHHHHHHHH
This is better than punpckl bw / wd -> pshufd xmm0, xmm0, 0 because there are some CPUs with only 64-bit shuffle units. (Including Merom and K8). On such CPUs, pshuflw is fast, and so is punpcklqdq, but pshufd and punpck with granularity less than 64-bit is slow. So this sequence uses only one "slow shuffle" instruction, vs. 3 for bw / wd / pshufd.
On all later CPUs, there's no difference between those two 3-instruction sequence, so it doesn't cost us anything to tune for old CPUs in this case. See also http://agner.org/optimize/ for instruction tables.
This is the sequence from Michael's answer with the middle two instructions replaced by pshuflw.
If your byte is in an integer register to start with, you can use a multiply by 0x01010101 to broadcast it to 4 bytes. e.g.
; movzx eax, whatever
imul edx, eax, 0x01010101 ; edx = al repeated 4 times
movd xmm0, eax
pshufd xmm0, xmm0, 0
Note that imul's non-immediate source operand can be memory, but it has to be a 32-bit memory location with your byte zero-extended to 32 bits.
If your data starts in memory, loading into an integer register first is probably not worth it. Just movd to an xmm register. (Or possibly pinsrb if you need to avoid a wider load to avoid crossing a page or maybe a cache line. But that has a false dependency on the old value of the register where movd doesn't.)
If instruction throughput is more of an issue than latency, it can be worth considering pmuludq if you can't use pshufb, even though it has 5 cycle latency on most CPUs.
; low 32 bits of xmm0 = your byte, **zero extended**
pmuludq xmm0, xmm7 ; xmm7 = 0x01010101 in the low 32 bits
pshufd xmm0, xmm0, 0

Related

How can I store a 16 byte number into memory in assembly?

section .bss
length equ 16
number: resb length
This works only with 64 bits:
mov qword[number], 43271
This is represented of only 8 byte (64 bit). I need to store values in memory in 128 bit, but I've no idea how I can do that.
Thanks
The standard approach is to use two store instructions, one for each half:
mov qword [number], 1234
mov qword [number+8], 4567

Checking parameters of multiplication by constant in 64 bit

For my BigInteger code, output turned out to be slow for very large BigIntegers. So now I use a recursive divide-and-conquer algorithm, which still needs 2'30" to convert the currently largest known prime to a decimal string of more than 22 million digits (but only 135 ms to turn it into a hexadecimal string).
I still want to reduce the time, so I need a routine that can divide a NativeUInt (i.e. UInt32 on 32 bit platforms, UInt64 on 64 bit platforms) by 100 very fast. So I use multiplication by constant. This works fine in 32 bit code, but I am not 100% sure for 64 bit.
So my question: is there a way to check the reliability of the results of multiplication by constant for unsigned 64 bit values? I checked the 32 bit values by simply trying with all values of UInt32 (0..$FFFFFFFF). This took approx. 3 minutes. Checking all UInt64s would take much longer than my lifetime. Is there a way to check if the parameters used (constant, post-shift) are reliable?
I noticed that DivMod100() always failed for a value like $4000004B if the chosen parameters were wrong (but close). Are there special values or ranges to check for 64 bit, so I don't have to check all values?
My current code:
const
{$IF DEFINED(WIN32)}
// Checked
Div100Const = UInt32(UInt64($1FFFFFFFFF) div 100 + 1);
Div100PostShift = 5;
{$ELSEIF DEFINED(WIN64)}
// Unchecked!!
Div100Const = $A3D70A3D70A3D71;
// UInt64(UInt128($3 FFFF FFFF FFFF FFFF) div 100 + 1);
// UInt128 is fictive type.
Div100PostShift = 2;
{$IFEND}
// Calculates X div 100 using multiplication by a constant, taking the
// high part of the 64 bit (or 128 bit) result and shifting
// right. The remainder is calculated as X - quotient * 100;
// This was tested to work safely and quickly for all values of UInt32.
function DivMod100(var X: NativeUInt): NativeUInt;
{$IFDEF WIN32}
asm
// EAX = address of X, X is UInt32 here.
PUSH EBX
MOV EDX,Div100Const
MOV ECX,EAX
MOV EAX,[ECX]
MOV EBX,EAX
MUL EDX
SHR EDX,Div100PostShift
MOV [ECX],EDX // Quotient
// Slightly faster than MUL
LEA EDX,[EDX + 4*EDX] // EDX := EDX * 5;
LEA EDX,[EDX + 4*EDX] // EDX := EDX * 5;
SHL EDX,2 // EDX := EDX * 4; 5*5*4 = 100.
MOV EAX,EBX
SUB EAX,EDX // Remainder
POP EBX
end;
{$ELSE WIN64}
asm
.NOFRAME
// RCX is address of X, X is UInt64 here.
MOV RAX,[RCX]
MOV R8,RAX
XOR RDX,RDX
MOV R9,Div100Const
MUL R9
SHR RDX,Div100PostShift
MOV [RCX],RDX // Quotient
// Faster than LEA and SHL
MOV RAX,RDX
MOV R9D,100
MUL R9
SUB R8,RAX
MOV RAX,R8 // Remainder
end;
{$ENDIF WIN32}
As usual when writing optimized code, use compiler output for hints / starting points. It's safe to assume any optimization it makes is safe in the general case. Wrong-code compiler bugs are rare.
gcc implements unsigned 64bit divmod with a constant of 0x28f5c28f5c28f5c3. I haven't looked in detail into generating constants for division, but there are algorithms for generating them that will give known-good results (so exhaustive testing isn't needed).
The code actually has a few important differences: it uses the constant differently from the OP's constant.
See the comments for an analysis of what this is is actually doing: divide by 4 first, so it can use a constant that only works for dividing by 25 when the dividend is small enough. This also avoids needing an add at all, later on.
#include <stdint.h>
// rem, quot ordering takes one extra instruction
struct divmod { uint64_t quotient, remainder; }
div_by_100(uint64_t x) {
struct divmod retval = { x%100, x/100 };
return retval;
}
compiles to (gcc 5.3 -O3 -mtune=haswell):
movabs rdx, 2951479051793528259
mov rax, rdi ; Function arg starts in RDI (SysV ABI)
shr rax, 2
mul rdx
shr rdx, 2
lea rax, [rdx+rdx*4] ; multiply by 5
lea rax, [rax+rax*4] ; multiply by another 5
sal rax, 2 ; imul rax, rdx, 100 is better here (Intel SnB).
sub rdi, rax
mov rax, rdi
ret
; return values in rdx:rax
Use the "binary" option to see the constant in hex, since disassembler output does it that way, unlike gcc's asm source output.
The multiply-by-100 part.
gcc uses the above sequence of lea/lea/shl, the same as in your question. Your answer is using a mov imm/mul sequence.
Your comments each say the version they chose is faster. If so, it's because of some subtle instruction alignment or other secondary effect: On Intel SnB-family, it's the same number of uops (3), and the same critical-path latency (mov imm is off the critical path, and mul is 3 cycles).
clang uses what I think is the best bet (imul rax, rdx, 100). I thought of it before I saw that clang chose it, not that that matters. That's 1 fused-domain uop (which can execute on p0 only), still with 3c latency. So if you're latency-bound using this routine for multi-precision, it probably won't help, but it is the best choice. (If you're latency-bound, inlining your code into a loop instead of passing one of the parameters through memory could save a lot of cycles.)
imul works because you're only using the low 64b of the result. There's no 2 or 3 operand form of mul because the low half of the result is the same regardless of signed or unsigned interpretation of the inputs.
BTW, clang with -march=native uses mulx for the 64x64->128, instead of mul, but doesn't gain anything by it. According to Agner Fog's tables, it's one cycle worse latency than mul.
AMD has worse than 3c latency for imul r,r,i (esp. the 64b version), which is maybe why gcc avoids it. IDK how much work gcc maintainers put into tweaking costs so settings like -mtune=haswell work well, but a lot of code isn't compiled with any -mtune setting (even one implied by -march), so I'm not surprised when gcc makes choices that were optimal for older CPUs, or for AMD.
clang still uses imul r64, r64, imm with -mtune=bdver1 (Bulldozer), which saves m-ops but at a cost of 1c latency more than using lea/lea/shl. (lea with a scale>1 is 2c latency on Bulldozer).
I found the solution with libdivide.h. Here is the slightly more complicated part for Win64:
{$ELSE WIN64}
asm
.NOFRAME
MOV RAX,[RCX]
MOV R8,RAX
XOR RDX,RDX
MOV R9,Div100Const // New: $47AE147AE147AE15
MUL R9 // Preliminary result Q in RDX
// Additional part: add/shift
ADD RDX,R8 // Q := Q + X shr 1;
RCR RDX,1
SHR RDX,Div100PostShift // Q := Q shr 6;
MOV [RCX],RDX // X := Q;
// Faster than LEA and SHL
MOV RAX,RDX
MOV R9D,100
MUL R9
SUB R8,RAX
MOV RAX,R8 // Remainder
end;
{$ENDIF WIN32}
The code in the #Rudy's answer results from the following steps:
Write 1/100 in binary form: 0.000000(10100011110101110000);
Count leading zeroes after the decimal point: S = 6;
72 first significant bits are:
1010 0011 1101 0111 0000 1010 0011 1101 0111 0000 1010 0011 1101 0111 0000 1010 0011 1101
Round to 65 bits; there is some kind of magic in how this rounding is performed; by reverse-engineering the constant from the Rudy's answer the correct rounding is:
1010 0011 1101 0111 0000 1010 0011 1101 0111 0000 1010 0011 1101 0111 0000 1010 1
Remove the leading 1 bit:
0100 0111 1010 1110 0001 0100 0111 1010 1110 0001 0100 0111 1010 1110 0001 0101
Write in hexadecimal form (getting back the revenged constant):
A = 47 AE 14 7A E1 47 AE 15
X div 100 = (((uint128(X) * uint128(A)) shr 64) + X) shr 7 (7 = 1 + S)

Problems with ADC/SBB and INC/DEC in tight loops on some CPUs

I am writing a simple BigInteger type in Delphi. It mainly consists of a dynamic array of TLimb, where a TLimb is a 32 bit unsigned integer, and a 32 bit size field, which also holds the sign bit for the BigInteger.
To add two BigIntegers, I create a new BigInteger of the appropriate size and then, after some bookkeeping, call the following procedure, passing it three pointers to the respective starts of the arrays for the left and right operand and the result, as well as the numbers of limbs for left and right, respectively.
Plain code:
class procedure BigInteger.PlainAdd(Left, Right, Result: PLimb; LSize, RSize: Integer);
asm
// EAX = Left, EDX = Right, ECX = Result
PUSH ESI
PUSH EDI
PUSH EBX
MOV ESI,EAX // Left
MOV EDI,EDX // Right
MOV EBX,ECX // Result
MOV ECX,RSize // Number of limbs at Left
MOV EDX,LSize // Number of limbs at Right
CMP EDX,ECX
JAE #SkipSwap
XCHG ECX,EDX // Left and LSize should be largest
XCHG ESI,EDI // so swap
#SkipSwap:
SUB EDX,ECX // EDX contains rest
PUSH EDX // ECX contains smaller size
XOR EDX,EDX
#MainLoop:
MOV EAX,[ESI + CLimbSize*EDX] // CLimbSize = SizeOf(TLimb) = 4.
ADC EAX,[EDI + CLimbSize*EDX]
MOV [EBX + CLimbSize*EDX],EAX
INC EDX
DEC ECX
JNE #MainLoop
POP EDI
INC EDI // Do not change Carry Flag
DEC EDI
JE #LastLimb
#RestLoop:
MOV EAX,[ESI + CLimbSize*EDX]
ADC EAX,ECX
MOV [EBX + CLimbSize*EDX],EAX
INC EDX
DEC EDI
JNE #RestLoop
#LastLimb:
ADC ECX,ECX // Add in final carry
MOV [EBX + CLimbSize*EDX],ECX
#Exit:
POP EBX
POP EDI
POP ESI
end;
// RET is inserted by Delphi compiler.
This code worked well, and I was pretty satisified with it, until I noticed that, on my development setup (Win7 in a Parallels VM on an iMac) a simple PURE PASCAL addition routine, doing the same while emulating the carry with a variable and a few if clauses, was faster than my plain, straightforward handcrafted assembler routine.
It took me a while to find out that on certain CPUs (including my iMac and an older laptop), the combination of DEC or INC and ADC or SBB could be extremely slow. But on most of my others (I have five other PCs to test it on, although four of these are exactly the same), it was quite fast.
So I wrote a new version, emulating INC and DEC using LEA and JECXZ instead, like so:
Part of emulating code:
#MainLoop:
MOV EAX,[ESI + EDX*CLimbSize]
LEA ECX,[ECX - 1] // Avoid INC and DEC, see above.
ADC EAX,[EDI + EDX*CLimbSize]
MOV [EBX + EDX*CLimbSize],EAX
LEA EDX,[EDX + 1]
JECXZ #DoRestLoop // LEA does not modify Zero flag, so JECXZ is used.
JMP #MainLoop
#DoRestLoop:
// similar code for the rest loop
That made my code on the "slow" machines almost three times as fast, but some 20% slower on the "faster" machines. So now, as initialzation code, I do a simple timing loop and use that to decide if I'll set up the unit to call the plain or the emulated routine(s). This is almost always correct, but sometimes it chooses the (slower) plain routines when it should have chosen the emulating routines.
But I don't know if this is the best way to do this.
Question
I gave my solution, but do the asm gurus here perhaps know a better way to avoid the slowness on certain CPUs?
Update
Peter and Nils' answers helped me a lot to get on the right track. This is the main part of my final solution for the DEC version:
Plain code:
class procedure BigInteger.PlainAdd(Left, Right, Result: PLimb; LSize, RSize: Integer);
asm
PUSH ESI
PUSH EDI
PUSH EBX
MOV ESI,EAX // Left
MOV EDI,EDX // Right
MOV EBX,ECX // Result
MOV ECX,RSize
MOV EDX,LSize
CMP EDX,ECX
JAE #SkipSwap
XCHG ECX,EDX
XCHG ESI,EDI
#SkipSwap:
SUB EDX,ECX
PUSH EDX
XOR EDX,EDX
XOR EAX,EAX
MOV EDX,ECX
AND EDX,$00000003
SHR ECX,2
CLC
JE #MainTail
#MainLoop:
// Unrolled 4 times. More times will not improve speed anymore.
MOV EAX,[ESI]
ADC EAX,[EDI]
MOV [EBX],EAX
MOV EAX,[ESI + CLimbSize]
ADC EAX,[EDI + CLimbSize]
MOV [EBX + CLimbSize],EAX
MOV EAX,[ESI + 2*CLimbSize]
ADC EAX,[EDI + 2*CLimbSize]
MOV [EBX + 2*CLimbSize],EAX
MOV EAX,[ESI + 3*CLimbSize]
ADC EAX,[EDI + 3*CLimbSize]
MOV [EBX + 3*CLimbSize],EAX
// Update pointers.
LEA ESI,[ESI + 4*CLimbSize]
LEA EDI,[EDI + 4*CLimbSize]
LEA EBX,[EBX + 4*CLimbSize]
// Update counter and loop if required.
DEC ECX
JNE #MainLoop
#MainTail:
// Add index*CLimbSize so #MainX branches can fall through.
LEA ESI,[ESI + EDX*CLimbSize]
LEA EDI,[EDI + EDX*CLimbSize]
LEA EBX,[EBX + EDX*CLimbSize]
// Indexed jump.
LEA ECX,[#JumpsMain]
JMP [ECX + EDX*TYPE Pointer]
// Align jump table manually, with NOPs. Update if necessary.
NOP
// Jump table.
#JumpsMain:
DD #DoRestLoop
DD #Main1
DD #Main2
DD #Main3
#Main3:
MOV EAX,[ESI - 3*CLimbSize]
ADC EAX,[EDI - 3*CLimbSize]
MOV [EBX - 3*CLimbSize],EAX
#Main2:
MOV EAX,[ESI - 2*CLimbSize]
ADC EAX,[EDI - 2*CLimbSize]
MOV [EBX - 2*CLimbSize],EAX
#Main1:
MOV EAX,[ESI - CLimbSize]
ADC EAX,[EDI - CLimbSize]
MOV [EBX - CLimbSize],EAX
#DoRestLoop:
// etc...
I removed a lot of white space, and I guess the reader can get the rest of the routine. It is similar to the main loop. A speed improvement of approx. 20% for larger BigIntegers, and some 10% for small ones (only a few limbs).
The 64 bit version now uses 64 bit addition where possible (in the main loop and in Main3 and Main2, which are not "fall-through" like above) and before, 64 bit was quite a lot slower than 32 bit, but now it is 30% faster than 32 bit and twice as fast as the original simple 64 bit loop.
Update 2
Intel proposes, in its Intel 64 and IA-32 Architectures Optimization Reference Manual, 3.5.2.6 Partial Flag Register Stalls -- Example 3-29:
XOR EAX,EAX
.ALIGN 16
#MainLoop:
ADD EAX,[ESI] // Sets all flags, so no partial flag register stall
ADC EAX,[EDI] // ADD added in previous carry, so its result might have carry
MOV [EBX],EAX
MOV EAX,[ESI + CLimbSize]
ADC EAX,[EDI + CLimbSize]
MOV [EBX + CLimbSize],EAX
MOV EAX,[ESI + 2*CLimbSize]
ADC EAX,[EDI + 2*CLimbSize]
MOV [EBX + 2*CLimbSize],EAX
MOV EAX,[ESI + 3*CLimbSize]
ADC EAX,[EDI + 3*CLimbSize]
MOV [EBX + 3*CLimbSize],EAX
SETC AL // Save carry for next iteration
MOVZX EAX,AL
ADD ESI,CUnrollIncrement*CLimbSize // LEA has slightly worse latency
ADD EDI,CUnrollIncrement*CLimbSize
ADD EBX,CUnrollIncrement*CLimbSize
DEC ECX
JNZ #MainLoop
The flag is saved in AL, and through MOVZX in EAX. It is added in through the first ADD in the loop. Then an ADC is needed, because the ADD might generate a carry. Also see comments.
Because the carry is saved in EAX, I can also use ADD to update the pointers. The first ADD in the loop also updates all flags, so ADC won't suffer from a partial flag register stall.
What you're seeing on old P6-family CPUs is a partial-flag stall.
Early Sandybridge-family handles merging more efficiently, and later SnB-family (e.g. Skylake) has no merging cost at all: uops that need both CF and some flags from the SPAZO group read them as 2 separate inputs.
Intel CPUs (other than P4) rename each flag bit separately, so JNE only depends on the last instruction that set all the flags it uses (in this case, just the Z flag). In fact, recent Intel CPUs can even internally combine an inc/jne into a single inc-and-branch uop (macro-fusion). However, the trouble comes when reading a flag bit that was left unmodified by the last instruction that updated any flags.
Agner Fog says Intel CPUs (even PPro / PII) don't stall on inc / jnz. It's not actually the inc/jnz that's stalling, it's the adc in the next iteration that has to read the CF flag after inc wrote other flags but left CF unmodified.
; Example 5.21. Partial flags stall when reading unmodified flag bits
cmp eax, ebx
inc ecx
jc xx
; Partial flags stall (P6 / PIII / PM / Core2 / Nehalem)
Agner Fog also says more generally: "Avoid code that relies on the fact that INC or DEC leaves the carry flag unchanged." (for Pentium M/Core2/Nehalem). The suggestion to avoid inc/dec entirely is obsolete, and only applied to P4. Other CPUs rename different parts of EFLAGS separately, and only have trouble when merging is required (reading a flag that was unmodified by the last insn to write any flags).
On the machines where it's fast (Sandybridge and later), they're inserting an extra uop to merge the flags register when you read bits that weren't written by the last instruction that modified it. This is much faster than stalling for 7 cycles, but still not ideal.
P4 always tracks whole registers, instead of renaming partial registers, not even EFLAGS. So inc/jz has a "false" dependency on whatever wrote the flags before it. This means that the loop condition can't detect the end of the loop until execution of the adc dep chain gets there, so the branch mispredict that can happen when the loop-branch stops being taken can't be detected early. It does prevent any partial-flags stalls, though.
Your lea / jecxz avoids the problem nicely. It's slower on SnB and later because you didn't unroll your loop at all. Your LEA version is 11 uops (can issue one iteration per 3 cycles), while the inc version is 7 uops (can issue one iter per 2 cycles), not counting the flag-merging uop it inserts instead of stalling.
If the loop instruction wasn't slow, it would be perfect for this. It's actually fast on AMD Bulldozer-family (1 m-op, same cost as a fused compare-and-branch), and Via Nano3000. It's bad on all Intel CPUs, though (7 uops on SnB-family).
Unrolling
When you unroll, you can get another small gain from using pointers instead of indexed addressing modes, because 2-reg addressing modes can't micro-fuse on SnB and later. A group of load/adc/store instructions is 6 uops without micro-fusion, but only 4 with micro-fusion. CPUs can issue 4 fused-domain uops/clock. (See Agner Fog's CPU microarch doc, and instruction tables, for details on this level.)
Save uops when you can to make sure the CPU can issue instructions faster than execute, to make sure it can see far enough ahead in the instruction stream to absorb any bubbles in insn fetch (e.g. branch mispredict). Fitting in the 28uop loop buffer also means power savings (and on Nehalem, avoiding instruction-decoding bottlenecks.) There are things like instruction alignment and crossing uop cache-line boundaries that make it hard to sustain a full 4 uops / clock without the loop buffer, too.
Another trick is to keep pointers to the end of your buffers, and count up towards zero. (So at the start of your loop, you get the first item as end[-idx].)
; pure loads are always one uop, so we can still index it
; with no perf hit on SnB
add esi, ecx ; point to end of src1
neg ecx
UNROLL equ 4
#MainLoop:
MOV EAX, [ESI + 0*CLimbSize + ECX*CLimbSize]
ADC EAX, [EDI + 0*CLimbSize]
MOV [EBX + 0*CLimbSize], EAX
MOV EAX, [ESI + 1*CLimbSize + ECX*CLimbSize]
ADC EAX, [EDI + 1*CLimbSize]
MOV [EBX + 1*CLimbSize], EAX
; ... repeated UNROLL times. Use an assembler macro to repeat these 3 instructions with increasing offsets
LEA ECX, [ECX+UNROLL] ; loop counter
LEA EDI, [EDI+ClimbSize*UNROLL] ; Unrolling makes it worth doing
LEA EBX, [EBX+ClimbSize*UNROLL] ; a separate increment to save a uop for every ADC and store on SnB & later.
JECXZ #DoRestLoop // LEA does not modify Zero flag, so JECXZ is used.
JMP #MainLoop
#DoRestLoop:
An unroll of 4 should be good. No need to overdo it, since you're prob. going to be able to saturate the load/store ports of pre-Haswell with an unroll of just 3 or 4, maybe even 2.
An unroll of 2 will make the above loop exactly 14 fused-domain uops for Intel CPUs. adc is 2 ALU (+1 fused memory), jecxz is 2, the rest (including LEA) are all 1. In the unfused domain, 10 ALU/branch and 6 memory (well, 8 memory if you really count store-address and store-data separately).
14 fused-domain uops per iteration: issue one iteration per 4 clocks. (The odd 2 uops at the end have to issue as a group of 2, even from the loop buffer.)
10 ALU & branch uops: Takes 3.33c to execute them all on pre-haswell. I don't think any one port will be a bottleneck, either: adc's uops can run on any port, and lea can run on p0/p1. The jumps use port5 (and jecx also uses one of p0/p1)
6 memory operations: Takes 3c to execute on pre-Haswell CPUs, which can handle 2 per clock. Haswell added a dedicated AGU for stores so it can sustain 2load+1store/clock.
So for pre-haswell CPUs, using LEA/JECXZ, an unroll of 2 won't quite saturate either the ALU or the load/store ports. An unroll of 4 will bring it up to 22 fused uops (6 cycles to issue). 14 ALU&branch: 4.66c to execute. 12 memory: 6 cycles to execute. So an unroll of 4 will saturate pre-Haswell CPUs, but only just barely. The CPU won't have any buffer of instructions to churn through on a branch mispredict.
Haswell and later will always be bottlenecked on the frontend (4 uops per clock limit), because the load/adc/store combo takes 4 uops, and can be sustained at one per clock. So there's never any "room" for loop overhead without cutting into adc throughput. This is where you have to know not to overdo it and unroll too much.
On Broadwell / Skylake, adc is only a single uop with 1c latency, and load / adc r, m / store appears to be the best sequence. adc m, r/i is 4 uops. This should sustain one adc per clock, like AMD.
On AMD CPUs, adc is only one macro-op, so if the CPU can sustain an issue rate of 4 (i.e. no decoding bottlenecks), then they can also use their 2 load / 1 store port to beat Haswell. Also, jecxz on AMD is as efficient as any other branch: only one macro-op. Multi-precision math is one of the few things AMD CPUs are good at. Lower latencies on some integer instructions give them an advantage in some GMP routines.
An unroll of more than 5 might hurt performance on Nehalem, because that would make the loop bigger than the 28uop loop buffer. Instruction decoding would then limit you to less than 4 uops per clock. On even earlier (Core2), there's a 64B x86-instruction loop buffer (64B of x86 code, not uops), which helps some with decode.
Unless this adc routine is the only bottleneck in your app, I'd keep the unroll factor down to maybe 2. Or maybe even not unroll, if that saves a lot of prologue / epilogue code, and your BigInts aren't too big. You don't want to bloat the code too much and create cache misses when callers call lots of different BigInteger functions, like add, sub, mul, and do other things in between. Unrolling too much to win at microbenchmarks can shoot yourself in the foot if your program doesn't spend a long time in your inner loop on each call.
If your BigInt values aren't usually gigantic, then it's not just the loop you have to tune. A smaller unroll could be good to simplify the prologue/epilogue logic. Make sure you check lengths so ECX doesn't cross zero without ever being zero, of course. This is the trouble with unrolling and vectors. :/
Saving / restoring CF for old CPUs, instead of flag-less looping:
This might be the most efficient way:
lahf
# clobber flags
sahf ; cheap on AMD and Intel. This doesn't restore OF, but we only care about CF
# or
setc al
# clobber flags
add al, 255 ; generate a carry if al is non-zero
Using the same register as the adc dep chain isn't actually a problem: eax will always be ready at the same time as the CF output from the last adc. (On AMD and P4/Silvermont partial-reg writes have a false dep on the full reg. They don't rename partial regs separately). The save/restore is part of the adc dep chain, not the loop condition dep chain.
The loop condition only checks flags written by cmp, sub, or dec. Saving/restoring flags around it doesn't make it part of the adc dep chain, so the branch mispredict at the end of the loop can be detected before adc execution gets there. (A previous version of this answer got this wrong.)
There's almost certainly some room to shave off instructions in the setup code, maybe by using registers where values start. You don't have to use edi and esi for pointers, although I know it makes initial development easier when you're using registers in ways consistent with their "traditional" use. (e.g. destination pointer in EDI).
Does Delphi let you use ebp? It's nice to have a 7th register.
Obviously 64bit code would make your BigInt code run about twice as fast, even though you'd have to worry about doing a single 32b adc at the end of a loop of 64bit adc. It would also give you 2x the amount of registers.
There are so many x86 chips with vastly different timing in use that you can't realistically have optimal code for all of them. Your approach to have two known good functions and benchmark before use is already pretty advanced.
However, depending on the size of your BigIntegers you can likely improve your code by simple loop-unrolling. That will remove the loop overhead drastically.
E.g. you could execute a specialized block that does the addition of eight integers like this:
#AddEight:
MOV EAX,[ESI + EDX*CLimbSize + 0*CLimbSize]
ADC EAX,[EDI + EDX*CLimbSize + 0*CLimbSize]
MOV [EBX + EDX*CLimbSize + 0*CLimbSize],EAX
MOV EAX,[ESI + EDX*CLimbSize + 1*CLimbSize]
ADC EAX,[EDI + EDX*CLimbSize + 1*CLimbSize]
MOV [EBX + EDX*CLimbSize + 1*CLimbSize],EAX
MOV EAX,[ESI + EDX*CLimbSize + 2*CLimbSize]
ADC EAX,[EDI + EDX*CLimbSize + 2*CLimbSize]
MOV [EBX + EDX*CLimbSize + 2*CLimbSize],EAX
MOV EAX,[ESI + EDX*CLimbSize + 3*CLimbSize]
ADC EAX,[EDI + EDX*CLimbSize + 3*CLimbSize]
MOV [EBX + EDX*CLimbSize + 3*CLimbSize],EAX
MOV EAX,[ESI + EDX*CLimbSize + 4*CLimbSize]
ADC EAX,[EDI + EDX*CLimbSize + 4*CLimbSize]
MOV [EBX + EDX*CLimbSize + 4*CLimbSize],EAX
MOV EAX,[ESI + EDX*CLimbSize + 5*CLimbSize]
ADC EAX,[EDI + EDX*CLimbSize + 5*CLimbSize]
MOV [EBX + EDX*CLimbSize + 5*CLimbSize],EAX
MOV EAX,[ESI + EDX*CLimbSize + 6*CLimbSize]
ADC EAX,[EDI + EDX*CLimbSize + 6*CLimbSize]
MOV [EBX + EDX*CLimbSize + 6*CLimbSize],EAX
MOV EAX,[ESI + EDX*CLimbSize + 7*CLimbSize]
ADC EAX,[EDI + EDX*CLimbSize + 7*CLimbSize]
MOV [EBX + EDX*CLimbSize + 7*CLimbSize],EAX
LEA ECX,[ECX - 8]
Now you rebuild your loop, execute the above block as long as you have more than 8 elements to process and do the remaining few elements using the single element addition loop that you already have.
For large BitIntegers you'll spend most of the time in the unrolled part which should execute a lot faster now.
If you want it even faster, then write seven additional blocks that are specialized to the remaining element counts and branch to them based on the element count. This can best be done by storing the seven addresses in a lookup table, loading up the address from it and directly jumping into the specialized code.
For small element counts this completely removes the entire loop and for large elements you'll get the full benefit of the unrolled loop.

Experimental OS in assembly - can't show a character on the screen (pmode)

I hope there's some experienced assembly/os developer here, even if my problem is not a huge one.
I am trying to play with assembly and create a small operating system. In fact, what I want is a boot-loader and a second boot-loader that activates pmode and displays a single char on the screen, using the video memory (not with interrupts, evidently).
I am using VirtualBox to emulate the code, which I paste manually inside a VHD disk (two sectors of code)
In first place, my code:
boot.asm
This is the first boot-loader
bits 16
org 0
mov al, dl
jmp 07c0h:Start
Start:
cli
push ax
mov ax, cs
mov ds, ax
mov es, ax
pop ax
sti
jmp ReadDisk
ReadDisk:
call ResetDisk
mov bx, 0x1000
mov es, bx
mov bx, 0x0000
mov dl, al
mov ah, 0x02
mov al, 0x01
mov ch, 0x00
mov cl, 0x02
mov dh, 0x00
int 0x13
jc ReadDisk
jmp 0x1000:0x0000
ResetDisk:
mov ah, 0x00
mov dl, al
int 0x13
jc ResetDisk
ret
times 510 - ($ - $$) db 0
dw 0xAA55
boot2.asm
This is the second boot-loader, pasted on the second sector (next 512 bytes)
bits 16
org 0
jmp 0x1000:Start
InstallGDT:
cli
pusha
lgdt [GDT]
sti
popa
ret
StartGDT:
dd 0
dd 0
dw 0ffffh
dw 0
db 0
db 10011010b
db 11001111b
db 0
dw 0ffffh
dw 0
db 0
db 10010010b
db 11001111b
db 0
StopGDT:
GDT:
dw StopGDT - StartGDT - 1
dd StartGDT + 10000h
OpenA20:
cli
pusha
call WaitInput
mov al, 0xad
out 0x64, al
call WaitInput
mov al, 0xd0
out 0x64, al
call WaitInput
in al, 0x60
push eax
call WaitInput
mov al, 0xd1
out 0x64, al
call WaitInput
pop eax
or al, 2
out 0x60, al
call WaitInput
mov al, 0xae
out 0x64, al
call WaitInput
popa
sti
ret
WaitInput:
in al, 0x64
test al, 2
jnz WaitInput
ret
WaitOutput:
in al, 0x64
test al, 1
jz WaitOutput
ret
Start:
cli
xor ax, ax
mov ds, ax
mov es, ax
mov ax, 0x9000
mov ss, ax
mov sp, 0xffff
sti
call InstallGDT
call OpenA20
ProtectedMode:
cli
mov eax, cr0
or eax, 1
mov cr0, eax
jmp 08h:ShowChar
bits 32
ShowChar:
mov ax, 0x10
mov ds, ax
mov ss, ax
mov es, ax
mov esp, 90000h
pusha ; save registers
mov edi, 0xB8000
mov bl, '.'
mov dl, bl ; Get character
mov dh, 63 ; the character attribute
mov word [edi], dx ; write to video display
popa
cli
hlt
So, I compile this code and paste the binary in the VHD, then run the system on Virtual Box. I can see that it goes in pmode correctly, the A20 gate is enabled and the LGTR contains a memory address (which I have no idea if is the correct). This is some part of the log file, that may be of interest:
00:00:07.852082 ****************** Guest state at power off ******************
00:00:07.852088 Guest CPUM (VCPU 0) state:
00:00:07.852096 eax=00000011 ebx=00000000 ecx=00010002 edx=00000080 esi=0000f4a0 edi=0000fff0
00:00:07.852102 eip=0000016d esp=0000ffff ebp=00000000 iopl=0 nv up di pl zr na po nc
00:00:07.852108 cs={1000 base=0000000000010000 limit=0000ffff flags=0000009b} dr0=00000000 dr1=00000000
00:00:07.852118 ds={0000 base=0000000000000000 limit=0000ffff flags=00000093} dr2=00000000 dr3=00000000
00:00:07.852124 es={0000 base=0000000000000000 limit=0000ffff flags=00000093} dr4=00000000 dr5=00000000
00:00:07.852129 fs={0000 base=0000000000000000 limit=0000ffff flags=00000093} dr6=ffff0ff0 dr7=00000400
00:00:07.852136 gs={0000 base=0000000000000000 limit=0000ffff flags=00000093} cr0=00000011 cr2=00000000
00:00:07.852141 ss={9000 base=0000000000090000 limit=0000ffff flags=00000093} cr3=00000000 cr4=00000000
00:00:07.852148 gdtr=0000000000539fc0:003d idtr=0000000000000000:ffff eflags=00000006
00:00:07.852155 ldtr={0000 base=00000000 limit=0000ffff flags=00000082}
00:00:07.852158 tr ={0000 base=00000000 limit=0000ffff flags=0000008b}
00:00:07.852162 SysEnter={cs=0000 eip=00000000 esp=00000000}
00:00:07.852166 FCW=037f FSW=0000 FTW=0000 FOP=0000 MXCSR=00001f80 MXCSR_MASK=0000ffff
00:00:07.852172 FPUIP=00000000 CS=0000 Rsrvd1=0000 FPUDP=00000000 DS=0000 Rsvrd2=0000
00:00:07.852177 ST(0)=FPR0={0000'00000000'00000000} t0 +0.0000000000000000000000 ^ 0
00:00:07.852185 ST(1)=FPR1={0000'00000000'00000000} t0 +0.0000000000000000000000 ^ 0
00:00:07.852193 ST(2)=FPR2={0000'00000000'00000000} t0 +0.0000000000000000000000 ^ 0
00:00:07.852201 ST(3)=FPR3={0000'00000000'00000000} t0 +0.0000000000000000000000 ^ 0
00:00:07.852209 ST(4)=FPR4={0000'00000000'00000000} t0 +0.0000000000000000000000 ^ 0
00:00:07.852222 ST(5)=FPR5={0000'00000000'00000000} t0 +0.0000000000000000000000 ^ 0
00:00:07.852229 ST(6)=FPR6={0000'00000000'00000000} t0 +0.0000000000000000000000 ^ 0
00:00:07.852236 ST(7)=FPR7={0000'00000000'00000000} t0 +0.0000000000000000000000 ^ 0
00:00:07.852244 XMM0 =00000000'00000000'00000000'00000000 XMM1 =00000000'00000000'00000000'00000000
00:00:07.852253 XMM2 =00000000'00000000'00000000'00000000 XMM3 =00000000'00000000'00000000'00000000
00:00:07.852262 XMM4 =00000000'00000000'00000000'00000000 XMM5 =00000000'00000000'00000000'00000000
00:00:07.852270 XMM6 =00000000'00000000'00000000'00000000 XMM7 =00000000'00000000'00000000'00000000
00:00:07.852280 XMM8 =00000000'00000000'00000000'00000000 XMM9 =00000000'00000000'00000000'00000000
00:00:07.852287 XMM10=00000000'00000000'00000000'00000000 XMM11=00000000'00000000'00000000'00000000
00:00:07.852295 XMM12=00000000'00000000'00000000'00000000 XMM13=00000000'00000000'00000000'00000000
00:00:07.852302 XMM14=00000000'00000000'00000000'00000000 XMM15=00000000'00000000'00000000'00000000
00:00:07.852310 EFER =0000000000000000
00:00:07.852312 PAT =0007040600070406
00:00:07.852316 STAR =0000000000000000
00:00:07.852318 CSTAR =0000000000000000
00:00:07.852320 LSTAR =0000000000000000
00:00:07.852322 SFMASK =0000000000000000
00:00:07.852324 KERNELGSBASE =0000000000000000
00:00:07.852327 ***
00:00:07.852334 Guest paging mode: Protected (changed 5 times), A20 enabled (changed 2 times)
So, this is the status of the processor at the end of the test.
The problem is that, I cannot see the character on the screen. This can be a problem related to memory (I must admit I'm not so good at memory addressing), like wrong content in segment register, or it can be related to the manner in which I am trying to use the video memory in order to show that character, but it may be something else. What do you think is wrong? Thanks so much!
Update
The problem is related to memory addressing. The ShowChar instructions are not executed. I verified it in the logs file. What I know is that everything is executed correctly up to this line:
jmp 08h:ShowChar
So, this might be related to wrong segment registers, wrong GDTR or something else related to memory addressing.
Update
I changed GDT, to be a linear address instead of a segment:offset one, but still not seeing the character. The problem is that I can't figure out the origin of the problem, because I can't verify if the GDT is correct. I can see the content of all the registers, but how could I know that the GDTR (which at the moment is 0000000000ff53f0:00e9) is correct? I'm just supposing that the ShowChar function is not executed because of a wrong GDT, but just a supposition.
The problem is, despite all your work for making character and attribute available in DX:
mov bl, '.'
mov dl, bl ; Get character
mov dh, CHAR_ATTRIB ; the character attribute
you end up writing word 63 into the screen buffer:
mov word [edi], 63 ; write to video display
which is a question mark with zero attributes, i.e. black question mark on black background.
I'm not very experienced with this, but...
GDT:
dw StopGDT - StartGDT - 1
dd StartGDT
Doesn't this need to be an "absolute" (not seg:offs) address? Since you've loaded it at segment 1000h, I would expect dd StartGDT + 10000h to be right here. No?
Here is a workable minimalist bootloader that switch to protected and print a "X" to VGA, using Qemu (so no need to read the disk).
[org 0x7C00]
cli
lgdt [gdt_descriptor]
; Enter PM
mov eax, cr0
or eax, 0x1
mov cr0, eax
; 1 GDT entry is 8B, code segment is 2nd entry (after null entry), so
; jump to code segment at 0x08 and load init_pm from there
jmp 0x8:init_pm
[bits 32]
init_pm :
; Data segment is 3rd entry in GDT, so pass to ds the value 3*8B = 0x10
mov ax, 0x10
mov ds, ax
mov ss, ax
mov es, ax
mov fs, ax
mov gs, ax
;Print a X of cyan color
;Note that this is printed over the previously printed Qemu screen
mov al, 'L'
mov ah, 3 ; cyan
mov edx, 0xb8004
mov [edx], ax
jmp $
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
[bits 16]
GDT:
;null :
dd 0x0
dd 0x0
;code :
dw 0xffff ;Limit
dw 0x0 ;Base
db 0x0 ;Base
db 0b10011010 ;1st flag, Type flag
db 0b11001111 ;2nd flag, Limit
db 0x0 ;Base
;data :
dw 0xffff
dw 0x0
db 0x0
db 0b10010010
db 0b11001111
db 0x0
gdt_descriptor :
dw $ - GDT - 1 ;16-bit size
dd GDT ;32-bit start address
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; Bootsector padding
times 510-($-$$) db 0
dw 0xaa55
Then, do:
nasm boot.asm
qemu boot
To write to video memory in standard VGA you write data to memory address 0xb8000 or byte 8000 in memory. To do a simple black and white character simply OR the character value with the value 15 << 8 so that you get a 16 bit unsigned short. You then write these 16 bits to that memory location to draw the character.
The problem is your use of the ORG directive and mixing up real mode and protected mode addressing schemes. You are right about your 32 bit code not being executed. When the CPU executes this code:
jmp 08h:ShowChar
It jumps to somewhere in the currently loaded Interrupt Vector Table, at the beginning of memory instead of your 32 bit code. Why? Because the base of your defined code segment is 0, and you told your assembler to resolve addresses relative to 0:
Org 0
Thus the CPU is actually jumping to an address that is numerically equal to (0 + the offset of the first instruction of your ShowChar code) (i.e. Code Segment Base + Offset)
To rectify this issue, change:
Org 0
Into
Org 0x10000
Then you would need to change your segment registers to match, but in this case the segment registers you originally set were incorrect for the origin directive you originally specified, but are valid when the origin directive is changed as above, so no further changes need to be made. As a side note, the fact that your origin directive was incorrect can explain why your GDT address appeared to be garbage - because it was in fact some part of the Interrupt Vector Table that was loaded by your lgdt instruction. Your pointer to the GDT parameters ('GTD' label) is actually pointing to somewhere in the beginning of the Interrupt Vector Table.
Anyway, simply changing the origin directive as shown above should fix the problem.
By the way, your code looks awfully similar to the code over at http://www.brokenthorn.com/Resources/OSDev8.html
Especially the demo code provided at the bottom of the page of
http://www.brokenthorn.com/Resources/OSDev10.html
Interesting..

Memory and data assembly

I can't figure out how data are handled in different situations in assembly tables.
I have the following simple program:
section .data
Digits: db "0123456789ABCDEF"
Sums: dd 15,12,6,0,21,14,4,0,0,19
Sums2: db 15,12,6,0,21,14,4,0,0,19
section .text
global _start
_start:
nop ; Put your experiments between the two nops...
mov ecx,2
mov al, byte [Sums+ecx*2]
mov bl, byte [Sums2+ecx*2]
mov dl, byte [Digits+ecx*2]
nop
Now, when I debug it instruction by instruction I look at the registers and can't understand what is happening.
rcx --> as expected it contains the decimal 2
rdx --> as expected it contains the hexadecimal 34 which represents the decimal 4
rax --> has c which represents new page
rbx --> has 15 which represents negative acknowledge (NAK character)
I expected finding 6 in rax and 1 in rbx. I can't figure out why it is not happening. I'm on a little endian architecture. Thanks

Resources