How do you get the ICC compiler to generate SSE instructions within an inner loop? - sse

I have an inner loop such as this
for(i=0 ;i<n;i++){
x[0] += A[i] * z[0];
x[1] += A[i] * z[1];
x[2] += A[i] * z[2];
x[3] += A[i] * z[3];
}
The inner 4 instructions can be easily converted to SSE instructions by a compiler. Do current compilers do this ? If they do what do I have to do to force this on the compiler?

From what you've provided, this can't be vectorized, because the pointers could alias each other, i.e. the x array could overlap with A or z.
A simple way to help the compiler out would be to declare x as __restrict. Another way would be to rewrite it like so:
for(i=0 ;i<n;i++)
{
float Ai=A[i];
float z0=z[0], z1=z[1], z2=z[2], z3=z[3];
x[0] += Ai * z0;
x[1] += Ai * z1;
x[2] += Ai * z2;
x[3] += Ai * z3;
}
I've never actually tried to get a compiler to auto-vectorize code, so I don't know if that will do it or not. Even if it doesn't get vectorized, it should be faster since the loads and stores can be ordered more efficiently and without causing a load-hit-store.
If you have more information than the compiler does, (e.g. whether or not your pointers are 16-byte aligned), and should be able to use that to your advantage (e.g. using aligned loads). Note that I'm not saying you should always try to beat the compiler, only when you know more than it does.
Further reading:
Load-hit-stores and the __restrict keyword
Memory Optimization (aliasing starts around slide 35)

ICC auto-vectorizes the below code snippet for SSE2 by default:
void foo(float *__restrict__ x, float *__restrict__ A, float *__restrict__ z, int n){
for(int i=0;i<n;i++){
x[0] += A[i] * z[0];
x[1] += A[i] * z[1];
x[2] += A[i] * z[2];
x[3] += A[i] * z[3];
}
return;
}
By using restrict keyword, the memory aliasing assumption is ignored. The vectorization report generated is:
$ icpc test.cc -c -vec-report2 -S
test.cc(2): (col. 1) remark: PERMUTED LOOP WAS VECTORIZED
test.cc(3): (col. 2) remark: loop was not vectorized: not inner loop
To confirm if SSE instructions are generated, open the ASM generated (test.s) and you will find the following instructions:
..B1.13: # Preds ..B1.13 ..B1.12
movaps (%rsi,%r15,4), %xmm10 #3.10
movaps 16(%rsi,%r15,4), %xmm11 #3.10
mulps %xmm0, %xmm10 #3.17
mulps %xmm0, %xmm11 #3.17
addps %xmm10, %xmm9 #3.2
addps %xmm11, %xmm6 #3.2
movaps 32(%rsi,%r15,4), %xmm12 #3.10
movaps 48(%rsi,%r15,4), %xmm13 #3.10
movaps 64(%rsi,%r15,4), %xmm14 #3.10
movaps 80(%rsi,%r15,4), %xmm15 #3.10
movaps 96(%rsi,%r15,4), %xmm10 #3.10
movaps 112(%rsi,%r15,4), %xmm11 #3.10
addq $32, %r15 #2.1
mulps %xmm0, %xmm12 #3.17
cmpq %r13, %r15 #2.1
mulps %xmm0, %xmm13 #3.17
mulps %xmm0, %xmm14 #3.17
addps %xmm12, %xmm5 #3.2
mulps %xmm0, %xmm15 #3.17
addps %xmm13, %xmm4 #3.2
mulps %xmm0, %xmm10 #3.17
addps %xmm14, %xmm7 #3.2
mulps %xmm0, %xmm11 #3.17
addps %xmm15, %xmm3 #3.2
addps %xmm10, %xmm2 #3.2
addps %xmm11, %xmm1 #3.2
jb ..B1.13 # Prob 75% #2.1
# LOE rax rdx rsi r8 r9 r10 r13 r15 ecx ebp edi r11d r14d bl xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7 xmm8 xmm9
..B1.14: # Preds ..B1.13
addps %xmm6, %xmm9 #3.2
addps %xmm4, %xmm5 #3.2
addps %xmm3, %xmm7 #3.2
addps %xmm1, %xmm2 #3.2
addps %xmm5, %xmm9 #3.2
addps %xmm2, %xmm7 #3.2
lea 1(%r14), %r12d #2.1
cmpl %r12d, %ecx #2.1
addps %xmm7, %xmm9 #3.2
jb ..B1.25 # Prob 50% #2.1

Related

Julia massively outperforms Delphi. Obsolete asm code by Delphi compiler?

I wrote a simple for loop in Delphi.
The same program is 7.6 times faster in Julia 1.6.
procedure TfrmTester.btnForLoopClick(Sender: TObject);
VAR
i, Total, Big, Small: Integer;
s: string;
begin
TimerStart;
Total:= 0;
Big := 0;
Small:= 0;
for i:= 1 to 1000000000 DO //1 billion
begin
Total:= Total+1;
if Total > 500000
then Big:= Big+1
else Small:= Small+1;
end;
s:= TimerElapsedS;
//here code to show Big/Small on the screen
end;
The ASM code seems decent to me:
TesterForm.pas.111: TimerStart;
007BB91D E8DE7CF9FF call TimerStart
TesterForm.pas.113: Total:= 0;
007BB922 33C0 xor eax,eax
007BB924 8945F4 mov [ebp-$0c],eax
TesterForm.pas.114: Big := 0;
007BB927 33C0 xor eax,eax
007BB929 8945F0 mov [ebp-$10],eax
TesterForm.pas.115: Small:= 0;
007BB92C 33C0 xor eax,eax
007BB92E 8945EC mov [ebp-$14],eax
TesterForm.pas.**116**: for i:= 1 to 1000000000 DO //1 billion
007BB931 C745F801000000 mov [ebp-$08],$00000001
TesterForm.pas.118: Total:= Total+1;
007BB938 FF45F4 inc dword ptr [ebp-$0c]
TesterForm.pas.119: if Total > 500000
007BB93B 817DF420A10700 cmp [ebp-$0c],$0007a120
007BB942 7E05 jle $007bb949
TesterForm.pas.120: then Big:= Big+1
007BB944 FF45F0 inc dword ptr [ebp-$10]
007BB947 EB03 jmp $007bb94c
TesterForm.pas.121: else Small:= Small+1;
007BB949 FF45EC inc dword ptr [ebp-$14]
TesterForm.pas.122: end;
007BB94C FF45F8 inc dword ptr [ebp-$08]
TesterForm.pas.**116**: for i:= 1 to 1000000000 DO //1 billion
007BB94F 817DF801CA9A3B cmp [ebp-$08],$3b9aca01
007BB956 75E0 jnz $007bb938
TesterForm.pas.124: s:= TimerElapsedS;
007BB958 8D45E8 lea eax,[ebp-$18]
How can it be that Delphi has such a pathetic score compared with Julia?
Can I do anything to improve the code generated by the compiler?
Info
My Delphi 10.4.2 program is Win32 bit. Of course, I run in "Release" mode :)
But the ASM code above is for the "Debug" version because I don't know how to pause the execution of the program when I run an optimized EXE file. But the difference between a Release and a Debug exe is pretty small (1.8 vs 1.5 sec). Julia does it in 195ms.
More discussions
I do have to mention that when you run the code in Julia for the first time, its time is ridiculous high, because Julia is JIT, so it has to compile the code first. The compilation time (since it is "one-time") was not included in the measurement.
Also, as AmigoJack commented, Delphi code will run pretty much everywhere, while Julia code will probably only run in computers that have a modern CPU to support all those new/fancy instructions. I do have small tools that I produced back in 2004 and still run today.
Whatever code Julia produces cannot be delivered to "customers" unless that have Julia installed.
Anyway, all these being said, it is sad that that Delphi compiler is so outdated.
I ran other tests, finding the shortest and longest string in a list of strings is 10x faster in Delphi than Julia. Allocating small blocks of memory (10000x10000x4 bytes) has the same speed.
As AhnLab mentioned, I run pretty "dry" tests. I guess a full program that performs more complex/realistic tasks needs to be written and see at the end of the program if Julia still outperforms Delphi 7x.
Update
Ok, the Julia code seems totally alien to me. Seems to use more modern ops:
; ┌ # Julia_vs_Delphi.jl:4 within `for_fun`
pushq %rbp
movq %rsp, %rbp
subq $96, %rsp
vmovdqa %xmm11, -16(%rbp)
vmovdqa %xmm10, -32(%rbp)
vmovdqa %xmm9, -48(%rbp)
vmovdqa %xmm8, -64(%rbp)
vmovdqa %xmm7, -80(%rbp)
vmovdqa %xmm6, -96(%rbp)
movq %rcx, %rax
; │ # Julia_vs_Delphi.jl:8 within `for_fun`
; │┌ # range.jl:5 within `Colon`
; ││┌ # range.jl:354 within `UnitRange`
; │││┌ # range.jl:359 within `unitrange_last`
testq %rdx, %rdx
; │└└└
jle L80
; │ # Julia_vs_Delphi.jl within `for_fun`
movq %rdx, %rcx
sarq $63, %rcx
andnq %rdx, %rcx, %r9
; │ # Julia_vs_Delphi.jl:13 within `for_fun`
cmpq $8, %r9
jae L93
; │ # Julia_vs_Delphi.jl within `for_fun`
movl $1, %r10d
xorl %edx, %edx
xorl %r11d, %r11d
jmp L346
L80:
xorl %edx, %edx
xorl %r11d, %r11d
xorl %r9d, %r9d
jmp L386
L93: movabsq $9223372036854775800, %r8 # imm = 0x7FFFFFFFFFFFFFF8
; │ # Julia_vs_Delphi.jl:13 within `for_fun`
andq %r9, %r8
leaq 1(%r8), %r10
movabsq $.rodata.cst32, %rcx
vmovdqa (%rcx), %ymm1
vpxor %xmm0, %xmm0, %xmm0
movabsq $.rodata.cst8, %rcx
vpbroadcastq (%rcx), %ymm2
movabsq $1023787240, %rcx # imm = 0x3D05C0E8
vpbroadcastq (%rcx), %ymm3
movabsq $1023787248, %rcx # imm = 0x3D05C0F0
vpbroadcastq (%rcx), %ymm5
vpcmpeqd %ymm6, %ymm6, %ymm6
movabsq $1023787256, %rcx # imm = 0x3D05C0F8
vpbroadcastq (%rcx), %ymm7
movq %r8, %rcx
vpxor %xmm4, %xmm4, %xmm4
vpxor %xmm8, %xmm8, %xmm8
vpxor %xmm9, %xmm9, %xmm9
nopw %cs:(%rax,%rax)
; │ # Julia_vs_Delphi.jl within `for_fun`
L224:
vpaddq %ymm2, %ymm1, %ymm10
; │ # Julia_vs_Delphi.jl:10 within `for_fun`
vpxor %ymm3, %ymm1, %ymm11
vpcmpgtq %ymm11, %ymm5, %ymm11
vpxor %ymm3, %ymm10, %ymm10
vpcmpgtq %ymm10, %ymm5, %ymm10
vpsubq %ymm11, %ymm0, %ymm0
vpsubq %ymm10, %ymm4, %ymm4
vpaddq %ymm11, %ymm8, %ymm8
vpsubq %ymm6, %ymm8, %ymm8
vpaddq %ymm10, %ymm9, %ymm9
vpsubq %ymm6, %ymm9, %ymm9
vpaddq %ymm7, %ymm1, %ymm1
addq $-8, %rcx
jne L224
; │ # Julia_vs_Delphi.jl:13 within `for_fun`
vpaddq %ymm8, %ymm9, %ymm1
vextracti128 $1, %ymm1, %xmm2
vpaddq %xmm2, %xmm1, %xmm1
vpshufd $238, %xmm1, %xmm2 # xmm2 = xmm1[2,3,2,3]
vpaddq %xmm2, %xmm1, %xmm1
vmovq %xmm1, %r11
vpaddq %ymm0, %ymm4, %ymm0
vextracti128 $1, %ymm0, %xmm1
vpaddq %xmm1, %xmm0, %xmm0
vpshufd $238, %xmm0, %xmm1 # xmm1 = xmm0[2,3,2,3]
vpaddq %xmm1, %xmm0, %xmm0
vmovq %xmm0, %rdx
cmpq %r8, %r9
je L386
L346:
leaq 1(%r9), %r8
nop
; │ # Julia_vs_Delphi.jl:10 within `for_fun`
; │┌ # operators.jl:378 within `>`
; ││┌ # int.jl:83 within `<`
L352:
xorl %ecx, %ecx
cmpq $500000, %r10 # imm = 0x7A120
seta %cl
cmpq $500001, %r10 # imm = 0x7A121
; │└└
adcq $0, %rdx
addq %rcx, %r11
; │ # Julia_vs_Delphi.jl:13 within `for_fun`
; │┌ # range.jl:837 within `iterate`
incq %r10
; ││┌ # promotion.jl:468 within `==`
cmpq %r10, %r8
; │└└
jne L352
; │ # Julia_vs_Delphi.jl:17 within `for_fun`
L386:
movq %r9, (%rax)
movq %rdx, 8(%rax)
movq %r11, 16(%rax)
vmovaps -96(%rbp), %xmm6
vmovaps -80(%rbp), %xmm7
vmovaps -64(%rbp), %xmm8
vmovaps -48(%rbp), %xmm9
vmovaps -32(%rbp), %xmm10
vmovaps -16(%rbp), %xmm11
addq $96, %rsp
popq %rbp
vzeroupper
retq
nopw %cs:(%rax,%rax)
Let's start by noting that there is no reason for an optimizing compiler to actually perform the loop, at present Delphi and Julia output similar assembler that actually run through the loop but the compilers could in the future just skip the loop and assign the values. Microbenchmarks are tricky.
The difference seems to be that Julia makes use of SIMD instructions which makes perfect sense for such loop (~8x speedup makes perfect sense depending on your CPU).
You could have a look at this blog post for thoughts on SIMD in Delphi.
Although this is not the main point of the answer, I'll expand a bit on the possibility to remove the loop altogether. I don't know for sure what the Delphi specification says but in many compiled languages, including Julia ("just-ahead-of-time"), the compiler could simply figure out the state of the variables after the loop and replace the loop with that state. Have a look at the following C++ code (compiler explorer):
#include <cstdio>
void loop() {
long total = 0, big = 0, small = 0;
for (long i = 0; i < 100; ++i) {
total++;
if (total > 50) {
big++;
} else {
small++;
}
}
std::printf("%ld %ld %ld", total, big, small);
}
this is the assembler clang trunk outputs:
loop(): # #loop()
lea rdi, [rip + .L.str]
mov esi, 100
mov edx, 50
mov ecx, 50
xor eax, eax
jmp printf#PLT # TAILCALL
.L.str:
.asciz "%ld %ld %ld"
as you can see, no loop, just the result. For longer loops clang stops doing this optimization but that's just a limitation of the compiler, other compilers could do it differently and I'm sure there is a heavily optimizing compiler out there that handles much more complex situations.

What do these 2 lines of assembly code do?

I am in the middle of phase 2 for bomb lab and I can't seem to figure out how these two lines of assembly affect the code overall and how they play a role in the loop going on.
Here is the 2 lines of code:
add -0x24(%ebp,%ebx,4),%eax
cmp %eax,-0x20(%ebp,%ebx,4)
and here is the entire code:
Dump of assembler code for function phase_2:
0x08048ba4 <+0>: push %ebp
0x08048ba5 <+1>: mov %esp,%ebp
0x08048ba7 <+3>: push %ebx
0x08048ba8 <+4>: sub $0x34,%esp
0x08048bab <+7>: lea -0x20(%ebp),%eax
0x08048bae <+10>: mov %eax,0x4(%esp)
0x08048bb2 <+14>: mov 0x8(%ebp),%eax
0x08048bb5 <+17>: mov %eax,(%esp)
0x08048bb8 <+20>: call 0x804922f <read_six_numbers>
0x08048bbd <+25>: cmpl $0x0,-0x20(%ebp)
0x08048bc1 <+29>: jns 0x8048be3 <phase_2+63>
0x08048bc3 <+31>: call 0x80491ed <explode_bomb>
0x08048bc8 <+36>: jmp 0x8048be3 <phase_2+63>
0x08048bca <+38>: mov %ebx,%eax
0x08048bcc <+40>: add -0x24(%ebp,%ebx,4),%eax
0x08048bd0 <+44>: cmp %eax,-0x20(%ebp,%ebx,4)
0x08048bd4 <+48>: je 0x8048bdb <phase_2+55>
0x08048bd6 <+50>: call 0x80491ed <explode_bomb>
0x08048bdb <+55>: inc %ebx
0x08048bdc <+56>: cmp $0x6,%ebx
0x08048bdf <+59>: jne 0x8048bca <phase_2+38>
0x08048be1 <+61>: jmp 0x8048bea <phase_2+70>
0x08048be3 <+63>: mov $0x1,%ebx
0x08048be8 <+68>: jmp 0x8048bca <phase_2+38>
0x08048bea <+70>: add $0x34,%esp
0x08048bed <+73>: pop %ebx
0x08048bee <+74>: pop %ebp
0x08048bef <+75>: ret
I noticed the inc command that increments %ebx by 1 and using that as %eax in the loop. But the add and cmp trip me up every time. If I had %eax as 1 going into to the add and cmp what %eax comes out? Thanks! I also know that once %ebx gets to 5 then the loop is over and it ends the entire code.
You got a list of 6 numbers. This means you can compare at most 5 pairs of numbers. So the loop that uses %ebx does 5 iterations.
In each iteration the value at the lower address is added to the current loop count, and then compared with the value at the next higher address. As long as they match the bomb won't explode!
This loops 5 times:
add -0x24(%ebp,%ebx,4),%eax
cmp %eax,-0x20(%ebp,%ebx,4)
These numbers are used:
with %ebx=1 numbers are at -0x20(%ebp) and -0x1C(%ebp)
with %ebx=2 numbers are at -0x1C(%ebp) and -0x18(%ebp)
with %ebx=3 numbers are at -0x18(%ebp) and -0x14(%ebp)
with %ebx=4 numbers are at -0x14(%ebp) and -0x10(%ebp)
with %ebx=5 numbers are at -0x10(%ebp) and -0x0C(%ebp)
Those two instructions are dealing with memory at two locations, indexed by ebp and ebx. In particular, the add instruction is keeping a running total of all the numbers examined so far, and the comparison instruction is checking whether that is equal to the next number. So something like:
int total = 0;
for (i=0; ..., i++) {
total += array[i];
if (total != array[i+])
explode_bomb();
}

Displaying environment variables in assembly language

I am trying to understand how assembly works by making a basic program to display environement variables like
C code :
int main(int ac, char **av, char **env)
{
int x;
int y;
y = -1;
while (env[++y])
{
x = -1;
while (env[y][++x])
{
write(1, &(env[y][x]), 1);
}
}
return (0);
}
I compiled that with gcc -S (on cygwin64) to see how to do, and wrote it my own way (similar but not same), but it did not work...
$>gcc my_av.s && ./a.exe
HOMEPATH=\Users\hadrien▒2▒p
My assembly code :
.file "test.c"
.LC0:
.ascii "\n\0"
.LC1:
.ascii "\033[1;31m.\033[0m\0"
.LC2:
.ascii "\033[1;31m#\033[0m\0"
.LCtest0:
.ascii "\033[1;32mdebug\033[0m\0"
.LCtest1:
.ascii "\033[1;31mdebug\033[0m\0"
.LCtest2:
.ascii "\033[1;34mdebug\033[0m\0"
.def main; .scl 2; .type 32; .endef
main:
/* initialisation du main */
pushq %rbp
movq %rsp, %rbp
subq $48, %rsp
movl %ecx, 16(%rbp) /* int argc */
movq %rdx, 24(%rbp) /* char **argv */
movq %r8, 32(%rbp) /* char **env */
/* saut de ligne */
/* write init */
movl $1, %r8d /* write size */
movl $1, %ecx /* sortie standart */
leaq .LC0(%rip), %rdx
/* write */
call write
/* debut du code */
movl $-1, -8(%rbp) /* y = -1 */
jmp .Loop_1_condition
.Loop_1_body:
movl $-1, -4(%rbp)
jmp .Loop_2_condition
.Loop_2_body:
/* affiche le charactere */
movl $1, %r8d
movl $1, %ecx
call write
.Loop_2_condition:
addl $1, -4(%rbp) /* x = -1 */
movl -8(%rbp), %eax
cltq
addq 32(%rbp), %rax
movq (%rax), %rax
movq %rax, %rdx
movl -4(%rbp), %eax
cltq
addq %rdx, %rax
movq %rax, %rdx
movq (%rax), %rax
cmpq $0, %rax
jne .Loop_2_body
/* saut de ligne */
movl $1, %r8d /* write size */
movl $1, %ecx /* sortie standart */
leaq .LC0(%rip), %rdx
call write
.Loop_1_condition:
addl $1, -8(%rbp) /* ++y */
movl -8(%rbp), %eax
cltq /* passe eax en 64bits */
addq 32(%rbp), %rax
movq (%rax), %rax
cmpq $0, %rax
jne .Loop_1_body
movl $1, %r8d /* write size */
movl $1, %ecx /* sortie standart */
leaq .LC0(%rip), %rdx
call write
/* fin du programme */
movl $0, %eax /* return (0) */
addq $48, %rsp
popq %rbp
ret
.def write; .scl 2; .type 32; .endef
Could someone explain me what is wrong with this code please ?
Also, while trying to solve the problem i tired to replace $0 by $97 in cmpq operation, thinking it would stop on 'a' character but it didn't... Why ?
You have a few issues. In this code (loop2) you have:
addq %rdx, %rax
movq %rax, %rdx
movq (%rax), %rax
cmpq $0, %rax
movq (%rax), %rax has read the next 8 characters in %rax. You are only interested in the first character. One way to achieve this is to compare the least significant byte in %rax with 0. You can use cmpb and use the %al register:
cmpb $0, %al
The biggest issue though is understanding that char **env is a pointer to array of char * .You first need to get the base pointer for the array, then that base pointer is indexed with y. The indexing looks something like basepointer + (y * 8) . You need to multiply y by 8 because each pointer is 8 bytes wide. The pointer at that location will be the char * for a particular environment string. Then you can index each character in the string array until you find a NUL (0) terminating character.
I've amended the code slightly and added comments on the few lines I changed:
.file "test.c"
.LC0:
.ascii "\x0a\0"
.LC1:
.ascii "\033[1;31m.\033[0m\0"
.LC2:
.ascii "\033[1;31m#\033[0m\0"
.LCtest0:
.ascii "\033[1;32mdebug\033[0m\0"
.LCtest1:
.ascii "\033[1;31mdebug\033[0m\0"
.LCtest2:
.ascii "\033[1;34mdebug\033[0m\0"
.def main; .scl 2; .type 32; .endef
main:
/* initialisation du main */
pushq %rbp
movq %rsp, %rbp
subq $48, %rsp
movl %ecx, 16(%rbp) /* int argc */
movq %rdx, 24(%rbp) /* char **argv */
movq %r8, 32(%rbp) /* char **env */
/* saut de ligne */
/* write init */
movl $1, %r8d /* write size */
movl $1, %ecx /* sortie standart */
leaq .LC0(%rip), %rdx
/* write */
call write
/* debut du code */
movl $-1, -8(%rbp) /* y = -1 */
jmp .Loop_1_condition
.Loop_1_body:
movl $-1, -4(%rbp)
jmp .Loop_2_condition
.Loop_2_body:
/* affiche le charactere */
movl $1, %r8d
movl $1, %ecx
call write
.Loop_2_condition:
addl $1, -4(%rbp) /* x = -1 */
movl -8(%rbp), %eax /* get y index */
cltq
movq 32(%rbp), %rbx /* get envp (pointer to element 0 of char * array) */
movq (%rbx,%rax,8), %rdx /* get pointer at envp+y*8
pointers are 8 bytes wide */
movl -4(%rbp), %eax /* get x */
cltq
leaq (%rdx, %rax), %rdx /* Get current character's address */
cmpb $0, (%rdx) /* Compare current byte to char 0
using cmpq will compare the next 8 bytes */
jne .Loop_2_body
/* saut de ligne */
movl $1, %r8d /* write size */
movl $1, %ecx /* sortie standart */
leaq .LC0(%rip), %rdx
call write
.Loop_1_condition:
addl $1, -8(%rbp) /* ++y */
movl -8(%rbp), %eax
cltq /* passe eax en 64bits */
movq 32(%rbp), %rbx /* get envp (pointer to element 0 of char * array) */
movq (%rbx,%rax,8), %rax /* get pointer at envp+y*8
pointers are 8 bytes wide */
cmpq $0, %rax /* Compare to NULL ptr */
jne .Loop_1_body
movl $1, %r8d /* write size */
movl $1, %ecx /* sortie standart */
leaq .LC0(%rip), %rdx
call write
/* fin du programme */
movl $0, %eax /* return (0) */
addq $48, %rsp
popq %rbp
ret
.def write; .scl 2; .type 32; .endef

Memory transfer intel assembly AT&T

I have a problem moving a string bytewise from one memory adress to another. Been at this for hours and tried some different strategies. Im new to Intel assemby so I need some tips and insight to help me solve the problem.
The getText routine is supposed to transfer n (found in %rsi) bytes from ibuf to the adress in %rdi. counterI is the offset used to indicate where to start the transfer, and after the routine is over it should point to the next byte that wasn't transfered. If there isn't n bytes it should cancel the transfer and return the actual number of bytes transfered in %rax.
getText:
movq $ibuf, %r10
#in rsi is the number of bytes to be transfered
#rdi contains the memory adress for the memory space to transfer to
movq $0, %r8 #start with offset 0
movq $0, %rax #zero return register
movq (counterI), %r11
cmpb $0, (%r10, %r11, 1) #check if ibuf+counterI=NULL
jne MOVE #if so call and read to ibuf
call inImage
MOVE:
cmpq $0,%rsi #if number of bytes to read is 0
je EXIT #exit
movq counterI, %r9
movq $0, %r9 #used for debugging only shold not be 0
movb (%r10, %r9, 1), %bl #loads one byte to rdi from ibuf
movb %bl, (%rdi, %r8, 1)
incq counterI #increase pointer offset
decq %rsi #dec number of bytes to read
incq %r8 #inc offset in write buffert
movq %r8, %rax #returns number of bytes wrote to buf
movq (counterI), %r9
cmpb $0, (%r10, %r9,1) #check if ibuf+offset is NULL
je EXIT #if so exit
cmpq $0, %rsi #can be cleaned up later
jne MOVE
EXIT:
movb $0, (%rdi, %r8, 1) #move NULL to buf+%r8?
ret
movq counterI, %r9
movq $0, %r9 #used for debugging only shold not be 0
The second instruction makes the first useless but given the remark I understand you will remove it. Better still, you can remove both if you would change every occurence of %R9 into %R11.
movzbq (%r10, %r9, 1), %r10 #loads one byte+zeroes to rdi from ibuf
movq %r10, (%rdi, %r8, 1) #HERE IS THE PROBLEM I THINK
Here is a dangerous construct. You're first using %R10 as an address but then drop a zero extended data byte in it. Later in the code you will again use %R10 as an address but sadly that won't be in there! The solution is to move into a different register and to not bother about the zero extention.
movb (%r10, %r9, 1), %bl #loads one byte to rdi from ibuf
movb %bl, (%rdi, %r8, 1)
The following code can be shortened
cmpb $0, (%r10, %r9,1) #check if ibuf+offset is NULL
je EXIT #if so exit
cmpq $0, %rsi #can be cleaned up later
jne MOVE
EXIT:
as
cmpb $0, (%r10, %r9, 1) #check if ibuf+offset is NULL
jne MOVE
EXIT:

SSE2: How to reduce a _m128 to a word

What's the best way ( sse2 ) to reduce a _m128 ( 4 words a b c d) to one word?
I want the low part of each _m128 components:
int result = ( _m128.a & 0x000000ff ) << 24
| ( _m128.b & 0x000000ff ) << 16
| ( _m128.c & 0x000000ff ) << 8
| ( _m128.d & 0x000000ff ) << 0
Is there an intrinsics for that ? thanks !
FYI, the sse3 intrinsics _mm_shuffle_epi8 do the job: (with the mask 0x0004080c in this case )
The SSE2 answer takes more than one instructions:
unsigned benoit(__m128i x)
{
__m128i zero = _mm_setzero_si128(), mask = _mm_set1_epi32(255);
return _mm_cvtsi128_si32(
_mm_packus_epi16(
_mm_packus_epi16(
_mm_and_si128(x, mask), zero), zero));
}
The above amounts to 5 machine ops, given the input in %xmm1 and output in %rax:
pxor %xmm0, %xmm0
pand MASK, %xmm1
packuswb %xmm0, %xmm1
packuswb %xmm0, %xmm1
movd %xmm1, %rax
If you want to see some unusual uses of SSE2, including high-speed bit-matrix transpose, string search and bitonic (GPGPU-style) sort, you might want to check my blog, Coding on the edges.
Anyway, hope that helps.

Resources