AVX instructions generated when -xSSE4.1 specified - sse

I have compiled a piece of code with the option -xSSE4.1 using the Intel compiler. When I looked at the generated assembly file, I see that AVX instructions such as 'vpmovzxbw' have been inserted. But, the executable still seems to run on machines that don't support the AVX instruction set. What explains this?
Here's the particular code snippet -
C -> src0_8x16b = _mm_cvtepu8_epi16 (src0_8x16b);
Assembly -> vpmovzxbw xmm4, QWORD PTR [rcx]
Binary -> 00066 c4 62 79 30 29
Here's another snippet where the assembly instruction uses 3 operands -
C -> src0_8x16b = _mm_sub_epi16 (src0_8x16b, src1_8x16b);
Assembly -> vpsubw xmm1, xmm13, xmm11
Binary -> 000bc c4 c1 11 f9 cb
For comparison, here's the disassembly generated by icc for the function 'foo' (The only difference between the function foo and the code snippet above is that the code snippet was coded using intrinsics) -
Compiler commands used -
icc -S -xSSE4.1 -axavx -O3 foo.c
Function foo -
void foo(float *x, int n)
{
int i;
for(i=0; i<n; i++) x[i] *= 2.0;
}
Autodispatch code -
testl $-131072, __intel_cpu_indicator(%rip) #1.27
jne foo.R #1.27
testl $-1, __intel_cpu_indicator(%rip) #1.27
jne foo.A
Loop in foo.R (AVX variant) -
vmulps (%rdi,%rcx,4), %ymm0, %ymm1 #3.24
vmulps 32(%rdi,%rcx,4), %ymm0, %ymm2 #3.24
vmovups %ymm1, (%rdi,%rcx,4) #3.24
vmovups %ymm2, 32(%rdi,%rcx,4) #3.24
addq $16, %rcx #3.5
cmpq %rdx, %rcx #3.5
jb ..B2.12 # Prob 82% #3.5
Loop in foo.A (SSE variant) -
movaps (%rdi,%r8,4), %xmm1 #3.24
movaps 16(%rdi,%r8,4), %xmm2 #3.24
mulps %xmm0, %xmm1 #3.24
mulps %xmm0, %xmm2 #3.24
movaps %xmm1, (%rdi,%r8,4) #3.24
movaps %xmm2, 16(%rdi,%r8,4) #3.24
addq $8, %r8 #3.5
cmpq %rsi, %r8 #3.5
jb ..B3.12 # Prob 82% #3.5

I have tried to replicate the results on two other compilers, viz., gcc and Microsoft Visual Studio's v100 compilers. I was unable to do so, i.e., gcc and v100 compilers seem to be generating the correct disassemblies. As a further step, I looked closely at the differences, if any, that existed between the compiler arguments that I had specified in each case. It turns out that whilst using the icc compiler, I had enabled the option to inherit project defaults for compiling this particular file. The project settings were configured such that this option was included -
-xavx
As a result when this file was being compiled, the settings I had provided -
-xSSE4.1 -axavx
were overridden by the former. This was the cause of the behavior I have detailed in my question.
I am sorry for this error, but I shall not delete this question since #Zboson 's
answer is exceptional.
PS - I had mentioned in one of my comments that I was able to run this code on an SSE42 machine. That was because the exe I had run on that machine was indeed SSE41 compliant since I had apparently used an exe generated using the gcc compiler. I ran the icc generated exe and it was indeed crashing with an illegal instruction error on the SSE42 machine.

The Intel compiler can
generate a single executable with multiple levels of vectorization with the -ax flag,
For example to generate code which is compatible with AVX, SSE4.1 and SSE2 to use -axAVX -axSSE4.2 -xSSE2.
Since you compiled with -axAVX -xSSE4.1 Intel generated a AVX branch and a SSE4.1 branch and at runtime it determines which instruct set is available and chooses that.
Agner Fog has a good description of Intel's CPU dispatcher in his Optimizing C++ manaul. See section "13.7 CPU dispatching in Intel compiler". Intel's CPU dispatcher is not ideal for several reasons, one of which is that it plays bad on AMD, which Agner describes in detail. Personally I would make my own dispatcher.
I compiled the following code with ICC 13.0 with options -O3 -axavx -xsse2
void foo(float *x, int n) {
for(int i=0; i<n; i++) x[i] *= 2.0;
}
and the start of the assembly is
test DWORD PTR __intel_cpu_indicator[rip], -131072 #1.27
jne _Z3fooPfi.R #1.27
test DWORD PTR __intel_cpu_indicator[rip], -1 #1.27
jne _Z3fooPfi.A
going to the _Z3fooPfi.R branch find the main AVX loop
..B2.12: # Preds ..B2.12 ..B2.11
vmulps ymm1, ymm0, YMMWORD PTR [rdi+rcx*4] #2.25
vmulps ymm2, ymm0, YMMWORD PTR [32+rdi+rcx*4] #2.25
vmovups YMMWORD PTR [rdi+rcx*4], ymm1 #2.25
vmovups YMMWORD PTR [32+rdi+rcx*4], ymm2 #2.25
add rcx, 16 #2.2
cmp rcx, rdx #2.2
jb ..B2.12 # Prob 82% #2.2
going to the _Z3fooPfi.A branch has the main SSE loop
movaps xmm1, XMMWORD PTR [rdi+r8*4] #2.25
movaps xmm2, XMMWORD PTR [16+rdi+r8*4] #2.25
mulps xmm1, xmm0 #2.25
mulps xmm2, xmm0 #2.25
movaps XMMWORD PTR [rdi+r8*4], xmm1 #2.25
movaps XMMWORD PTR [16+rdi+r8*4], xmm2 #2.25
add r8, 8 #2.2
cmp r8, rsi #2.2
jb ..B3.12 # Prob 82% #2.2

Related

Do the glibc implementation of pthread_spin_lock() and pthread_spin_unlock() function have memory fence instructions?

Do the glibc implementation of pthread_spin_lock() and pthread_spin_unlock() function have memory fence instructions? (I could not find any fence instructions.)
Similar question has answers here.
Does pthread_mutex_lock contains memory fence instruction?
Do the glibc implementation of pthread_spin_lock() and pthread_spin_unlock() function have memory fence instructions?
There is no the implementation -- there is an implementation for each supported processor.
The x86_64 implementation does not use memory fences; it uses lock prefix instead:
gdb -q /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) disas pthread_spin_lock
Dump of assembler code for function pthread_spin_lock:
0x00000000000108c0 <+0>: lock decl (%rdi)
0x00000000000108c3 <+3>: jne 0x108d0 <pthread_spin_lock+16>
0x00000000000108c5 <+5>: xor %eax,%eax
0x00000000000108c7 <+7>: retq
0x00000000000108c8 <+8>: nopl 0x0(%rax,%rax,1)
0x00000000000108d0 <+16>: pause
0x00000000000108d2 <+18>: cmpl $0x0,(%rdi)
0x00000000000108d5 <+21>: jg 0x108c0 <pthread_spin_lock>
0x00000000000108d7 <+23>: jmp 0x108d0 <pthread_spin_lock+16>
Since lock-prefixed instructions are already a memory barrier on x86_64 (and i386), no additional memory barriers are necessary.
But powerpc implementation uses lwarx and stwcx instructions, which are closer to "memory fence", and sparc64 implementation uses full membar (memory barrier) instruction.
You can see the various implementations in sysdeps/.../pthread_spin_lock.* files in GLIBC sources.

iOS ARM64 Syscalls

I am learning more about shellcode and making syscalls in arm64 on iOS devices. The device I am testing on is iPhone 6S.
I got the list of syscalls from this link (https://github.com/radare/radare2/blob/master/libr/include/sflib/darwin-arm-64/ios-syscalls.txt).
I learnt that x8 is used for putting the syscall number for arm64 from here (http://arm.ninja/2016/03/07/decoding-syscalls-in-arm64/).
I figured the various registers used to pass in parameters for arm64 should be the same as arm so I referred to this link (https://w3challs.com/syscalls/?arch=arm_strong), taken from https://azeria-labs.com/writing-arm-shellcode/.
I wrote inline assembly in Xcode and here are some snippets
//exit syscall
__asm__ volatile("mov x8, #1");
__asm__ volatile("mov x0, #0");
__asm__ volatile("svc 0x80");
However, the application does not terminate when I stepped over these codes.
char write_buffer[]="console_text";
int write_buffer_size = sizeof(write_buffer);
__asm__ volatile("mov x8,#4;" //arm64 uses x8 for syscall number
"mov x0,#1;" //1 for stdout file descriptor
"mov x1,%0;" //the buffer to display
"mov x2,%1;" //buffer size
"svc 0x80;"
:
:"r"(write_buffer),"r"(write_buffer_size)
:"x0","x1","x2","x8"
);
If this syscall works, it should print out some text in Xcode's console output screen. However, nothing gets printed.
There are many online articles for ARM assembly, some use svc 0x80 and some use svc 0 etc and so there can be a few variations. I tried various methods but I could not get the two code snippets to work.
Can someone provide some guidance?
EDIT:
This is what Xcode shows in its Assembly view when I wrote a C function syscall int return_value=syscall(1,0);
mov x1, sp
mov x30, #0
str x30, [x1]
orr w8, wzr, #0x1
stur x0, [x29, #-32] ; 8-byte Folded Spill
mov x0, x8
bl _syscall
I am not sure why this code was emitted.
The registers used for syscalls are completely arbitrary, and the resources you've picked are certainly wrong for XNU.
As far as I'm aware, the XNU syscall ABI for arm64 is entirely private and subject to change without notice so there's no published standard that it follows, but you can scrape together how it works by getting a copy of the XNU source (as tarballs, or viewing it online if you prefer that), grep for the handle_svc function, and just following the code.
I'm not gonna go into detail on where exactly you find which bits, but the end result is:
The immediate passed to svc is ignored, but the standard library uses svc 0x80.
x16 holds the syscall number
x0 through x8 hold up to 9 arguments*
There are no arguments on the stack
x0 and x1 hold up to 2 return values (e.g. in the case of fork)
The carry bit is used to report an error, in which case x0 holds the error code
* This is used only in the case of an indirect syscall (x16 = 0) with 8 arguments.
* Comments in the XNU source also mention x9, but it seems the engineer who wrote that should brush up on off-by-one errors.
And then it comes to the actual syscall numbers available:
The canonical source for "UNIX syscalls" is the file bsd/kern/syscalls.master in the XNU source tree. Those take syscall numbers from 0 up to about 540 in the latest iOS 13 beta.
The canonical source for "Mach syscalls" is the file osfmk/kern/syscall_sw.c in the XNU source tree. Those syscalls are invoked with negative numbers between -10 and -100 (e.g. -28 would be task_self_trap).
Unrelated to the last point, two syscalls mach_absolute_time and mach_continuous_time can be invoked with syscall numbers -3 and -4 respectively.
A few low-level operations are available through platform_syscall with the syscall number 0x80000000.
This should get you going. As #Siguza mentioned you must use x16 , not x8 for the syscall number.
#import <sys/syscall.h>
char testStringGlobal[] = "helloWorld from global variable\n";
int main(int argc, char * argv[]) {
char testStringOnStack[] = "helloWorld from stack variable\n";
#if TARGET_CPU_ARM64
//VARIANT 1 suggested by #PeterCordes
//an an input it's a file descriptor set to STD_OUT 1 so the syscall write output appears in Xcode debug output
//as an output this will be used for returning syscall return value;
register long x0 asm("x0") = 1;
//as an input string to write
//as an output this will be used for returning syscall return value higher half (in this particular case 0)
register char *x1 asm("x1") = testStringOnStack;
//string length
register long x2 asm("x2") = strlen(testStringOnStack);
//syscall write is 4
register long x16 asm("x16") = SYS_write; //syscall write definition - see my footnote below
//full variant using stack local variables for register x0,x1,x2,x16 input
//syscall result collected in x0 & x1 using "semi" intrinsic assembler
asm volatile(//all args prepared, make the syscall
"svc #0x80"
:"=r"(x0),"=r"(x1) //mark x0 & x1 as syscall outputs
:"r"(x0), "r"(x1), "r"(x2), "r"(x16): //mark the inputs
//inform the compiler we read the memory
"memory",
//inform the compiler we clobber carry flag (during the syscall itself)
"cc");
//VARIANT 2
//syscall write for globals variable using "semi" intrinsic assembler
//args hardcoded
//output of syscall is ignored
asm volatile(//prepare x1 with the help of x8 register
"mov x1, %0 \t\n"
//set file descriptor to STD_OUT 1 so it appears in Xcode debug output
"mov x0, #1 \t\n"
//hardcoded length
"mov x2, #32 \t\n"
//syscall write is 4
"mov x16, #0x4 \t\n"
//all args prepared, make the syscall
"svc #0x80"
::"r"(testStringGlobal):
//clobbered registers list
"x1","x0","x2","x16",
//inform the compiler we read the memory
"memory",
//inform the compiler we clobber carry flag (during the syscall itself)
"cc");
//VARIANT 3 - only applicable to global variables using "page" address
//which is PC-relative addressing to load addresses at a fixed offset from the current location (PIC code).
//syscall write for global variable using "semi" intrinsic assembler
asm volatile(//set x1 on proper PAGE
"adrp x1,_testStringGlobal#PAGE \t\n" //notice the underscore preceding variable name by convention
//add the offset of the testStringGlobal variable
"add x1,x1,_testStringGlobal#PAGEOFF \t\n"
//set file descriptor to STD_OUT 1 so it appears in Xcode debug output
"mov x0, #1 \t\n"
//hardcoded length
"mov x2, #32 \t\n"
//syscall write is 4
"mov x16, #0x4 \t\n"
//all args prepared, make the syscall
"svc #0x80"
:::
//clobbered registers list
"x1","x0","x2","x16",
//inform the compiler we read the memory
"memory",
//inform the compiler we clobber carry flag (during the syscall itself)
"cc");
#endif
#autoreleasepool {
return UIApplicationMain(argc, argv, nil, NSStringFromClass([AppDelegate class]));
}
}
EDIT
To #PeterCordes excellent comment, yes there is a syscall numbers definition header <sys/syscall.h> which I included in the above snippet^ in Variant 1. But it's important to mention inside it's defined by Apple like this:
#ifdef __APPLE_API_PRIVATE
#define SYS_syscall 0
#define SYS_exit 1
#define SYS_fork 2
#define SYS_read 3
#define SYS_write 4
I haven't heard of a case yet of an iOS app AppStore rejection due to using a system call directly through svc 0x80 nonetheless it's definitely not public API.
As for the suggested "=#ccc" by #PeterCordes i.e. carry flag (set by syscall upon error) as an output constraint that's not supported as of latest XCode11 beta / LLVM 8.0.0 even for x86 and definitely not for ARM.

RGBA to ABGR: Inline arm neon asm for iOS/Xcode

This code(very similar code, haven't tried exactly this code) compiles using Android NDK, but not with Xcode/armv7+arm64/iOS
Errors in comments:
uint32_t *src;
uint32_t *dst;
#ifdef __ARM_NEON
__asm__ volatile(
"vld1.32 {d0, d1}, [%[src]] \n" // error: Vector register expected
"vrev32.8 q0, q0 \n" // error: Unrecognized instruction mnemonic
"vst1.32 {d0, d1}, [%[dst]] \n" // error: Vector register expected
:
: [src]"r"(src), [dst]"r"(dst)
: "d0", "d1"
);
#endif
What's wrong with this code?
EDIT1:
I rewrote the code using intrinsics:
uint8x16_t x = vreinterpretq_u8_u32(vld1q_u32(src));
uint8x16_t y = vrev32q_u8(x);
vst1q_u32(dst, vreinterpretq_u32_u8(y));
After disassembling, I get the following, which is a variation I have already tried:
vld1.32 {d16, d17}, [r0]!
vrev32.8 q8, q8
vst1.32 {d16, d17}, [r1]!
So my code looks like this now, but gives the exact same errors:
__asm__ volatile("vld1.32 {d0, d1}, [%0]! \n"
"vrev32.8 q0, q0 \n"
"vst1.32 {d0, d1}, [%1]! \n"
:
: "r"(src), "r"(dst)
: "d0", "d1"
);
EDIT2:
Reading through the disassembly, I actually found a second version of the function. It turns out that arm64 uses a slightly different instruction set. For example, the arm64 assembly uses rev32.16b v0, v0 instead. The whole function listing(which I can't make heads or tails of) is below:
_My_Function:
cmp w2, #0
add w9, w2, #3
csel w8, w9, w2, lt
cmp w9, #7
b.lo 0x3f4
asr w9, w8, #2
ldr x8, [x0]
mov w9, w9
lsl x9, x9, #2
ldr q0, [x8], #16
rev32.16b v0, v0
str q0, [x1], #16
sub x9, x9, #16
cbnz x9, 0x3e0
ret
I have successfully published several iOS apps which make use of ARM assembly language and inline code is the most frustrating way to do it. Apple still requires apps to support both ARM32 and ARM64 devices. Since the code will be built as both ARM32 and ARM64 by default (unless you changed the compile options), you need to design code which will successfully compile in both modes. As you noticed, ARM64 is a completely different mnemonic format and register model. There are 2 simple ways around this:
1) Write your code using NEON intrinsics. ARM specified that the original ARM32 intrinsics would remain mostly unchanged for ARMv8 targets and therefore can be compiled to both ARM32 and ARM64 code. This is the safest/easiest option.
2) Write inline code or a separate '.S' module for your assembly language code. To deal with the 2 compile modes, use "#ifdef __arm64__" and "#ifdef __arm__" to distinguish between the two instruction sets.
Intrinsics are apparently the only way to use the same code for NEON between ARM (32-bit) and AArch64.
There are many reasons not to use inline-assembly: https://gcc.gnu.org/wiki/DontUseInlineAsm
Unfortunately, current compilers often do a very poor job with ARM / AArch64 intrinsics, which is surprising because they do an excellent job optimizing x86 SSE/AVX intrinsics and PowerPC Altivec. They often do ok in simple cases, but can easily introduce extra store/reloads.
In theory with intrinsics, you should get good asm output, and it lets the compiler schedule instructions between the vector load and store, which will help most on an in-order core. (Or you could write a whole loop in inline asm that you schedule by hand.)
ARM's official documentation:
Although it is technically possible to optimize NEON assembly by hand, this can be very difficult because the pipeline and memory access timings have complex inter-dependencies. Instead of hand assembly, ARM strongly recommends the use of intrinsics
If you do use inline asm anyway, avoid future pain by getting it right.
It's easy to write inline asm that happens to work, but isn't safe wrt. future source changes (and sometimes to future compiler optimizations), because the constraints don't accurately describe what the asm does. The symptoms will be weird, and this kind of context-sensitive bug could even lead to unit tests passing but wrong code in the main program. (or vice versa).
A latent bug that doesn't cause any defects in the current build is still a bug, and is a really Bad Thing in a Stackoverflow answer that can be copied as an example into other contexts. #bitwise's code in the question and self-answer both have bugs like this.
The inline asm in the question isn't safe, because it modifies memory telling the compiler about it. This probably only manifests in a loop that reads from dst in C both before and after the inline asm. However, it's easy to fix, and doing so lets us drop the volatile (and the `"memory" clobber which it's missing) so the compiler can optimize better (but still with significant limitations compared to intrinsics).
volatile should prevent reordering relative to memory accesses, so it may not happen outside of fairly contrived circumstances. But that's hard to prove.
The following compiles for ARM and AArch64 (it might fail if compiling for ILP32 on AArch64, though, I forgot about that possibility). Using -funroll-loops leads to gcc choosing different addressing modes, and not forcing the dst++; src++; to happen between every inline asm statement. (This maybe wouldn't be possible with asm volatile).
I used memory operands so the compiler knows that memory is an input and an output, and giving the compiler the option to use auto-increment / decrement addressing modes. This is better than anything you can do with a pointer in a register as an input operand, because it allows loop unrolling to work.
This still doesn't let the compiler schedule the store many instructions after the corresponding load to software pipeline the loop for in-order cores, so it's probably only going to perform decently on out-of-order ARM chips.
void bytereverse32(uint32_t *dst32, const uint32_t *src32, size_t len)
{
typedef struct { uint64_t low, high; } vec128_t;
const vec128_t *src = (const vec128_t*) src32;
vec128_t *dst = (vec128_t*) dst32;
// with old gcc, this gets gcc to use a pointer compare as the loop condition
// instead of incrementing a loop counter
const vec128_t *src_endp = src + len/(sizeof(vec128_t)/sizeof(uint32_t));
// len is in units of 4-byte chunks
while (src < src_endp) {
#if defined(__ARM_NEON__) || defined(__ARM_NEON)
#if __LP64__ // FIXME: doesn't account for ILP32 in 64-bit mode
// aarch64 registers: s0 and d0 are subsets of q0 (128bit), synonym for v0
asm ("ldr q0, %[src] \n\t"
"rev32.16b v0, v0 \n\t"
"str q0, %[dst] \n\t"
: [dst] "=<>m"(*dst) // auto-increment/decrement or "normal" memory operand
: [src] "<>m" (*src)
: "q0", "v0"
);
#else
// arm32 registers: 128bit q0 is made of d0:d1, or s0:s3
asm ("vld1.32 {d0, d1}, %[src] \n\t"
"vrev32.8 q0, q0 \n\t" // reverse 8 bit elements inside 32bit words
"vst1.32 {d0, d1}, %[dst] \n"
: [dst] "=<>m"(*dst)
: [src] "<>m"(*src)
: "d0", "d1"
);
#endif
#else
#error "no NEON"
#endif
// increment pointers by 16 bytes
src++; // The inline asm doesn't modify the pointers.
dst++; // of course, these increments may compile to a post-increment addressing mode
// this way has the advantage of letting the compiler unroll or whatever
}
}
This compiles (on the Godbolt compiler explorer with gcc 4.8), but I don't know if it assembles, let alone works correctly. Still, I'm confident these operand constraints are correct. Constraints are basically the same across all architectures, and I understand them much better than I know NEON.
Anyway, the inner loop on ARM (32bit) with gcc 4.8 -O3, without -funroll-loops is:
.L4:
vld1.32 {d0, d1}, [r1], #16 # MEM[(const struct vec128_t *)src32_17]
vrev32.8 q0, q0
vst1.32 {d0, d1}, [r0], #16 # MEM[(struct vec128_t *)dst32_18]
cmp r3, r1 # src_endp, src32
bhi .L4 #,
The register constraint bug
The code in the OP's self-answer has another bug: The input pointer operands uses separate "r" constraints. This leads to breakage if the compiler wants to keep the old value around, and chooses an input register for src that isn't the same as the output register.
If you want to take pointer inputs in registers and choose your own addressing modes, you can use "0" matching-constraints, or you can use "+r" read-write output operands.
You will also need a "memory" clobber or dummy memory input/output operands (i.e. that tell the compiler which bytes of memory are read and written, even if you don't use that operand number in the inline asm).
See Looping over arrays with inline assembly for a discussion of the advantages and disadvantages of using r constraints for looping over an array on x86. ARM has auto-increment addressing modes, which appear to produce better code than anything you could get with manual choice of addressing modes. It lets gcc use different addressing modes in different copies of the block when loop-unrolling. "r" (pointer) constraints appear to have no advantage, so I won't go into detail about how to use a dummy input / output constraint to avoid needing a "memory" clobber.
Test-case that generates wrong code with #bitwise's asm statement:
// return a value as a way to tell the compiler it's needed after
uint32_t* unsafe_asm(uint32_t *dst, const uint32_t *src)
{
uint32_t *orig_dst = dst;
uint32_t initial_dst0val = orig_dst[0];
#ifdef __ARM_NEON
#if __LP64__
asm volatile("ldr q0, [%0], #16 # unused src input was %2\n\t"
"rev32.16b v0, v0 \n\t"
"str q0, [%1], #16 # unused dst input was %3\n"
: "=r"(src), "=r"(dst)
: "r"(src), "r"(dst)
: "d0", "d1" // ,"memory"
// clobbers don't include v0?
);
#else
asm volatile("vld1.32 {d0, d1}, [%0]! # unused src input was %2\n\t"
"vrev32.8 q0, q0 \n\t"
"vst1.32 {d0, d1}, [%1]! # unused dst input was %3\n"
: "=r"(src), "=r"(dst)
: "r"(src), "r"(dst)
: "d0", "d1" // ,"memory"
);
#endif
#else
#error "No NEON/AdvSIMD"
#endif
uint32_t final_dst0val = orig_dst[0];
// gcc assumes the asm doesn't change orig_dst[0], so it only does one load (after the asm)
// and uses it for final and initial
// uncomment the memory clobber, or use a dummy output operand, to avoid this.
// pointer + initial+final compiles to LSL 3 to multiply by 8 = 2 * sizeof(uint32_t)
// using orig_dst after the inline asm makes the compiler choose different registers for the
// "=r"(dst) output operand and the "r"(dst) input operand, since the asm constraints
// advertise this non-destructive capability.
return orig_dst + final_dst0val + initial_dst0val;
}
This compiles to (AArch64 gcc4.8 -O3):
ldr q0, [x1], #16 # unused src input was x1 // src, src
rev32.16b v0, v0
str q0, [x2], #16 # unused dst input was x0 // dst, dst
ldr w1, [x0] // D.2576, *dst_1(D)
add x0, x0, x1, lsl 3 //, dst, D.2576,
ret
The store uses x2 (an uninitialized register, since this function only takes 2 args). The "=r"(dst) output (%1) picked x2, while the "r"(dst) input (%3 which is used only in a comment) picked x0.
final_dst0val + initial_dst0val compiles to 2x final_dst0val, because we lied to the compiler and told it that memory wasn't modified. So instead of reading the same memory before and after the inline asm statement, it just reads after and left-shifts by one extra position when adding to the pointer. (The return value exists only to use the values so they're not optimized away).
We can fix both problems by correcting the constraints: using "+r" for the pointers and adding a "memory" clobber. (A dummy output would also work, and might hurt optimization less.) I didn't bother since this appears to have no advantage over the memory-operand version above.
With those changes, we get
safe_register_pointer_asm:
ldr w3, [x0] //, *dst_1(D)
mov x2, x0 // dst, dst ### These 2 insns are new
ldr q0, [x1], #16 // src
rev32.16b v0, v0
str q0, [x2], #16 // dst
ldr w1, [x0] // D.2597, *dst_1(D)
add x3, x1, x3, uxtw // D.2597, D.2597, initial_dst0val ## And this is new, to add the before and after loads
add x0, x0, x3, lsl 2 //, dst, D.2597,
ret
As stated in the edits to the original question, it turned out that I needed a different assembly implementation for arm64 and armv7.
#ifdef __ARM_NEON
#if __LP64__
asm volatile("ldr q0, [%0], #16 \n"
"rev32.16b v0, v0 \n"
"str q0, [%1], #16 \n"
: "=r"(src), "=r"(dst)
: "r"(src), "r"(dst)
: "d0", "d1"
);
#else
asm volatile("vld1.32 {d0, d1}, [%0]! \n"
"vrev32.8 q0, q0 \n"
"vst1.32 {d0, d1}, [%1]! \n"
: "=r"(src), "=r"(dst)
: "r"(src), "r"(dst)
: "d0", "d1"
);
#endif
#else
The intrinsics code that I posted in the original post generated surprisingly good assembly though, and also generated the arm64 version for me, so it may be a better idea to use intrinsics instead in the future.

Getting calling conventions from DWARF info

I am trying to get information about calling conventions from DWARF info. More specific, I want to get which registers / stack locations are used to pass arguments to functions. My problem is that I am getting somehow wrong information in some cases from DWARF dump. The example I am using is the following "C code":
int __attribute__ ((fastcall)) __attribute__ ((noinline)) mult (int x, int y) {
return x*y;
}
I compile this example using the following command:
gcc -c -g -m32 test.c -o test.o
Now when I use the following command to get the dwarf dump:
dwarfdump test.o
I am getting the following information about this function:
< 2><0x00000042> DW_TAG_formal_parameter
DW_AT_name "x"
DW_AT_decl_file 0x00000001 /home/khaled/Repo_current/trunk/test.c
DW_AT_decl_line 0x00000001
DW_AT_type <0x0000005b>
DW_AT_location DW_OP_fbreg -12
< 2><0x0000004e> DW_TAG_formal_parameter
DW_AT_name "y"
DW_AT_decl_file 0x00000001 /home/khaled/Repo_current/trunk/test.c
DW_AT_decl_line 0x00000001
DW_AT_type <0x0000005b>
DW_AT_location DW_OP_fbreg -16
Looking at the DW_AT_location entries, it is some offset from the frame base. This implies they are memory arguments, but the actual calling convention "fastcall" forces passing them into registers. By looking at the disassembly of the produced object file, I can see they are copied from registers to stack locations at the entry point of the function. Is there a way to know from the dwarf dump --or using any other way-- where the arguments are passed at the call initially?
Thanks,
Because you are using gcc -c -g -m32 test.c -o test.o. Although it is a fastcall function, GCC still needs to generate code to save values from registers to the stack frame at the beginning of the function. Without that, any debugger or gdb cannot debug the program or they will say the argument is being optimized and not shown. It makes debugging impossible.
In x86_64, compiler also uses some registers to pass some arguments by default, even without specifying attribute fastcall for a function. You can also find those registers are being copied to the stack as well.
// x86_64 assembly code
_mult:
Leh_func_begin1:
pushq %rbp
Ltmp0:
movq %rsp, %rbp
Ltmp1:
movl %edi, -4(%rbp)
movl %esi, -8(%rbp)
movl -4(%rbp), %eax
movl -8(%rbp), %ecx
imull %ecx, %eax
If you turn on optimization flag -O, -O2, -O3 (no matter -g or not), you can disassemble and find there is nothing being copied to the stack frame. And when you gdb the optimized executable file, and stop at the beginning of the function to show local variables, gdb will tell you those arguments are being optimized out.
the dwarfdump example of the 32-bit program would look like
0x00000083: TAG_formal_parameter [4]
AT_name( "x" )
AT_decl_file( "test.c" )
AT_decl_line( 1 )
AT_type( {0x0000005f} ( int ) )
AT_location( 0x00000000
0x00000000 - 0x00000003: ecx
0x00000003 - 0x00000018: ecx )
0x00000090: TAG_formal_parameter [4]
AT_name( "y" )
AT_decl_file( "test.c" )
AT_decl_line( 1 )
AT_type( {0x0000005f} ( int ) )
AT_location( 0x0000001e
0x00000000 - 0x00000003: edx
0x00000003 - 0x00000018: edx )
And you can find the generated assembly code is much simple and clean.
_mult:
pushl %ebp
movl %esp, %ebp
movl %ecx, %eax
imull %edx, %eax
popl %ebp
ret $12

How slow is NaN arithmetic in the Intel x64 FPU?

Hints and allegations abound that arithmetic with NaNs can be 'slow' in hardware FPUs. Specifically in the modern x64 FPU, e.g on a Nehalem i7, is that still true? Do FPU multiplies get churned out at the same speed regardless of the values of the operands?
I have some interpolation code that can wander off the edge of our defined data, and I'm trying to determine whether it's faster to check for NaNs (or some other sentinel value) here there and everywhere, or just at convenient points.
Yes, I will benchmark my particular case (it could be dominated by something else entirely, like memory bandwidth), but I was surprised not to see a concise summary somewhere to help with my intuition.
I'll be doing this from the CLR, if it makes a difference as to the flavor of NaNs generated.
For what it's worth, using the SSE instruction mulsd with NaN is pretty much exactly as fast as with the constant 4.0 (chosen by a fair dice roll, guaranteed to be random).
This code:
for (unsigned i = 0; i < 2000000000; i++)
{
double j = doubleValue * i;
}
generates this machine code (inside the loop) with clang (I assume the .NET virtual machine uses SSE instructions when it can too):
movsd -16(%rbp), %xmm0 ; gets the constant (NaN or 4.0) into xmm0
movl -20(%rbp), %eax ; puts i into a register
cvtsi2sdq %rax, %xmm1 ; converts i to a double and puts it in xmm1
mulsd %xmm0, %xmm1 ; multiplies xmm0 (the constant) with xmm1 (i)
movsd %xmm1, -32(%rbp) ; puts the result somewhere on the stack
And with two billion iterations, the NaN (as defined by the C macro NAN from <math.h>) version took about 0.017 less seconds to execute on my i7. The difference was probably caused by the task scheduler.
So to be fair, they're exactly as fast.

Resources