Do the glibc implementation of pthread_spin_lock() and pthread_spin_unlock() function have memory fence instructions? - pthreads

Do the glibc implementation of pthread_spin_lock() and pthread_spin_unlock() function have memory fence instructions? (I could not find any fence instructions.)
Similar question has answers here.
Does pthread_mutex_lock contains memory fence instruction?

Do the glibc implementation of pthread_spin_lock() and pthread_spin_unlock() function have memory fence instructions?
There is no the implementation -- there is an implementation for each supported processor.
The x86_64 implementation does not use memory fences; it uses lock prefix instead:
gdb -q /lib/x86_64-linux-gnu/libpthread.so.0
(gdb) disas pthread_spin_lock
Dump of assembler code for function pthread_spin_lock:
0x00000000000108c0 <+0>: lock decl (%rdi)
0x00000000000108c3 <+3>: jne 0x108d0 <pthread_spin_lock+16>
0x00000000000108c5 <+5>: xor %eax,%eax
0x00000000000108c7 <+7>: retq
0x00000000000108c8 <+8>: nopl 0x0(%rax,%rax,1)
0x00000000000108d0 <+16>: pause
0x00000000000108d2 <+18>: cmpl $0x0,(%rdi)
0x00000000000108d5 <+21>: jg 0x108c0 <pthread_spin_lock>
0x00000000000108d7 <+23>: jmp 0x108d0 <pthread_spin_lock+16>
Since lock-prefixed instructions are already a memory barrier on x86_64 (and i386), no additional memory barriers are necessary.
But powerpc implementation uses lwarx and stwcx instructions, which are closer to "memory fence", and sparc64 implementation uses full membar (memory barrier) instruction.
You can see the various implementations in sysdeps/.../pthread_spin_lock.* files in GLIBC sources.

Related

how to find variables location in memory without source code?

Basically I want to find the address/location of a variable in gdb?
I know normally the variable are store at rbp but don't know how to locate them using gdb.
I want to find the address/location of a variable in gdb?
That is possible, but the approach is different depending on whether the variable is a global or a local.
I know normally the variable are store at rbp
Local variables are stored at some offset of the frame pointer. %RBP is often used as a frame pointer in unoptimized binaries.
To find such variable, you'll need to know how to read machine code, and then you can find it. GDB will not help you with finding it in code that is compiled without debug info (it can't).
without source code
Source code has nothing to do with this -- GDB never looks at the source code, except to display it to you.
On to concrete example. Suppose you have the following source:
int foo(int *ip) { return *ip + 42; }
int main()
{
int j = 1;
return foo(&j);
}
Compiling this without debug info and without optimizations, results in:
(gdb) disas main
Dump of assembler code for function main:
0x000000000000060d <+0>: push %rbp
0x000000000000060e <+1>: mov %rsp,%rbp
0x0000000000000611 <+4>: sub $0x10,%rsp
0x0000000000000615 <+8>: movl $0x1,-0x4(%rbp)
0x000000000000061c <+15>: lea -0x4(%rbp),%rax
0x0000000000000620 <+19>: mov %rax,%rdi
0x0000000000000623 <+22>: callq 0x5fa <foo>
0x0000000000000628 <+27>: leaveq
0x0000000000000629 <+28>: retq
End of assembler dump.
here you can clearly see that j is being stored at negative offset 4 off %rbp.
You can set a breakpoint on foo, and use GDB to examine its value like so:
(gdb) b foo
Breakpoint 1 at 0x5fe
(gdb) run
Breakpoint 1, 0x00005555555545fe in foo ()
(gdb) up
#1 0x0000555555554628 in main ()
(gdb) x/x $rbp-4
0x7fffffffdbcc: 0x00000001 // indeed that is expected value of j

AVX instructions generated when -xSSE4.1 specified

I have compiled a piece of code with the option -xSSE4.1 using the Intel compiler. When I looked at the generated assembly file, I see that AVX instructions such as 'vpmovzxbw' have been inserted. But, the executable still seems to run on machines that don't support the AVX instruction set. What explains this?
Here's the particular code snippet -
C -> src0_8x16b = _mm_cvtepu8_epi16 (src0_8x16b);
Assembly -> vpmovzxbw xmm4, QWORD PTR [rcx]
Binary -> 00066 c4 62 79 30 29
Here's another snippet where the assembly instruction uses 3 operands -
C -> src0_8x16b = _mm_sub_epi16 (src0_8x16b, src1_8x16b);
Assembly -> vpsubw xmm1, xmm13, xmm11
Binary -> 000bc c4 c1 11 f9 cb
For comparison, here's the disassembly generated by icc for the function 'foo' (The only difference between the function foo and the code snippet above is that the code snippet was coded using intrinsics) -
Compiler commands used -
icc -S -xSSE4.1 -axavx -O3 foo.c
Function foo -
void foo(float *x, int n)
{
int i;
for(i=0; i<n; i++) x[i] *= 2.0;
}
Autodispatch code -
testl $-131072, __intel_cpu_indicator(%rip) #1.27
jne foo.R #1.27
testl $-1, __intel_cpu_indicator(%rip) #1.27
jne foo.A
Loop in foo.R (AVX variant) -
vmulps (%rdi,%rcx,4), %ymm0, %ymm1 #3.24
vmulps 32(%rdi,%rcx,4), %ymm0, %ymm2 #3.24
vmovups %ymm1, (%rdi,%rcx,4) #3.24
vmovups %ymm2, 32(%rdi,%rcx,4) #3.24
addq $16, %rcx #3.5
cmpq %rdx, %rcx #3.5
jb ..B2.12 # Prob 82% #3.5
Loop in foo.A (SSE variant) -
movaps (%rdi,%r8,4), %xmm1 #3.24
movaps 16(%rdi,%r8,4), %xmm2 #3.24
mulps %xmm0, %xmm1 #3.24
mulps %xmm0, %xmm2 #3.24
movaps %xmm1, (%rdi,%r8,4) #3.24
movaps %xmm2, 16(%rdi,%r8,4) #3.24
addq $8, %r8 #3.5
cmpq %rsi, %r8 #3.5
jb ..B3.12 # Prob 82% #3.5
I have tried to replicate the results on two other compilers, viz., gcc and Microsoft Visual Studio's v100 compilers. I was unable to do so, i.e., gcc and v100 compilers seem to be generating the correct disassemblies. As a further step, I looked closely at the differences, if any, that existed between the compiler arguments that I had specified in each case. It turns out that whilst using the icc compiler, I had enabled the option to inherit project defaults for compiling this particular file. The project settings were configured such that this option was included -
-xavx
As a result when this file was being compiled, the settings I had provided -
-xSSE4.1 -axavx
were overridden by the former. This was the cause of the behavior I have detailed in my question.
I am sorry for this error, but I shall not delete this question since #Zboson 's
answer is exceptional.
PS - I had mentioned in one of my comments that I was able to run this code on an SSE42 machine. That was because the exe I had run on that machine was indeed SSE41 compliant since I had apparently used an exe generated using the gcc compiler. I ran the icc generated exe and it was indeed crashing with an illegal instruction error on the SSE42 machine.
The Intel compiler can
generate a single executable with multiple levels of vectorization with the -ax flag,
For example to generate code which is compatible with AVX, SSE4.1 and SSE2 to use -axAVX -axSSE4.2 -xSSE2.
Since you compiled with -axAVX -xSSE4.1 Intel generated a AVX branch and a SSE4.1 branch and at runtime it determines which instruct set is available and chooses that.
Agner Fog has a good description of Intel's CPU dispatcher in his Optimizing C++ manaul. See section "13.7 CPU dispatching in Intel compiler". Intel's CPU dispatcher is not ideal for several reasons, one of which is that it plays bad on AMD, which Agner describes in detail. Personally I would make my own dispatcher.
I compiled the following code with ICC 13.0 with options -O3 -axavx -xsse2
void foo(float *x, int n) {
for(int i=0; i<n; i++) x[i] *= 2.0;
}
and the start of the assembly is
test DWORD PTR __intel_cpu_indicator[rip], -131072 #1.27
jne _Z3fooPfi.R #1.27
test DWORD PTR __intel_cpu_indicator[rip], -1 #1.27
jne _Z3fooPfi.A
going to the _Z3fooPfi.R branch find the main AVX loop
..B2.12: # Preds ..B2.12 ..B2.11
vmulps ymm1, ymm0, YMMWORD PTR [rdi+rcx*4] #2.25
vmulps ymm2, ymm0, YMMWORD PTR [32+rdi+rcx*4] #2.25
vmovups YMMWORD PTR [rdi+rcx*4], ymm1 #2.25
vmovups YMMWORD PTR [32+rdi+rcx*4], ymm2 #2.25
add rcx, 16 #2.2
cmp rcx, rdx #2.2
jb ..B2.12 # Prob 82% #2.2
going to the _Z3fooPfi.A branch has the main SSE loop
movaps xmm1, XMMWORD PTR [rdi+r8*4] #2.25
movaps xmm2, XMMWORD PTR [16+rdi+r8*4] #2.25
mulps xmm1, xmm0 #2.25
mulps xmm2, xmm0 #2.25
movaps XMMWORD PTR [rdi+r8*4], xmm1 #2.25
movaps XMMWORD PTR [16+rdi+r8*4], xmm2 #2.25
add r8, 8 #2.2
cmp r8, rsi #2.2
jb ..B3.12 # Prob 82% #2.2

asm usage of memory location operands

I am in trouble with the definition 'memory location'. According to the 'Intel 64 and IA-32 Software Developer's Manual' many instruction can use a memory location as operand.
For example MOVBE (move data after swapping bytes):
Instruction: MOVBE m32, r32
The question is now how a memory location is defined;
I tried to use variables defined in the .bss section:
section .bss
memory: resb 4 ;reserve 4 byte
memorylen: equ $-memory
section .text
global _start
_start:
MOV R9D, 0x6162630A
MOV [memory], R9D
SHR [memory], 1
MOVBE [memory], R9D
EDIT:->
MOV EAX, 0x01
MOV EBX, 0x00
int 0x80
<-EDIT
If SHR is commented out yasm (yasm -f elf64 .asm) compiles without problems but when executing stdio shows: Illegal Instruction
And if MOVBE is commented out the following error occurs when compiling: error: invalid size for operand 1
How do I have to allocate memory for using the 'm' option shown by the instruction set reference?
[CPU=x64, Compiler=yasm]
If that is all your code, you are falling off at the end into uninitialized region, so you will get a fault. That has nothing to do with allocating memory, which you did right. You need to add code to terminate your program using an exit system call, or at least put an endless loop so you avoid the fault (kill your program using ctrl+c or equivalent).
Update: While the above is true, the illegal instruction here is more likely caused by the fact that your cpu simply does not support the MOVBE instruction, because not all do. If you look in the reference, you can see it says #UD If CPUID.01H:ECX.MOVBE[bit 22] = 0. That is trying to tell you that a particular flag bit in the ECX register returned by the 01 leaf of the CPUID instruction shows support of this instruction. If you are on linux, you can conveniently check in /proc/cpuinfo whether you have the movbe flag or not.
As for the invalid operand size: you should generally specify the operand size when it can not be deduced from the instruction. That said, SHR accepts all sizes (byte, word, dword, qword) so you should really not get that error at all, but you might get an operation of unexpected default size. You should use SHR dword [memory], 1 in this case, and that also makes yasm happy.
Oh, and +1 for reading the intel manual ;)

Virtual memory without hardware support

While reading this question and its answer I couldn't help but think why is it obligatory for the hardware to support virtual memory?
For example can't I simulate this behavior with software only (e.g the o.s with represent all the memory as some table, intercept all memory related actions and and do the mapping itself)?
Is there any OS that implements such techniques?
As far as I know, No.
intercept all memory related actions? It doesn't look impossible, but I must be very very slow.
For example, suppose this code:
int f(int *a1, int *b1, int *c1, int *d1)
{
const int n=100000;
for(int j=0;j<n;j++){
a1[j] += b1[j];
c1[j] += d1[j];
}
}
(From here >o<)
This simple loop is compiled into following by gcc -std=c99 -O3 with gcc 4.8.3:
push %edi ; 1
xor %eax,%eax
push %esi ; 2
push %ebx ; 3
mov 0x10(%esp),%ecx ; 4
mov 0x14(%esp),%esi ; 5
mov 0x18(%esp),%edx ; 6
mov 0x1c(%esp),%ebx ; 7
mov (%esi,%eax,4),%edi ; 8
add %edi,(%ecx,%eax,4) ; 9
mov (%ebx,%eax,4),%edi ; 10
add %edi,(%edx,%eax,4) ; 11
add $0x1,%eax
cmp $0x186a0,%eax
jne 15 <_f+0x15> ; 12
pop %ebx ; 13
pop %esi ; 14
pop %edi ; 15
ret ; 16
Even this really really simple function has 16 machine codes that access memory. Probably OS's simulate code has hundreds of machine codes, Then we can guess this memory-accessing codes' speed will slows down hundreds times at least.
Moreover, It's when you can watch only memory-accessing codes. Probably your processor doesn't have this feature, you should use step-by-step debugging, like x86's Trap Flag, and check every command every time.
Things goes worse - It's not enough to check codes. You may want IP (Instruction Pointer) to follow your OS's virtual memory rule as well, so you must check whether IP is over page's boundary after each codes was run. You also must very carefully deal with codes which can change IP, such as jmp, call, ret, ...
I don't think it can be implemented efficiently. It cannot be implemented efficiently. Speed is one of the most important part of operating system. If OS become a bit slow, all system is affected. In this case, it's not a bit - your computer gets slow lots and lots. Moreover, implementing this is very difficult as I say above - I'd rather write an emulator of a processor which has hardware-suppported virtual memory than do this crazy job!

How to Switch Xcode 4.2's iOS disassembly from Thumb to ARM?

My iOS App is build with the Apple LLVM 3.0 compiler in Thumb mode. For armv7, I'm pretty sure that's actually Thumb-2.
I'm reimplementing my two most time-consuming functions in ARM assembly code. The callers of these functions are Thumb, so I use Thumb to ARM interworking instructions to switch to ARM in the prologue of my functions so I have acesss to ARM's richer instruction set and greater number of registers. At function exit I use ARM to Thumb interworking to return to ARM mode.
GDB's disassembly is correct for the Thumb code, but when I am in ARM mode, it disassembles the ARM instructions as if each one were a pair of completely nonsensical Thumb instructions. Is there some way I can tell GDB to switch to ARM disassembly, then upon returning to Thumb code, use the Thumb disassembler?
Google is no help. There are apparently other forks of GDB that can do that, but I haven't figured out a way to do it with GDB.
LLDB apparently supports ARM debugging, but it does not yet work on iOS devices in Xcode 4.2. When I choose the LLDB debugger in Product -> Edit Scheme, then set a breakpoint in my code, my App hangs before hitting the breakpoint.
It's been a long time since I've done any assembly of any sort, so I am brushing up on the ARM calling conventions by implementing functions that take various parameters and return various types of results in both C and ARM assembly. The lower_case functions are C and the CamelCase functions are assembly. I call abiTest the very first thing from main(), and use assert() to ensure that it returns YES
BOOL abiTest( void )
{
void_no_args();
VoidNoArgs();
if ( 42 != int_no_args() )
return NO;
if ( 42 != IntNoArgs() )
return NO;
return YES;
}
Here is the source for IntNoArgs. .thumb_func is a directive for the linker. My research seems to indicate that you want it even for ARM functions, if one mixes the two types of code
.globl _IntNoArgs
.align 1
.code 16
.thumb_func _IntNoArgs
_IntNoArgs:
# int IntNoArgs( void );
.loc 1 __LINE__ 0
adr r0, Larm1 # Larm1 is a PC-relative address. r0's low bit will be cleared
bx r0 # Switch to ARM mode then branch to Larm1. That's the next instruction
.align 2
.code 32
Larm1:
stmfd sp!, { r7, lr }
mov r0, #42
ldmfd sp!, { r7, lr }
bx lr
Here is how GDB disassembles the _IntNoArgs. The first two lines are correct, the remainder are completely wrong
0x000172c8 <+0000> add r0, pc, #0 (adr r0, 0x172cc <VoidNoArgs+4>)
0x000172ca <+0002> bx r0
0x000172cc <+0004> lsls r0, r0
0x000172ce <+0006> stmdb sp!, {r1, r3, r5}
0x000172d2 <+0010> b.n 0x17a16
0x000172d4 <+0012> lsls r0, r0
0x000172d6 <+0014> ldmia.w sp!, {r1, r2, r3, r4, r8, r9, r10, r11, r12, sp, lr, pc}
The disassembly stops here because the ldmia.w instruction appears to be putting a new value into the program counter after taking it from the stack, thereby returning from the subroutine. After I step over this instruction with "si" the disassembly pane show:
0x000172d8 <+0016> vrhadd.u16 d14, d14, d31
The si instruction always does the right thing, by advancing just one instruction whether we are in Thumb or ARM mode. So GDB must know the current instruction set architecture, it's just that the disassembler is not getting that information.
There is a bit in one of the ARM's register that indicates the current mode. Some forks of GDB have the ability to use the value of that bit when determining which ISA to disassemble as, but this is apparently not the case with Xcode 4.2's GDB.
Xcode's GDB has a command "set arm disassembler" and its corresponding "show arm disassembler" that looks like it would help, but it doesn't. That apparently is meant to support other kinds of ARM variants than what the iOS devices use.
"set fallback-mode" can be set to arm, thumb or auto in other forks of GDB, but not Xcodes. Ditto for "set disassembler-flavor".
What I would REALLY REALLY REALLY like is a machine debugger that worked just like MacsBug did on the Classic Mac OS. While GDB is generally capable of doing assembler debugging, it totally sucks for that purpose. That's not anyone's fault, really, because it is designed for source debugging. A good assembly debugger is designed to do it that way from the ground up.
The ABI function call guide states that switching between ARM and Thumb mode can be done only at function boundaries in iOS. Make sure your functions are either ARM or Thumb-only, and the debugger will work fine.

Resources