I"m learning to code in x86 assembly (32-bit at the moment) and I'm struggling to understand the memory model completely. Particularly confusing is the semantics for labels, and the LEA instruction, and the layout of the executable. I wrote this sample program so i could inspect it running in gdb.
; mem.s
SECTION .data
msg: db "labeled string\n"
db "unlabeled-string\n"
nls: db 10,10,10,10,10,10,10,10
SECTION .text
global _start
_start:
; inspect msg label, LEA instruction
mov eax, msg
mov eax, &msg
mov eax, [msg]
; lea eax, msg (invalid instruction)
lea eax, &msg
lea eax, [msg]
; populate array in BSS section
mov [arr], DWORD 1
mov [arr+4], DWORD 2
mov [arr+8], DWORD 3
mov [arr+12], DWORD 4
; trying to print the unlabeled string
mov eax, 4
mov ebx, 1
lea ecx, [msg+15]
int 80H
mov eax, 1 ; exit syscall
mov ebx, 0 ; return value
int 80H
SECTION .bss
arr: resw 16
I've assembled and linked with:
nasm -f elf -g -F stabs mem.s
ld -m elf_i386 -o mem mem.o
GDB session:
(gdb) disas *_start
Dump of assembler code for function _start:
0x08048080 <+0>: mov $0x80490e4,%eax
0x08048085 <+5>: mov 0x80490e4,%eax
0x0804808a <+10>: mov 0x80490e4,%eax
0x0804808f <+15>: lea 0x80490e4,%eax
0x08048095 <+21>: lea 0x80490e4,%eax
0x0804809b <+27>: movl $0x1,0x8049110
0x080480a5 <+37>: movl $0x2,0x8049114
0x080480af <+47>: movl $0x3,0x8049118
0x080480b9 <+57>: movl $0x4,0x804911c
0x080480c3 <+67>: mov $0x4,%eax
0x080480c8 <+72>: mov $0x1,%ebx
0x080480cd <+77>: lea 0x80490f3,%ecx
0x080480d3 <+83>: int $0x80
0x080480d5 <+85>: mov $0x1,%eax
0x080480da <+90>: mov $0x0,%ebx
0x080480df <+95>: int $0x80
inspecting "msg" value:
(gdb) b _start
Breakpoint 1 at 0x8048080
(gdb) run
Starting program: /home/jh/workspace/x86/mem_addr/mem
(gdb) p msg
# what does this value represent?
$1 = 1700946284
(gdb) p &msg
$2 = (<data variable, no debug info> *) 0x80490e4
# this is the address where "labeled string" is stored
(gdb) p *0x80490e4
# same value as above (eg: p msg)
$3 = 1700946284
(gdb) p *msg
Cannot access memory at address 0x6562616c
# NOTE: 0x6562616c is ASCII values of 'e','b','a','l'
# the first 4 bytes from msg: db "labeled string"... little-endian
(gdb) x msg
0x6562616c: Cannot access memory at address 0x6562616c
(gdb) x &msg
0x80490e4 <msg>: 0x6562616c
(gdb) x *msg
Cannot access memory at address 0x6562616c
Stepping through one instruction at a time:
(gdb) p $eax
$4 = 0
(gdb) stepi
0x08048085 in _start ()
(gdb) p $eax
$5 = 134516964
0x0804808a in _start ()
(gdb) p $eax
$6 = 1700946284
(gdb) stepi
0x0804808f in _start ()
(gdb) p $eax
$7 = 1700946284
(gdb) stepi
0x08048095 in _start ()
(gdb) p $eax
$8 = 134516964
The array was populated with the values 1,2,3,4 as expected:
# before program execution:
(gdb) x/16w &arr
0x8049104 <arr>: 0 0 0 0
0x8049114: 0 0 0 0
0x8049124: 0 0 0 0
0x8049134: 0 0 0 0
# after program execution
(gdb) x/16w &arr
0x8049104 <arr>: 1 2 3 4
0x8049114: 0 0 0 0
0x8049124: 0 0 0 0
0x8049134: 0 0 0 0
I don't understand why printing a label in gdb results in those two values. Also, how can I print the unlabeled string.
Thanks in advance
Its somewhat confusing because gdb doesn't understand the concept of labels, really -- its designed to debug a program written in higher-level language (C or C++, generally) and compiled by a compiler. So it tries to map what it sees in the binary to high-level language concepts -- variables and types -- based on its best guess as to what is going on (in the absence of debug info from the compiler that tells it what is going on).
what nasm does
To the assembler, a label is value that hasn't been set yet -- it actually gets its final value when the linker runs. Generally, labels are used to refer to addresses in sections of memory -- the actual address will get defined when the linker lays out the final executable image. The assembler generates relocation records so that uses of the label can be set properly by the linker.
So when the assembler sees
mov eax, msg
it knows that msg is a label corresponding to an address in the data segment, so it generates an instruction to load that address into eax. When it sees
mov eax, [msg]
it generates an instruction to load 32-bits (the size of register eax) from memory at address of msg. In both cases, there will be a relocation generated so that the linker can plug in the final address msg ends up with.
(aside -- I have no idea what & means to nasm -- it doesn't appear anywhere in the documentation I can see, so I'm suprised it doesn't give an error. But it looks like it treats it as an alias for [])
Now LEA is a funny instruction -- it has basically the same format as a move from memory, but instead of reading memory, it stores the address it would have read from into the destination register. So
lea eax, msg
makes no sense -- the source is the label (address) msg, which is a (link time) constant and is not in memory anywhere.
lea eax, [msg]
works, as the source is in memory, so it sticks the address of the source into eax. This is the same effect as mov eax, msg. Most commonly, you only see lea used with more complex addressing modes, so that you can leverage the x86 AGU to do useful work other than just computing addresses. Eg:
lea eax, [ebx+4*ecx+32]
which does a shift and two adds in the AGU and puts the result into eax rather than loading from that address.
what gdb does
In gdb, when you type p <expression> it tries to evaluate <expression> to the best of its understanding of what the C/C++ compiler means for that expression. So when you say
(gdb) p msg
it looks at msg and says "that looks like a variable, so lets get the current value of that variable and print that". Now it knows that compilers like to put global variables into the .data segment, and that they create symbols for those variables with the same name as the varible. Since it sees msg in the symbol table as a symbol in the .data segment, it assumes that is what is going on, and fetches the memory at that symbol and prints it. Now it has no idea what TYPE that variable is (no debug info), so it guesses that it is a 32-bit int and prints it as that.
So the output
$1 = 1700946284
is the first 4 bytes of msg, treated as an integer.
For p &msg it understands you want to take the address of the variable msg, so it give the address from the symbol directly. When printing addresses, gdb prints the type information it has about those addresses, thus the "data variable, no debug info" that comes out with it.
If you want, you can use a cast to specify the type of something to gdb, and it will use that type instead of what it has guessed:
(gdb) p (char)msg
$6 = 108 'l'
(gdb) p (char [10])msg
$7 = "labeled st"
(gdb) p (char *)&msg
$8 = 0x80490e4 "labeled string\\nunlabeled-string\\n\n\n\n\n\n\n\n\n" <Address 0x804910e out of bounds>
Note in the latter case here, there's no NUL terminator on the string, so it prints out the entire data segment...
To print the unlabelled string with sys_write, you need to figure out the address
and length of string, which you almost have. For completeness you should also check the return value:
mov ebx, 1 ; fd 1 (stdout)
lea ecx, [msg+15] ; address
mov edx, 17 ; length
write_more:
mov eax, 4 ; sys_write
int 80H ; write(1, &msg[15], 17)
test eax, eax ; check for error
js error ; error, eax = -ERRNO
add ecx, eax
sub edx, eax
jg write_more ; only part of the string was written
Chris Dodd sez...
(aside -- I have no idea what & means to nasm -- it doesn't appear anywhere in the documentation I can see, so I'm suprised it doesn't give an error. But it looks like it treats it as an alias for [])
Oh, oh! You've discovered the secret syntax! "&" was added to Nasm (as an alias for "[]") per user request, a long time ago. It was never documented. Never removed, either. I'd stick with "[]". Being "undocumented", it might just disappear. Note that the meaning is almost "opposite" from what it means to gdb!
Might try "-F dwarf" instead of "-F stabs". It's supposed to be the "native" debugging info format used by gdb. (never noticed much difference, myself)
Best,
Frank
http://www.nasm.us
Related
I am trying to assemble the code below using yasm. I have put 'here' comments where yasm reports the error "error: invalid size for operand 2". Why is this error happening ?
segment .data
a db 25
b dw 0xffff
c dd 3456
d dq -14
segment .bss
res resq 1
segment .text
global _start
_start:
movsx rax, [a] ; here
movsx rbx, [b] ; here
movsxd rcx, [c] ; here
mov rdx, [d]
add rcx, rdx
add rbx, rcx
add rax, rbx
mov [res], rax
ret
For most instructions, the width of the register operand implies the width of the memory operand, because both operands have to be the same size. e.g. mov rdx, [d] implies mov rdx, qword [d] because you used a 64-bit register.
But the same movsx / movzx mnemonics are used for the byte-source and word-source opcodes, so it's ambiguous unless the source is a register (like movzx eax, cl). Another example is crc32 r32, r/m8 vs. r/m16 vs. r/m32. (Unlike movsx/zx, its source size can be as wide as the operand-size.)
movsx / movzx with a memory source always need the width of the memory operand specified explicitly.
The movsxd mnemonic is supposed to imply a 32-bit source size. movsxd rcx, [c] assembles with NASM, but apparently not with YASM. YASM requires you to write dword, even though it doesn't accept byte, word, or qword there, and it doesn't accept movsx rcx, dword [c] either (i.e. it requires the movsxd mnemonic for 32-bit source operands).
In NASM, movsx rcx, dword [c] assembles to movsxd, but movsxd rcx, word [c] is still rejected. i.e. in NASM, plain movsx is fully flexible, but movsxd is still rigid. I'd still recommend using dword to make the width of the load explicit, for the benefit of humans.
movsx rax, byte [a]
movsx rbx, word [b]
movsxd rcx, dword [c]
Note that the "operand size" of the instruction (as determined by the operand-size prefix to make it 16-bit, or REX.W=1 to make it 64-bit) is the destination width for movsx / movzx. Different source sizes use different opcodes.
In case it's not obvious, there's no movzxd because 32-bit mov already zero-extends to 64-bit implicitly. movsxd eax, ecx is encodeable, but not recommended (use mov instead).
In AT&T syntax, you need to explicitly specify both the source and destination width in the mnemonic, like movsbq (%rsi), %rax. GAS won't let you write movsb (%rsi), %eax to infer a destination width (operand-size) because movsb/movsw/etc are the mnemonics for string-move instructions with implicit (%rsi), (%rdi) operands.
Fun fact: GAS and clang do allow it for things like movzb (%rsi), %eax as movzbl, but GAS only has extra logic to allow disambiguation (not just inferring size) based on operands when it's necessary, like movsd (%rsi), %xmm0 vs. movsd. (Clang12.0.1 actually does accept movsb (%rcx), %eax as movsbl, but GAS 2.36.1 doesn't, so for portability it's best to be explicit with sign-extension, and not a bad idea for zero-extension too.)
Other stuff about your source code:
NASM/YASM allow you to use the segment keyword instead of section, but really you're giving ELF section names, not executable segment names. Also, you can put read-only data in section .rodata (which is linked as part of the text segment). What's the difference of section and segment in ELF file format.
You can't ret from _start. It's not a function, it's your ELF entry point. The first thing on the stack is argc, not a valid return address. Use this to exit cleanly:
xor edi,edi
mov eax, 231
syscall ; sys_exit_group(0)
See the x86 tag wiki for links to more useful guides (and debugging tips at the bottom).
Let's say for example I have four specific memory addresses that each hold a 32-bit integer. How would you use assembly language to take the address and assign it register %eax?
Would it be movl 0x12AED567, %eax?
Yes, it is that simple. If you already have the addresses, just assign them to eax, I corrected your code a little :
mov 12AED567h, eax
But, if you want to get the addresses dynamically, you have to use lea instruction, next little program shows how :
.stack 100h
.data
my_number dd A01Ch
.code
;INITIALIZE DATA SEGMENT.
mov ax,#data
mov ds,ax
;GET THE MEMORY ADDRESS OF MY_NUMBER, NOT THE NUMBER ITSELF.
lea eax, my_number
;FINISH THE PROGRAM PROPERLY.
mov ax,4c00h
int 21h
Is this what you were looking for?
By the way, this is 8086 assembler with Intel's syntax.
reader of this question.
I'm not new with assembly. But I'm new with MASM. (in fact, I was using that hardcore clean tasm stuff for about 8 years without even a single minute of using a single macros, he-he).
Now, I've got to make a simple program. I already did it's main logic. But there is some trouble with output.
When I use
output <some-variable-name>
it makes the thing - it outputs characters.
But now I want to begin output not from the very beginning of some variable but from a specific address in memory. Now I do:
lea eax, <some-variable-name>
mov esi, eax
... manipulations with address in esi, like 'add esi, ebx' and so on...
output esi
But that won't work.
Compiler says 'error A2070: invalid instruction operands'.
I use Microsoft Macro Assembler version 6.11.
Thanks in advance.
Sorry for my broken English.
UPD: defenition of 'output' macros, taken from included 'io.h' file:
output MACRO string,xtra ;; display string
IFB <string>
.ERR <missing operand in OUTPUT>
EXITM
ENDIF
IFNB <xtra>
.ERR <extra operand(s) in OUTPUT>
EXITM
ENDIF
push eax ;; save EAX
lea eax,string ;; string address
push eax ;; string parameter on stack
call outproc ;; call outproc(string)
pop eax ;; restore EAX
ENDM
The soulution in this situation is use following:
lea eax, <some-variable-name>
mov esi, eax
... manipulations with pointer, like 'add esi, edx' and so on ...
push esi
call outproc
I have been trying to create a simple chunk of shell code that allows me to modify a string by doing something simple like changing a letter, then print it out.
_start:
jmp short ender
starter:
xor ecx, ecx ;clear out registers
xor eax, eax
xor ebx, ebx
xor edx, edx
pop esi ;pop address of string into esi register
mov byte [esi+1], 41 ;try and put an ASCII 'A' into the second letter of the string
;at the address ESI+1
mov ecx, esi ;move our string into the ecx register for the write syscall
mov al, 4 ;write syscall number
mov bl, 1 ;write to STDOUT
mov dl, 11 ;string length
int 0x80 ;interrupt call
ender:
call starter
db 'hello world'
It's supposed to print out "hallo world". The problem occurs (segfault) when I try and modify a byte of memory with a mov command like so.
mov byte [esi+1], 41
I ran the program though GDB and the pop esi command works correctly, esi is loaded with the address of the string and everything is valid. I can't understand why I cant modify a byte value at the valid address though. I am testing this "shellcode" by just running the executable generated by NASM and ld, I am not putting it in a C program or anything yet so the bug exists in the assembly.
Extra information
I am using x64 Linux with the following build commands:
nasm -f elf64 shellcode.asm -o shellcode.o
ld -o shellcode shellcode.o
./shellcode
I have pasted the full code here.
If I had to guess, I'd say dbhello world` is being compiled in as code, rather than data, and as such has read and execute permissions, but not write ones. So you're actually falling foul of page protection.
To change this, you need to place the string in a section .data section and use nasm's variable syntax to find it. In order to modify the data as is, you're going to need to make a mprotect call to modify the permissions on your pages, and write to them. Note: these won't persist back to the executable file - mmap()'s MAP_PRIVATE ensures that's the case.
I'm using the NASM assembler.
The value returned to the eax register is supposed to be a character, when I attempt to print the integer representation its a value that looks like a memory address. I was expecting the decimal representation of the letter. For example, if character 'a' was moved to eax I should see 97 being printed (the decimal representation of 'a'). But this is not the case.
section .data
int_format db "%d", 0
;-----------------------------------------------------------
mov eax, dword[ebx + edx]
push eax
push dword int_format
call _printf ;prints a strange number
add esp, 8
xor eax, eax
mov eax, dword[ebx + edx]
push eax
call _putchar ;prints the correct character!
add esp, 4
So what gives here? ultimately I want to compare the character so it is important that eax gets the correct decimal representation of the character.
mov eax, dword[ebx + edx]
You are loading a dword (32 bits) from the address pointed to ebx+edx. If you want a single character, you need to load a byte. For that, you can use movzx instruction:
movzx eax, byte[ebx + edx]
This will load a single byte to the low byte of eax (i.e. al) and zero out the rest of the register.
Another option would be to mask out the extra bytes after loading the dword, e.g.:
and eax, 0ffh
or
movxz eax, al
As for putchar, it works because it interprets the passed value as char, i.e. it ignores the high three bytes present in the register and considers only the low byte.