Is the stack only preserved above the stack pointer? - memory

I sometimes see disassembled programs which have instructions like:
mov %eax, -4(%esp)
which stores eax to stack at esp-4, without changing esp.
I'd like to know whether in general, you could put data into the stack beyond the stack pointer, and have those data be preserved (not altered unless I do it specifically).
Also, does this depend on which OS I use?

It matters which OS you use, because different OSes have different ABIs. (See the x86 tag wiki if you don't know what that means).
There are two ways I can see that mov %eax, -4(%esp) could be sane:
In the Linux x32 ABI (long mode with 32bit pointers), where there's a 128B red zone like in the normal x86-64 ABI. Compilers frequently generate code using the address-size prefix when they can't prove that e.g. 4(%rdi) would be the same as 4(%edi) in every case (e.g. wraparound). Unfortunately gcc 5.3 still uses 32bit addressing for locals on the stack, which could only wrap if %rsp == 0 (since the ABI requires it to be 16B-aligned).
Anyway, void foo(void) { volatile int x = 10; } compiles to
movl $10, -4(%esp) / ret with gcc 5.3 -O3 -mx32 on the Godbolt Compiler Explorer.
In (kernel) code that runs with interrupts disabled. Since nothing asynchronous other than DMA can happen, nothing can clobber your stack memory. (Although x86 has NMIs: Non-maskable interrupts. Depending on the handler for NMIs, and whether they can be blocked at all, NMIs could clobber memory below the stack pointer, I think.)
In user-space, your signal handlers aren't the only thing that can asynchronously clobber memory below the stack pointer:
As Jester points out in comments on dwelch's answer, pages below the stack pointer can be discarded (asynchronously of course), so a process that temporarily uses a lot of stack isn't wasting all those pages forever. If %esp happens to be at a page boundary, -4(%esp) is in a different page. And instead of faulting in a newly-allocated page of stack memory, access to unmapped pages below the stack pointer turn into segfaults on Linux.
Unless you have a guarantee otherwise (e.g. the red zone), then you must assume that everything below %esp is scribbled over between every instruction. None of the standard 32bit ABIs have a red-zone, and the Windows 64bit ABI also lacks one. Asynchronous use of the stack (usually by signal handlers in Linux) is a whole-program thing, not something that the compiler could determine just from the current compilation unit (even in cases where the compiler could prove that -4(%esp) was in the same page as (%esp)).
Note that the Linux x32 ABI is a 64bit ABI for AMD64 aka x86-64, not i386 aka IA32 aka x86-32. It's much more like the usual AMD64 ABI, since it was designed after.

EDIT
not sure what you mean by above and below since some folks "see" addresses increasing up or increasing down.
But it doesnt matter. If the stack was initialized at address X and is currently at Y then the data between X and Y must be preserved (one end not inclusive). The memory on either side is fair game.
The compiler not the operating system makes this happen, it moves the stack pointer to cover whatever it needs for that function. And moves it back when done. Each nested function consuming more and more stack and each return giving a little back.

Related

aarch64 inline assembly stack pointer constraint memory address with offset for Clang 6+

I noticed at different optimization levels, Clang 6 is sometimes using ldp (load neon register pair) for adjacent memory addresses vld1 neon load instrinsics.
I am trying to use inline assembly to manually force more load pair instructions. The source array is held in the stack and when Clang itself produces ldp instructions, it is using a stack pointer with offset however when I enter the array with its index with inline assembly, it expands into a x register for the address. This works however is causing performance regression. I believe this is because reading from the stack is faster but a x register as a source address might be pointing to the heap which may in turn referencing back to the stack though I am not sure, or perhaps it is reading from duplicate data in the heap.
This is an example what I am using now.
asm (
"ldp %q[DST1], %q[DST2], [%[SRC]]" "\n"
: [DST1] "=w" (TMP1), [DST2] "=w" (TMP2)
: [SRC] "X" (&K2[8])
);
and this is what Clang expands it into
ldp q19, q4, [x11]
But I want to use a stack pointer with offset address, automatically resolved from the indexed K2 array variable. e.g.
ldp q19, q4, [sp,#32]
The offsets of the stack pointer address in the disassembled code are not adjacent, so I cannot just hard code the sp register and enter an offset to load sequential data. This is because Clang 6 is consolidating identical values in other arrays used by other functions into the stack.
GCC has the aarch64 machine constraints like k which is for the stack pointer (sp) register and Ump which are meant for stp and ldp store/load pair instruction addresses which I never got to work on either GCC or Clang the latter having no equivalent constraints in its sparse documentation.
My preference is to use Clang 6 as it is producing code that is over 6% faster than GCC 8 because it is arranging most of the instructions in a performance critical loop to dual issue properly.
Is there anyway to enter an array with index as input into inline assembly and have it automatically resolve to a stack pointer with offset address in Clang 6?
Have you tried using a memory source operand like [SRC] "m" (K2[8])? Without that, you haven't even told the compiler that the memory contents are also an input to the inline asm, so it might reorder your asm wrt. stores, or do dead-store elimination.
Letting the compiler pick the addressing mode is the entire point of "m" operands.

Do any CPUs have hardware support for bounds checking?

It doesn't seem like it would be difficult to associate ranges with segments of memory. Then have an assembly instruction which treats 2 integers as "location" & "offset" (another for "data" if setting), and returns the data and error code. This would mean no longer having to make a choice between speed and security/safety when working with arrays.
Another example might be a function which verifies that instructions originating in a particular memory range cannot physically access memory outside that range. If all hardware connected to the motherboard had this capability (and were made to be compatible with each other), it would be trivial to make perfect virtual machines that run at nearly the same speed as the physical machine.
Dustin Soodak
Yes.
Decades ago, Lisp machines performed simultaneous validation checks (e.g. type checks and bounds checks) as the program ran with the assumption the program and state were valid, jumping "back in time" if the check failed - unfortunately this ability to get "free" runtime validation was lost when conventional (i.e. x86) machines became dominant.
https://en.wikipedia.org/wiki/Lisp_machine
Lisp Machines ran the tests in parallel with the more conventional single instruction additions. If the simultaneous tests failed, then the result was discarded and recomputed; this meant in many cases a speed increase by several factors. This simultaneous checking approach was used as well in testing the bounds of arrays when referenced, and other memory management necessities (not merely garbage collection or arrays).
Fortunately we're finally learning from the past and slowly, and by piecemeal, reintroducing those innovations - Intel's "MPX" (Memory Protection eXtensions) for x86 were introduced in Skylake-generation processors for hardware bounds-checking - though it isn't perfect.
(x86 is a regression in other ways too: IBM's mainframes had true hardware-accelerated system virtualization in the 1980s - we didn't get it on x86 until 2005 with Intel's "VT-x" and AMD's "AMD-V" extensions).
x86 BOUND
Technically, x86 does have hardware bounds-checking: the the BOUND instruction was introduced in 1982 in the Intel 80188 (as well as the Intel 286 and above, but not the Intel 8086, 8088 or 80186 processors).
While the BOUND instruction does provide hardware bounds-checking, I understand it indirectly caused performance issues because it breaks the hardware branch predictor (according to a Reddit thread, but I'm unsure why), but also because it requires the bounds to be specified in a tuple in memory - that's terrible for performance - I understand at runtime it's no faster than manually having the instructions to do an "if index not in range [x,y] then signal the BR exception to the program or OS" (so you might imagine the BOUND instruction was added for the convenience of people who coded assembly by-hand, which was quite common in the 1980s).
The BOUND instruction is still present in today's processors, but it was not included in AMD64 (x64) - likely for the performance reasons I explained above, and also because likely very few people were using it (and compilers could trivially replace it with a manual bounds check, that might have better performance anyway, as that could use registers).
Another disadvantage to storing the array bounds in memory is that code elsewhere (that wasn't subject to BOUNDS checking) could overwrite the previously written bounds for another pointer and circumvent the check that way - this is mostly a problem with code that intentionally tries to disable safety features (i.e. malware), but if the bounds were stored in the stack - and given how easy it is to corrupt the stack, it has even less utility.
Intel MPX
Intel MPX was introduced in Skylake architecture in 2015 and should be present in all Skylake and subsequent processor models in the mainstream Intel Core family (including Xeon, and non-SoC versions of Celeron and Pentium). Intel also implemented MPX in the Goldmont architecture (Atom, and SoC versions of Celeron and Pentium) from 2016 onwards.
MPX is superior to BOUND in that it provides dedicated registers to store the bounds range so the bounds-check should be almost zero-cost compared to BOUND which required a memory access. On the Intel 486 the BOUND instruction takes 7 cycles (compare to CMP which takes only 2 cycles even if the operand was a memory address). In Skylake the MPX equivalent (BNDMK, BNDCL and BNDCU) are all 1-cycle instructions and BNDMK can be amortized as it only needs to be called once for each new pointer).
I cannot find any information on wherever or not AMD has implemented their own version of MPX yet (as of June 2017).
Critical thoughts on MPX
Unfortunately the current state of MPX is not all that rosy - a recent paper by Oleksenko, Kuvaiskii, et al. in February 2017 "Intel MPX Explained" (PDF link: caution: not yet peer-reviewed) is a tad critical:
Our main conclusion is that Intel MPX is a promising technique that is not yet practical for widespread adoption. Intel MPX’s performance overheads are still high (~50% on average), and the supporting infrastructure has bugs which may cause compilation or runtime errors. Moreover, we showcase the design limitations of Intel MPX: it cannot detect temporal errors, may have false positives and false negatives in multithreaded code, and its restrictions
on memory layout require substantial code changes for some programs.
Also note that compared to the Lisp Machines of yore, Intel MPX is still executed inline - whereas in Lisp Machines (if my understanding is correct) bounds checks happened concurrently in hardware with a retroactive jump backwards if the check failed; thus, so-long as a running program's pointers do not point to out-of-bounds locations then there would be an absolutely zero runtime performance cost, so if you have this C code:
char arr[10];
arr[9] = 'a';
arr[8] = 'b';
Then under MPX then this would be executed:
Time Instruction Notes
1 BNDMK arr, arr+9 Set bounds 0 to 9.
2 BNDCL arr Check `arr` meets lower-bound.
3 BNDCU arr Check `arr` meets upper-bound.
4 MOV 'a' arr+9 Assign 'a' to arr+9.
5 MOV 'a' arr+8 Assign 'a' to arr+8.
But on a Lisp machine (if it were magically possible to compile C to Lisp...), then the program-reader-hardware in the computer has the ability to execute additional "side" instructions concurrently with the "actual" instructions, allowing the "side" instructions to instruct the computer to disregard the results from the "actual" instructions in the event of an error:
Time Actual instruction Side instruction
1 MOV 'A' arr+9 ENSURE arr+9 BETWEEN arr, arr+9
2 MOV 'A' arr+8 ENSURE arr+8 BETWEEN arr, arr+9
I understand the instructions-per-cycle for the "side" instructions are not the same as the "Actual" instructions - so the side-check for the instruction at Time=1 might only complete after the "Actual" instructions have already progressed on to Time=3 - but if the check failed then it would pass the instruction pointer of the failed instruction to the exception handler that would direct the program to disregard the results of the instructions executed after Time=1. I don't know how they could achieve that without massive amounts of memory or some mandatory execution pauses, possibly memory-fencing too -
that's outside the scope of my answer, but it is at least theoretically possible.
(Note in this contrived example I'm using constexpr index values that a compiler can prove will never be out-of-bounds so would omit the MPX checks entirely - so pretend they're user-supplied variables instead :) ).
I'm not an expert in x86 (or have any experience in microprocessor design, spare a CS500-level course I took at UW and didn't do the homework for...) but I don't believe concurrent execution of bounds-checks nor "time travel" is possible with x86's current design, despite the extant implementation of out-of-order execution - I might be wrong, however. I speculate that if all pointer-types were promoted to 3-tuples ( struct BoundedPointer<T> { T* ptr, T* min, T* max } - which technically already happens with MPX and other software-based bounds-checks as every guarded pointer has its bounds defined when BNDMK is called) then the protection could be provided for free by the MMU - but now pointers will consume 24 bytes of memory, each, instead of the current 8 bytes - or compare to the measly 4 bytes under 32-bit x86 - RAM is plentiful, but still a finite resource that shouldn't be wasted.
MPX in GCC
GCC supported for MPX from version 5.0 to 9.1 ( https://gcc.gnu.org/wiki/Intel%20MPX%20support%20in%20the%20GCC%20compiler ) when it was removed due to its maintenance burden.
MPX in Visual Studio / Visual C++
Visual Studio 2015 Update 1 (2015.1) added "experimental" support for MPX with the /d2MPX switch ( https://blogs.msdn.microsoft.com/vcblog/2016/01/20/visual-studio-2015-update-1-new-experimental-feature-mpx/ ). Support is still present in Visual Studio 2017 but Microsoft has not announced if it's considered a mainstream (i.e. non-experimental) feature yet.
MPX in Clang / LLVM
Clang has partially supported manual use of MPX in the past, but that support was fully removed in version 10.0
As of July 2021, LLVM still seems capable of outputting MPX instructions, but I can't see any evidence of an MPX "pass".
MPX in Intel C/C++ Compiler
The Intel C/C++ Compiler has supported MPX since version 15.0.
The XL compilers available on the IBM POWER processors on the Little Endian Linux, Big Endian Linux or AIX operating systems have a different implementation of array bounds checking.
Using the -qcheck or its synonym -C option turns on various kinds of checking. -qcheck=bounds checks array bounds. When this is used, the compilers check that every array reference has a valid subscript.
The hardware instruction used is a conditional trap, comparing the subscript to the upper limit and trapping if the subscript is too large or too small. In C and C++ the lower limit is 0. In Fortran it defaults to 1 but can be any integer. When it is not zero, the lower limit is subtracted from the subscript being checked, and the check compares that to the upper limit minus the lower limit.
When the limit is known at compile time and small enough, a conditional trap immediate instruction is enough. When the limit is calculated at execution time or is greater than 65535, a conditional trap instruction comparing two registers is needed.
The performance impact is small for several reasons:
1. The conditional trap instructions are fast.
2. They are executed in a standard integer pipeline. Since most POWER CPUs have 2 or 4 integer pipelines, there is usually an otherwise empty slot to put the trap in, so it is often essentially zero cost.
3. When it can the compiler optimizer moves the conditional trap out of loops so it is executed only once, checking all loop iterations at once.
4. When it can prove the actual subscript cannot exceed the limit, the optimizer discards the instruction.
5. Also when it can prove the subscript will also be invalid, the optimizer uses an unconditional trap.
6. If necessary -qcheck can be used during testing and skipped for production builds, but the overhead is small enough that's not usually necessary.
If my memory is correct, one long ago paper reported a 2% slowdown in one case and 0% in another. Since that CPU had only one integer pipeline, the slowdown should be significantly less with modern CPUs.
Other checking using the same mechanism is available to detect dereferencing NULL pointers, dividing an integer by zero, using an uninitialized auto variable, specially written asserts, etc.
This doesn't include all kinds of invalid memory usage, but it does handle the most common kind, does it very efficiently, and is very easy to use.
GCC supports -fbounds-check for similar purposes, but at this time it is only available for the Fortran front end (gfortran).

What do the contents of the general purpose registers contain?

I included the iOS tag, but I'm running in the simulator on a Core i7 MacBook Pro (x86-64, right?), so I think that's immaterial.
I'm currently debugging a crash in Flurry's video ads. I have a breakpoint set on Objective-C exceptions. When the breakpoint is hit I am in objc_msgSend. The callstack contains a mix of private Flurry and iOS methods, nothing public and nothing that I've written. Calling register read from the objc_msgSend stack frame outputs the following:
(lldb) register read
General Purpose Registers:
eax = 0x1ac082d0
ebx = 0x009600b5 "spaceWillDismiss:interstitial:"
ecx = 0x03e2cddb "makeKeyAndVisible"
edx = 0x0000003f
edi = 0x0097c6f3 "removeWindow"
esi = 0x00781e65 App`-[FlurryAdViewController removeWindow] + 12
ebp = 0xbfffd608
esp = 0xbfffd5e8
ss = 0x00000023
eflags = 0x00010202 App`-[FeedTableCell setupVisibleCommentAndLike] + 1778 at FeedTableCell.m:424
eip = 0x049bd09b libobjc.A.dylib`objc_msgSend + 15
cs = 0x0000001b
ds = 0x00000023
es = 0x00000023
fs = 0x00000000
gs = 0x0000000f
I've got a few questions about this output.
I assumed $ebx contains the selector that caused the crash and $edi is the last executing method. Is that the case?
$eip is where I crashed. Is that usually the case?
$eflags references an instance method that, as far as I know, has nothing to do with this crash. What is that?
Is there any other information I can pry out of these registers?
I can't speak to iOS/Objective-C frame layouts specifically, so I can't answer your question about EBX and EDI. But I can help you regarding EIP and EFLAGS and give you some general hints about ESP/EBP and the selector registers. (By the way, the simulator is simulating a 32-bit x86 environment; you can tell because your registers are 32 bits long.)
EIP is the instruction pointer register, also known as the program counter, which contains the address of the currently executing machine instruction. Thus it will point to where your program crashed, or more generally, where your program is when it hits a breakpoint, dumps core etc.
EIP is saved and restored to implement function calls (at the machine code level -- inlining may result in high-level language calls not performing actual calls). In memory-unsafe languages, a stack buffer overflow can overwrite the saved value of the instruction pointer, causing the return instruction to return to the wrong place. If you're lucky, the overwritten value will trigger a segfault on the next memory fetch, but the value of EIP will be arbitrary and unhelpful in debugging the problem. If you're unlucky, an attacker crafted the new EIP to point to useful code, so many environments use "stack cookies" or "canaries" to detect these overwrites before restoring the saved/overwritten EIP, in which case the EIP value may be useful.
EFLAGS isn't a memory address, and arguably isn't a general purpose register. Each bit of EFLAGS is a flag that can be set or tested by various instructions. The most important flags are the carry, zero and sign flags, which are set by arithmetic instructions and used for conditional branching. Your debugger is misinterpreting it as a memory address and displaying it as the closest function, but that isn't actually related to your crash. (The + 1778 is the giveaway: this means EFLAGS points 1778 bytes into the function, but the function is unlikely to actually be 1778 bytes long.)
ESP is the stack pointer and EBP is (usually) the frame pointer (also called the base pointer). These registers bound the current frame on the call stack. Your debugger usually can show you the values of stack variables and the current call stack based on these pointers. In case of corruption, sometimes you can manually inspect the stack to recover EBP and manually unwind the call stack. Note that code can be compiled without frame pointers (frame pointer omission), freeing EBP for other uses; this is common on x86 because there are so few general-purpose registers.
SS, CS, DS, ES, FS and GS hold segment selectors, used in the bad old days before paging to implement segmentation. Today FS and GS are commonly used by operating systems for process and thread state blocks; they were the only selector registers carried forward into x86-64. The selector registers are generally not helpful for debugging.

Heap overflow exploit

I understand that overflow exploitation requires three steps:
1.Injecting arbitrary code (shellcode) into target process memory space.
2.Taking control over eip.
3.Set eip to execute arbitrary code.
I read ben hawkens articles about heap exploitation and understood few tactics about how to ultimatly override a function pointer to point to my code.
In other words, I understand step 2.
I do not understand step 1 and 3.
How do I inject my code to the process memory space ?
During step 3 I override a function pointer with a
Pointer to my shellcode, How can I calculate\know what address
Was my injected code injected into ? (This problem is solved
In stackoverflow by using "jmp esp).
In a heap overflow, supposing that the system does not have ASLR activated, you will know the address of the memory chunks (aka, the buffers) you use in the overflow.
One option is to place the shellcode where the buffer is, given that you can control the contents of the buffer (as the application user). Once you have placed the shellcode bytes in the buffer, you only have to jump to that buffer address.
One way to perform that jump is by, for example, overwriting a .dtors entry. Once the vulnerable program finishes, the shellcode - placed in the buffer - will be executed. The complicated part is the .dtors overwriting. For that you will have to use the published heap exploiting techniques.
The prerequisites are that ASLR is deactivated (to know the address of the buffer before executing the vulnerable program) and that the memory region where the buffer is placed must be executable.
On more thing, steps 2 and 3 are the same. If you control eip, it's logic that you will point it to the shellcode (the arbitrary code).
P.S.: Bypassing ASLR is more complex.
Step 1 requires a vulnerability in the attacked code.
Common vulnerabilites include:
buffer overflow (common i C code, happens if the program reads an arbitrary long string into a fixed buffer)
evaluation of unsanitized data (common in SQL and script languages, but can occur in other languages as well)
Step 3 requires detailed knowledge of the target architecture.
How do I inject my code into process space?
This is quite a statement/question. It requires an 'exploitable' region of code in said process space. For example, Windows is currently rewriting most strcpy() to strncpy() if at all possible. I say if possible
because not all areas of code that use strcpy can successfully be changed over to strncpy. Why? BECAUSE ~# of this crux in difference shown below;
strcpy($buffer, $copied);
or
strncpy($buffer, $copied, sizeof($copied));
This is what makes strncpy so difficult to implement in real world scenarios. There has to be installed a 'magic number' on most strncpy operations (the sizeof() operator creates this magic number)
As coders' we are taught using hard coded values such as a strict compliance with a char buffer[1024]; is really bad coding practise.
BUT ~ in comparison - using buffer[]=""; or buffer[1024]=""; is the heart of the exploit. HOWEVER, if for example we change this code to the latter we get another exploit introduced into the system...
char * buffer;
char * copied;
strcpy(buffer, copied);//overflow this right here...
OR THIS:
int size = 1024;
char buffer[size];
char copied[size];
strncpy(buffer,copied, size);
This will stop overflows, but introduce a exploitable region in RAM due to size being predictable and structured into 1024 blocks of code/data.
Therefore, original poster, looking for strcpy for example, in a program's address space, will make the program exploitable if strcpy is present.
There are many reasons why strcpy is favoured by programmers over strncpy. Magic numbers, variable input/output data size...programming styles...etc...
HOW DO I FIND MYSELF IN MY CODE (MY LOCATION)
Check various hacker books for examples of this ~
BUT, try;
label:
pop eax
pop eax
call pointer
jmp label
pointer:
mov esp, eax
jmp $
This is an example that is non-working due to the fact that I do NOT want to be held responsible for writing the next Morris Worm! But, any decent programmer will get the jist of this code and know immediately what I am talking about here.
I hope your overflow techniques work in the future, my son!

Does the system allocates memory from high->low or the reverse?

IIRC it should be high->low,but according to this image,it's low->high.
I'm now confused,which is the case?
Seems the code is also executed from low->hight:
0x0000000000400498 <main+0>: push %rbp
0x0000000000400499 <main+1>: mov %rsp,%rbp
0x000000000040049c <main+4>: sub $0x10,%rsp
0x00000000004004a0 <main+8>: movl $0x6,-0x4(%rbp)
On Intel x86/x64, which are the most popular architectures that run Windows, the stack "grows" towards the lower addresses. I.e., pushing onto the stack involves subtracting from the stack pointer (ESP), and popping from the stack involves adding to the stack pointer.
The stack grows from the top to the bottom in your example. This is the function's prologue, and it uses the SUB instruction to allocate stack space for local variables. You might be confusing the stack with the memory in which your program is stored -- in that area, the CPU executes instructions sequentially, from low to high addresses, until a branch (e.g. JMP) instruction is encountered.

Resources