c code to paint an embedded stack with a pattern say (0xABABABAB) just after main begins? - stack

I am working on dynamic memory analysis using stack painting/foot print analysis method.
dynamic-stack-depth-determination-using-footprint-analysis
basically the idea is to fill the entire amount of memory allocated to the stack area with a dedicated fill value, for example 0xABABABAB, before the application starts executing. Whenever the execution stops, the stack memory can be searched upwards from the end of the stack until a value that is not 0xABABABABis found, which is assumed to be how far the stack has been used. If the dedicated value cannot be found, the stack has consumed all stack space and most likely has overflowed.
I want a c code to fill the stack from top to bottom with a pattern.
void FillSystemStack()
{
extern char __stack_start,_Stack_bottom;
}
NOTE
I am using STM32F407VG board emulated with QEMU on eclipse.
stack is growing from higher address to lower address
start of the stack is 0x20020000
bottom of the stack is Ox2001fc00

You shouldn't completely fill the stack after main() begins, because the stack is in use once main() begins. Completely filling the stack would overwrite the bit of stack that has already been used and could lead to undefined behavior. I suppose you could fill a portion of the stack soon after main() begins as long as you're careful not to overwrite the portion that has been used already.
But a better plan is to fill the stack with a pattern before main() is called. Review the startup code for your tool chain. The startup code initializes variable values and sets the stack pointer before calling main(). The startup code may be in assembly depending on your tool chain. The code that initializes variables is probably a simple loop that copies bytes or words from the appropriate ROM to RAM sections. You can probably use this code as an example to write a new loop that will fill the stack memory range with a pattern.

This is a Cortex M so it gets the stack pointer set out of reset. Meaning it's pretty much instantly ready to go for C code. If your reset vector is written in C and performs stacking/C function calls, it will be too late to fill the stack at a very early stage. Meaning you shouldn't do it from application C code.
The normal way to do the trick you describe is through an in-circuit debugger. Download the program, hit reset, fill the stack with the help of the debugger. There will be some convenient debugger command available to do that. Execute the program, try to use as much of it as possible, observe the stack in the memory map of your debugger.

With the insights from #kkrambo answer, I tried to paint the stack just after the start of main by taking care that I do not overwrite the portion that has been used already.My stack paint and stack count functions are given below:
StackPaint
uint32_t sp;
#define STACK_CANARY 0xc5
#define WaterMark 0xc9
void StackPaint(void)
{
char *p = &__StackLimit; // __StackLimit macro defined in linker script
sp=__get_MSP();
PRINTF("stack pointer %08x \r\n\n",sp);
while((uint32_t)p < sp)
{
if(p==&__StackTop){ // __StackTop macro defined in linker script
*p = WaterMark;
p++;
}
*p = STACK_CANARY;
p++;
}
}
StackCount
uint16_t StackCount(void)
{
PRINTF("In the check address function in main file \r\n\n");
const char *p = &__StackLimit;
uint16_t c = 0;
while(*p == WaterMark || (*p == STACK_CANARY && p <= &__StackTop))
{
p++;
c++;
}
PRINTF("stack used:%d bytes \n",1024-c);
PRINTF("remaining stack :%d bytes\n",c);
return c;
}

Related

How to avoid crash during stack buffer overflow exploit?

void deal_msg(unsigned char * buf, int len)
{
unsigned char msg[1024];
strcpy(msg,buf);
//memcpy(msg, buf, len);
puts(msg);
}
void main()
{
// network operation
sock = create_server(port);
len = receive_data(sock, buf);
deal_msg(buf, len);
}
As the pseudocode shows above, the compile environment is vc6 and running environment is windows xp sp3 en. No other protection mechanisms are applied, that is stack can be executed, no ASLR.
The send data is 'A' * 1024 + addr_of_jmp_esp + shellcode.
My question is:
if strcpy is used, the shellcode is generated by msfvenom, msfvenom -p windows/exec cmd=calc.exe -a x86 -b "\x00" -f python,
msfvenom attempts to encode payload with 1 iterations of x86/shikata_ga_nai
after data is sent, no calc pops up, the exploit won't work.
But if memcpy is used, shellcode generated by msfvenom -p windows/exec cmd=calc.exe -a x86 -f python without encoding works.
How to avoid the original program's crash after calc pops up, how to keep stack balance to avoid crash?
Hard to say. I'd use a custom payload (just copy the windows/exec cmd=calc.exe) and put a 0xcc at the start and debug it (or something that will be easily recognizable under the debugger like a ud2 or \0xeb\0xfe). If your payload is executed, you'll see it. Bypass the added instruction (just NOP it) and try to see what can possibly go wrong with the remainder of the payload.
You'll need a custom payload ; Since you're on XP SP3 you don't need to do crazy things.
Don't try to do the overflow and smash the whole stack (given your overflow it seems to be perfect, just enough overflow to control rIP).
See how the target function (deal_msg in your example) behave under normal conditions. Note the stack address when the ret is executed (and if register need to have certain values, this depend on the caller).
Try to replicate that in your shellcode: you'll most probably to adjust the stack pointer a bit at the end of your shellcode.
Make sure the caller (main) stack hasn't been affected when executing the payload. This might happen, in this case reserve enough room on the stack (going to lower addresses), so the caller stack is far from the stack space needed by the payload and it doesn't get affected by the payload execution.
Finally return to the ret of the target or directly after the call of the deal_msg function (or anywhere you see fit, e.g. returning directly to ExitProcess(), but this might be more interesting to return close to the previous "normal" execution path).
All in all, returning somewhere after the payload execution is easy, just push <addr> and ret but you'll need to ensure that the stack is in good shape to continue execution and most of the registers are correctly set.

Tracking stack size during application execution

I am running an application on an embedded (PowerPC 32 bit) system where there is a stack size limitation of 64K. I am experiencing some occasional crashes because of stack overflow.
I can build the application also for a normal Linux system (with some minor little changes in the code), so I can run an emulation on my development environment.
I was wondering which is the best way to find the methods that exceed the stack size limitation and which is the stack frame when this happens (in order to perform some code refactoring).
I've already tried Callgrind (a Valgrind tool), but it seems not to be the right tool.
I'm looking more for a tool than changes in the code (since it's a 200K LOC and 100 files project).
The application is entirely written in C++03.
While it seems that there should be an existing tool for this, I would approach it by writing a small macro and adding it to the top of suspected functions:
char *__stack_root__;
#define GUARD_SIZE (64 * 1024 - 1024)
#define STACK_ROOT \
char __stack_frame__; \
__stack_root__ = &__stack_frame__;
#define STACK_GUARD \
char __stack_frame__; \
if (abs(&__stack_frame__ - __stack_root__) > GUARD_SIZE) { \
printf("stack is about to overflow in %s: at %d bytes\n", __FUNCTION__, abs(&__stack_frame__ - __stack_root__)); \
}
And here how to use it:
#include <stdio.h>
#include <stdlib.h>
void foo(int);
int main(int argc, char** argv) {
STACK_ROOT; // this macro records the top of the bottom of the thread's stack
foo(10000);
return 0;
}
void foo(int x) {
STACK_GUARD; // this macro checks if we're approaching the end of memory available for stack
if (x > 0) {
foo(x - 1);
}
}
couple notes here:
this code assumes single thread. If you have multiple threads, you need to keep track of individual __stack_frame__ variables, one per thread. Use thread local storage for this
using abs() to make sure the macro works both when PowerPC grows its stack up, as well as down (it can: depends on your setup)
adjust the GUARD_SIZE to your liking, but keep it smaller than the max size of your stack on the target

Understanding dispatch_block_t [duplicate]

In most managed languages (that is, the ones with a GC), local variables that go out of scope are inaccessible and have a higher GC-priority (hence, they'll be freed first).
Now, C is not a managed language, what happens to variables that go out of scope here?
I created a small test-case in C:
#include <stdio.h>
int main(void){
int *ptr;
{
// New scope
int tmp = 17;
ptr = &tmp; // Just to see if the memory is cleared
}
//printf("tmp = %d", tmp); // Compile-time error (as expected)
printf("ptr = %d\n", *ptr);
return 0;
}
I'm using GCC 4.7.3 to compile and the program above prints 17, why? And when/under what circumstances will the local variables be freed?
The actual behavior of your code sample is determined by two primary factors: 1) the behavior is undefined by the language, 2) an optimizing compiler will generate machine code that does not physically match your C code.
For example, despite the fact that the behavior is undefined, GCC can (and will) easily optimize your code to a mere
printf("ptr = %d\n", 17);
which means that the output you see has very little to do with what happens to any variables in your code.
If you want the behavior of your code to better reflect what happens physically, you should declare your pointers volatile. The behavior will still be undefined, but at least it will restrict some optimizations.
Now, as to what happens to local variables when they go out of scope. Nothing physical happens. A typical implementation will allocate enough space in the program stack to store all variables at the deepest level of block nesting in the current function. This space is typically allocated in the stack in one shot at the function startup and released back at the function exit.
That means that the memory formerly occupied by tmp continues to remain reserved in the stack until the function exits. That also means that the same stack space can (and will) be reused by different variables having approximately the same level of "locality depth" in sibling blocks. The space will hold the value of the last variable until some other variable declared in some sibling block variable overrides it. In your example nobody overrides the space formerly occupied by tmp, so you will typically see the value 17 survive intact in that memory.
However, if you do this
int main(void) {
volatile int *ptr;
volatile int *ptrd;
{ // Block
int tmp = 17;
ptr = &tmp; // Just to see if the memory is cleared
}
{ // Sibling block
int d = 5;
ptrd = &d;
}
printf("ptr = %d %d\n", *ptr, *ptrd);
printf("%p %p\n", ptr, ptrd);
}
you will see that the space formerly occupied by tmp has been reused for d and its former value has been overriden. The second printf will typically output the same pointer value for both pointers.
The lifetime of an automatic object ends at the end of the block where it is declared.
Accessing an object outside of its lifetime is undefined behavior in C.
(C99, 6.2.4p2) "If an object is referred to outside of its lifetime, the behavior is undefined. The value of a pointer becomes indeterminate when the object it points to reaches the end of its lifetime."
Local variables are allocated on the stack. They are not "freed" in the sense you think about GC languages, or memory allocated on the heap. They simply go out of scope, and for builtin types the code won't do anything - and for objects, the destructor is called.
Accessing them beyond their scope is Undefined Behaviour. You were just lucky, as no other code has overwritten that memory area...yet.

CUDA kernels and memory access (one kernel doesn't execute entirely and the next doesn't get launched)

I'm having trouble here. I launch two kernels , check if some value is the one expected (memcpy to the host), if it is I stop, if it isn't I launch the two kernels again.
the first kernel:
__global__ void aco_step(const KPDeviceData* data)
{
int obj = threadIdx.x;
int ant = blockIdx.x;
int id = threadIdx.x + blockIdx.x * blockDim.x;
*(data->added) = 1;
while(*(data->added) == 1)
{
*(data->added) = 0;
//check if obj fits
int fits = (data->obj_weights[obj] + data->weight[ant] <= data->max_weight);
fits = fits * !(getElement(data->selections, data->selections_pitch, ant, obj));
if(obj == 0)
printf("ant %d going..\n", ant);
__syncthreads();
...
The code goes on after this. But that printf never gets printed, that syncthreads is there just for debugging purposes.
The "added" variable was shared, but since shared memory is a PITA and usually throws bugs in the code, i just removed it for now. This "added" variable isn't the smartest thing to do but it's faster than the alternative, which is checking if any variable within an array is some value on the host and deciding to keep iterating or not.
The getElement, simply does the matrix memory calculation with the pitch to access the right position and returns the element there:
int* el = (int*) ((char*)mat + row * pitch) + col;
return *el;
The obj_weights array has the right size, n*sizeof(int). So does the weight array, ants*sizeof(float). So they aren't out of bounds.
The kernel after this one has a printf right on the beginning, and it doesn't get printed either and after the printf it sets a variable on the device memory, and this memory is copied to the CPU after the kernel finished, and it isn't the right value when I print it in the CPU code. So I think this kernel is doing something illegal and the second one doesn't even get launched.
I'm testing some instances, when I launch 8 blocks and 512 threads, it runs OK. 32 blocks, 512 threads, OK. But 8 blocks and 1024 threads, and this happens, the kernel doesn't work, neither 32 blocks and 1024 threads.
Am I doing something wrong? Memory access? Am I launching too many threads?
edit: tried removing the "added" variable and the while loop, so it should execute just once. Still doesnt work, nothing gets printed, even if the printf is right after the three initial lines and the next kernel also doesn't print anything.
edit: another thing, I'm using a GTX 570, so the "Maximum number of threads per block" is 1024 according to http://en.wikipedia.org/wiki/CUDA. Maybe I'll just stick with 512 maximum or check on how higher I can put this value.
__syncthreads() inside conditional code is only allowed if the condition evaluates identically on all threads of a block.
In your case the condition suffers a race condition and is nondeterministic, so it most probably evaluates to different results for different threads.
printf() output is only displayed after the kernel finishes successfully. In this case it doesn't due to the problem mentioned above, so the output never shows up. You could have figured out this by testing the return codes all CUDA function calls for errors.

The art of exploitation - exploit_notesearch.c

i've got a question regarding the exploit_notesearch program.
This program is only used to create a command string we finally call with the system() function to exploit the notesearch program that contains a buffer overflow vulnerability.
The commandstr looks like this:
./notesearch Nop-block|shellcode|repeated ret(will jump in nop block).
Now the actual question:
The ret-adress is calculated in the exploit_notesearch program by the line:
ret = (unsigned int) &i-offset;
So why can we use the address of the i-variable that is quite at the bottom of the main-stackframe of the exploit_notesearch program to calculate the ret address that will be saved in an overflowing buffer in the notesearch program itself ,so in an completely different stackframe, and has to contain an address in the nop block(which is in the same buffer).
that will be saved in an overflowing buffer in the notesearch program itself ,so in an completely different stackframe
As long as the system uses virtual memory, another process will be created by system() for the vulnerable program, and assuming that there is no stack randomization,
both processes will have almost identical values of esp (as well as offset) when their main() functions will start, given that the exploit was compiled on the attacked machine (i.e. with vulnerable notesearch).
The address of variable i was chosen just to give an idea about where the frame base is. We could use this instead:
unsigned long sp(void) // This is just a little function
{ __asm__("movl %esp, %eax");} // used to return the stack pointer
int main(){
esp = sp();
ret = esp - offset;
//the rest part of main()
}
Because the variable i will be located on relatively constant distance from esp, we can use &i instead of esp, it doesn't matter much.
It would be much more difficult to get an approximate value for ret if the system did not use virtual memory.
the stack is allocated in a way as first in last out approach. The location of i variable is somewhere on the top and lets assume that it is 0x200, and the return address is located in a lower address 0x180 so in order to determine the where about to put the return address and yet to leave some space for the shellcode, the attacker must get the difference, which is: 0x200 - 0x180 = 0x80 (128), so he will break that down as follows, ++, the return address is 4 bytes so, we have only 48 bytes we left before reaching the segmentation. that is how it is calculated and the location i give approximate reference point.

Resources