How to avoid crash during stack buffer overflow exploit? - stack

void deal_msg(unsigned char * buf, int len)
{
unsigned char msg[1024];
strcpy(msg,buf);
//memcpy(msg, buf, len);
puts(msg);
}
void main()
{
// network operation
sock = create_server(port);
len = receive_data(sock, buf);
deal_msg(buf, len);
}
As the pseudocode shows above, the compile environment is vc6 and running environment is windows xp sp3 en. No other protection mechanisms are applied, that is stack can be executed, no ASLR.
The send data is 'A' * 1024 + addr_of_jmp_esp + shellcode.
My question is:
if strcpy is used, the shellcode is generated by msfvenom, msfvenom -p windows/exec cmd=calc.exe -a x86 -b "\x00" -f python,
msfvenom attempts to encode payload with 1 iterations of x86/shikata_ga_nai
after data is sent, no calc pops up, the exploit won't work.
But if memcpy is used, shellcode generated by msfvenom -p windows/exec cmd=calc.exe -a x86 -f python without encoding works.
How to avoid the original program's crash after calc pops up, how to keep stack balance to avoid crash?

Hard to say. I'd use a custom payload (just copy the windows/exec cmd=calc.exe) and put a 0xcc at the start and debug it (or something that will be easily recognizable under the debugger like a ud2 or \0xeb\0xfe). If your payload is executed, you'll see it. Bypass the added instruction (just NOP it) and try to see what can possibly go wrong with the remainder of the payload.
You'll need a custom payload ; Since you're on XP SP3 you don't need to do crazy things.
Don't try to do the overflow and smash the whole stack (given your overflow it seems to be perfect, just enough overflow to control rIP).
See how the target function (deal_msg in your example) behave under normal conditions. Note the stack address when the ret is executed (and if register need to have certain values, this depend on the caller).
Try to replicate that in your shellcode: you'll most probably to adjust the stack pointer a bit at the end of your shellcode.
Make sure the caller (main) stack hasn't been affected when executing the payload. This might happen, in this case reserve enough room on the stack (going to lower addresses), so the caller stack is far from the stack space needed by the payload and it doesn't get affected by the payload execution.
Finally return to the ret of the target or directly after the call of the deal_msg function (or anywhere you see fit, e.g. returning directly to ExitProcess(), but this might be more interesting to return close to the previous "normal" execution path).
All in all, returning somewhere after the payload execution is easy, just push <addr> and ret but you'll need to ensure that the stack is in good shape to continue execution and most of the registers are correctly set.

Related

c code to paint an embedded stack with a pattern say (0xABABABAB) just after main begins?

I am working on dynamic memory analysis using stack painting/foot print analysis method.
dynamic-stack-depth-determination-using-footprint-analysis
basically the idea is to fill the entire amount of memory allocated to the stack area with a dedicated fill value, for example 0xABABABAB, before the application starts executing. Whenever the execution stops, the stack memory can be searched upwards from the end of the stack until a value that is not 0xABABABABis found, which is assumed to be how far the stack has been used. If the dedicated value cannot be found, the stack has consumed all stack space and most likely has overflowed.
I want a c code to fill the stack from top to bottom with a pattern.
void FillSystemStack()
{
extern char __stack_start,_Stack_bottom;
}
NOTE
I am using STM32F407VG board emulated with QEMU on eclipse.
stack is growing from higher address to lower address
start of the stack is 0x20020000
bottom of the stack is Ox2001fc00
You shouldn't completely fill the stack after main() begins, because the stack is in use once main() begins. Completely filling the stack would overwrite the bit of stack that has already been used and could lead to undefined behavior. I suppose you could fill a portion of the stack soon after main() begins as long as you're careful not to overwrite the portion that has been used already.
But a better plan is to fill the stack with a pattern before main() is called. Review the startup code for your tool chain. The startup code initializes variable values and sets the stack pointer before calling main(). The startup code may be in assembly depending on your tool chain. The code that initializes variables is probably a simple loop that copies bytes or words from the appropriate ROM to RAM sections. You can probably use this code as an example to write a new loop that will fill the stack memory range with a pattern.
This is a Cortex M so it gets the stack pointer set out of reset. Meaning it's pretty much instantly ready to go for C code. If your reset vector is written in C and performs stacking/C function calls, it will be too late to fill the stack at a very early stage. Meaning you shouldn't do it from application C code.
The normal way to do the trick you describe is through an in-circuit debugger. Download the program, hit reset, fill the stack with the help of the debugger. There will be some convenient debugger command available to do that. Execute the program, try to use as much of it as possible, observe the stack in the memory map of your debugger.
With the insights from #kkrambo answer, I tried to paint the stack just after the start of main by taking care that I do not overwrite the portion that has been used already.My stack paint and stack count functions are given below:
StackPaint
uint32_t sp;
#define STACK_CANARY 0xc5
#define WaterMark 0xc9
void StackPaint(void)
{
char *p = &__StackLimit; // __StackLimit macro defined in linker script
sp=__get_MSP();
PRINTF("stack pointer %08x \r\n\n",sp);
while((uint32_t)p < sp)
{
if(p==&__StackTop){ // __StackTop macro defined in linker script
*p = WaterMark;
p++;
}
*p = STACK_CANARY;
p++;
}
}
StackCount
uint16_t StackCount(void)
{
PRINTF("In the check address function in main file \r\n\n");
const char *p = &__StackLimit;
uint16_t c = 0;
while(*p == WaterMark || (*p == STACK_CANARY && p <= &__StackTop))
{
p++;
c++;
}
PRINTF("stack used:%d bytes \n",1024-c);
PRINTF("remaining stack :%d bytes\n",c);
return c;
}

DTrace build in built-in variable stackdepth always return 0

I am recently using DTrace to analyze my iOS app。
Everything goes well except when I try to use the built-in variable stackDepth。
I read the document here where shows the introduction of built-in variable stackDepth.
So I write some D code
pid$target:::entry
{
self->entry_times[probefunc] = timestamp;
}
pid$target:::return
{
printf ("-----------------------------------\n");
this->delta_time = timestamp - self->entry_times[probefunc];
printf ("%s\n", probefunc);
printf ("stackDepth %d\n", stackdepth);
printf ("%d---%d\n", this->delta_time, epid);
ustack();
printf ("-----------------------------------\n");
}
And run it with sudo dtrace -s temp.d -c ./simple.out。 unstack() function goes very well, but stackDepth always appears to 0。
I tried both on my iOS app and a simple C program.
So anybody knows what's going on?
And how to get stack depth when the probe fires?
You want to use ustackdepth -- the user-land stack depth.
The stackdepth variable refers to the kernel thread stack depth; the ustackdepth variable refers to the user-land thread stack depth. When the traced program is executing in user-land, stackdepth will (should!) always be 0.
ustackdepth is calculated using the same logic as is used to walk the user-land stack as with ustack() (just as stackdepth and stack() use similar logic for the kernel stack).
This seems like a bug in the Mac / iOS implementation of DTrace to me.
However, since you're already probing every function entry and return, you could just keep a new variable self->depth and do ++ in the :::entry probe and -- in the :::return probe. This doesn't work quite right if you run it against optimized code, because any tail-call-optimized functions may look like they enter but never return. To solve that, you can turn off optimizations.
Also, because what you're doing looks a lot like this, I thought maybe you would be interested in the -F option:
Coalesce trace output by identifying function entry and return.
Function entry probe reports are indented and their output is prefixed
with ->. Function return probe reports are unindented and their output
is prefixed with <-.
The normal script to use with -F is something like:
pid$target::some_function:entry { self->trace = 1 }
pid$target:::entry /self->trace/ {}
pid$target:::return /self->trace/ {}
pid$target::some_function:return { self->trace = 0 }
Where some_function is the function whose execution you want to be printed. The output shows a textual call graph for that execution:
-> some_function
-> another_function
-> malloc
<- malloc
<- another_function
-> yet_another_function
-> strcmp
<- strcmp
-> malloc
<- malloc
<- yet_another_function
<- some_function

How to get this sqrt inline assembly working for iOS

I am trying to follow another SO post and implement sqrt14 within my iOS app:
double inline __declspec (naked) __fastcall sqrt14(double n)
{
_asm fld qword ptr [esp+4]
_asm fsqrt
_asm ret 8
}
I have modified this to the following in my code:
double inline __declspec (naked) sqrt14(double n)
{
__asm__("fld qword ptr [esp+4]");
__asm__("fsqrt");
__asm__("ret 8");
}
Above, I have removed the "__fastcall" keyword from the method definition since my understanding is that it is for x86 only. The above gives the following errors for each assembly line respectively:
Unexpected token in argument list
Invalid instruction
Invalid instruction
I have attempted to read through a few inline ASM guides and other posts on how to do this, but I am generally just unfamiliar with the language. I know MIPS quite well, but these commands/registers seem to be very different. For example, I don't understand why the original author never uses the passed in "n" value anywhere in the assembly code.
Any help getting this to work would be greatly appreciated! I am trying to do this because I am building an app where I need to calculate sqrt (ok, yes, I could do a lookup table, but for right now I care a lot about precision) on every pixel of a live-video feed. I am currently using the standard sqrt, and in addition to the rest of the computation, I'm running at around 8fps. Hoping to bump that up a frame or two with this change.
If it matters: I'm building the app to ideally be compatibly with any current iOS device that can run iOS 7.1 Again, many thanks for any help.
The compiler is perfectly capable of generating fsqrt instruction, you don't need inline asm for that. You might get some extra speed if you use -ffast-math.
For completeness' sake, here is the inline asm version:
__asm__ __volatile__ ("fsqrt" : "=t" (n) : "0" (n));
The fsqrt instruction has no explicit operands, it uses the top of the stack implicitly. The =t constraint tells the compiler to expect the output on the top of the fpu stack and the 0 constraint instructs the compiler to place the input in the same place as output #0 (ie. the top of the fpu stack again).
Note that fsqrt is of course x86-only, meaning it wont work for example on ARM cpus.

CUDA kernels and memory access (one kernel doesn't execute entirely and the next doesn't get launched)

I'm having trouble here. I launch two kernels , check if some value is the one expected (memcpy to the host), if it is I stop, if it isn't I launch the two kernels again.
the first kernel:
__global__ void aco_step(const KPDeviceData* data)
{
int obj = threadIdx.x;
int ant = blockIdx.x;
int id = threadIdx.x + blockIdx.x * blockDim.x;
*(data->added) = 1;
while(*(data->added) == 1)
{
*(data->added) = 0;
//check if obj fits
int fits = (data->obj_weights[obj] + data->weight[ant] <= data->max_weight);
fits = fits * !(getElement(data->selections, data->selections_pitch, ant, obj));
if(obj == 0)
printf("ant %d going..\n", ant);
__syncthreads();
...
The code goes on after this. But that printf never gets printed, that syncthreads is there just for debugging purposes.
The "added" variable was shared, but since shared memory is a PITA and usually throws bugs in the code, i just removed it for now. This "added" variable isn't the smartest thing to do but it's faster than the alternative, which is checking if any variable within an array is some value on the host and deciding to keep iterating or not.
The getElement, simply does the matrix memory calculation with the pitch to access the right position and returns the element there:
int* el = (int*) ((char*)mat + row * pitch) + col;
return *el;
The obj_weights array has the right size, n*sizeof(int). So does the weight array, ants*sizeof(float). So they aren't out of bounds.
The kernel after this one has a printf right on the beginning, and it doesn't get printed either and after the printf it sets a variable on the device memory, and this memory is copied to the CPU after the kernel finished, and it isn't the right value when I print it in the CPU code. So I think this kernel is doing something illegal and the second one doesn't even get launched.
I'm testing some instances, when I launch 8 blocks and 512 threads, it runs OK. 32 blocks, 512 threads, OK. But 8 blocks and 1024 threads, and this happens, the kernel doesn't work, neither 32 blocks and 1024 threads.
Am I doing something wrong? Memory access? Am I launching too many threads?
edit: tried removing the "added" variable and the while loop, so it should execute just once. Still doesnt work, nothing gets printed, even if the printf is right after the three initial lines and the next kernel also doesn't print anything.
edit: another thing, I'm using a GTX 570, so the "Maximum number of threads per block" is 1024 according to http://en.wikipedia.org/wiki/CUDA. Maybe I'll just stick with 512 maximum or check on how higher I can put this value.
__syncthreads() inside conditional code is only allowed if the condition evaluates identically on all threads of a block.
In your case the condition suffers a race condition and is nondeterministic, so it most probably evaluates to different results for different threads.
printf() output is only displayed after the kernel finishes successfully. In this case it doesn't due to the problem mentioned above, so the output never shows up. You could have figured out this by testing the return codes all CUDA function calls for errors.

The art of exploitation - exploit_notesearch.c

i've got a question regarding the exploit_notesearch program.
This program is only used to create a command string we finally call with the system() function to exploit the notesearch program that contains a buffer overflow vulnerability.
The commandstr looks like this:
./notesearch Nop-block|shellcode|repeated ret(will jump in nop block).
Now the actual question:
The ret-adress is calculated in the exploit_notesearch program by the line:
ret = (unsigned int) &i-offset;
So why can we use the address of the i-variable that is quite at the bottom of the main-stackframe of the exploit_notesearch program to calculate the ret address that will be saved in an overflowing buffer in the notesearch program itself ,so in an completely different stackframe, and has to contain an address in the nop block(which is in the same buffer).
that will be saved in an overflowing buffer in the notesearch program itself ,so in an completely different stackframe
As long as the system uses virtual memory, another process will be created by system() for the vulnerable program, and assuming that there is no stack randomization,
both processes will have almost identical values of esp (as well as offset) when their main() functions will start, given that the exploit was compiled on the attacked machine (i.e. with vulnerable notesearch).
The address of variable i was chosen just to give an idea about where the frame base is. We could use this instead:
unsigned long sp(void) // This is just a little function
{ __asm__("movl %esp, %eax");} // used to return the stack pointer
int main(){
esp = sp();
ret = esp - offset;
//the rest part of main()
}
Because the variable i will be located on relatively constant distance from esp, we can use &i instead of esp, it doesn't matter much.
It would be much more difficult to get an approximate value for ret if the system did not use virtual memory.
the stack is allocated in a way as first in last out approach. The location of i variable is somewhere on the top and lets assume that it is 0x200, and the return address is located in a lower address 0x180 so in order to determine the where about to put the return address and yet to leave some space for the shellcode, the attacker must get the difference, which is: 0x200 - 0x180 = 0x80 (128), so he will break that down as follows, ++, the return address is 4 bytes so, we have only 48 bytes we left before reaching the segmentation. that is how it is calculated and the location i give approximate reference point.

Resources