Part of pthread stack seems to be already used - stack

I've set the stack size of a pthread in Linux to 16 KB. If I then push an array bigger than 8 KB on the stack, the applications stops with a segmentation fault. It seems to me that I am trying to access memory below the bottom of the stack, which is probably unmapped memory and hence the segfault.
Here is the sample code:
#include <stdlib.h>
#include <stdio.h>
#include <pthread.h>
#include <string.h>
void *start_routine(void *arg)
{
size_t size = 9*1024;
unsigned char arr[size];
memset(arr, 0, size);
}
int main()
{
int err;
pthread_attr_t threadAttr;
size_t stacksize;
void *stackAddr;
pthread_t thread;
pthread_attr_init(&threadAttr);
pthread_attr_setstacksize(&threadAttr, 16*1024);
pthread_attr_getstacksize(&threadAttr, &stacksize);
printf("stacksize: %d\n", stacksize);
pthread_create(&thread, &threadAttr, start_routine, NULL );
pthread_join(thread, NULL);
return 0;
}
It seems strange that I loose around 8 KB of stack. I tried also with slightly bigger stack sizes. Somehow it seems to vary how much of the stack I can use.
I know that for nowadays-systems (except some embedded systems) these few bytes are not really important but I'm just curious why I cannot use most of the defined stack. I do not expect that I can use the whole stack, but loosing around 8 KB seems quite much.
What information is there put on the thread's stack before the entry-routine is called?
Thanks
Philip

After some investigation in the glibc nptl source code I come to the conclusion that at the bottom of the stack there is put the pthread-struct of the thread who owns the stack and likely some other variables depending on the glibc-configuration. They use together around 3K. The top of the stack is filled with a guard page, which is typically 4K big. Thus around 7-8K are already used. I am a bit surprised that at least the memory for the guard page is not allocated separately. On the top of my head I thought to remember that that would be the case but it isn't.

Related

CUDA "out of memory" with plenty of memory in the VRAM [duplicate]

Seems like there are a lot of questions on here about moving double (or int, or float, etc) 2d arrays from host to device. This is NOT my question.
I have already moved all of the data onto the GPU and, the __global__ kernel calls several __device__ functions.
In these device kernels, I have tried the following:
To allocate:
__device__ double** matrixCreate(int rows, int cols, double initialValue)
{
double** temp; temp=(double**)malloc(rows*sizeof(double*));
for(int j=0;j<rows;j++) {temp[j]=(double*)malloc(cols*sizeof(double));}
//Set initial values
for(int i=0;i<rows;i++)
{
for(int j=0;j<cols;j++)
{
temp[i][j]=initialValue;
}
}
return temp;
}
To deallocate:
__device__ void matrixDestroy(double** temp,int rows)
{
for(int j=0;j<rows;j++) { free( temp[j] ); }
free(temp);
}
For single dimension arrays the __device__ mallocs work great, can't seem to keep it stable in the multidimensional case. By the way, the variables are sometime used like this:
double** z=matrixCreate(2,2,0);
double* x=z[0];
However, care is always taken to ensure no calls to free are done with active data. The code is actually an adaption of cpu only code, so I know nothing funny is going on with the pointers or memory. Basically I'm just re-defining the allocators and throwing a __device__ on the serial portions. Just want to run the whole serial bit 10000 times and the GPU seems like a good way to do it.
++++++++++++++ UPDATE +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Problem solved by Vyas. As per cuda specifications, heap size is initially set to 8Mb, if your mallocs exceed this, NSIGHT will not launch and the kernel crashes. Use the following under host code.
float increaseHeap=10;
cudaDeviceSetLimit(cudaLimitMallocHeapSize, size[0]*increaseHeap);
Worked for me!
The GPU side malloc() is a suballocator from a limited heap. Depending on the number of allocations, it is possible the heap is being exhausted. You can change the size of the backing heap using cudaDeviceSetLimit(cudaLimitMallocHeapSize, size_t size). For more info see : CUDA programming guide

Binary Exploitation - ASLR

This is more so just a general question about how ASLR actually prevents Buffer Overflow. The statement I keep seeing is that it randomises the address space of the Stack and the excecutable. It then goes on to say that for this exploit to work the location of the excecutable and stack are needed. Thats the part I am getting confused on, all of the examples I have seen off Bufferoverflow dont trouble themselves with finding the location of these things.
This is one of the examples I looked at and all the other ones are pretty much the same, It doesnt mention or do anything to do with the location of the Stack or excecutables.
Here is the link to the example in case the answer is there and I am not understanding something:
https://www.coengoedegebure.com/buffer-overflow-attacks-explained/#:~:text=A%20buffer%20overflow%20occurs%20when,possibly%20taking%20over%20the%20machine.
Sorry if this is a dumb question
#include <string.h>
void func(char *name)
{
char buf[100];
strcpy(buf, name);
printf("Welcome %s\n", buf);
}
int main(int argc, char *argv[])
{
func(argv[1]);
return 0;
}
Okay, so once we do a buffer overflow, we want to redirect control flow to something that we know.
Let's make this really simple and assume that the stack itself is executable. We could pop a reverse shell using this, for example. But, how do we run this shellcode, exactly?
So, what we want to do is overflow the buffer, change the return address on the stack to point to the stack itself (in particular, the offset of the stack where this shellcode exists, we could find these offsets using gdb for example).
But, how do we know where the stack is exactly?
A long time ago, before ASLR, the stack was always in a particular spot, so we could just use gdb (for example) to find where the stack was, and just change the return pointer to point to the stack (in particular, where our shellcode is).
But, with ASLR, we either need to brute force the stack position, along with NOP sleds (assuming we have enough room), or achieve address disclosure of the stack through another bug.

Process memory mapping in C++

#include <iostream>
int main(int argc, char** argv) {
int* heap_var = new int[1];
/*
* Page size 4KB == 4*1024 == 4096
*/
heap_var[1025] = 1;
std::cout << heap_var[1025] << std::endl;
return 0;
}
// Output: 1
In the above code, I allocated 4 bytes of space in the heap. Now as the OS maps the virtual memory to system memory in pages (which are 4KB each), A block of 4KB in my virtual mems heap would get mapped to the system mem. For testing I decided I would try to access other addresses in my allocated page/heap-block and it worked, however I shouldn't have been allowed to access more than 4096 bytes from the start (which means index 1025 as an int variable is 4 bytes).
I'm confused why I am able to access 4*1025 bytes (More than the size of the page that has been allocated) from the start of the heap block and not get a seg fault.
Thanks.
The platform allocator likely allocated far more than the page size is since it is planning to use that memory "bucket" for other allocation or is likely keeping some internal state there, it is likely that in release builds there is far more than just a page sized virtual memory chunk there. You also don't know where within that particular page the memory has been allocated (you can find out by masking some bits) and without mentioning the platform/arch (I'm assuming x86_64) there is no telling that this page is even 4kb, it could be a 2MB "huge" page or anything alike.
But by accessing outside array bounds you're triggering undefined behavior like crashes in case of reads or data corruption in case of writes.
Don't use memory that you don't own.
I should also mention that this is likely unrelated to C++ since the new[] operator usually just invokes malloc/calloc behind the scenes in the core platform library (be that libSystem on OSX or glibc or musl or whatever else on Linux, or even an intercepting allocator). The segfaults you experience are usually from guard pages around heap blocks or in absence of guard pages there simply using unmapped memory.
NB: Don't try this at home: There are cases where you may intentionally trigger what would be considered undefined behavior in general, but on that specific platform you may know exactly what's there (a good example is abusing pthread_t opaque on Linux to get tid without an overhead of an extra syscall, but you have to make sure you're using the right libc, the right build type of that libc, the right version of that libc, the right compiler that it was built with etc).

Using inactive memory to my advantage. What is this code storing in RAM or inactive memory?

I'm developing on OS X 10.8.3. The following code is simple. It can perform two operations. If the read function is uncommented then the program will open the file at "address" and transfer all of its contents into data. If instead, the memcpy function is uncommented the program will copy the mmapped contents into data. I am developing on a mac which caches commonly used files in inactive memory of RAM for faster future access. I have turned off caching in both the file control and mmap becuase I am working with large files of 1 GB or greater. If i did not setup the NOCACHE option, the entire 1 GB would be stored in inactive memory.
If the read function is uncommented, the program behaves as expected. Nothing is cached and everytime the program is ran it takes about 20 seconds to read the entire 1 GB.
But if instead the memcpy function is uncommented something changes. Still i see no increase in memory and it still takes 20 seconds to copy on the first run. But every execution after the previous one, copies in under a second. This is very analogous to the behavior of caching the entire file in inactive memory, but I never see an increase in memory. Even if I then do not mmap the file and only perform a read, it performs in the same time, under a second.
Something must be getting stored in inactive memory, but what and how do I track it? I would like to find what is being stored and use it to my advantage.
I am using activity monitor to see a general memory size. I am using Xcode Instruments to compare the initial memcpy execution to an execution where both read and memcpy are commented. I see no difference in the Allocations, File Activity, Reads/Writes, VM Tracker, or Shared Memory tools.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/mman.h>
int main(int argc, const char * argv[])
{
unsigned char *data;
unsigned char *mmapdata;
size_t length;
int file = open("address", O_RDONLY);
fcntl(file, F_NOCACHE, 1);
struct stat st;
stat("address", &st);
length = st.st_size;
data = malloc(length);
memset(data,0,length);
mmapdata = mmap(NULL, length, PROT_READ,MAP_SHARED|MAP_NOCACHE, file, 0);
if (mmapdata == MAP_FAILED)
fprintf(stderr, "failure");
// read(file,data,length);
close(file);
// memcpy(data,mmapdata,length);
munmap(mmapdata,length);
free(data);
return 0;
}
UPDATE:
Sorry if I was unclear. During program execution, the active memory portion of my RAM increases according to the data I malloc and the size of the mmapped file. This is surely where the pages are residing. After cleanup, the amount of available memory returns to as it was before. Inactive memory is never increased. It makes sense that the OS wouldn't really throw that active memory away since free memory is useless, but this process is not identical to caching, because of the following reason. I've tested two scenarios. In both I load a number of files who's size totals more than my available ram. One scenario I cache the files and one I do not. With caching, my inactive memory increases and once I fill my ram everything slows down tremendously. Loading a new file will replace another file's allocated inactive memory but this process will take an exceptionally longer time than the next scenario. The next scenario is with caching off. I again run the program several times loading enough files to fill my ram, but inactive memory never increases and active memory always returns to normal so it appears I've done nothing. The files I've mmapped still load fast, just as before, but mmapping new files load in a normal amount of time, replacing other files. My system never slows down with this method. Why is the second scenario faster?
How could the OS possibly make a memcpy on an mmap'd file work if the file's pages weren't resident in memory? The OS takes your hint that you don't want the data cached, but if still will if it has no choice or it has nothing better to do with the memory.
Your pages have the lowest priority, because the OS believes you that you won't access them again. But they had to be resident for the memcpy to work, and the OS won't throw them away just to have free memory (which is 100% useless). Inactive memory is better than free memory because there's at least some chance it might save I/O operations.

How do I tell if two addresses are in the same page file?

What math is involved and how do I tell if two addresses are in the same 4 kilobyte page?
Well, assuming 4 KiB pages,
#include <stdint.h>
bool same_page(const void *x, const void *y)
{
uintptr_t mask = ~(uintptr_t) 4095;
return ((uintptr_t) x & mask) == ((uintptr_t) y & mask);
}
This will get ugly quickly since pages have a variable size on common architectures, and the page size of a particular region of memory can and will be changed by the operating system on the fly depending on application memory usage patters.
(Note that memory pages are virtual memory and not physical memory. Strictly speaking, it does not make sense to talk about physical pages, although we usually understand when someone says "physical page" they mean "physical memory corresponding to a page".)

Resources