Process memory mapping in C++ - memory

#include <iostream>
int main(int argc, char** argv) {
int* heap_var = new int[1];
/*
* Page size 4KB == 4*1024 == 4096
*/
heap_var[1025] = 1;
std::cout << heap_var[1025] << std::endl;
return 0;
}
// Output: 1
In the above code, I allocated 4 bytes of space in the heap. Now as the OS maps the virtual memory to system memory in pages (which are 4KB each), A block of 4KB in my virtual mems heap would get mapped to the system mem. For testing I decided I would try to access other addresses in my allocated page/heap-block and it worked, however I shouldn't have been allowed to access more than 4096 bytes from the start (which means index 1025 as an int variable is 4 bytes).
I'm confused why I am able to access 4*1025 bytes (More than the size of the page that has been allocated) from the start of the heap block and not get a seg fault.
Thanks.

The platform allocator likely allocated far more than the page size is since it is planning to use that memory "bucket" for other allocation or is likely keeping some internal state there, it is likely that in release builds there is far more than just a page sized virtual memory chunk there. You also don't know where within that particular page the memory has been allocated (you can find out by masking some bits) and without mentioning the platform/arch (I'm assuming x86_64) there is no telling that this page is even 4kb, it could be a 2MB "huge" page or anything alike.
But by accessing outside array bounds you're triggering undefined behavior like crashes in case of reads or data corruption in case of writes.
Don't use memory that you don't own.
I should also mention that this is likely unrelated to C++ since the new[] operator usually just invokes malloc/calloc behind the scenes in the core platform library (be that libSystem on OSX or glibc or musl or whatever else on Linux, or even an intercepting allocator). The segfaults you experience are usually from guard pages around heap blocks or in absence of guard pages there simply using unmapped memory.
NB: Don't try this at home: There are cases where you may intentionally trigger what would be considered undefined behavior in general, but on that specific platform you may know exactly what's there (a good example is abusing pthread_t opaque on Linux to get tid without an overhead of an extra syscall, but you have to make sure you're using the right libc, the right build type of that libc, the right version of that libc, the right compiler that it was built with etc).

Related

Anonymous mmap zero-filled?

In Linux, the mmap(2) man page explains that an anonymous mapping
. . . is not backed by any file; its contents are initialized to zero.
The FreeBSD mmap(2) man page does not make a similar guarantee about zero-filling, though it does promise that bytes after the end of a file in a non-anonymous mapping are zero-filled.
Which flavors of Unix promise to return zero-initialized memory from anonymous mmaps? Which ones return zero-initialized memory in practice, but make no such promise on their man pages?
It is my impression that zero-filling is partially for security reasons. I wonder if any mmap implementations skip the zero-filling for a page that was mmapped, munmapped, then mmapped again by a single process, or if any implementations fill a newly mapped page with pseudorandom bits, or some non-zero constant.
P.S. Apparently, even brk and sbrk used to guarantee zero-filled pages. My experiments on Linux seem to indicate that, even if full pages are zero-filled upon page fault after a sbrk call allocates them, partial pages are not:
#include <unistd.h>
#include <stdio.h>
int main() {
const intptr_t many = 100;
char * start = sbrk(0);
sbrk(many);
for (intptr_t i = 0; i < many; ++i) {
start[i] = 0xff;
}
printf("%d\n",(int)start[many/2]);
sbrk(many/-2);
sbrk(many/2);
printf("%d\n",(int)start[many/2]);
sbrk(-1 * many);
sbrk(many/2);
printf("%d\n",(int)start[0]);
}
Which flavors of Unix promise to return zero-initialized memory from anonymous mmaps?
GNU/Linux
As you said in your question, the Linux version of mmap promises to zero-fill anonymous mappings:
MAP_ANONYMOUS
The mapping is not backed by any file; its contents are initialized to zero.
NetBSD
The NetBSD version of mmap promises to zero-fill anonymous mappings:
MAP_ANON
Map anonymous memory not associated with any specific file. The file descriptor is not used for creating MAP_ANON regions, and must be specified as -1. The mapped memory will be zero filled.
OpenBSD
The OpenBSD manpage of mmap does not promise to zero-fill anonymous mappings. However, Theo de Raadt (prominent OpenBSD developer), declared in November 2019 on the OpenBSD mailing list:
Of course it is zero filled. What else would it be? There are no plausible alternatives.
I think it detracts from the rest of the message to say something so obvious.
And the other OpenBSD developers did not contradict him.
IBM AIX
The AIX version of mmap promises to zero-fill anonymous mappings:
MAP_ANONYMOUS
Specifies the creation of a new, anonymous memory region that is initialized to all zeros.
HP-UX
According to nixdoc.net, the HP-UX version of mmap promises to zero-fill anonymous mappings:
If MAP_ANONYMOUS is set in flags, a new memory region is created and initialized to all zeros.
Solaris
The Solaris version of mmap promises to zero-fill anonymous mappings:
When MAP_ANON is set in flags, and fildes is set to -1, mmap() provides a direct path to return anonymous pages to the caller. This operation is equivalent to passing mmap() an open file descriptor on /dev/zero with MAP_ANON elided from the flags argument.
This Solaris man page gives us a way to get zero-filled memory pages without relying on the behavior of mmap used with the MAP_ANONYMOUS flag: do not use the MAP_ANONYMOUS flag, and create a mapping backed by the /dev/zero file. It would be useful to know the list of Unix-like operating systems providing the /dev/zero file, to see if this approach is more portable than using the MAP_ANONYMOUS flag (neither /dev/zero nor MAP_ANONYMOUS are POSIX).
Interestingly, the Wikipedia article about /dev/zero claims that MAP_ANONYMOUS was introduced to remove the need of opening /dev/zero when creating an anonymous mapping.
It's hard to say which ones promise what without simply exhaustively enumerating all man pages or other release documentation, but the underlying code that handles MAP_ANON is (usually? always?) also used to map in bss space for executables, and bss space needs to be zero-filled. So it's pretty darn likely.
As for "giving you back your old values" (or some non-zero values but most likely, your old ones) if you unmap and re-map, it certainly seems possible, if some system were to be "lazy" about deallocation. I have only used a few systems that support mmap in the first place (BSD and Linux derivatives) and neither one is lazy that way, at least, not in the kernel code handling mmap.
The reason sbrk might or might not zero-fill a "regrown" page is probably tied to history, or lack thereof. The current FreeBSD code matches with what I recall from the old, pre-mmap days: there are two semi-secret variables, minbrk and curbrk, and both brk and sbrk will only invoke SYS_break (the real system call) if they are moving curbrk to a value that is at least minbrk. (Actually, this looks slightly broken: brk has the at-least behavior but sbrk just adds its argument to curbrk and invokes SYS_break. Seems harmless since the kernel checks, in sys_obreak() in /sys/vm/vm_unix.c, so a too-negative sbrk() will fail with EINVAL.)
I'd have to look at the Linux C library (and then perhaps kernel code too) but it may simply ignore attempts to "lower the break", and merely record a "logical break" value in libc. If you have mmap() and no backwards compatibility requirements, you can implement brk() and sbrk() entirely in libc, using anonymous mappings, and it would be trivial to implement both of them as "grow-only", as it were.

Using inactive memory to my advantage. What is this code storing in RAM or inactive memory?

I'm developing on OS X 10.8.3. The following code is simple. It can perform two operations. If the read function is uncommented then the program will open the file at "address" and transfer all of its contents into data. If instead, the memcpy function is uncommented the program will copy the mmapped contents into data. I am developing on a mac which caches commonly used files in inactive memory of RAM for faster future access. I have turned off caching in both the file control and mmap becuase I am working with large files of 1 GB or greater. If i did not setup the NOCACHE option, the entire 1 GB would be stored in inactive memory.
If the read function is uncommented, the program behaves as expected. Nothing is cached and everytime the program is ran it takes about 20 seconds to read the entire 1 GB.
But if instead the memcpy function is uncommented something changes. Still i see no increase in memory and it still takes 20 seconds to copy on the first run. But every execution after the previous one, copies in under a second. This is very analogous to the behavior of caching the entire file in inactive memory, but I never see an increase in memory. Even if I then do not mmap the file and only perform a read, it performs in the same time, under a second.
Something must be getting stored in inactive memory, but what and how do I track it? I would like to find what is being stored and use it to my advantage.
I am using activity monitor to see a general memory size. I am using Xcode Instruments to compare the initial memcpy execution to an execution where both read and memcpy are commented. I see no difference in the Allocations, File Activity, Reads/Writes, VM Tracker, or Shared Memory tools.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/mman.h>
int main(int argc, const char * argv[])
{
unsigned char *data;
unsigned char *mmapdata;
size_t length;
int file = open("address", O_RDONLY);
fcntl(file, F_NOCACHE, 1);
struct stat st;
stat("address", &st);
length = st.st_size;
data = malloc(length);
memset(data,0,length);
mmapdata = mmap(NULL, length, PROT_READ,MAP_SHARED|MAP_NOCACHE, file, 0);
if (mmapdata == MAP_FAILED)
fprintf(stderr, "failure");
// read(file,data,length);
close(file);
// memcpy(data,mmapdata,length);
munmap(mmapdata,length);
free(data);
return 0;
}
UPDATE:
Sorry if I was unclear. During program execution, the active memory portion of my RAM increases according to the data I malloc and the size of the mmapped file. This is surely where the pages are residing. After cleanup, the amount of available memory returns to as it was before. Inactive memory is never increased. It makes sense that the OS wouldn't really throw that active memory away since free memory is useless, but this process is not identical to caching, because of the following reason. I've tested two scenarios. In both I load a number of files who's size totals more than my available ram. One scenario I cache the files and one I do not. With caching, my inactive memory increases and once I fill my ram everything slows down tremendously. Loading a new file will replace another file's allocated inactive memory but this process will take an exceptionally longer time than the next scenario. The next scenario is with caching off. I again run the program several times loading enough files to fill my ram, but inactive memory never increases and active memory always returns to normal so it appears I've done nothing. The files I've mmapped still load fast, just as before, but mmapping new files load in a normal amount of time, replacing other files. My system never slows down with this method. Why is the second scenario faster?
How could the OS possibly make a memcpy on an mmap'd file work if the file's pages weren't resident in memory? The OS takes your hint that you don't want the data cached, but if still will if it has no choice or it has nothing better to do with the memory.
Your pages have the lowest priority, because the OS believes you that you won't access them again. But they had to be resident for the memcpy to work, and the OS won't throw them away just to have free memory (which is 100% useless). Inactive memory is better than free memory because there's at least some chance it might save I/O operations.

OS memory allocation addresses

Quick curious question, memory allocation addresses are choosed by the language compiler or is it the OS which chooses the addresses for the memory asked?
This is from a doubt about virtual memory, where it could be quickly explained as "let the process think he owns all the memory", but what happens on 64 bits architectures where only 48 bits are used for memory addresses if the process wants a higher address?
Lets say you do a int a = malloc(sizeof(int)); and you have no memory left from the previous system call so you need to ask the OS for more memory, is the compiler the one who determines the memory address to allocate this variable, or does it just ask the OS for memory and it allocates it on the address returned by it?
It would not be the compiler, especially since this is dynamic memory allocation. Compilation is done well before you actually execute your program.
Memory reservation for static variables happens at compile time. But the static memory allocation will happen at start-up, before the user defined Main.
Static variables can be given space in the executable file itself, this would then be memory mapped into the process address space. This is only one of few times(?) I can image the compiler actually "deciding" on an address.
During dynamic memory allocation your program would ask the OS for some memory and it is the OS that returns a memory address. This address is then stored in a pointer for example.
The dynamic memory allocation in C/C++ is simply done by runtime library functions. Those functions can do pretty much as they please as long as their behavior is standards-compliant. A trivial implementation of compliant but useless malloc() looks like this:
void * malloc(size_t size) {
return NULL;
}
The requirements are fairly relaxed -- the pointer has to be suitably aligned and the pointers must be unique unless they've been previously free()d. You could have a rather silly but somewhat portable and absolutely not thread-safe memory allocator done the way below. There, the addresses come from a pool that was decided upon by the compiler.
#include "stdint.h"
// 1/4 of available address space, but at most 2^30.
#define HEAPSIZE (1UL << ( ((sizeof(void*)>4) ? 4 : sizeof(void*)) * 2 ))
// A pseudo-portable alignment size for pointerĊšbwitary types. Breaks
// when faced with SIMD data types.
#define ALIGNMENT (sizeof(intptr_t) > sizeof(double) ? sizeof(intptr_t) : siE 1Azeof(double))
void * malloc(size_t size)
{
static char buffer[HEAPSIZE];
static char * next = NULL;
void * result;
if (next == NULL) {
uintptr_t ptr = (uintptr_t)buffer;
ptr += ptr % ALIGNMENT;
next = (char*)ptr;
}
if (size == 0) return NULL;
if (next-buffer > HEAPSIZE-size) return NULL;
result = next;
next += size;
next += size % ALIGNMENT;
return result;
}
void free(void * ptr)
{}
Practical memory allocators don't depend upon such static memory pools, but rather call the OS to provide them with newly mapped memory.
The proper way of thinking about it is: you don't know what particular pointer you are going to get from malloc(). You can only know that it's unique and points to properly aligned memory if you've called malloc() with a non-zero argument. That's all.

Is it possible to use cudaMemcpy with src and dest as different types?

I'm using a Tesla, and for the first time, I'm running low on CPU memory instead of GPU memory! Hence, I thought I could cut the size of my host memory by switching all integers to short (all my values are below 255).
However, I want my device memory to use integers, since the memory access is faster. So is there a way to copy my host memory (in short) to my device global memory (in int)? I guess this won't work:
short *buf_h = new short[100];
int *buf_d = NULL;
cudaMalloc((void **)&buf_d, 100*sizeof(int));
cudaMemcpy( buf_d, buf_h, 100*sizeof(short), cudaMemcpyHostToDevice );
Any ideas? Thanks!
There isn't really a way to do what you are asking directly. The CUDA API doesn't support "smart copying" with padding or alignment, or "deep copying" of nested pointers, or anything like that. Memory transfers require linear host and device memory, and alignment must be the same between source and destination memory.
Having said that, one approach to circumvent this restriction would be to copy the host short data to an allocation of short2 on the device. Your device code can retrieve a short2 containing two packed shorts, extract the value it needs and then cast the value to int. This will give the code 32 bit memory transactions per thread, allowing for memory coalescing, and (if you are using Fermi GPUs) good L1 cache hit rates, because adjacent threads within a block would be reading the same 32 bit word. On non Fermi GPUs, you could probably use a shared memory scheme to efficiently retrieve all the values for a block using coalesced reads.

Memory access after ioremap very slow

I'm working on a Linux kernel driver that makes a chunk of physical memory available to user space. I have a working version of the driver, but it's currently very slow. So, I've gone back a few steps and tried making a small, simple driver to recreate the problem.
I reserve the memory at boot time using the kernel parameter memmap=2G$1G. Then, in the driver's __init function, I ioremap some of this memory, and initialize it to a known value. I put in some code to measure the timing as well:
#define RESERVED_REGION_SIZE (1 * 1024 * 1024 * 1024) // 1GB
#define RESERVED_REGION_OFFSET (1 * 1024 * 1024 * 1024) // 1GB
static int __init memdrv_init(void)
{
struct timeval t1, t2;
printk(KERN_INFO "[memdriver] init\n");
// Remap reserved physical memory (that we grabbed at boot time)
do_gettimeofday( &t1 );
reservedBlock = ioremap( RESERVED_REGION_OFFSET, RESERVED_REGION_SIZE );
do_gettimeofday( &t2 );
printk( KERN_ERR "[memdriver] ioremap() took %d usec\n", usec_diff( &t2, &t1 ) );
// Set the memory to a known value
do_gettimeofday( &t1 );
memset( reservedBlock, 0xAB, RESERVED_REGION_SIZE );
do_gettimeofday( &t2 );
printk( KERN_ERR "[memdriver] memset() took %d usec\n", usec_diff( &t2, &t1 ) );
// Register the character device
...
return 0;
}
I load the driver, and check dmesg. It reports:
[memdriver] init
[memdriver] ioremap() took 76268 usec
[memdriver] memset() took 12622779 usec
That's 12.6 seconds for the memset. That means the memset is running at 81 MB/sec. Why on earth is it so slow?
This is kernel 2.6.34 on Fedora 13, and it's an x86_64 system.
EDIT:
The goal behind this scheme is to take a chunk of physical memory and make it available to both a PCI device (via the memory's bus/physical address) and a user space application (via a call to mmap, supported by the driver). The PCI device will then continually fill this memory with data, and the user-space app will read it out. If ioremap is a bad way to do this (as Ben suggested below), I'm open to other suggestions that'll allow me to get any large chunk of memory that can be directly accessed by both hardware and software. I can probably make do with a smaller buffer also.
See my eventual solution below.
ioremap allocates uncacheable pages, as you'd desire for access to a memory-mapped-io device. That would explain your poor performance.
You probably want kmalloc or vmalloc. The usual reference materials will explain the capabilities of each.
I don't think ioremap() is what you want there. You should only access the result (what you call reservedBlock) with readb, readl, writeb, memcpy_toio etc. It is not even guaranteed that the return is virtually mapped (although it apparently is on your platform). I'd guess that the region is being mapped uncached (suitable for IO registers) leading to the terrible performance.
It's been a while, but I'm updating since I did eventually find a workaround for this ioremap problem.
Since we had custom hardware writing directly to the memory, it was probably more correct to mark it uncacheable, but it was unbearably slow and wasn't working for our application. Our solution was to only read from that memory (a ring buffer) once there was enough new data to fill a whole cache line on our architecture (I think that was 256 bytes). This guaranteed we never got stale data, and it was plenty fast.
I have tried out doing a huge memory chunk reservations with the memmap
The ioremapping of this chunk gave me a mapped memory address space which in beyond few tera bytes.
when you ask to reserve 128GB memory starting at 64 GB. you see the following in /proc/vmallocinfo
0xffffc9001f3a8000-0xffffc9201f3a9000 137438957568 0xffffffffa00831c9 phys=1000000000 ioremap
Thus the address space starts at 0xffffc9001f3a8000 (which is waay too large).
Secondly, Your observation is correct. even the memset_io results in a extremely large delays (in tens of minutes) to touch all this memory.
So, the time taken has to do mainly with address space conversion and non cacheable page loading.

Resources