Using inactive memory to my advantage. What is this code storing in RAM or inactive memory? - memory

I'm developing on OS X 10.8.3. The following code is simple. It can perform two operations. If the read function is uncommented then the program will open the file at "address" and transfer all of its contents into data. If instead, the memcpy function is uncommented the program will copy the mmapped contents into data. I am developing on a mac which caches commonly used files in inactive memory of RAM for faster future access. I have turned off caching in both the file control and mmap becuase I am working with large files of 1 GB or greater. If i did not setup the NOCACHE option, the entire 1 GB would be stored in inactive memory.
If the read function is uncommented, the program behaves as expected. Nothing is cached and everytime the program is ran it takes about 20 seconds to read the entire 1 GB.
But if instead the memcpy function is uncommented something changes. Still i see no increase in memory and it still takes 20 seconds to copy on the first run. But every execution after the previous one, copies in under a second. This is very analogous to the behavior of caching the entire file in inactive memory, but I never see an increase in memory. Even if I then do not mmap the file and only perform a read, it performs in the same time, under a second.
Something must be getting stored in inactive memory, but what and how do I track it? I would like to find what is being stored and use it to my advantage.
I am using activity monitor to see a general memory size. I am using Xcode Instruments to compare the initial memcpy execution to an execution where both read and memcpy are commented. I see no difference in the Allocations, File Activity, Reads/Writes, VM Tracker, or Shared Memory tools.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/stat.h>
#include <sys/mman.h>
int main(int argc, const char * argv[])
{
unsigned char *data;
unsigned char *mmapdata;
size_t length;
int file = open("address", O_RDONLY);
fcntl(file, F_NOCACHE, 1);
struct stat st;
stat("address", &st);
length = st.st_size;
data = malloc(length);
memset(data,0,length);
mmapdata = mmap(NULL, length, PROT_READ,MAP_SHARED|MAP_NOCACHE, file, 0);
if (mmapdata == MAP_FAILED)
fprintf(stderr, "failure");
// read(file,data,length);
close(file);
// memcpy(data,mmapdata,length);
munmap(mmapdata,length);
free(data);
return 0;
}
UPDATE:
Sorry if I was unclear. During program execution, the active memory portion of my RAM increases according to the data I malloc and the size of the mmapped file. This is surely where the pages are residing. After cleanup, the amount of available memory returns to as it was before. Inactive memory is never increased. It makes sense that the OS wouldn't really throw that active memory away since free memory is useless, but this process is not identical to caching, because of the following reason. I've tested two scenarios. In both I load a number of files who's size totals more than my available ram. One scenario I cache the files and one I do not. With caching, my inactive memory increases and once I fill my ram everything slows down tremendously. Loading a new file will replace another file's allocated inactive memory but this process will take an exceptionally longer time than the next scenario. The next scenario is with caching off. I again run the program several times loading enough files to fill my ram, but inactive memory never increases and active memory always returns to normal so it appears I've done nothing. The files I've mmapped still load fast, just as before, but mmapping new files load in a normal amount of time, replacing other files. My system never slows down with this method. Why is the second scenario faster?

How could the OS possibly make a memcpy on an mmap'd file work if the file's pages weren't resident in memory? The OS takes your hint that you don't want the data cached, but if still will if it has no choice or it has nothing better to do with the memory.
Your pages have the lowest priority, because the OS believes you that you won't access them again. But they had to be resident for the memcpy to work, and the OS won't throw them away just to have free memory (which is 100% useless). Inactive memory is better than free memory because there's at least some chance it might save I/O operations.

Related

Where does the increment value for sbrk come from when using newlib nano malloc?

I have a target (Stm32f030R8) I am using with FreeRTOS and the newlib reentrant heap implementation (http://www.nadler.com/embedded/newlibAndFreeRTOS.html). This shim defines sbrk in addition to implementing the actual FreeRTOS memory allocation shim. The project is built with GNU ARM GCC and using --specs=nosys.specs.
For the following example, FreeRTOS has not been started yet. malloc has not been called before. The mcu is fresh off boot and has only just copied initialized data from flash into ram and configured clocks and basic peripherals.
Simply having the lib in my project, I am seeing that sbrk is being called with very large increment values for seemingly small malloc calls.
The target has 8K of memory, of which I have 0x12b8 (~4KB bytes between the start of the heap and end of ram (top of the stack).
I am seeing that if I allocate 1000 bytes with str = (char*) malloc(1000);, that sbrk gets called twice. First with an increment value of 0x07e8 and then again with an increment value of 0x0c60. the result being that the desired total increment count is 0x1448 (5192 bytes!) and of course this overflows not just the stack, but available ram.
What on earth is going on here? Why are these huge increment values being used by malloc for such a relatively small desired buffer allocation?
I think it may not possible to answer definitively, rather than just advise on debugging. The simplest solution is to step through the allocation code to determine where and why the allocation size request is being corrupted (as it appears to be). You will need to course to build the library from source or at least include mallocr.c in your code to override any static library implementation.
In Newlib Nano the call-stack to _sbrk_r is rather simple (compared to regular Newlib). The increment is determined from the allocation size s in nano_malloc() in nano-mallocr.c as follows:
malloc_size_t alloc_size;
alloc_size = ALIGN_TO(s, CHUNK_ALIGN); /* size of aligned data load */
alloc_size += MALLOC_PADDING; /* padding */
alloc_size += CHUNK_OFFSET; /* size of chunk head */
alloc_size = MAX(alloc_size, MALLOC_MINCHUNK);
Then if a chunk of alloc_size is not found in the free-list (which is always the case when free() has not been called), then sbrk_aligned( alloc_size ) is called which calls _sbrk_r(). None of the padding, alignment or minimum allocation macros add such a large amount.
sbrk_aligned( alloc_size ) calls _sbrk_r if the request is not aligned. The second call should be never be larger than CHUNK_ALIGN - (sizeof(void*)).
In you debugger you should be able to inspect the call stack or step through the call to see where parameter becomes incorrect.

Process memory mapping in C++

#include <iostream>
int main(int argc, char** argv) {
int* heap_var = new int[1];
/*
* Page size 4KB == 4*1024 == 4096
*/
heap_var[1025] = 1;
std::cout << heap_var[1025] << std::endl;
return 0;
}
// Output: 1
In the above code, I allocated 4 bytes of space in the heap. Now as the OS maps the virtual memory to system memory in pages (which are 4KB each), A block of 4KB in my virtual mems heap would get mapped to the system mem. For testing I decided I would try to access other addresses in my allocated page/heap-block and it worked, however I shouldn't have been allowed to access more than 4096 bytes from the start (which means index 1025 as an int variable is 4 bytes).
I'm confused why I am able to access 4*1025 bytes (More than the size of the page that has been allocated) from the start of the heap block and not get a seg fault.
Thanks.
The platform allocator likely allocated far more than the page size is since it is planning to use that memory "bucket" for other allocation or is likely keeping some internal state there, it is likely that in release builds there is far more than just a page sized virtual memory chunk there. You also don't know where within that particular page the memory has been allocated (you can find out by masking some bits) and without mentioning the platform/arch (I'm assuming x86_64) there is no telling that this page is even 4kb, it could be a 2MB "huge" page or anything alike.
But by accessing outside array bounds you're triggering undefined behavior like crashes in case of reads or data corruption in case of writes.
Don't use memory that you don't own.
I should also mention that this is likely unrelated to C++ since the new[] operator usually just invokes malloc/calloc behind the scenes in the core platform library (be that libSystem on OSX or glibc or musl or whatever else on Linux, or even an intercepting allocator). The segfaults you experience are usually from guard pages around heap blocks or in absence of guard pages there simply using unmapped memory.
NB: Don't try this at home: There are cases where you may intentionally trigger what would be considered undefined behavior in general, but on that specific platform you may know exactly what's there (a good example is abusing pthread_t opaque on Linux to get tid without an overhead of an extra syscall, but you have to make sure you're using the right libc, the right build type of that libc, the right version of that libc, the right compiler that it was built with etc).

ATmega32 SRAM and EEPROM difference

So from what I have read SRAM is volatile and EEPROM is non volatile. If SRAM is volatile, how come I sometimes get values (random and garbage but still values) when I use *ptr.
For example for ptr=&x, *ptr might give me a value. Shouldn't I get NULL because it is volatile and SRAM is wiped out every time the power is off?
Volatile, in the terms of memory, means that values won't get preserved after a power cycle. Given the nature of RAM, it may contain any garbage value at the point of power-up. There is nothing in the hardware that initializes RAM to zero.
So you will have to initialize RAM to zero manually if this is needed.
The C standard actually mandates that such initialization is done on all variables with static storage duration - but those only. That "zero-out" initialization is carried out by some firmware before main() is executed. But local C variables will never get initialized automatically.
Please note that the volatile keyword in C has little to do with volatile memories. Don't confuse those two different terms.
No. You mixing contexts. One thing is memory volatility, it concerns the memory physical construction. Other is your code reading random memory address.
Sometimes hardware can wipe a SRAM on power up, sometimes not, you cannot count on it.
If you read non occupied address of the RAM in your code you will read garbage, either bits generated powering process or old data that was disposed and is no longer used in the same power cycle.

How do non temporal instructions work?

I'm reading What Every Programmer Should Know About Memory pdf by Ulrich Drepper. At the beginning of part 6 theres's a code fragment:
#include <emmintrin.h>
void setbytes(char *p, int c)
{
__m128i i = _mm_set_epi8(c, c, c, c,
c, c, c, c,
c, c, c, c,
c, c, c, c);
_mm_stream_si128((__m128i *)&p[0], i);
_mm_stream_si128((__m128i *)&p[16], i);
_mm_stream_si128((__m128i *)&p[32], i);
_mm_stream_si128((__m128i *)&p[48], i);
}
With such a comment right below it:
Assuming the pointer p is appropriately aligned, a call to this
function will set all bytes of the addressed cache line to c. The
write-combining logic will see the four generated movntdq instructions
and only issue the write command for the memory once the last
instruction has been executed. To summarize, this code sequence not
only avoids reading the cache line before it is written, it also
avoids polluting the cache with data which might not be needed soon.
What bugs me is the that in comment to the function it is written that it "will set all bytes of the addressed cache line to c" but from what I understand of stream intrisics they bypass caches - there is neither cache reading nor cache writing. How would this code access any cache line? The second bolded fragment says sotheming similar, that the function "avoids reading the cache line before it is written". As stated above I don't see any how and when the caches are written to. Also, does any write to cache need to be preceeded by a cache write? Could someone clarify this issue to me?
When you write to memory, the cache line where you write must first be loaded into the caches in case you only write the cache line partially.
When you write to memory, stores are grouped in store buffers. Typically once the buffer is full, it will be flushed to the caches/memory. Note that the number of store buffers is typically small (~4). Consecutive writes to addresses will use the same store buffer.
The streaming read/write with non-temporal hints are typically used to reduce cache pollution (often with WC memory). The idea is that a small set of cache lines are reserved on the CPU for these instructions to use. Instead of loading a cache line into the main caches, it is loaded into this smaller cache.
The comment supposes the following behavior (but I cannot find any references that the hardware actually does this, one would need to measure or a solid source and it could vary from hardware to hardware):
- Once the CPU sees that the store buffer is full and that it is aligned to a cache line, it will flush it directly to memory since the non-temporal write bypasses the main cache.
The only way this would work is if the merging of the store buffer with the actual cache line written happens once it is flushed. This is a fair assumption.
Note that if the cache line written is already in the main caches, the above method will also update them.
If regular memory writes were used instead of non-temporal writes, the store buffer flushing would also update the main caches. It is entirely possible that this scenario would also avoid reading the original cache line in memory.
If a partial cache line is written with a non-temporal write, presumably the cache line will need to be fetched from main memory (or the main cache if present) and could be terribly slow if we have not read the cache line ahead of time with a regular read or non-temporal read (which would place it into our separate cache).
Typically the non-temporal cache size is on the order of 4-8 cache lines.
To summarize, the last instruction kicks in the write because it also happens to fill up the store buffer. The store buffer flush can avoid reading the cache line written to because the hardware knows the store buffer is contiguous and aligned to a cache line. The non-temporal write hint only serves to avoid populating the main cache with our written cache line IF and only IF it wasn't already in the main caches.
I think this is partly a terminology question: The passage you quote from Ulrich Drepper's article isn't talking about cached data. It's just using the term "cache line" for an aligned 64B block.
This is normal, and especially useful when talking about a range of hardware with different cache-line sizes. (Earlier x86 CPUs, as recently as PIII, had 32B cache lines, so using this terminology avoids hard-coding that microarch design decision into the discussion.)
A cache-line of data is still a cache-line even if it's not currently hot in any caches.
I don't have references under my fingers to prove what I am saying, but my understanding is this: the only unit of transfer over the memory bus is cache lines, whether they go into the cache or to some special registers. So indeed, the code you pasted fills a cache line, but it is a special cache line that does not reside in cache. Once all bytes of this cache line have been modified, the cache line is send directly to memory, without passing through the cache.

Memory access after ioremap very slow

I'm working on a Linux kernel driver that makes a chunk of physical memory available to user space. I have a working version of the driver, but it's currently very slow. So, I've gone back a few steps and tried making a small, simple driver to recreate the problem.
I reserve the memory at boot time using the kernel parameter memmap=2G$1G. Then, in the driver's __init function, I ioremap some of this memory, and initialize it to a known value. I put in some code to measure the timing as well:
#define RESERVED_REGION_SIZE (1 * 1024 * 1024 * 1024) // 1GB
#define RESERVED_REGION_OFFSET (1 * 1024 * 1024 * 1024) // 1GB
static int __init memdrv_init(void)
{
struct timeval t1, t2;
printk(KERN_INFO "[memdriver] init\n");
// Remap reserved physical memory (that we grabbed at boot time)
do_gettimeofday( &t1 );
reservedBlock = ioremap( RESERVED_REGION_OFFSET, RESERVED_REGION_SIZE );
do_gettimeofday( &t2 );
printk( KERN_ERR "[memdriver] ioremap() took %d usec\n", usec_diff( &t2, &t1 ) );
// Set the memory to a known value
do_gettimeofday( &t1 );
memset( reservedBlock, 0xAB, RESERVED_REGION_SIZE );
do_gettimeofday( &t2 );
printk( KERN_ERR "[memdriver] memset() took %d usec\n", usec_diff( &t2, &t1 ) );
// Register the character device
...
return 0;
}
I load the driver, and check dmesg. It reports:
[memdriver] init
[memdriver] ioremap() took 76268 usec
[memdriver] memset() took 12622779 usec
That's 12.6 seconds for the memset. That means the memset is running at 81 MB/sec. Why on earth is it so slow?
This is kernel 2.6.34 on Fedora 13, and it's an x86_64 system.
EDIT:
The goal behind this scheme is to take a chunk of physical memory and make it available to both a PCI device (via the memory's bus/physical address) and a user space application (via a call to mmap, supported by the driver). The PCI device will then continually fill this memory with data, and the user-space app will read it out. If ioremap is a bad way to do this (as Ben suggested below), I'm open to other suggestions that'll allow me to get any large chunk of memory that can be directly accessed by both hardware and software. I can probably make do with a smaller buffer also.
See my eventual solution below.
ioremap allocates uncacheable pages, as you'd desire for access to a memory-mapped-io device. That would explain your poor performance.
You probably want kmalloc or vmalloc. The usual reference materials will explain the capabilities of each.
I don't think ioremap() is what you want there. You should only access the result (what you call reservedBlock) with readb, readl, writeb, memcpy_toio etc. It is not even guaranteed that the return is virtually mapped (although it apparently is on your platform). I'd guess that the region is being mapped uncached (suitable for IO registers) leading to the terrible performance.
It's been a while, but I'm updating since I did eventually find a workaround for this ioremap problem.
Since we had custom hardware writing directly to the memory, it was probably more correct to mark it uncacheable, but it was unbearably slow and wasn't working for our application. Our solution was to only read from that memory (a ring buffer) once there was enough new data to fill a whole cache line on our architecture (I think that was 256 bytes). This guaranteed we never got stale data, and it was plenty fast.
I have tried out doing a huge memory chunk reservations with the memmap
The ioremapping of this chunk gave me a mapped memory address space which in beyond few tera bytes.
when you ask to reserve 128GB memory starting at 64 GB. you see the following in /proc/vmallocinfo
0xffffc9001f3a8000-0xffffc9201f3a9000 137438957568 0xffffffffa00831c9 phys=1000000000 ioremap
Thus the address space starts at 0xffffc9001f3a8000 (which is waay too large).
Secondly, Your observation is correct. even the memset_io results in a extremely large delays (in tens of minutes) to touch all this memory.
So, the time taken has to do mainly with address space conversion and non cacheable page loading.

Resources