Matlab: Free memory is lost after calling a function - memory

I have some troubles with memory management in Matlab. Finally it leads to not enough free memory and an error.I tried to pinpoint the problem and found one interesting "feature": Somehow I loose free Memory in Matlab.
I do the following:
1) Start Matlab
2) typing "memory" I get: Maximum possible array: 1293 mb, Memory available for all arrays: 1456 mb
3) I'll call a function. The function is rather long, so it's hard to paste it here. But basically it loads 5 ca. 300mb mat files (sequentially), picks some few values and returns them. The returned matrix is ca. 1,2mb (4650x35 double)
4) I clear all variables in workspace ("clear all")
5) typing "memory" I get: Maximum possible array: 759 mb, Memory available for all arrays: 1029 mb
If I repeat steps 3) to 5) the memory numbers are constant.
So what is wrong here? Where do I loose the 400mb of free space? The memory used by Matlab is constant at around 330mb.
Does anyone have some ideas what is wrong here? Or is this something totally natural, but I miss it??
Thanks
Thomas
Ps: I use Matlab 2010a and Win 7 pro 32bit.

A good part of this "lost" memory is probably due to memory fragmentation. As Matlab allocates and frees arrays over the course of a session, the memory gets broken up into smaller areas, and some is lost to overhead in the memory manager, at both the Matlab and the underlying C levels. The overhead is not counted as "used" by Matlab because it's not being used to hold M-code array values. Some memory may also be consumed by Matlab loading additional M-files and libraries, allocating internal buffers or structures, or by expansion of the Java heap in Matlab's embedded JVM. This is normal. After doing some work, Matlab won't have as much memory available as it did in a fresh session.
AFAIK, once low-level fragmentation occurs, there's nothing you can do to eliminate it aside from restarting Matlab. Allocating lots of small arrays can accelerate fragmentation. This sometimes happens if you use large cellstrs or large arrays of objects. So if you are having problems, you may need to reduce your peak memory usage in the function by breaking the work in to smaller chunks, reducing cell usage, and so on. And if you have big cellstr arrays in the MAT files, convert them to char. The "high water mark" of allocation is what governs fragmentation, so if you can break your data set in to smaller chunks, you can fit it in less memory.
Inside your function, clear as much as you can from one MAT file before moving on to the next. One way to do this implicitly is to move the per-file processing into a subfunction if it's currently sitting in a loop in your main function.
To help debug, do a "dbstop if all error", which will get triggered by the OOM. From there, you can use whos and the debugger to find out where the space is being taken up when you exhaust memory. That might reveal temp variables that need to be cleaned up, or suggest ways of chunking the work.
If you'd like to experiment to see what fragmentation looks like and how it affects memory()'s output, here's a function that will just create some fragmentation.
function fragmem(nbytes, chunksize)
%FRAGMEM Fragment the Matlab session's memory
if nargin < 2; chunksize = 1*2^10; end
nbytes = nbytes - rem(nbytes, chunksize);
nsteps = 100; % to make initial input relatively small
c = cell([1 nsteps]);
stepsize = nbytes / nsteps;
chunksperstep = ceil(stepsize / chunksize);
fprintf('Fragmenting %d MB memory into %d KB chunks (%d steps of %d chunks)\n',...
round(nbytes/2^20), round(chunksize/2^10), nsteps, chunksperstep);
x = zeros([1 chunksperstep * chunksize], 'uint8');
colsizes = repmat(chunksize, [1 chunksperstep]);
for i = 1:nsteps
c{i} = mat2cell(x, 1, colsizes);
end
Fragging 300 MB in 1KB chunks on my machine reproduces a "loss" on my win32 machine about the size you're seeing.
>> memory
Maximum possible array: 1384 MB (1.451e+009 bytes) *
Memory available for all arrays: 1552 MB (1.627e+009 bytes) **
Memory used by MATLAB: 235 MB (2.463e+008 bytes)
Physical Memory (RAM): 3311 MB (3.472e+009 bytes)
>> fragmem(300*2^20)
Fragmenting 300 MB memory into 1 KB chunks (100 steps of 3072 chunks)
>> memory
Maximum possible array: 1009 MB (1.059e+009 bytes) *
Memory available for all arrays: 1175 MB (1.232e+009 bytes) **
Memory used by MATLAB: 257 MB (2.691e+008 bytes)
Physical Memory (RAM): 3311 MB (3.472e+009 bytes)
>>

Related

Why CPU accesses aligned memory

good people of the Internet!
A past couple of days I've been reading about how CPU access memory and how it could be slower then desired if the accessed object is spread over different chunks that CPU accesses.
In a very generalized and abstract words, if I, say, have an address space from 0x0 to 0xF with a cell of one byte, and CPU reads memory in chunks of 4 bytes (that is, has a quad byte memory access granularity), then, if I need to read an object of 4 bytes size residing in cells 0x0 - 0x3, CPU would do it in one operation, while if the same object occupies cells 0x1 - 0x4, then CPU needs to perform two read operations (read memory in 0x0 - 0x3 first, then in 0x4 - 0x7), shift bytes and combine two parts (or break, if it cannot do unaligned access). This happens, once again, because CPU can read memory in 4 bytes chunks (in our abstract case). Let's also assume, that CPU make these reads inside one cache line and there is no need to change the contents of cache between reads.
So, in this case, the beginning of each chunk CPU can read is residing in a memory cell that has an address which is multiple of 4 (right?). Ok, i don't have any questions about why CPU reads in chunks, but why exactly the beginning of each chunk is aligned in such a way? If referring to an example in a previous paragraph, why exactly CPU cannot read a chunk of 4 bytes starting from 0x1?
As I may understand, CPU is pretty much aware that 0x1 exists. So is all the fuzz happening because memory controller cannot access chunk of memory starting from 0x1? Or is it because a couple of LSBs in a processor word are reserved on some architectures? Or the fact that they are reserved is the consequence of an aligned access, an not its cause (it seems like it's a second question already, but I would leave it as at the time I write this question I have a feeling that they are related)?
There are a bunch of answers here touching this topic (like this and this) and articles online (like this and this), but in all the resources there are good explanations on the phenomena itself and its consequences, but no explanation on why exactly CPU cannot read a chunk of memory starting "in between" byte boundaries (or I couldn't see it maybe).
Consider a simple CPU. It has 32 RAM chips. Each chip supplies one bit of memory. The CPU produces one address, passes it to the 32 RAM chips, and 32 bits come back. The first RAM chip holds bit 0 of bytes 0, 4, 8, 12, 16 etc. The second RAM chip holds bit 1 of bytes 0, 4, 8, 12, 16 etc. The ninth RAM chip holds bit 0 of bytes 1, 5, 9, 13, 17 etc.
So you see that the 32 RAM chips between them can produce bits 0 to 7 of bytes 0 to 3, or bytes 4 to 7, or bytes 8 to 11 etc. They are incapable of producing bytes 1 to 4.

kbmmemtable EOutOfMemory error after LoadFromDataset

I am using Delphi 7 Enterprise under Windows 7 64 bit.
My computer had 16 GB of RAM.
I try to use kbmMemTable 7.70.00 Professional Edition (http://news.components4developers.com/products_kbmMemTable.html) .
My table has 150,000 records, but when I try to copy the data from Dataset to the kbmMemTable it only copies 29000 records and I get this error: EOutOfMemory
I saw this message:
https://groups.yahoo.com/neo/groups/memtable/conversations/topics/5769,
but it didn't solve my problem.
An out of memory can happen of various reasons:
Your application uses too much memory in general. A 32 bit application typically runs out of memory when it has allocated 1.4GB using FastMM memory manager. Other memory managers may have worse or better ranges.
Memory fragementation. There may not be enough space in memory for a single large allocation that is requested. kbmMemTable will attempt to allocate roughly 200000 x 4 bytes as one single large allocation. As its own largest single allocation. That shouldnt be a problem.
Too many small allocations leading to the above memory fragmentation. kbmMemTable will allocate from 1 to n blocks of memory per record depending on the setting of the Performance property .
If Performance is set to fast, then 1 block will be allocated (unless blobs fields exists, in which case an additional allocation will be made per not null blob field).
If Performance is balanced or small, then each string field will allocate another block of memory per record.
best regards
Kim/C4D

CUDA Local memory register spilling overhead

I have a kernel which uses a lot of registers and spills them into local memory heavily.
4688 bytes stack frame, 4688 bytes spill stores, 11068 bytes spill loads
ptxas info : Used 255 registers, 348 bytes cmem[0], 56 bytes cmem[2]
Since the spillage seems quite high I believe it gets past L1 or even L2 cache. Since the local memory is private to each thread, how are accesses to local memory coalesced by the compiler? Is this memory read in 128byte transactions like global memory? With this amount of spillage I am getting low memory bandwidth utilisation (50%). I have similar kernels without the spillage that obtain up to 80% of the peak memory bandwidth.
EDIT
I've extracted some more metrics from with the nvprof tool. If I understand well the technique mentioned here, then I have a significant amount of memory traffic due to register spilling (4 * l1 hits and misses / sum of all writes across 4 sectors of L2 = (4 * (45936 + 4278911)) / (5425005 + 5430832 + 5442361 + 5429185) = 79.6%). Could somebody verify whether I am right here?
Invocations Event Name Min Max Avg
Device "Tesla K40c (0)"
Kernel: mulgg(double const *, double*, int, int, int)
30 l2_subp0_total_read_sector_queries 5419871 5429821 5425005
30 l2_subp1_total_read_sector_queries 5426715 5435344 5430832
30 l2_subp2_total_read_sector_queries 5438339 5446012 5442361
30 l2_subp3_total_read_sector_queries 5425556 5434009 5429185
30 l2_subp0_total_write_sector_queries 2748989 2749159 2749093
30 l2_subp1_total_write_sector_queries 2748424 2748562 2748487
30 l2_subp2_total_write_sector_queries 2750131 2750287 2750205
30 l2_subp3_total_write_sector_queries 2749187 2749389 2749278
30 l1_local_load_hit 45718 46097 45936
30 l1_local_load_miss 4278748 4279071 4278911
30 l1_local_store_hit 0 1 0
30 l1_local_store_miss 1830664 1830664 1830664
EDIT
I've realised that it is 128-byte and not bit transactions I was thinking of.
According to
Local Memory and Register Spilling
the impact of register spills on performance entails more than just coalescing decided at compile time; more important: read/write from/to L2 cache is already quite expensive and you want to avoid it.
The presentation suggests that using a profiler you can count at run time the number of L2 queries due to local memory (LMEM) access, see whether they have a major impact on the total number of all L2 queries, then optimize the shared to L1 ratio in favour of the latter, through a single host call for example
cudaDeviceSetCacheConfig( cudaFuncCachePreferL1 );
Hope this helps.

Allocation of memory by zalloc

I am going through the perf source in linux kernel source to find out how user space probing is implemented. At many places, I encountered this :
zalloc(sizeof(struct __event_package) * npevs);
I think its located in zlib library (for fedora 18). Can anybody tell me how this zalloc helps in allocating memory?
Thanks in advance...
You can refer this link:
The allocation is the same as any other heap allocation. In the kernel space, heap is divided into many freelists, and each freelist has blocks of same sizes connected in a linked list.
For eg:
Freelist1 - 4 bytes/block x 10 blocks
Freelist2 - 8 bytes/block x 10 blocks
Freelist3 - 16 bytes/block x 10 blocks
....
Freelist10 - 1024 bytes/block x 10 blocks
Each free list represents slabs (slab allocator) and make use of buddy system
So, when one does a zalloc, it first decides which size freelist can fulfill this request and then finds a free blocks from it.
In some custom kernel implementations, heap is divided amongst kernel & other services. In that case, *alloc needs to know which heap to access to fulfill the request.

Reading a bit from memory

I'm looking into reading single bits from memory (RAM, harddisk). My understanding was, one can not read less than a byte.
However I read someone telling it can be done with assembly.
I wan't the bandwidth usage to be as low as possible and the to be retrieved data is not sequential, so I can not read a byte and convert it to 8 bits.
I don't think the CPU will read less than the size of a cache line from RAM (64 bytes on recent Intel chips). From disk, the minimum is typically 4 kiB.
Reading a single bit at a time is neither possible nor necessary, since the data bus is much wider than that.
You cannot read less than a byte from any PC or hard disk that I know of. Even if you could, it would be extremely inefficient.
Some machines do memory mapped port io that can read/write less than a byte to the port, but it still shows up when you get it as at least a byte.
Use the bitwise operators to pick off specific bits as in:
char someByte = 0x3D; // In binary, 111101
bool flag = someByte & 1; // Get the first bit, 1
flag = someByte & 2; // Get the second bit, 0
// And so on. The number after the & operator is a power of 2 if you want to isolate one bit.
// You can also pick off several bits like so:
int value = someByte & 3; // Assume the lower 2 bits are interesting for some reason
It used to be, say 386/486 days, where a memory was a bit wide, 1 meg by 1 bit, but you will have 8 or some multiple number of chips, one for each bit lane on the bus, and you could only read in widths of the bus. today the memories are a byte wide and you can only read in units of 32 or 64 or multiples of those. Even when you read a byte, most designs fill in the whole byte. it adds unnecessarily complication/cost, to isolate the bus all the way to the memory, a byte read looks to most of the system as a 32 or 64 bit read, as it approaches the edge of the processor (sometimes physical pins, sometimes the edge of the core inside the chip) is when the individual byte lane is separated out and the other bits are discarded. Having the cache on changes the smallest divisible read size from the memory, you will see a burst or block of reads.
it is possible to design a memory system that is 8 bits wide and read 8 bits at a time, but why would you? unless it is an 8 bit processor which you probably couldnt take advantage of a 8bit by 2 gig memory. dram is pretty slow anyway, something like 133 mhz (even your 1600mhz memory is only short burst as you read from slow parts, memory has not gotten faster in over 10 years).
Hard disks are similar but different, I think sectors are the smallest divisible unit, you have to read or write in those units. so when reading you have a memory cycle on the processor, no different that going to a memory, and depending on the controller either before you do the read or as a result, a sector is read of the disk, into a buffer, not unlike a cache line read, then your memory cycle to the buffer in the disk controller either causes a bus width read and the processor divides it up or if the bus adds complexity to isolate byte lanes then you isolate a byte, but nobody isolates bit lanes. (I say the word nobody and someone will come back with an exception...)
most of this is well documented, not hard to find. For arm platforms look for the amba and/or axi specifications, freely downloaded. the number of bridges, pcie controllers, disk controller documents are all available for PCs and other platforms. it still boils down to an address and data bus or one goesouta and one goesinta data bus and some control signals that indicate the access type. some busses have byte lane enables, which is generally for a write not a read. If I want to write only a byte to a dram in a modern 64 bit system, I DO have to tell everyone almost all the way out to the dram what I want to write. To write a byte on a memory module which must be accessed 64 bits at a time, at a minimum a 64 bit read happens into a temporary place either the cache or the memory controller, then the byte to be written modifies the specific byte within the 64 bit word, then that 64 bit quantity, eventually, is written back to the memory module itself. You can do this using a combination of the address bits and a few control signals or you can just put 8 byte lane enables and the lower address bits can be ignored. Hard disk, same deal, have to read a sector, modify one byte, then eventually write the whole sector at a time. with flash and eeprom, you can only write zeros (from the programmers perspective), you erase to ones (from the programmers perspective, is actually a zero in the logic, there is an inversion) and a write has to be a sector at a time, sectors can be 64 bytes, 128 bytes, 256 bytes typically.

Resources