Reducing memory usage with large models - memory

I'm trying to set up a model for calculating proximity values for large environments (more than 500x500 patches) with many targets (ca 100). The proximity values from a patch to all the targets are kept in a list at each patch.
The script below (fills the lists with random numbers) has inexplicably high memory usage. 500x500x100 integer values (4 bytes each) should take up about 100 MB of space. When I run the script the Netlogo application increases from ca 200 MB (when the program is started) to about 1500 MB.
What could I do to reduce the memory usage? Is the problem in my implementation or the overhead that Netlogo has for lists?
patches-own[ proximity_list ] ; The list to hold proximity values
let num_targets 100 ; Number of targets to calculate proximity values for
let largest_value 100000 ; The largest value the list can contain
ask patches[
set proximity_list []
let i 0
while [i < num_targets] [
set test_list (lput (random largest_value) test_list)
set i i + 1
]
]
PS I tried the same thing with the Arrays extension, it reduces memory usage a bit (ca 1300MB) but still does not solve the problem.

Related

OpenCL kernel out of resources based on number of loop iterations INSIDE the kernel. Can the compiled kernel be too large to fit on the GPU?

TL;DR I have an OpenCL kernel that loops for a large number of iterations and calls several user-made functions. The kernel works fine for few iterations, but increasing the number of iterations causes an CL_OUT_OF_RESOURCES (-5) error. If the same kernel is executed on a better GPU it is able to loop for more iterations without the error. What can be causing this error based on the number of iterations? Is it possible that the loops are being unrolled and generating a coder larger than the GPU can hold?
I am developing an OpenCL kernel to run on GPU that computes a very complex function. To keep things organized, I have a "kernel.cl" file with the main kernel (the __kernel void function) and a "aux_functions.cl" file with ~20 auxiliary functions (they are of type int, int2, int16, but not __kernel) that are called several times by the kernel and by themselves.
The problem specification is roughly as follows (justification for such many loops):
I have two arrays representing full HD images (1920x1080 integers)
For each 128x128 patch of one image, I must find the value of 4 parameters that optimize a given function (the second image is used to evaluate how good it is)
For the same 128x128 patch and the same 4 parameters, each 4x4 sub-patch is transformed slightly different based on its position inside the larger 128x128 patch
And I tried to model the execution as follows:
Each workgroup will compute the kernel for one 128x128 patch (I started processing only the 10 first patches -- 10 workgroups)
Each workgroup is composed of 256 workitems
Each workitem will test a distinct set of values (a fraction of a predefiend set) for the 4 parameters based on their IDs
The main structure of the kernel is as follows:
__kernel void funct(__global int *referenceFrameSamples, __global int *currentFrameSamples,const int frameWidth, const int frameHeight, __global int *result){
// Initialize some variables and get global and local IDs
for(executed X times){ // These 4 outer loops are used to test different combinations of parameters to be applied in a given function in the inner loops
for(executed Y times){
for(executed Z times){
for(executed W times){
// Simple assignments based on the outer for loops
for(executed 32x){ // Each execution of the inner loop applies a function to a 4x4 patch
for(executed 32x){
// The relevant computation is performed here
// Calls a couple of lightweight functions using only the private variables
// Calls a complex function that uses the __global int *referenceFrameSamples variable
}
}
// Compute something and use select() to update the best value
}
}
}
// Write the best value to __global *results buffer
}
The problem is that when the outer 4 loops are repeated a few times the kernel runs fine, but if I increase the iterations the kernel crashes with the error ERROR! clWaitForEvents returned CL_OUT_OF_RESOURCES (-5). I am testing it on a notebook with a GPU GeForce 940MX with 2GB, and the kernel starts crashing when X * Y * Z * W = 1024.
The clEnqueueNDRangeKernel() call has no error, only the clWaitForEvents() called after it returns an error. I am using CLIntercept to profile the errors and running time of the kernel. Also, when the kernel runs smooth I can measure the execution time correctly (showed next), but when it crashes, the "measured" execution time is ridiculously wrong (billions of miliseconds) even though it crashes on the first seconds.
cl_ulong time_start;
cl_ulong time_end;
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(time_start), &time_start, NULL);
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(time_end), &time_end, NULL);
double nanoSeconds = time_end-time_start;
printf("OpenCl Execution time is: %0.3f miliseconds \n",nanoSeconds / 1000000.0);
What I tested:
Improve the complex auxiliary function that used __global variable: instead of passing the __global pointer, I read the relevant part of the array into a private array and passed it as argument. Outcome: improved running time on success cases, but still fails in the same case
Reduce workgroups and workitems: even using 1 workgroup and 1 workitem (the absolute minimum) with the same number of iterations yields the same error. For a smaller number of iterations, running time decreases with less workitems/groups
Running the same kernel on a better GPU: after doing the previous 2 modifications (improved function and reduced workitems) I launched the kernel on a desktop equipped with a GPU Titan V with 12GB. It is able to compute the kernel with a larger number of iterations (I tried up to 1 million iterations) without giving the CL_OUT_OF_RESOURCES, and the running time seems to increase linearly with the iterations. Although this is the computer that will actually run the kernel over a dataset to solve my problem, it is a server that must be accessed remotely. I would prefer to do the development on my notebook and deploy the final code on the server.
My guess: I know that function calls are inlined in GPU. Since the program is crashing based on the number of iterations, my only guess is that these for loops are being unrolled, and with the inlined functions, the compiled kernel is too big to fit on the GPU (even with a single workitem). This also explains why using a better GPU allows increasing the number of iterations.
Question: What could be causing this CL_OUT_OF_RESOURCES error based on the number of iterations?
Of course I could reduce the number of iterations in each workitem, but then I would need multiple workgroups to process the same set of data (the same 128x128 patch) and would need to access global memory to select the best result between workgroups of the same patch. I may end up proceeding in this direction, but I would really like to know what is happening with the kernel at this moment.
Update after #doqtor comment:
Using -cl-nv-verbose when building the program reports the following resources usage. It's strange that these values do not change irrespective of the number of iterations, either when the program runs successfully and when it crashes.
ptxas info : 0 bytes gmem, 532 bytes cmem[3]
ptxas info : Compiling entry function 'naive_affine_2CPs_CU_128x128_SR_64x64' for 'sm_50'
ptxas info : Function properties for naive_affine_2CPs_CU_128x128_SR_64x64
ptxas . 66032 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 140 registers, 360 bytes cmem[0]
Running clinfo reports that my GPU has
Registers per block (NV) 65536
Global memory size 2101870592 (1.958GiB)
Max memory allocation 525467648 (501.1MiB)
Local memory size 49152 (48KiB)
It seems that I am not using too many registers, but I don't know how those stack frame, cmem[0] and cmem[3] relate to the memory information reported by clinfo.
Is it possible that the loops are being unrolled and generating a coder larger than the GPU can hold?
Yey, that is part of the problem. The compiler sees that you have a loop with a fixed, small range, and it automatically unrolls it. This happens for the six nested loops and then the assembly blows up. This will get register spilling into global memory which makes the application very slow.
However even if the compiler does not unroll the loops, every thread does X*Y*Z*W*32*32 iterations of "The relevant computation", which takes an eternety. The system thinks it freezed up and you get CL_OUT_OF_RESOURCES .
Can you really not parallelize any of these six nested loops? The best solution would be to parallelize them all, that means include them in the kernel range and launch a few hundred million threads that do "The relevant computation" without any loops. You should have as much independent threads / workgroups as possible to get the best performance (saturate the GPU).
Remember, your GPU has thousands of cores grouped into warps of 32 and SMs of 2 or 4 warps, and if you only launch a single workgroup, it will run on only a single SM with 64 or 128 cores and the remaining cores stay idle.

How to identify the exact memory size of an ETS table?

Give an ETS table with data, the info/1 function returns various properties for the table, including a size value which is specific to the number of rows rather than the physical size.
Is there any way to calculate the amount of memory in bytes occupied by an ETS table ?
ets:new( mytable, [bag, named_table, compressed]),
ets:insert( mytable, { Key, Value } ),
....
ets:info ( mytable ).
TL;DR:
ETS table allocated memory size in bytes:
ets:info(Table,memory) * erlang:system_info(wordsize).
To elaborate a bit, ets:info(Table,memory) gives you the words allocated to data in an ETS table
(same for Mnesia. You can view al that info in the TV application. The same attribute for DETS tables is in bytes).
A word is nothing more than the 'natural' data unit of a particular CPU architecture. What that represents depends on your architecture: 32-bit or 64-bit (or use erlang:system_info(wordsize) to get the correct word size immediately)
On a 32-bit system, a word is 4 bytes (32 bits).
On a 64-bit system, a word is 8 bytes (64 bits).
Also note that a ETS table initially spans 768 words, to wich you must add the size of each element, 6 words + size of Erlang data. It's not really clear if those are the words "allocated to data" ets:info specifies.
Calculating the exact size is a bit of a hassle: ETS tables have their own independent memory management system, which is optimized and garbage collected, and can vary depending on table type (set, bag, duplicate_bag). As an experiment, an empty table returns, in my environment, 300 words "allocated to data". If I add 11 tuples, size increases to 366 words. No Idea to how those relate to the initial 768 words, or why the size only increases by 11*6 words, when it should have been 11*6 + 11*1 (11 atoms), according to definition.
Still, a naive estimate, taking the initial table size and the words allocated to data, for example 22086 words, results in 768*8 + 22.086*8 = 182.832 bytes (178.54 KiB).
Of course, the bigger the data, the less those "structural" words matter, so you could only use the "words allocated to data" number, returned by ets:info, to estimate your table's size in memory.
Edit: There are two other functions that let you audit ETS memory usage:
erlang:memory/1: erlang:memory(ets) returns the memory size, in bytes, allocated to ETS.
ets:i/0: an overview of all active ETS tables (a bit like viewing system tables in TV, but with type and memory data).
As a small test, an newly created empty table increased memory use with 312 words (2.44 KiB), a lot less than the 768 number in the manual (perhaps it's CPU architecture related, I have no idea), while ETS itself reported 299 words (2.33 KiB) allocated to the data.
That's only 13 words (104 bytes) of structural overhead away (or so it seems, it remains nebulous) from the increase erlang:memory/1 reported, so ets:info/2 is fairly accurate after all.
After the insertion of a simple tuple consisting of 2 atoms, erlang:memory/1 reported a 8 word memory allocation increase, just like the documentation said it would (new ETS record: 6 words + size of data - 2 in this case : 1 word per atom).
you can read the document about ets.
you can ues this to get the memory allocated to the table.
ets:info ( mytable, memory).
{memory, integer() >= 0
The number of words allocated to the table.

Redis Data Structure Space Requirements

What is the difference in space between sorted sets and lists in redis? My guess is that sorted sets are some kind of balanced binary tree, and lists are a linked list. This means that on top of the three values that I'm encoding for each of them, key, score, value, although I'll munge together score and value for the linkedlist, the overhead is that the linkedlist needs to keep track of one other node, and the binary tree needs to keep track of two, so that the space overhead to using a sorted set is O(N).
If my value, and score are both longs, and the pointers to the other nodes are also longs, it seems like the space overhead of a single node goes from 3 longs to 4 longs on a 64-bit computer, which is a 33% increase in space.
Is this true?
It is much more than your estimation. Let's suppose ziplists are not used (i.e. you have a significant number of items).
A Redis list is a classical double-linked list: 3 pointers (prev,next,value) per item.
A sorted set is a dictionary plus a skip list. In the dictionary, items will be stored with 3 pointers as well (key,value,next). The skip list memory footprint is more complex to evaluate: each node takes 1 double (score), 2 pointers (obj,backward), plus n couples (pointer,span value) with n between 1 and 32. Most items will take only 1 or 2 couples.
In other words, when it is not represented as a ziplist, a sorted set is by far the Redis data structure with the most overhead. Compared to a list, the memory overhead is more than 200% (i.e. 3 times).
Note: the best way to evaluate memory consumption with Redis is to try to build a big list or sorted set with pseudo-data and use INFO to get the memory footprint.

How data is stored in memory beyond 0s and 1s?

I am working as software Engineer. As far as I know the data being stored in memory (Either HARD Disk or RAM) is 0s and 1s.
I am sure beyond 0s and 1s there are different ways data being stored in memory devices based on memory device types.
Please share your ideas about it .
or
Where can I study about the how data stored in memory devices ?
Logically, a bit is the smallest piece of data that can be stored. A bit is either 0 or 1; nothing else. It's like asking "what is between these 2 quantum states?"
In electronics 0 or 1 can be distinguished by separate voltage levels, separate directions of magnetization etc. Some implementations may use multiple levels to store more than one bit in a given space. But logically 0 or 1 are the only values.

Matlab: Free memory is lost after calling a function

I have some troubles with memory management in Matlab. Finally it leads to not enough free memory and an error.I tried to pinpoint the problem and found one interesting "feature": Somehow I loose free Memory in Matlab.
I do the following:
1) Start Matlab
2) typing "memory" I get: Maximum possible array: 1293 mb, Memory available for all arrays: 1456 mb
3) I'll call a function. The function is rather long, so it's hard to paste it here. But basically it loads 5 ca. 300mb mat files (sequentially), picks some few values and returns them. The returned matrix is ca. 1,2mb (4650x35 double)
4) I clear all variables in workspace ("clear all")
5) typing "memory" I get: Maximum possible array: 759 mb, Memory available for all arrays: 1029 mb
If I repeat steps 3) to 5) the memory numbers are constant.
So what is wrong here? Where do I loose the 400mb of free space? The memory used by Matlab is constant at around 330mb.
Does anyone have some ideas what is wrong here? Or is this something totally natural, but I miss it??
Thanks
Thomas
Ps: I use Matlab 2010a and Win 7 pro 32bit.
A good part of this "lost" memory is probably due to memory fragmentation. As Matlab allocates and frees arrays over the course of a session, the memory gets broken up into smaller areas, and some is lost to overhead in the memory manager, at both the Matlab and the underlying C levels. The overhead is not counted as "used" by Matlab because it's not being used to hold M-code array values. Some memory may also be consumed by Matlab loading additional M-files and libraries, allocating internal buffers or structures, or by expansion of the Java heap in Matlab's embedded JVM. This is normal. After doing some work, Matlab won't have as much memory available as it did in a fresh session.
AFAIK, once low-level fragmentation occurs, there's nothing you can do to eliminate it aside from restarting Matlab. Allocating lots of small arrays can accelerate fragmentation. This sometimes happens if you use large cellstrs or large arrays of objects. So if you are having problems, you may need to reduce your peak memory usage in the function by breaking the work in to smaller chunks, reducing cell usage, and so on. And if you have big cellstr arrays in the MAT files, convert them to char. The "high water mark" of allocation is what governs fragmentation, so if you can break your data set in to smaller chunks, you can fit it in less memory.
Inside your function, clear as much as you can from one MAT file before moving on to the next. One way to do this implicitly is to move the per-file processing into a subfunction if it's currently sitting in a loop in your main function.
To help debug, do a "dbstop if all error", which will get triggered by the OOM. From there, you can use whos and the debugger to find out where the space is being taken up when you exhaust memory. That might reveal temp variables that need to be cleaned up, or suggest ways of chunking the work.
If you'd like to experiment to see what fragmentation looks like and how it affects memory()'s output, here's a function that will just create some fragmentation.
function fragmem(nbytes, chunksize)
%FRAGMEM Fragment the Matlab session's memory
if nargin < 2; chunksize = 1*2^10; end
nbytes = nbytes - rem(nbytes, chunksize);
nsteps = 100; % to make initial input relatively small
c = cell([1 nsteps]);
stepsize = nbytes / nsteps;
chunksperstep = ceil(stepsize / chunksize);
fprintf('Fragmenting %d MB memory into %d KB chunks (%d steps of %d chunks)\n',...
round(nbytes/2^20), round(chunksize/2^10), nsteps, chunksperstep);
x = zeros([1 chunksperstep * chunksize], 'uint8');
colsizes = repmat(chunksize, [1 chunksperstep]);
for i = 1:nsteps
c{i} = mat2cell(x, 1, colsizes);
end
Fragging 300 MB in 1KB chunks on my machine reproduces a "loss" on my win32 machine about the size you're seeing.
>> memory
Maximum possible array: 1384 MB (1.451e+009 bytes) *
Memory available for all arrays: 1552 MB (1.627e+009 bytes) **
Memory used by MATLAB: 235 MB (2.463e+008 bytes)
Physical Memory (RAM): 3311 MB (3.472e+009 bytes)
>> fragmem(300*2^20)
Fragmenting 300 MB memory into 1 KB chunks (100 steps of 3072 chunks)
>> memory
Maximum possible array: 1009 MB (1.059e+009 bytes) *
Memory available for all arrays: 1175 MB (1.232e+009 bytes) **
Memory used by MATLAB: 257 MB (2.691e+008 bytes)
Physical Memory (RAM): 3311 MB (3.472e+009 bytes)
>>

Resources