I am running SBCL 1.0.51 on a Linux (Fedora 15) 32-bit system (kernel 3.6.5) with 1GB Ram and 256MB swap space.
I fire up sbcl --dynamic-space-size 125 and start calling a function that makes ~10000 http-requests (using drakma) to an http (couchDB) server and I just format to the standard-output the results of an operation on the returned data.
After each call I do a (sb-ext:gc :full t) and then (room). The results are not growing. No matter how many times I run the function, (room) reports the same used space (with some ups and downs, but around the same average which does not grow).
BUT: After every time I call the function, top reports that the VIRT and RES amount of the sbcl process keeps growing ,even beyond the 125MB space I told sbcl to ask for itself. So I have the following questions:
Why top -reported memory keeps growing, while (room) says it does not? The only thing I can think of is some leakage through ffi. I am not directly calling out with ffi but maybe some drakma dep does and forgets to free its C garbage. Anyway I dont know if this could even be an explanation. Could it be something else? Any insights?
Why isnt --dynamic-space-size honoured?
Related
I have a code that creates a global array and when I unset the array the memory is still busy.
I have tried in Windows with TCL 8.4 and 8.6
console show
puts "allocating memory..."
update
for {set i 0} {$i < 10000} {incr i} {
set a($i) $i
}
after 10000
puts "deallocating memory..."
update
foreach v [array names a] {
unset a($v)
}
after 10000
exit
In a lot of programs, both written in Tcl and in other languages, past memory usage is a pretty good indicator of future memory usage. Thus, as a general heuristic, Tcl's implementation does not try to return memory to the OS (it can always page it out if it wants; the OS is always in charge). Indeed, each thread actually has its own memory pool (allowing memory handling to be largely lock-free), but this doesn't make much difference here where there's only one main thread (and a few workers behind the scenes that you can normally ignore). Also, the memory pools will tend to overallocate because it is much faster to work that way.
Whatever you are measuring with, if it is with a tool external to Tcl at all, it will not provide particularly good real memory usage tracking because of the way the pooling works. Tcl's internal tools for this (the memory command) provide much more accurate information but aren't there by default: they're a compile-time option when building the Tcl library, and are usually switched off because they have a lot of overhead. Also, on Windows some of their features only work at all if you build a console application (a consequence of how they're implemented).
I use recon_alloc:memory(allocated_types) and get info like below.
34> recon_alloc:memory(allocated_types).
[{binary_alloc,1546650440},
{driver_alloc,21504840},
{eheap_alloc,28704768840},
{ets_alloc,526938952},
{fix_alloc,145359688},
{ll_alloc,403701800},
{sl_alloc,688968},
{std_alloc,67633992},
{temp_alloc,21504840}]
The eheap_alloc is using 28G. But sum up with heap_size of all process
>lists:sum([begin {_, X}=process_info(P, heap_size), X end || P <- processes()]).
683197586
Only 683M !Any idea where is the 28G ?
You are not comparing the right values. From erlang:process_info
{heap_size, Size}
Size is the size in words of youngest heap generation of the
process. This generation currently include the stack of the process.
This information is highly implementation dependent, and may change if
the implementation change.
recon_alloc:memory(allocated_types) is in bytes by default. You can change it using set_unit. It is not the memory that is currently used but it is the memory reserved by the VM grouped into different allocators. You can use recon_alloc:memory(used) instead. More details in allocator() - Recon Library
Searching through the Erlang source code for the eheap_alloc keyword I didn't come up with much. The most relevant piece of code was this XML code from erts_alloc.xml (https://github.com/erlang/otp/blob/172e812c491680fbb175f56f7604d4098cdc9de4/erts/doc/src/erts_alloc.xml#L46):
<tag><c>eheap_alloc</c></tag>
<item>Allocator used for Erlang heap data, such as Erlang process heaps.</item>
This says that process heaps are stored in eheap_alloc but it doesn't say what else is stored in eheap_alloc. The eheap_alloc stores everything your application needs to run along with some extra memory along with some additional space, so the VM doesn't have to request more memory from the OS every time something needs to be added. There are things the VM must keep in memory that aren't associated with a specific process. For example, large binaries, even though they may used within a process, are not stored inside that processes heap. They are stored in a shared process binary heap called binary_alloc. The binary heap, along with the process heaps and some extra memory, are what make up eheap_alloc.
In your case it looks like you have a lot of memory in your binary_alloc. binary_alloc is probably using a significant portion of your eheap_alloc.
For more details on binary handling checkout these pages:
http://blog.bugsense.com/post/74179424069/erlang-binary-garbage-collection-a-love-hate
http://www.erlang.org/doc/efficiency_guide/binaryhandling.html#id65224
When I run my WebSocket test, I found the following interesting memory usage results:
Server stated, no connection
[{total,573263528},
{processes,17375688},
{processes_used,17360240},
{system,555887840},
{atom,472297},
{atom_used,451576},
{binary,28944},
{code,3774097},
{ets,271016}]
44 processes,
System:705M,
Erlang Residence:519M
100K Connections
[{total,762564512},
{processes,130105104},
{processes_used,130089656},
{system,632459408},
{atom,476337},
{atom_used,456484},
{binary,50160},
{code,3925064},
{ets,7589160}]
100044 processes,
System: 1814M,
Erlang Residence: 950M
200K Connections
( restart server and create from 0 connection, not continue from case 2)
[{total,952040232},
{processes,243161192},
{processes_used,243139984},
{system,708879040},
{atom,476337},
{atom_used,456484},
{binary,70856},
{code,3925064},
{ets,14904760}]
200044 processes,
System:3383M,
Erlang: 1837M
The figures with "System:" and "Erlang:" are provided htop, others are output of memory() call from erlang shell. Please look at the total and erlang residence memory. When there is no connection, these two are roughly same, with 100K connections, residence memory is a little larger than total, with 200K connections, residence memory is almost double the total.
Can anybody explain?
The most probable answer for your quersion is memory fragmentation.
Allocating OS memory is expensive, so Erlang tries to manage memory for you.
When Erlang allocates memory, it creates an entity called "carrier", which consists of many "blocks". Erlang memory(total) reports the sum of all the block sizes (memory actually used). OS reports the sum of all carriers sizes (sum of memory used and preallocated). Both sum of blocks sizes and carrier sizes can be read from Erlang VM. If (block sizes)/(carrier sizes) << 1, than VM has hard time with freeing the carriers. There might be many big carriers with only couple of blocks used. You can read it with: erlang:system_info({allocator,Type}). but there is an easier way. You can check it using Recon library:
http://ferd.github.io/recon/recon_alloc.html
Firstly check:
recon_alloc:fragmentation(current).
and next:
recon_alloc:fragmentation(max).
This should explain the difference between total memory reported by Erlang VM and OS. If you are sending many small messages over websockets, you can decrease the fragmentation by running Erlang with 2 options:
erl +MBas aobf +MBlmbcs 512
First option will change the block allocation strategy from best fit to address order best fit, which could help squeeze more blocks into first carriers and second one decreases maximum multiblock carrier size, which makes carriers smaller (this should make freeing them easier).
It seems like 2 million floats should be no big deal, only 8MBs of 1GB of GPU RAM. I am able to allocate that much at times and sometimes more than that with no trouble. I get CL_OUT_OF_RESOURCES when I do a clEnqueueReadBuffer, which seems odd. Am I able to sniff out where the trouble really started? OpenCL shouldn't be failing like this at clEnqueueReadBuffer right? It should be when I allocated the data right? Is there some way to get more details than just the error code? It would be cool if I could see how much VRAM was allocated when OpenCL declared CL_OUT_OF_RESOURCES.
I just had the same problem you had (took me a whole day to fix).
I'm sure people with the same problem will stumble upon this, that's why I'm posting to this old question.
You propably didn't check for the maximum work group size of the kernel.
This is how you do it:
size_t kernel_work_group_size;
clGetKernelWorkGroupInfo(kernel, device, CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t), &kernel_work_group_size, NULL);
My devices (2x NVIDIA GTX 460 & Intel i7 CPU) support a maximum work group size of 1024, but the above code returns something around 500 when I pass my Path Tracing kernel.
When I used a workgroup size of 1024 it obviously failed and gave me the CL_OUT_OF_RESOURCES error.
The more complex your kernel becomes, the smaller the maximum workgroup size for it will become (or that's at least what I experienced).
Edit:
I just realized you said "clEnqueueReadBuffer" instead of "clEnqueueNDRangeKernel"...
My answer was related to clEnqueueNDRangeKernel.
Sorry for the mistake.
I hope this is still useful to other people.
From another source:
- calling clFinish() gets you the error status for the calculation (rather than getting it when you try to read data).
- the "out of resources" error can also be caused by a 5s timeout if the (NVidia) card is also being used as a display
- it can also appear when you have pointer errors in your kernel.
A follow-up suggests running the kernel first on the CPU to ensure you're not making out-of-bounds memory accesses.
Not all available memory can necessarily be supplied to a single acquisition request. Read up on heap fragmentation 1, 2, 3 to learn more about why the largest allocation that can succeed is for the largest contiguous block of memory and how blocks get divided up into smaller pieces as a result of using the memory.
It's not that the resource is exhausted... It just can't find a single piece big enough to satisfy your request...
Out of bounds acesses in a kernel are typically silent (since there is still no error at the kernel queueing call).
However, if you try to read the kernel result later with a clEnqueueReadBuffer(). This error will show up. It indicates something went wrong during kernel execution.
Check your kernel code for out-of-bounds read/writes.
I am using R on some relatively big data and am hitting some memory issues. This is on Linux. I have significantly less data than the available memory on the system so it's an issue of managing transient allocation.
When I run gc(), I get the following listing
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 2147186 114.7 3215540 171.8 2945794 157.4
Vcells 251427223 1918.3 592488509 4520.4 592482377 4520.3
yet R appears to have 4gb allocated in resident memory and 2gb in swap. I'm assuming this is OS-allocated memory that R's memory management system will allocate and GC as needed. However, lets say that I don't want to let R OS-allocate more than 4gb, to prevent swap thrashing. I could always ulimit, but then it would just crash instead of working within the reduced space and GCing more often. Is there a way to specify an arbitrary maximum for the gc trigger and make sure that R never os-allocates more? Or is there something else I could do to manage memory usage?
In short: no. I found that you simply cannot micromanage memory management and gc().
On the other hand, you could try to keep your data in memory, but 'outside' of R. The bigmemory makes that fairly easy. Of course, using a 64bit version of R and ample ram may make the problem go away too.