I am using R on some relatively big data and am hitting some memory issues. This is on Linux. I have significantly less data than the available memory on the system so it's an issue of managing transient allocation.
When I run gc(), I get the following listing
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 2147186 114.7 3215540 171.8 2945794 157.4
Vcells 251427223 1918.3 592488509 4520.4 592482377 4520.3
yet R appears to have 4gb allocated in resident memory and 2gb in swap. I'm assuming this is OS-allocated memory that R's memory management system will allocate and GC as needed. However, lets say that I don't want to let R OS-allocate more than 4gb, to prevent swap thrashing. I could always ulimit, but then it would just crash instead of working within the reduced space and GCing more often. Is there a way to specify an arbitrary maximum for the gc trigger and make sure that R never os-allocates more? Or is there something else I could do to manage memory usage?
In short: no. I found that you simply cannot micromanage memory management and gc().
On the other hand, you could try to keep your data in memory, but 'outside' of R. The bigmemory makes that fairly easy. Of course, using a 64bit version of R and ample ram may make the problem go away too.
Related
I am writing an algorithm which all blocks are reading a same address. Such as we have a list=[1, 2, 3, 4], and all blocks are reading it and store it to their own shared memory...My test shows the more blocks reading it, the slower it will be...I guess no broadcast happen here? Any idea I can make it faster? Thank you!!!
I learnt from previous post that this can be broadcast in one wrap, seems can not happen in different wrap....(Actually in my case, the threads in one wrap are not reading a same location...)
Once list element is accessed by first warp of a SM unit, the second warp in same SM unit gets it from cache and broadcasts to all simt lanes. But another SM unit's warp may not have it in L1 cache so it fetches from L2 to L1 first.
It is similar in __constant__ memory but it requires same address to be accessed by all threads. Its latency is closer to register access. __constant__ memory is like instruction cache, you get more performance when all threads do same thing.
For example, if you have a Gaussian-filter that iterates over same coefficient-list of filter on all threads, it is better to use constant memory. Using shared memory does not have much advantage as the filter array is not scanned randomly. Shared memory is better when the filter array content is different per block or if it needs random access.
You can also combine constant memory and shared memory. Get half of list from constant memory, then the other half from shared memory. This should let 1024 threads hide latency of one memory type hidden behind the other.
If list is small enough, you can use registers directly (has to be compile-time known indices). But it increases register pressure and may decrease occupancy so be careful about this.
Some old cuda architectures (in case of fma operation) required one operand fetched from constant memory and the other operand from a register to achieve better performance in compute-bottlenecked algorithms.
In a test with 12000 floats as filter to be applied on all threads inputs, shared memory version with 128 threads-per-block completed work in 330 milliseconds while constant-memory version completed in 260 milliseconds and the L1 access performance was the real bottleneck in both versions so the real constant-memory performance is even better, as long as it is similar-index for all threads.
I am using Delphi 7 Enterprise under Windows 7 64 bit.
My computer had 16 GB of RAM.
I try to use kbmMemTable 7.70.00 Professional Edition (http://news.components4developers.com/products_kbmMemTable.html) .
My table has 150,000 records, but when I try to copy the data from Dataset to the kbmMemTable it only copies 29000 records and I get this error: EOutOfMemory
I saw this message:
https://groups.yahoo.com/neo/groups/memtable/conversations/topics/5769,
but it didn't solve my problem.
An out of memory can happen of various reasons:
Your application uses too much memory in general. A 32 bit application typically runs out of memory when it has allocated 1.4GB using FastMM memory manager. Other memory managers may have worse or better ranges.
Memory fragementation. There may not be enough space in memory for a single large allocation that is requested. kbmMemTable will attempt to allocate roughly 200000 x 4 bytes as one single large allocation. As its own largest single allocation. That shouldnt be a problem.
Too many small allocations leading to the above memory fragmentation. kbmMemTable will allocate from 1 to n blocks of memory per record depending on the setting of the Performance property .
If Performance is set to fast, then 1 block will be allocated (unless blobs fields exists, in which case an additional allocation will be made per not null blob field).
If Performance is balanced or small, then each string field will allocate another block of memory per record.
best regards
Kim/C4D
How do I determine the lower bound for the JVM option Xmx or otherwise economize on memory without a trial and error process? I happen to set Xms and Xmx to be the same amount, which I assume helps to economize on execution time. If I set Xmx to 7G, and likewise Xms, it will happily report that all of it is being used. I use the following query:
Runtime.getRuntime().totalMemory()
If I set it to less than that, say 5GB, likewise all of it will be used. It is not until I provide very much less, say 1GB will there be an out-of-heap exception. Since my execution times are typically 10 hours or more, I need to avoid trial and error processes.
I'd execute the program with plenty of heap while monitoring heap usage with JConsole. Take note of the highest memory use after a major garbage collection, and set about maximum heap size 50% to 100% higher than that amount to avoid frequent garbage collection.
As an aside, totalMemory reports the size of the heap, not how much of it is presently used. If you set minimum and maximum heap size to the same number, totalMemory will be the same irrespective of what your program does ...
Using Xms256M and Xmx512M, and a trivial program, freeMemory is 244M and totalMemory is 245M and maxMemory is 455M. Using Xms512M and Xmx512M, the amounts are 488M, 490M, and 490M. This suggests that totalMemory is a variable amount that can vary if Xms is less than Xmx. That suggests the answer to the question is to set Xms to a small amount and monitor the highwater mark of totalMemory. It also suggests maxMemory is the ultimate heap size that cannot be exceed by the total of current and future objects.
Once the highwater mark is known, set Xmx to be somewhat more than that to be prudent -- but not excessively more because this is an economization effort -- and set Xms to be the same amount to get the time efficiency that is evidently preferred.
I have read that there's a limit to the maximum memory allocation to around 60% of device memory, and these can be changed by modifying the GPU_MAX_HEAP_SIZE and GPU_MAX_ALLOC_SIZE environment variables for GPU.
I am wonder if the AMD SDK has something similar for the CPU if I want to raise the limit of memory allocation?
For my current configuration, it returns the following:
CL_DEVICE_MAX_MEM_ALLOC_SIZE = 2973.37MB
CL_DEVI_CEGLOBAL_MEM_SIZE = 11893.5MB
Thanks.
I was able to change this on my system. I don't know if this method was possible when you originally asked the question.
set the environment variable 'CPU_MAX_ALLOC_PERCENT' to the percentage of total memory you want to be able to allocate for a single global buffer. I have 8GB system memory, and after setting CPU_MAX_ALLOC_PERCENT to 80, clinfo reports the following:
Max memory allocation: 6871207116
Success! 6.399GB
You can also use GPU_MAX_ALLOC_PERCENT in the same way for your GPU devices.
I know I can reserve virtual memory using VirtualAlloc.
e.g. I can claim 1GB of virtual memory and then call in the first MB of that to put my a growing array into.
When the array grows beyond 1MB I call in the 2nd MB and so on.
This way I don't need to move the array around in memory when it grows, it just stays in place and the Intel/AMD virtual memory manager takes care of my problems.
However does FastMM support this structure, so I don't have to do my own memory management?
Pseudo code:
type
PBigarray = ^TBigarray;
TBigArray = array[0..0] of SomeRecord;
....
begin
VirtualMem:= FastMM.ReserveVirtualMemory(1GB);
PBigArray:= FastMM.ClaimPhysicalMemory(VirtualMem, 1MB);
....
procedure GrowBigArray
begin
FastMM.ClaimMorePhysicalMemory(PBigArray, 1MB {extra});
//will generate OOM exception when claim exceeds 1GB
Does FastMM support this?
No, FastMM4 (as of the latest version I looked at) does not explicitly support this. It's really not a functionality you would expect in a general purpose memory manager as it's trivially simple to do with VirtualAlloc calls.
NexusMM4 (which is part of NexusDB) does something that gives you a similar result, but without wasting all the address space before it is needed in the background.
If you make an initial large allocation (directly via GetMem, or indirectly via a dynamic array or such) the memory is allocated in just the size needed, via VirtualAlloc.
But if that allocation is then resized to a larger size, NexusMM will use a different way to allocate memory which allows it to simply unmap the allocation from the address space an remap it again, at a larger size, when further reallocs takes place.
This prevents the 2 major problems that most general purpose memory managers have when reallocating:
during a normal realloc the existing and new allocation need to be present in the address space at the same time, temporarily doubling the address space and physical memory requirements
during a normal realloc, the whole contents of the existing allocation needs to be copied
So with NexusMM you would get all the advantages of what you showed in your pseudo code (with the exception that the first realloc will involve a copy, and that growing your array might change it's address) by simply using normal GetMem/ReallocMem/FreeMem calls.