Maximum memory allocation on openCL CPU

Maximum memory allocation on openCL CPU - memory

I have read that there's a limit to the maximum memory allocation to around 60% of device memory, and these can be changed by modifying the GPU_MAX_HEAP_SIZE and GPU_MAX_ALLOC_SIZE environment variables for GPU.
I am wonder if the AMD SDK has something similar for the CPU if I want to raise the limit of memory allocation?
For my current configuration, it returns the following:
CL_DEVICE_MAX_MEM_ALLOC_SIZE = 2973.37MB
CL_DEVI_CEGLOBAL_MEM_SIZE = 11893.5MB
Thanks.

I was able to change this on my system. I don't know if this method was possible when you originally asked the question.
set the environment variable 'CPU_MAX_ALLOC_PERCENT' to the percentage of total memory you want to be able to allocate for a single global buffer. I have 8GB system memory, and after setting CPU_MAX_ALLOC_PERCENT to 80, clinfo reports the following:
Max memory allocation: 6871207116
Success! 6.399GB
You can also use GPU_MAX_ALLOC_PERCENT in the same way for your GPU devices.

Related

Why interfaced DDR3 memory working slow as compared to on-chip memory in FPGA?

I am using MAX10 FPGA and have interfaced DDR3 memory. I have noticed that my DDR3 Memory is working slow as compared to on-chip memory. I came to know about this, as I wrote a blinking LEDs program, and for same delay function with on-chip memory it is working faster as compared to DDR3 memory. What can be done possibly to increase speed? And what might possibly be wrong? My system clock is running at 50MHz.
P.S. There are no Instruction or Data Caches in my system.

First,your function is not pipeline function as your description.Because you do something with memory and then blinking the LED.Every thing run in sequence.
In this case,you should estimate the response time and throughout of your memory.For example,you read a data from memory and then do a add function,and do this 10 times.If you always read memory after add function,your sum time consumption is about 10*response time + 10 add function time.
The difference is memory response time.Inner ram's response time can be 1 cycle at 50MHz.But DDR3 memory should be about 80 ns. That's the difference.
But you can change your module to pipeline pattern.Read/write data and do your other function parallel.And r/w DDR ahead.That's like cache in PC. This can save some time.
By the way,DDR throughout is highly depends on your function pattern.If you read or write data at the sequence order address, then you will get a bigger throughout.
After all,external memory's throughout and response time can never greater then internal memory.
Forgive my English.

How can I use Hugepage memory from kernel space?

i need to be able to allocate 2MB or 4MB sized pages of memory in a kernel module.

In Linux Kernel for allocation of continuous memory you can use function:
__get_free_pages(flags, page_rate);
where flags is usual flags and page_rate defines number of allocated pages where: number of pages = 2 ^ page_rate. You can use this function as proxy between the Kernel and your calling code.
Another approach is allocate huge page if it is possible.

How to economize on memory use using the Xmx JVM option

How do I determine the lower bound for the JVM option Xmx or otherwise economize on memory without a trial and error process? I happen to set Xms and Xmx to be the same amount, which I assume helps to economize on execution time. If I set Xmx to 7G, and likewise Xms, it will happily report that all of it is being used. I use the following query:
Runtime.getRuntime().totalMemory()
If I set it to less than that, say 5GB, likewise all of it will be used. It is not until I provide very much less, say 1GB will there be an out-of-heap exception. Since my execution times are typically 10 hours or more, I need to avoid trial and error processes.

I'd execute the program with plenty of heap while monitoring heap usage with JConsole. Take note of the highest memory use after a major garbage collection, and set about maximum heap size 50% to 100% higher than that amount to avoid frequent garbage collection.
As an aside, totalMemory reports the size of the heap, not how much of it is presently used. If you set minimum and maximum heap size to the same number, totalMemory will be the same irrespective of what your program does ...

Using Xms256M and Xmx512M, and a trivial program, freeMemory is 244M and totalMemory is 245M and maxMemory is 455M. Using Xms512M and Xmx512M, the amounts are 488M, 490M, and 490M. This suggests that totalMemory is a variable amount that can vary if Xms is less than Xmx. That suggests the answer to the question is to set Xms to a small amount and monitor the highwater mark of totalMemory. It also suggests maxMemory is the ultimate heap size that cannot be exceed by the total of current and future objects.
Once the highwater mark is known, set Xmx to be somewhat more than that to be prudent -- but not excessively more because this is an economization effort -- and set Xms to be the same amount to get the time efficiency that is evidently preferred.

How to allocate all of the available shared memory to a single block in CUDA?

I want to allocate all the available shared memory of an SM to one block. I am doing this because I don't want multiple blocks to be assigned to the same SM.
My GPU card has 64KB (Shared+L1) memory. In my current configuration, 48KB is assigned to the Shared memory and 16KB to the L1.
I wrote the following code to use up all of the available Shared memory.
__global__ void foo()
{
__shared__ char array[49152];
...
}
I have two questions:
How can I make sure that all of the shared memory space is used up?
I can increase "48K" to a much higher value(without getting any error or warning). Is there anyone who can justify this?
Thanks in advance,
Iman

You can read size of available device shared memory from cudaDeviceProp::sharedMemPerBlock that you can obtain by calling cudaGetDeviceProperties
You do not have to specify size of your array. Instead, you may dynamically pass size of the shared memory as 3rd kernel launch parameter.
The "clock" CUDA SDK sample illustrates how you can specify shared memory size at launch time.

GC and memory limit issues with R

I am using R on some relatively big data and am hitting some memory issues. This is on Linux. I have significantly less data than the available memory on the system so it's an issue of managing transient allocation.
When I run gc(), I get the following listing
used (Mb) gc trigger (Mb) max used (Mb)
Ncells 2147186 114.7 3215540 171.8 2945794 157.4
Vcells 251427223 1918.3 592488509 4520.4 592482377 4520.3
yet R appears to have 4gb allocated in resident memory and 2gb in swap. I'm assuming this is OS-allocated memory that R's memory management system will allocate and GC as needed. However, lets say that I don't want to let R OS-allocate more than 4gb, to prevent swap thrashing. I could always ulimit, but then it would just crash instead of working within the reduced space and GCing more often. Is there a way to specify an arbitrary maximum for the gc trigger and make sure that R never os-allocates more? Or is there something else I could do to manage memory usage?

In short: no. I found that you simply cannot micromanage memory management and gc().
On the other hand, you could try to keep your data in memory, but 'outside' of R. The bigmemory makes that fairly easy. Of course, using a 64bit version of R and ample ram may make the problem go away too.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Maximum memory allocation on openCL CPU - memory

Related

Why interfaced DDR3 memory working slow as compared to on-chip memory in FPGA?

How can I use Hugepage memory from kernel space?

How to economize on memory use using the Xmx JVM option

How to allocate all of the available shared memory to a single block in CUDA?

GC and memory limit issues with R

Categories

Resources