I have 3 questions about gpu memory:
Why my application takes a different amount of GPU memory on different machines (with different graphic card)?
What happens when there is not enough memory on GPU for my application? Can RAM memory be used instead? Who is responsible for this memory management?
I saw a strange behavior of GPU memory:
My application starts with 2.5/4 GB GPU memory. When running some function, the GPU memory reaches the maximum (4 GB)and then immediately falls down to illogical values (less than was allocated before this function).
How it could be explained ?
Why my application takes a different amount of GPU memory on different machines (with different graphic card)?
Because the GPUs are different. Code sizes, minimum runtime resource requirements, page sizes, etc can be different between GPUs, driver versions, and toolkit versions.
What happens when there is not enough memory on GPU for my application
That would depend entirely on your application and how it handles runtime errors. But the CUDA runtime will simply return errors.
Can RAM memory be used instead?
Possibly, if you have designed your application to use it. But automatically, no
Who is responsible for this memory management?
You are.
I saw a strange behavior of GPU memory: My application starts with 2.5/4 GB GPU memory. When running some function, the GPU memory reaches the maximum (4 GB)and then immediately falls down to illogical values (less than was allocated before this function). How it could be explained ?
The runtime detected an irrecoverable error (like a kernel trying to access invalid memory as the the result of a prior memory allocation failure) and destroyed the CUDA context held by your application, which releases all resources on the GPU associated with your application.
Related
So I installed the GPU version of TensorFlow on a Windows 10 machine with a GeForce GTX 980 graphics card on it.
Admittedly, I know very little about graphics cards, but according to dxdiag it does have:
4060MB of dedicated memory (VRAM) and;
8163MB of shared memory
for a total of about 12224MB.
What I noticed, though, is that this "shared" memory seems to be pretty much useless. When I start training a model, the VRAM will fill up and if the memory requirement exceeds these 4GB, TensorFlow will crash with a "resource exhausted" error message.
I CAN, of course, prevent reaching that point by choosing the batch size suitably low, but I do wonder if there's a way to make use of these "extra" 8GB of RAM, or if that's it and TensorFlow requires the memory to be dedicated.
Shared memory is an area of the main system RAM reserved for graphics. References:
https://en.wikipedia.org/wiki/Shared_graphics_memory
https://www.makeuseof.com/tag/can-shared-graphics-finally-compete-with-a-dedicated-graphics-card/
https://youtube.com/watch?v=E5WyJY1zwcQ
This type of memory is what integrated graphics eg Intel HD series typically use.
This is not on your NVIDIA GPU, and CUDA can't use it. Tensorflow can't use it when running on GPU because CUDA can't use it, and also when running on CPU because it's reserved for graphics.
Even if CUDA could use it somehow. It won't be useful because system RAM bandwidth is around 10x less than GPU memory bandwidth, and you have to somehow get the data to and from the GPU over the slow (and high latency) PCIE bus.
Bandwidth numbers for reference :
GeForce GTX 980: 224 GB/s
DDR4 on desktop motherboard: approx 25GB/s
PCIe 16x: 16GB/s
This doesn't take into account latency. In practice, running a GPU compute task on data which is too big to fit in GPU memory and has to be transferred over PCIe every time it is accessed is so slow for most types of compute that doing the same calculation on CPU would be much faster.
Why do you see that kind of memory being allocated when you have a NVIDIA card in your machine? Good question. I can think of a couple of possibilities:
(a) You have both NVIDIA and Intel graphics drivers active (eg as happens when running different displays on both). Uninstaller the Intel drivers and/or disable Intel HD graphics in the BIOS and shared memory will disappear.
(b) NVIDIA is using it. This may be eg extra texture memory, etc. It could also not be real memory but just a memory mapped area that corresponds to GPU memory. Look in the advanced settings of the NVIDIA driver for a setting that controls this.
In any case, no, there isn't anything that Tensorflow can use.
CUDA can make use of the RAM, as well. In CUDA shared memory between VRAM and RAM is called unified memory. However, TensorFlow does not allow it due to performance reasons.
I had the same problem. My vram is 6gb but only 4 gb was detected. I read a code about tensorflow limiting gpu memory then I try this code, but it works:
#Setting gpu for limit memory
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
#Restrict Tensorflow to only allocate 6gb of memory on the first GPU
try:
tf.config.experimental.set_virtual_device_configuration(gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=6144)])
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
#virtual devices must be set before GPUs have been initialized
print(e)
Note: if you have 10gb vram, then try to allocate a memory limit of 10240.
Well, that's not entirely true. You're right in terms of lowering the batch size but it will depend on what model type you are training. if you train Xseg, it won't use the shared memory but when you get into SAEHD training, you can set your model optimizers on CPU (instead of GPU) as well as your learning dropout rate which will then let you take advantage of that shared memory for those optimizations while saving the dedicated GPU memory for your model resolution and batch size. So it may seem like that shared memory is useless, but play with your settings and you'll see that for certain settings, it's just a matter of redistributing the right tasks. You'll have increased iteration times but you'll be utilizing that shared memory in one way or another. I had to experiment a lot to find what worked with my GPU and there was some surprising revelations. This is an old post but I bet you've figured it out by now, hopefully.
I'm getting 'out of memory' error while using cudaMallocPitch API with GeForce GTX 1080 TI and\or GeForce GTX 1080 GPUs which are part of an entire PC server that include 4 GPUs (1 1080 TI and 3 1080) and two CPUs.
Each GPU is controlled by a dedicated CPU thread which calls to cudaSetDevice with the right device index at the begining of its running.
Based on a configuration file information the application know how much CPU threads shall be created.
I can also run my application several times as a separated processes that each one will control different GPU.
I'm using OpenCV version 3.2 in order to perform an image Background Subtraction.
First, you shall create the BackgroundSubtractorMOG2 object by using this method: cv::cuda::createBackgroundSubtractorMOG2 and after that you shall call its apply method.
The first time apply method is called all required memory is alocated once.
My image size is 10000 cols and 7096 rows. Each pixel is 1B (Grayscale).
When I run my application as a one process which have several threads (each one for each GPU) everything works fine but when I run it 4 times as a separated processes (each one for each GPU) the OpenCV apply function start to fail due to cudaMallocPitch 'not enough memory' failure.
For all GPUs i was verified that I have enough available memory before apply was activated for the first time. For the 1080 it is reported that I have ~5.5GB and for the the 1080 TI I have ~8.3GB and the requested size is: width - 120000bytes, Height - 21288bytes - ~2.4GB.
Please advise.
The problem source was found:
cudaMallocPitch API returned value was cudaErrorMemoryAllocation due to the fact that there wasn’t available OS virtual memory which used by the OS when the process performs read\write accesses to the GPU physical memory.
Because of that, the CUDA driver fails any kind of GPU physical memory allocation.
The complexity here was to figured out why this API is failed while enough GPU physical memory is exist (checked by cudaMemGetInfo API).
I started to analyze two points:
Why I don’t have enough virtual memory in my PC?
By performing the following link instructions I changed its size and the problem was disappeared:
https://www.online-tech-tips.com/computer-tips/simple-ways-to-increase-your-computers-performace-configuring-the-paging-file/
Why my process consume a lot of OS virtual memory?
In the past I figured it out that in order to have a better performance during processing time I shall allocate all required GPU physical memory only once at the beginning because an allocation operation takes a lot of time depends on the required memory size.
Due to the fact that I’m working with a frame resolution of ~70Mbytes and my processing logics required a huge amount of auxiliary buffers, a massive GPU and CPU memory areas were required to be allocated which empty the OS virtual memory available areas.
Alright so I have a question regarding the Memory segments of a JVM,
I know every JVM would choose to implement this a little bit different yet it is an overall concept that should remain the same within all JVM's
A standart C / C++ program that does not use a virtual machine to execute during runtime has four memory segments during runtime,
The Code / Stack / Heap / Data
all of these memory segments are automatically allocated by the Operating System during runtime.
However, When a JVM executes a Java compiled program, during runtime it has 5 Memory segments
The Method area / Heap / Java Stacks / PC Registers / Native Stacks
My question is this, who allocates and manages those memory segments?
The operating system is NOT aware of a java program running and thinks it is a part of the JVM running as a regular program on the computer, JIT compilation, Java stacks usage, these operations require run-time memory allocation, And what I'm failing to understand Is how a JVM divides it's memory into those memory segments.
It is definitely not done by the Operating System, and those memory segments (for example the java stacks) must be contiguous in order to work, so if the JVM program would simply use a command such as malloc in order to receive the maximum size of heap memory and divide that memory into segments, we have no promise for contiguous memory, I would love it if someone could help me get this straight in my head, it's all mixed up...
When the JVM starts it has hundreds if not thousand of memory regions. For example, there is a stack for every thread as well as a thread state region. There is a memory mapping for every shared library and jar. Note: Java 64-bit doesn't use segments like a 16-bit application would.
who allocates and manages those memory segments?
All memory mappings/regions are allocated by the OS.
The operating system is NOT aware of a java program running and thinks it is a part of the JVM running as a regular program on the computer,
The JVM is running as a regular program however memory allocation uses the same mechanism as a normal program would. The only difference is that in Java object allocation is managed by the JVM, but this is the only regions which work this way.
JIT compilation, Java stacks usage,
JIT compilation occurs in a normal OS thread and each Java stack is a normal thread stack.
these operations require run-time memory allocation,
It does and for the most part it uses malloc and free and map and unmap
And what I'm failing to understand Is how a JVM divides it's memory into those memory segments
It doesn't. The heap is for Java Objects only. The maximum heap for example is NOT the maximum memory usage, only the maximum amount of objects you can have at once.
It is definitely not done by the Operating System, and those memory segments (for example the java stacks) must be contiguous in order to work
You are right that they need to be continuous in virtual memory but the OS does this. On Linux at least there is no segments used, only one 32-bit or 64-bit memory region.
so if the JVM program would simply use a command such as malloc in order to receive the maximum size of heap memory and divide that memory into segments,
The heap is divided either into generations or in G1 multiple memory chunks, but this is for object only.
we have no promise for contiguous memory
The garbage collectors either defragment memory by copying it around or take steps to try to reduce it to ensure there is enough continuous memory for any object you allocate.
would love it if someone could help me get this straight in my head, it's all mixed up...
In short, the JVM runs like any other program except when Java code runs it's object are allocated in a managed region of memory. All other memory regions act just as they would in a C program, because the JVM is a C/C++ program.
I observe substantial speedups in data transfer when I use pinned memory for CUDA data transfers. On linux, the underlying system call for achieving this is mlock. From the man page of mlock, it states that locking the page prevents it from being swapped out:
mlock() locks pages in the address range starting at addr and continuing for len bytes. All pages that contain a part of the specified address range are guaranteed to be resident in RAM when the call returns successfully;
In my tests, I had a fews gigs of free memory on my system so there was never any risk that the memory pages could've been swapped out yet I still observed the speedup. Can anyone explain what's really going on here?, any insight or info is much appreciated.
CUDA Driver checks, if the memory range is locked or not and then it will use a different codepath. Locked memory is stored in the physical memory (RAM), so device can fetch it w/o help from CPU (DMA, aka Async copy; device only need list of physical pages). Not-locked memory can generate a page fault on access, and it is stored not only in memory (e.g. it can be in swap), so driver need to access every page of non-locked memory, copy it into pinned buffer and pass it to DMA (Syncronious, page-by-page copy).
As described here http://forums.nvidia.com/index.php?showtopic=164661
host memory used by the asynchronous mem copy call needs to be page locked through cudaMallocHost or cudaHostAlloc.
I can also recommend to check cudaMemcpyAsync and cudaHostAlloc manuals at developer.download.nvidia.com. HostAlloc says that cuda driver can detect pinned memory:
The driver tracks the virtual memory ranges allocated with this(cudaHostAlloc) function and automatically accelerates calls to functions such as cudaMemcpy().
CUDA use DMA to transfer pinned memory to GPU. Pageable host memory cannot be used with DMA because they may reside on the disk.
If the memory is not pinned (i.e. page-locked), it's first copied to a page-locked "staging" buffer and then copied to GPU through DMA.
So using the pinned memory you save the time to copy from pageable host memory to page-locked host memory.
If the memory pages had not been accessed yet, they were probably never swapped in to begin with. In particular, newly allocated pages will be virtual copies of the universal "zero page" and don't have a physical instantiation until they're written to. New maps of files on disk will likewise remain purely on disk until they're read or written.
A verbose note on copying non-locked pages to locked pages.
It could be extremely expensive if non-locked pages are swapped out by OS on a busy system with limited CPU RAM. Then page fault will be triggered to load pages into CPU RAM through expensive disk IO operations.
Pinning pages can also cause virtual memory thrashing on a system where CPU RAM is precious. If thrashing happens, the throughput of CPU can be degraded a lot.
I understand that cudaMallocManaged simplifies memory access by eliminating the need for explicit memory allocations on host and device. Consider a scenario where the host memory is significantly larger than the device memory, say 16 GB host & 2 GB device which is fairly common these days. If I am dealing with input data of large size say 4-5 GB which is read from an external data source. Am I forced to resort to explicit host and device memory allocation (as device memory is insufficient to accommodate at once) or does the CUDA unified memory model has a way to get around this (something like, auto allocate/deallocate on need basis)?
Am I forced to resort to explicit host and device memory allocation?
You are not forced to resort to explicit host and device memory allocation, but you will be forced to handle the amount of allocated memory manually. This is because, on current hardware at least, the CUDA unified virtual memory doesn't allow you to oversubscribe GPU memory. In other words, cudaMallocManaged will fail once you allocate more memory than what is available on the device. But that doesn't mean you can't use cudaMallocManaged, it merely means you have to keep track of the amount of memory allocated and never exceed what the device could support, by "streaming" your data instead of allocating everything at once.
Pure speculation as I can't speak for NVIDIA, but I believe this could be one of the future improvements on upcoming hardware.
And indeed, one year and a half after the above prediction, as of CUDA 8, Pascal GPUs are now enhanced with a page-faulting capability that allows memory pages to migrate between the host and the device without explicit intervention from the programmer.