How to allocate GpuMat in shared (not dedicated) GPU memory? [duplicate] - opencv

So I installed the GPU version of TensorFlow on a Windows 10 machine with a GeForce GTX 980 graphics card on it.
Admittedly, I know very little about graphics cards, but according to dxdiag it does have:
4060MB of dedicated memory (VRAM) and;
8163MB of shared memory
for a total of about 12224MB.
What I noticed, though, is that this "shared" memory seems to be pretty much useless. When I start training a model, the VRAM will fill up and if the memory requirement exceeds these 4GB, TensorFlow will crash with a "resource exhausted" error message.
I CAN, of course, prevent reaching that point by choosing the batch size suitably low, but I do wonder if there's a way to make use of these "extra" 8GB of RAM, or if that's it and TensorFlow requires the memory to be dedicated.

Shared memory is an area of the main system RAM reserved for graphics. References:
https://en.wikipedia.org/wiki/Shared_graphics_memory
https://www.makeuseof.com/tag/can-shared-graphics-finally-compete-with-a-dedicated-graphics-card/
https://youtube.com/watch?v=E5WyJY1zwcQ
This type of memory is what integrated graphics eg Intel HD series typically use.
This is not on your NVIDIA GPU, and CUDA can't use it. Tensorflow can't use it when running on GPU because CUDA can't use it, and also when running on CPU because it's reserved for graphics.
Even if CUDA could use it somehow. It won't be useful because system RAM bandwidth is around 10x less than GPU memory bandwidth, and you have to somehow get the data to and from the GPU over the slow (and high latency) PCIE bus.
Bandwidth numbers for reference :
GeForce GTX 980: 224 GB/s
DDR4 on desktop motherboard: approx 25GB/s
PCIe 16x: 16GB/s
This doesn't take into account latency. In practice, running a GPU compute task on data which is too big to fit in GPU memory and has to be transferred over PCIe every time it is accessed is so slow for most types of compute that doing the same calculation on CPU would be much faster.
Why do you see that kind of memory being allocated when you have a NVIDIA card in your machine? Good question. I can think of a couple of possibilities:
(a) You have both NVIDIA and Intel graphics drivers active (eg as happens when running different displays on both). Uninstaller the Intel drivers and/or disable Intel HD graphics in the BIOS and shared memory will disappear.
(b) NVIDIA is using it. This may be eg extra texture memory, etc. It could also not be real memory but just a memory mapped area that corresponds to GPU memory. Look in the advanced settings of the NVIDIA driver for a setting that controls this.
In any case, no, there isn't anything that Tensorflow can use.

CUDA can make use of the RAM, as well. In CUDA shared memory between VRAM and RAM is called unified memory. However, TensorFlow does not allow it due to performance reasons.

I had the same problem. My vram is 6gb but only 4 gb was detected. I read a code about tensorflow limiting gpu memory then I try this code, but it works:
#Setting gpu for limit memory
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
#Restrict Tensorflow to only allocate 6gb of memory on the first GPU
try:
tf.config.experimental.set_virtual_device_configuration(gpus[0],
[tf.config.experimental.VirtualDeviceConfiguration(memory_limit=6144)])
logical_gpus = tf.config.experimental.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
#virtual devices must be set before GPUs have been initialized
print(e)
Note: if you have 10gb vram, then try to allocate a memory limit of 10240.

Well, that's not entirely true. You're right in terms of lowering the batch size but it will depend on what model type you are training. if you train Xseg, it won't use the shared memory but when you get into SAEHD training, you can set your model optimizers on CPU (instead of GPU) as well as your learning dropout rate which will then let you take advantage of that shared memory for those optimizations while saving the dedicated GPU memory for your model resolution and batch size. So it may seem like that shared memory is useless, but play with your settings and you'll see that for certain settings, it's just a matter of redistributing the right tasks. You'll have increased iteration times but you'll be utilizing that shared memory in one way or another. I had to experiment a lot to find what worked with my GPU and there was some surprising revelations. This is an old post but I bet you've figured it out by now, hopefully.

Related

cudaMallocPitch is failed while multi GPUs are controlled by separated CPU processes despite the fact that enough memory is exist

I'm getting 'out of memory' error while using cudaMallocPitch API with GeForce GTX 1080 TI and\or GeForce GTX 1080 GPUs which are part of an entire PC server that include 4 GPUs (1 1080 TI and 3 1080) and two CPUs.
Each GPU is controlled by a dedicated CPU thread which calls to cudaSetDevice with the right device index at the begining of its running.
Based on a configuration file information the application know how much CPU threads shall be created.
I can also run my application several times as a separated processes that each one will control different GPU.
I'm using OpenCV version 3.2 in order to perform an image Background Subtraction.
First, you shall create the BackgroundSubtractorMOG2 object by using this method: cv::cuda::createBackgroundSubtractorMOG2 and after that you shall call its apply method.
The first time apply method is called all required memory is alocated once.
My image size is 10000 cols and 7096 rows. Each pixel is 1B (Grayscale).
When I run my application as a one process which have several threads (each one for each GPU) everything works fine but when I run it 4 times as a separated processes (each one for each GPU) the OpenCV apply function start to fail due to cudaMallocPitch 'not enough memory' failure.
For all GPUs i was verified that I have enough available memory before apply was activated for the first time. For the 1080 it is reported that I have ~5.5GB and for the the 1080 TI I have ~8.3GB and the requested size is: width - 120000bytes, Height - 21288bytes - ~2.4GB.
Please advise.
The problem source was found:
cudaMallocPitch API returned value was cudaErrorMemoryAllocation due to the fact that there wasn’t available OS virtual memory which used by the OS when the process performs read\write accesses to the GPU physical memory.
Because of that, the CUDA driver fails any kind of GPU physical memory allocation.
The complexity here was to figured out why this API is failed while enough GPU physical memory is exist (checked by cudaMemGetInfo API).
I started to analyze two points:
Why I don’t have enough virtual memory in my PC?
By performing the following link instructions I changed its size and the problem was disappeared:
https://www.online-tech-tips.com/computer-tips/simple-ways-to-increase-your-computers-performace-configuring-the-paging-file/
Why my process consume a lot of OS virtual memory?
In the past I figured it out that in order to have a better performance during processing time I shall allocate all required GPU physical memory only once at the beginning because an allocation operation takes a lot of time depends on the required memory size.
Due to the fact that I’m working with a frame resolution of ~70Mbytes and my processing logics required a huge amount of auxiliary buffers, a massive GPU and CPU memory areas were required to be allocated which empty the OS virtual memory available areas.

Memory management when using GPU in TensorFlow

I have some doubts about using GPU in Tensorflow. I was following convolutional neural network tutorial here (tensorflow/models/image/cifar10/cifar10_train.py). As in the tutorial, all parameters (e.g., weights) are stored and updated in CPU memory and GPUs are only used to compute gradients or inference.
Since the weights are stored in CPU, they should be synchronized every iteration and it seems that GPU is underutilized (about 60% according to nvidia-smi). In case of using multiple GPUs, I understand that weights should be stored in CPU memory to synchronize between the GPUs. However, why does this tutorial store all weights in CPU even in single GPU? Is there any way to store and update them in GPU memory?
In case of inference, does the weights copied to GPU once and reuse them? or should they copied every time they are used?
How about image data? It seems that those data reside in GPU (not sure). When does this data transferred to GPU? When they are loaded from disk? or when they are required in the GPU?
If they are copied to GPU right after they are loaded from disk, what happens if size of the image data is too large to fit in the GPU memory? In such case, there is any way to copy data separately (something like prefetching)?
If they are copied to GPU on demand, is there any way to prefetch them before they are actually used by GPU to avoid idle time?
EDIT: It would be helpful if there is any way to check where the send/recv nodes are inserted between CPU and GPU (as in the white paper).
Those tutorials are meant to show off the API, so they don't optimize for performance. It's faster to keep variable on GPU for single tower model, and also faster for multi-tower model when you have p2p communication enabled between GPU. To pin variables to GPU, use the same tf.device('/gpu:0') approach as for any other op.
You can see all the memory copies between GPUs if you enable partition graphs, ie do something like this:
metadata = tf.RunMetadata()
sess.run(x, options=tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE,
output_partition_graphs=True),
run_metadata=metadata)
timeline = Timeline(metadata.step_stats)
with open("dynamic_stitch_gpu_profile.json", "w") as f:
f.write(timeline.generate_chrome_trace_format())
with open("dynamic_stitch_gpu_profile.pbtxt", "w") as f:
f.write(str(metadata))
See this issue for an example of using this technique to track down copies:
https://github.com/tensorflow/tensorflow/issues/7251#issuecomment-277385212
For prefetching to GPU, see this issue
There are new stage_op ops that have been added that allow prefetching to GPU and are dramatically faster than using Python queue runner approach. They are in process of being documented.

GPU memory questions

I have 3 questions about gpu memory:
Why my application takes a different amount of GPU memory on different machines (with different graphic card)?
What happens when there is not enough memory on GPU for my application? Can RAM memory be used instead? Who is responsible for this memory management?
I saw a strange behavior of GPU memory:
My application starts with 2.5/4 GB GPU memory. When running some function, the GPU memory reaches the maximum (4 GB)and then immediately falls down to illogical values (less than was allocated before this function).
How it could be explained ?
Why my application takes a different amount of GPU memory on different machines (with different graphic card)?
Because the GPUs are different. Code sizes, minimum runtime resource requirements, page sizes, etc can be different between GPUs, driver versions, and toolkit versions.
What happens when there is not enough memory on GPU for my application
That would depend entirely on your application and how it handles runtime errors. But the CUDA runtime will simply return errors.
Can RAM memory be used instead?
Possibly, if you have designed your application to use it. But automatically, no
Who is responsible for this memory management?
You are.
I saw a strange behavior of GPU memory: My application starts with 2.5/4 GB GPU memory. When running some function, the GPU memory reaches the maximum (4 GB)and then immediately falls down to illogical values (less than was allocated before this function). How it could be explained ?
The runtime detected an irrecoverable error (like a kernel trying to access invalid memory as the the result of a prior memory allocation failure) and destroyed the CUDA context held by your application, which releases all resources on the GPU associated with your application.

What is the maximum size of the texture memory on a modern GPU?

We believe that texture memory is part of the global memory, is this true? If so, how much can you allocate? (Indirectly, how much is there?)
And is it true that all multiprocessors can read from the texture memory at the same time?
Texture data is contained in CUDA arrays, and CUDA arrays are allocated out of global memory; so however much global memory is still free (you can call cuMemGetInfo() to see how much free memory is left) is available for allocation as textures.
It's impossible to know how much memory is consumed by a given CUDA array - obviously it has to be at least Width*Height*Depth*sizeof(Texel), but it may take more because the driver has to do an allocation that conforms to the hardware's alignment requirements.
Texture limits for different compute capabilities can be found in the CUDA Programming Guide, available at the NVIDIA CUDA website.
For a given device, one can query device capabilities including texture limits using the cudaGetDeviceProperties function.
Allocation depends on the amount of available global memory and the segmentation of the memory, so there is no easy way to tell whether a given allocation will be successfull or not, especially when working with large textures.

How many memory latency cycles per memory access type in OpenCL/CUDA?

I looked through the programming guide and best practices guide and it mentioned that Global Memory access takes 400-600 cycles. I did not see much on the other memory types like texture cache, constant cache, shared memory. Registers have 0 memory latency.
I think constant cache is the same as registers if all threads use the same address in constant cache. Worst case I am not so sure.
Shared memory is the same as registers so long as there are no bank conflicts? If there are then how does the latency unfold?
What about texture cache?
For (Kepler) Tesla K20 the latencies are as follows:
Global memory: 440 clocks
Constant memory
L1: 48 clocks
L2: 120 clocks
Shared memory: 48 clocks
Texture memory
L1: 108 clocks
L2: 240 clocks
How do I know? I ran the microbenchmarks described by the authors of Demystifying GPU Microarchitecture through Microbenchmarking. They provide similar results for the older GTX 280.
This was measured on a Linux cluster, the computing node where I was running the benchmarks was not used by any other users or ran any other processes. It is BULLX linux with a pair of 8 core Xeons and 64 GB RAM, nvcc 6.5.12. I changed the sm_20 to sm_35 for compiling.
There is also an operands cost chapter in PTX ISA although it is not very helpful, it just reiterates what you already expect, without giving precise figures.
The latency to the shared/constant/texture memories is small and depends on which device you have. In general though GPUs are designed as a throughput architecture which means that by creating enough threads the latency to the memories, including the global memory, is hidden.
The reason the guides talk about the latency to global memory is that the latency is orders of magnitude higher than that of other memories, meaning that it is the dominant latency to be considered for optimization.
You mentioned constant cache in particular. You are quite correct that if all threads within a warp (i.e. group of 32 threads) access the same address then there is no penalty, i.e. the value is read from the cache and broadcast to all threads simultaneously. However, if threads access different addresses then the accesses must serialize since the cache can only provide one value at a time. If you're using the CUDA Profiler, then this will show up under the serialization counter.
Shared memory, unlike constant cache, can provide much higher bandwidth. Check out the CUDA Optimization talk for more details and an explanation of bank conflicts and their impact.

Resources