I'm currently running a CNN on 3D medical images on tensorflow GPU. Whenever I run the code, resource exhausted error appears on the command prompt. I have already tried running the code in small batches of size 1(one patient at a time).
My GPU is NVIDIA GeForce GTX 960. I'm looking at my GPU's specifications, but I'm not sure which component is limiting the memory. Is it the Standard Memory Config(2GB)?
The command prompt returns the following:
2017-06-11 16:23:37.095587: W c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\framework\op_kernel.cc:1152] Resource exhausted: OOM when allocating tensor with shape[3,3,3,8,16]
2017-06-11 16:23:47.096178: W c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 365.63MiB.
Current allocation summary follows.
2017-06-11 16:23:47.096349: I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\bfc_allocator.cc:643] Bin (256): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
......
2017-06-11 16:23:47.144036: I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\bfc_allocator.cc:696] 3 Chunks of size 13824 totalling 40.5KiB
2017-06-11 16:23:47.144745: I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\bfc_allocator.cc:696] 3 Chunks of size 383385600 totalling 1.07GiB
2017-06-11 16:23:47.145486: I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\bfc_allocator.cc:696] 1 Chunks of size 398141184 totalling 379.70MiB
2017-06-11 16:23:47.146771: I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\bfc_allocator.cc:700] Sum Total of in-use chunks: 1.44GiB
2017-06-11 16:23:47.146796: I c:\tf_jenkins\home\workspace\release-win\device\gpu\os\windows\tensorflow\core\common_runtime\bfc_allocator.cc:702]
Stats:
Limit: 1548396134
InUse: 1548374272
MaxInUse: 1548396032
NumAllocs: 35
MaxAllocSize: 398141184
Related
NVIDIA specifies that their GeForce RTX 3090 (Ti) has 24 GB of memory. How are you supposed to know how much data you can fit on it, when some sources use 1 GB = 1,0243 bytes, while other sources use 1 GB = 1,0003 bytes? Can you assume that hardware manufacturers always use 1,000-base since that means they can write higher number in the specifications, or do some hardware manufacturers still use 1,024-base?
I would expect 1 GB = 1,024^3 as I only encountered 1 GB = 1,000^3 in the context of storage as in SSD.
I am trying to train a CNN network for video frame prediction. My images are large (10 * 480 * 1440 * 3). I want to know if the number of samples that I am using for training is going to affect the GPU memory use, or only the batch size (and also network parameters) need to fit into the GPU memory?
The problem is when I load 100 samples for training with batch_size = 1, I can train the model. However, when I increase the number of samples to 200 I run out of GPU memory.
My machine configuration is:
GPU: A100 NVIDIA 40 GB memory
System memory: 1008 GB
I would appreciate any suggestion to solve this issue.
I'm trying to train a model meant for long text classification. I'm using this repo: https://github.com/franbvalero/BERT-long-sentence-classification/tree/optimizer_lstm
But I modified it to work with a different dataset (i.e. I changed the inputs, not anything directly related to the training). I'm also using a different BERT model trained for another language.
The problem I'm facing is that going lower in bath-size increases the amount PyTorch reserves. Using 64 as batch-size results in the following error message:
RuntimeError: CUDA out of memory. Tried to allocate 20.95 GiB (GPU 0; 8.00 GiB total capacity; 700.89 MiB already allocated; 5.68 GiB free; 756.00 MiB reserved in total by PyTorch)
Definitely too high for the capacity available to me; I try 32 instead.
RuntimeError: CUDA out of memory. Tried to allocate 10.47 GiB (GPU 0; 8.00 GiB total capacity; 602.89 MiB already allocated; 5.77 GiB free; 658.00 MiB reserved in total by PyTorch)
Next 16:
RuntimeError: CUDA out of memory. Tried to allocate 5.24 GiB (GPU 0; 8.00 GiB total capacity; 5.78 GiB already allocated; 596.27 MiB free; 5.83 GiB reserved in total by PyTorch)
At this point the reserved total by PyTorch has sky-rocketed and this trend follows all the way down to a batch-size of 1 where it will try allocate just a few mb but PyTorch will reserve >6 gb.
RuntimeError: CUDA out of memory. Tried to allocate 336.00 MiB (GPU 0; 8.00 GiB total capacity; 6.20 GiB already allocated; 138.27 MiB free; 6.26 GiB reserved in total by PyTorch)
I'm a little bit confused as to why this happens. Lowering the batch-size does reduce the memory required in terms of the allocation that is attempted, but it also increases the amount that PyTorch reserves.
When I do docker stats I see that usage is greater than 100% most of the times. I have a machine which has 8 cores. So, does below output mean that 100% CPU means one core is totally occupied. So, 690% means close to 7 cores is totally occupied ?
d99e067cfffc 690.00% 5.517 GiB / 12.7 GiB 43.46% 1.47 GB / 1.03 GB 9.15 MB / 0 B 338
Exactly as you stated. You can have up to N * 100% CPU usage when N is the number of cores you have.
By the way, you can run the container with a --cpus <your_num> flag to limit the usage of CPU cores if you like.
More details in the official docs.
I have a NVIDIA GTX 570 graphics card running on a Ubuntu 10.10 system with Cuda 4.0.
I know that for performance, we need to access memory efficiently, and use register and shared memory on the device cleverly.
However I don't understand how to calculate, number of registers available per thread, or how much shared memory can a single block use and other such simple / important calculations for particular kernel configurations.
I want to understand this by an explicit example.
Incidentally, I am currently trying to write an a particle code, in which one of the kernels should look like this.
Each block is a 1-D collection of threads, and each grid is a 1-D collection of blocks.
Number of blocks : 16384
Number of threads per block : 32 ( => total threads 32*16384 = 524288)
Each thread-block is given a 32 x 32 two-d integer array of shared memory
to work with.
Within a thread I would like to store some numbers of type double. But I am not sure
how many such double numbers I can store without any register spilling into local memory (which is on device). Can someone tell
me how many doubles can be stored per thread for this kernel configuration?
Also is the above mentioned configuration for shared-memory for each of my blocks valid?
A sample computation about how one would go about deducing these things would be very
illustrative and helpful
Here is the information about my GTX 570: (using deviceQuery from CUDA-SDK)
[deviceQuery] starting...
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Found 1 CUDA Capable device(s)
Device 0: "GeForce GTX 570"
CUDA Driver Version / Runtime Version 4.0 / 4.0
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 1279 MBytes (1341325312 bytes)
(15) Multiprocessors x (32) CUDA Cores/MP: 480 CUDA Cores
GPU Clock Speed: 1.46 GHz
Memory Clock rate: 1900.00 Mhz
Memory Bus Width: 320-bit
L2 Cache Size: 655360 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 2 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.0, CUDA Runtime Version = 4.0, NumDevs = 1, Device = GeForce GTX 570
[deviceQuery] test results...
PASSED
Press ENTER to exit...
So, the kernel configuration is a little complicated. You should use the CUDA OCCUPANCY CALCULATOR. And the other hand you have to study how warps work. Once a block is assigned to a SM, it is further divided into 32-thread units called warps. We can say that a warp is a unit of thread scheduling in SMs. We can calculate the number of warps that reside in a SM for a given block size and given number of blocks assigned to each SM. In your case a warp consists in 32 threads, so if you have a block with 256 threads then you have 8 warps. Now choosing a correctly kernel setting depends of your data and operations, remember that you have to full occupy a SM, that is: you have to get full thread capacity in each SM and the maximal number of warps for scheduling around the long-latency operations. Another important thing is dont exceed the limitations of up to maximum threads per blocks, in your case 1024.