GPU out of memory error message on Google Colab - memory

I'm using a GPU on Google Colab to run some deep learning code.
I have got 70% of the way through the training, but now I keep getting the following error:
RuntimeError: CUDA out of memory. Tried to allocate 2.56 GiB (GPU 0; 15.90 GiB total capacity; 10.38 GiB already allocated; 1.83 GiB free; 2.99 GiB cached)
I'm trying to understand what this means. Is it talking about RAM memory? If so, the code should just run the same as is has been doing shouldn't it? When I try to restart it, the memory message appears immediately. Why would it be using more RAM when I start it today than it did when I started it yesterday or the day before?
Or is this message about hard disk space? I could understand that because the code saves things as it goes on and so the hard disk usage would be cumulative.
Any help would be much appreciated.
So if it's just the GPU running out of memory - could someone explain why the error message says 10.38 GiB already allocated - how can there be memory already allocated when I start to run something. Could that be being used by someone else? Do I just need to wait and try again later?
Here is a screenshot of the GPU usage when I run the code, just before it runs out of memory:
I found this post in which people seem to be having similar problems. When I run a code suggested on that thread I see:
Gen RAM Free: 12.6 GB | Proc size: 188.8 MB
GPU RAM Free: 16280MB | Used: 0MB | Util 0% | Total 16280MB
which seems to suggest there is 16 GB of RAM free.
I'm confused.

You are getting out of memory in GPU. If you are running a python code, try to run this code before yours. It will show the amount of memory you have. Note that if you try in load images bigger than the total memory, it will fail.
# memory footprint support libraries/code
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
!pip install psutil
!pip install humanize
import psutil
import humanize
import os
import GPUtil as GPU
GPUs = GPU.getGPUs()
# XXX: only one GPU on Colab and isn’t guaranteed
gpu = GPUs[0]
def printm():
process = psutil.Process(os.getpid())
print("Gen RAM Free: " + humanize.naturalsize(psutil.virtual_memory().available), " | Proc size: " + humanize.naturalsize(process.memory_info().rss))
print("GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total {3:.0f}MB".format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))
printm()

Try reducing your batch size to 8 or 16. It worked for me

Google Colab resource allocation is dynamic, based on users past usage. Suppose if a user has been using more resources recently and a new user who is less frequently uses Colab, he will be given relatively more preference in resource allocation.
Hence to get the max out of Colab , close all your Colab tabs and all other active sessions ,restart runtime for the one you want to use. You'll definitely get better GPU allocation.
If you are training a NN and still face the same issue Try to reduce the batch size too.

Just as an answer to other people using Google Colab. I had this problem often when I used it for my deep learning class. I started paying for Google Colab and it immediately started allowing me to run my code. This however does not stop the problem completely. I started using Google Colab for my research and hit this error again! I started researching on Google Colabs website and found that there are GPU usage limits even for people who pay for Google Colab. To test this I tried using a secondary gmail account I rarely use. Sure enough it ran perfectly...
So in short. Share your code with a secondary email or set up a new email account. Sign into Colab with the secondary account. If that works for any of you, comment below so people are aware of this. I found it super frustrating and lost a lot of time to this error.

I was attempting to use the trained model to predict the test dataset (~17,000 entries) when CUDA out of memory error appeared.
Reducing the batch size 32 > 4 didn't work for me, I was able to see that the memory required to run the operation was not decreasing even with the change in batch size.
What worked for me was reducing the test dataset size into smaller sized chunks, and merging the predicted output back into a combined dataframe subsequently.

There are few techniques to tackle this problem:
reduce the batch size, let's say if you have 1000 reduce to 700 or 500, restart the runtime
go to runtime-> factory reset runtime
reduce the num_worker

I got this after running a few training sessions on my notebook, so I assumed something's staying too long in memory.
import gc
gc.collect()
Solved it, although I had to sometimes wait a few seconds after I ran GC for some reason.

My woes were caused by retaining my loss on the gpu, and appending it to a list. (That probably caused torch to keep the whole graph intact, and it took only a few batches to consume all available GPU ram.) For example, when you save the loss of the model, make sure to do:
epoch_losses.append(loss.item())
rather than
epoch_losses.append(loss)

Related

CUDA out of memory runtime error, anyway to delete pytorch "reserved memory"

Like many othersm I'm getting a Runtime error of Cuda out of memory, but for some reason pytorch has reserved a large amount of it.
RuntimeError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 6.00 GiB total capacity; 4.31 GiB already allocated; 844.80 KiB free; 4.71 GiB reserved in total by PyTorch)
I've tried the torch.cuda.empy_cache(), but this isn't working either and none of the other CUDA out of memory posts have helped me either.
When I've checked my gpu usage(nvidia-smi) before running my python program, it is plenty free.
From the given description it seems that the problem is not allocated memory by Pytorch so far before the execution but cuda ran out of memory while allocating the data that means the 4.31GB got already allocated (not cached) but failed to allocate the 2MB last block.
Possible solution already worked for me, is to decrease the batch size, hope that helps!
You should find processes by typing,
!ps -aux|grep python
and then when you find the process you should kill it by typing,
!kill -9 652
This will kill process number 652. In your case this will be something else that you want to get rid of.
NOTE: Remember you have to start over your code if you kill some process that you should not have ended. But this is the most easiest and manual way to do it.
Another Note: Or you can always decrease the batch size, if the problem happens again after successfully emptying the gpu cache.

cudaMallocPitch is failed while multi GPUs are controlled by separated CPU processes despite the fact that enough memory is exist

I'm getting 'out of memory' error while using cudaMallocPitch API with GeForce GTX 1080 TI and\or GeForce GTX 1080 GPUs which are part of an entire PC server that include 4 GPUs (1 1080 TI and 3 1080) and two CPUs.
Each GPU is controlled by a dedicated CPU thread which calls to cudaSetDevice with the right device index at the begining of its running.
Based on a configuration file information the application know how much CPU threads shall be created.
I can also run my application several times as a separated processes that each one will control different GPU.
I'm using OpenCV version 3.2 in order to perform an image Background Subtraction.
First, you shall create the BackgroundSubtractorMOG2 object by using this method: cv::cuda::createBackgroundSubtractorMOG2 and after that you shall call its apply method.
The first time apply method is called all required memory is alocated once.
My image size is 10000 cols and 7096 rows. Each pixel is 1B (Grayscale).
When I run my application as a one process which have several threads (each one for each GPU) everything works fine but when I run it 4 times as a separated processes (each one for each GPU) the OpenCV apply function start to fail due to cudaMallocPitch 'not enough memory' failure.
For all GPUs i was verified that I have enough available memory before apply was activated for the first time. For the 1080 it is reported that I have ~5.5GB and for the the 1080 TI I have ~8.3GB and the requested size is: width - 120000bytes, Height - 21288bytes - ~2.4GB.
Please advise.
The problem source was found:
cudaMallocPitch API returned value was cudaErrorMemoryAllocation due to the fact that there wasn’t available OS virtual memory which used by the OS when the process performs read\write accesses to the GPU physical memory.
Because of that, the CUDA driver fails any kind of GPU physical memory allocation.
The complexity here was to figured out why this API is failed while enough GPU physical memory is exist (checked by cudaMemGetInfo API).
I started to analyze two points:
Why I don’t have enough virtual memory in my PC?
By performing the following link instructions I changed its size and the problem was disappeared:
https://www.online-tech-tips.com/computer-tips/simple-ways-to-increase-your-computers-performace-configuring-the-paging-file/
Why my process consume a lot of OS virtual memory?
In the past I figured it out that in order to have a better performance during processing time I shall allocate all required GPU physical memory only once at the beginning because an allocation operation takes a lot of time depends on the required memory size.
Due to the fact that I’m working with a frame resolution of ~70Mbytes and my processing logics required a huge amount of auxiliary buffers, a massive GPU and CPU memory areas were required to be allocated which empty the OS virtual memory available areas.

Limiting Dr. Racket's memory

I augmented the memory of Dr. Racket a week ago, now I want to reduce it to the same amount as before. So I limit it back to 128 MB. But that has no effect... It is always consuming much more then 128 MB...
It's really a problem because it causes my computer to overheat.
Does someone know how I can limit Dr. Racket so that he don't exceed 128 MB?
Here's a screenshot of the problem :
There is a difference between the memory used by a program and the memory used in total by DrRacket. When I start up DrRacket and before entering or running any program I see that DrRacket uses 250MB. The interaction window states I have limited memory to 128MB too so that means that that particular program cannot go beond those bounds, but there are featrues of DrRacket that uses alot more memory on you machine than mine.
I went into the settings and removed some features I don't use (like Algiol60). When restarting after that I used 50MB less memory which indeed confirms the memory is used by DrRacket and not programs.
For a particular complex program I guess background expansion might use a lot of memory. Perhaps you can turn that off as well to see if not the current used memory goes down.
About heat
As Óscar mentioned memory usage has little to do with heat as long as you don't hear the swap is being used (heavy disk usage). Heat has to do with CPU usage. When doing calculations the OS will make available resources available and perhaps increase the frequencey of the CPU which increases the heat.
If you are making a threaded application that has loops waiting for tasks make sure you are not making an active loop. Sleep might reduce activeness and perhaps Racket has better approaches (never done threded apps in Racket)
If you are calculating something the increase of CPU is natural. It's so that you get the answer earlier. Computer settings can be changed to favor battery time. Check both OS and BIOS. (That makes this not a Racket issue)
The memory shown in the Dr Racket status bar is N/A.
Experiment:
Choose Racket | Limit Memory and specify 8 MB (the minimum).
Choose File | New Tab.
In the Interactions pane allocate 8 MB of memory. For example enter (define x (make-bytes (* 8 1024 1024))). (I recommend assigning the result to a variable, like this, because I doubt you want Dr Racket to print 8 MB of bytes.)
The result I get:
Welcome to DrRacket, version 6.1.1.6--2014-12-21(aabe9d7/a) [3m].
Language: racket [custom]; memory limit: 8 MB.
> (define x (make-bytes (* 8 1024 1024)))
out of memory
>
Assuming you get the same result, there is some other reason your computer is running hotter.
I don't think that the extra memory being consumed is the cause for your computer overheating. More likely, it's because some function is consuming the CPU. Try to optimize the code, instead.
In fact, by limiting the available memory you might end up causing more disk paging, hence slowing things down and potentially consuming more CPU … and causing more overheating.

cudaSetDevice() allocates more than 580 MB of global memory

I have a sophisticated CUDA-based Linux application. It runs on an i7 machine with one NVIDIA GTX 560 Ti card (1 GB memory), using Ubuntu 12.04 (x86_64) and NVIDIA driver 295.41 + CUDA 4.2 Toolkit.
The application requires about 600-700 MB of global memory in GPU, and it fails to run due to "out of memory" error on calls to cudaMalloc().
After some debugging, I found that the first call to cudaSetDevice() at the very beginning of the application allocates about 580 MB of global memory at once, and the available memory for the rest of application is only 433 MB.
The CUDA reference manual says that it initializes a "primary context" for the device and allocates various resources such as CUDA kernels (called "module" in the driver API) and constant variables. The application has some __device__ __constant__ variables but the total amount of them is just a few KB. There are about 20-30 kernels and device functions.
I have no idea why CUDA allocates such a large amount of GPU memory during initialization.
In a separate minimal program that do only cudaSetDevice(0); cudaMemGetInfo(&a, &t); printf("%ld, %ld\n", a, t); shows about 980 MB of available memory. So the problem should reside at my application, but I could not figure out what causes such large memory allocation because the implementation detail of cudaSetDevice() is completely proprietary.
Could I get some other ideas?
I presume that cudaSetDevice is the 1st CUDA call you are doing in your application, therefore as a CUDA developer you should know that 1st CUDA call is very expensive because CUDA 1st allocates its components on the graphic card, which is around 500 MB.
Try starting your program using another CUDA command, e.g. cudaMalloc, you'll experience that same amount of allocation by CUDA. You can also run deviceQuery under the CUDA Samples to see how much memory is in use.
It sounds like an issue, would you like to file a bug to Nvidia? The step are:
1. Open page http://developer.nvidia.com/cuda/join-cuda-registered-developer-program;
2. If not registered, please click "Join Now", otherwise click "Login Now";
3. Input e-mail and password to login;
4. On the left panel, there is a "Bug Report" item in Home section, click it to file a bug;
5. Fill the required itmes, other items are optional, but detailed information will help us to target and fix the issue a lot;
6. If necessary, an attachment should be uploaded;
7. For Linux system, it is better to attach an nvidia-bug-report;
8. If an issue is related to specific code pattern, a sample code and instructions to compile it are desired for reproduction.
I had a similar problem when the first call to any cudaXXX() function caused the reported VmData (UNIX) to spike massively, sometimes to tens of GB. This is not a bug and the reason is given here:
Why does the Cuda runtime reserve 80 GiB virtual memory upon initialization?

Huge jump in memory use after calling clGetPlatformIDs

I'm in the process of learning something about openCL, and am having what I hope is not a unique problem (found nothing from google, but..). When I call:
clGetPlatformIDs
from my host program I see a sudden increase in the 'VIRT' memory usage as reported by 'top' to about 45 GB. The values for resident and shared memory don't change noticeably and I'm not completely sure as to what top is reporting here. However, if I repeatedly call a function that runs openCL commands I see some fluctuation in the 'VIRT' memory usage, until openCL calls fail with CL_OUT_OF_HOST_MEMORY. I have 32 GB of memory, so this seems a bit absurd.
I see this in some code (C++) that performs maximum intensity projections on image stacks, but I see exactly the same behaviour in code I took from Erik Smistad's blog.
http://www.thebigblob.com/getting-started-with-opencl-and-gpu-computing/
running that example through GDB, the first call to openCL functions has the same effect as in my code:
cl_platform_id platform_id = NULL;
cl_uint ret_num_platforms;
cl_int ret = clGetPlatformIDs(1, &platform_id, &ret_num_platforms);
VIRT memory jumps massively (again to about 45 GB).
Since I haven't seen anything like this anywhere, I suspect that there may be something funny about my setup:
openSUSE 12.1
GeForce GTX 560Ti 1024 MB
nvidia-computeG02-295.49-17.1.x86_64
but.. CUDA toolkit for openSUSE 11.2 downloaded from NVIDIA, which may expect driver versions 295.41 rather than the 295.49 installed with openSUSE.
I'm hoping someone here has seen a similar problem and has some idea as to what's going on, or some idea as to where to look. I'd very much like to work this out as apart from this issue it's working pretty nicely.

Resources