Huge jump in memory use after calling clGetPlatformIDs - memory

I'm in the process of learning something about openCL, and am having what I hope is not a unique problem (found nothing from google, but..). When I call:
clGetPlatformIDs
from my host program I see a sudden increase in the 'VIRT' memory usage as reported by 'top' to about 45 GB. The values for resident and shared memory don't change noticeably and I'm not completely sure as to what top is reporting here. However, if I repeatedly call a function that runs openCL commands I see some fluctuation in the 'VIRT' memory usage, until openCL calls fail with CL_OUT_OF_HOST_MEMORY. I have 32 GB of memory, so this seems a bit absurd.
I see this in some code (C++) that performs maximum intensity projections on image stacks, but I see exactly the same behaviour in code I took from Erik Smistad's blog.
http://www.thebigblob.com/getting-started-with-opencl-and-gpu-computing/
running that example through GDB, the first call to openCL functions has the same effect as in my code:
cl_platform_id platform_id = NULL;
cl_uint ret_num_platforms;
cl_int ret = clGetPlatformIDs(1, &platform_id, &ret_num_platforms);
VIRT memory jumps massively (again to about 45 GB).
Since I haven't seen anything like this anywhere, I suspect that there may be something funny about my setup:
openSUSE 12.1
GeForce GTX 560Ti 1024 MB
nvidia-computeG02-295.49-17.1.x86_64
but.. CUDA toolkit for openSUSE 11.2 downloaded from NVIDIA, which may expect driver versions 295.41 rather than the 295.49 installed with openSUSE.
I'm hoping someone here has seen a similar problem and has some idea as to what's going on, or some idea as to where to look. I'd very much like to work this out as apart from this issue it's working pretty nicely.

Related

GPU out of memory error message on Google Colab

I'm using a GPU on Google Colab to run some deep learning code.
I have got 70% of the way through the training, but now I keep getting the following error:
RuntimeError: CUDA out of memory. Tried to allocate 2.56 GiB (GPU 0; 15.90 GiB total capacity; 10.38 GiB already allocated; 1.83 GiB free; 2.99 GiB cached)
I'm trying to understand what this means. Is it talking about RAM memory? If so, the code should just run the same as is has been doing shouldn't it? When I try to restart it, the memory message appears immediately. Why would it be using more RAM when I start it today than it did when I started it yesterday or the day before?
Or is this message about hard disk space? I could understand that because the code saves things as it goes on and so the hard disk usage would be cumulative.
Any help would be much appreciated.
So if it's just the GPU running out of memory - could someone explain why the error message says 10.38 GiB already allocated - how can there be memory already allocated when I start to run something. Could that be being used by someone else? Do I just need to wait and try again later?
Here is a screenshot of the GPU usage when I run the code, just before it runs out of memory:
I found this post in which people seem to be having similar problems. When I run a code suggested on that thread I see:
Gen RAM Free: 12.6 GB | Proc size: 188.8 MB
GPU RAM Free: 16280MB | Used: 0MB | Util 0% | Total 16280MB
which seems to suggest there is 16 GB of RAM free.
I'm confused.
You are getting out of memory in GPU. If you are running a python code, try to run this code before yours. It will show the amount of memory you have. Note that if you try in load images bigger than the total memory, it will fail.
# memory footprint support libraries/code
!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
!pip install psutil
!pip install humanize
import psutil
import humanize
import os
import GPUtil as GPU
GPUs = GPU.getGPUs()
# XXX: only one GPU on Colab and isn’t guaranteed
gpu = GPUs[0]
def printm():
process = psutil.Process(os.getpid())
print("Gen RAM Free: " + humanize.naturalsize(psutil.virtual_memory().available), " | Proc size: " + humanize.naturalsize(process.memory_info().rss))
print("GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total {3:.0f}MB".format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))
printm()
Try reducing your batch size to 8 or 16. It worked for me
Google Colab resource allocation is dynamic, based on users past usage. Suppose if a user has been using more resources recently and a new user who is less frequently uses Colab, he will be given relatively more preference in resource allocation.
Hence to get the max out of Colab , close all your Colab tabs and all other active sessions ,restart runtime for the one you want to use. You'll definitely get better GPU allocation.
If you are training a NN and still face the same issue Try to reduce the batch size too.
Just as an answer to other people using Google Colab. I had this problem often when I used it for my deep learning class. I started paying for Google Colab and it immediately started allowing me to run my code. This however does not stop the problem completely. I started using Google Colab for my research and hit this error again! I started researching on Google Colabs website and found that there are GPU usage limits even for people who pay for Google Colab. To test this I tried using a secondary gmail account I rarely use. Sure enough it ran perfectly...
So in short. Share your code with a secondary email or set up a new email account. Sign into Colab with the secondary account. If that works for any of you, comment below so people are aware of this. I found it super frustrating and lost a lot of time to this error.
I was attempting to use the trained model to predict the test dataset (~17,000 entries) when CUDA out of memory error appeared.
Reducing the batch size 32 > 4 didn't work for me, I was able to see that the memory required to run the operation was not decreasing even with the change in batch size.
What worked for me was reducing the test dataset size into smaller sized chunks, and merging the predicted output back into a combined dataframe subsequently.
There are few techniques to tackle this problem:
reduce the batch size, let's say if you have 1000 reduce to 700 or 500, restart the runtime
go to runtime-> factory reset runtime
reduce the num_worker
I got this after running a few training sessions on my notebook, so I assumed something's staying too long in memory.
import gc
gc.collect()
Solved it, although I had to sometimes wait a few seconds after I ran GC for some reason.
My woes were caused by retaining my loss on the gpu, and appending it to a list. (That probably caused torch to keep the whole graph intact, and it took only a few batches to consume all available GPU ram.) For example, when you save the loss of the model, make sure to do:
epoch_losses.append(loss.item())
rather than
epoch_losses.append(loss)

Limiting Dr. Racket's memory

I augmented the memory of Dr. Racket a week ago, now I want to reduce it to the same amount as before. So I limit it back to 128 MB. But that has no effect... It is always consuming much more then 128 MB...
It's really a problem because it causes my computer to overheat.
Does someone know how I can limit Dr. Racket so that he don't exceed 128 MB?
Here's a screenshot of the problem :
There is a difference between the memory used by a program and the memory used in total by DrRacket. When I start up DrRacket and before entering or running any program I see that DrRacket uses 250MB. The interaction window states I have limited memory to 128MB too so that means that that particular program cannot go beond those bounds, but there are featrues of DrRacket that uses alot more memory on you machine than mine.
I went into the settings and removed some features I don't use (like Algiol60). When restarting after that I used 50MB less memory which indeed confirms the memory is used by DrRacket and not programs.
For a particular complex program I guess background expansion might use a lot of memory. Perhaps you can turn that off as well to see if not the current used memory goes down.
About heat
As Óscar mentioned memory usage has little to do with heat as long as you don't hear the swap is being used (heavy disk usage). Heat has to do with CPU usage. When doing calculations the OS will make available resources available and perhaps increase the frequencey of the CPU which increases the heat.
If you are making a threaded application that has loops waiting for tasks make sure you are not making an active loop. Sleep might reduce activeness and perhaps Racket has better approaches (never done threded apps in Racket)
If you are calculating something the increase of CPU is natural. It's so that you get the answer earlier. Computer settings can be changed to favor battery time. Check both OS and BIOS. (That makes this not a Racket issue)
The memory shown in the Dr Racket status bar is N/A.
Experiment:
Choose Racket | Limit Memory and specify 8 MB (the minimum).
Choose File | New Tab.
In the Interactions pane allocate 8 MB of memory. For example enter (define x (make-bytes (* 8 1024 1024))). (I recommend assigning the result to a variable, like this, because I doubt you want Dr Racket to print 8 MB of bytes.)
The result I get:
Welcome to DrRacket, version 6.1.1.6--2014-12-21(aabe9d7/a) [3m].
Language: racket [custom]; memory limit: 8 MB.
> (define x (make-bytes (* 8 1024 1024)))
out of memory
>
Assuming you get the same result, there is some other reason your computer is running hotter.
I don't think that the extra memory being consumed is the cause for your computer overheating. More likely, it's because some function is consuming the CPU. Try to optimize the code, instead.
In fact, by limiting the available memory you might end up causing more disk paging, hence slowing things down and potentially consuming more CPU … and causing more overheating.

Why it is not possible to overlap memHtoD with GPU kernel with GTX 590

I tested my GTX590 and GTX680 with cudaSDK "simpleStreams". The timeline results are shown as the pictures. Anyone to explain why in GTX 590 memC!pyDtoH cannot overlap with previous kernel computation which happens in GTX 680?
I get similar behavior with my GTX 480. I suspect something is wrong with Fermi ?
maybe related to wddm? (using Windows 7 x64 here)
I have tried many many different drivers and all of them show the same wrong behavior. You know have tested GK104 proven right and I have already tested it on an old 8800 GTS and it indeed works. It seems the fermi cards doesnt work :/
edit:
see this also
How can I overlap memory transfers and kernel execution in a CUDA application?

cudaSetDevice() allocates more than 580 MB of global memory

I have a sophisticated CUDA-based Linux application. It runs on an i7 machine with one NVIDIA GTX 560 Ti card (1 GB memory), using Ubuntu 12.04 (x86_64) and NVIDIA driver 295.41 + CUDA 4.2 Toolkit.
The application requires about 600-700 MB of global memory in GPU, and it fails to run due to "out of memory" error on calls to cudaMalloc().
After some debugging, I found that the first call to cudaSetDevice() at the very beginning of the application allocates about 580 MB of global memory at once, and the available memory for the rest of application is only 433 MB.
The CUDA reference manual says that it initializes a "primary context" for the device and allocates various resources such as CUDA kernels (called "module" in the driver API) and constant variables. The application has some __device__ __constant__ variables but the total amount of them is just a few KB. There are about 20-30 kernels and device functions.
I have no idea why CUDA allocates such a large amount of GPU memory during initialization.
In a separate minimal program that do only cudaSetDevice(0); cudaMemGetInfo(&a, &t); printf("%ld, %ld\n", a, t); shows about 980 MB of available memory. So the problem should reside at my application, but I could not figure out what causes such large memory allocation because the implementation detail of cudaSetDevice() is completely proprietary.
Could I get some other ideas?
I presume that cudaSetDevice is the 1st CUDA call you are doing in your application, therefore as a CUDA developer you should know that 1st CUDA call is very expensive because CUDA 1st allocates its components on the graphic card, which is around 500 MB.
Try starting your program using another CUDA command, e.g. cudaMalloc, you'll experience that same amount of allocation by CUDA. You can also run deviceQuery under the CUDA Samples to see how much memory is in use.
It sounds like an issue, would you like to file a bug to Nvidia? The step are:
1. Open page http://developer.nvidia.com/cuda/join-cuda-registered-developer-program;
2. If not registered, please click "Join Now", otherwise click "Login Now";
3. Input e-mail and password to login;
4. On the left panel, there is a "Bug Report" item in Home section, click it to file a bug;
5. Fill the required itmes, other items are optional, but detailed information will help us to target and fix the issue a lot;
6. If necessary, an attachment should be uploaded;
7. For Linux system, it is better to attach an nvidia-bug-report;
8. If an issue is related to specific code pattern, a sample code and instructions to compile it are desired for reproduction.
I had a similar problem when the first call to any cudaXXX() function caused the reported VmData (UNIX) to spike massively, sometimes to tens of GB. This is not a bug and the reason is given here:
Why does the Cuda runtime reserve 80 GiB virtual memory upon initialization?

Super slow Image processing on Android tablet

I am trying to implement SLIC superpixel algorithm in Android tablet (SLIC)
I port the code which in C++ to work with android environment using stl-lib and all. What application doing is taking an image from camera and send data to process in native code.
I got the app running but the problem is that it took 20-30 second to process a single frame (640 x 400) while in my notebook running with visual studio application would be almost instantly finish!
I check the memory leak, their isn't any... is their anything that might cause computation time to be way more expensive than VS2010 in notebook?
I know this question might be very open and not really specific but I'm really in the dark too. Hope you guys can help.
Thanks
PS. I check running time for each process, I think that every line of code execution time just went up. I don't see any specific function that take way longer than usual.
PSS. Do you think follow may cause the slow?
Memory size : investigated, during native not much of paused time show from GC
STL-library : not investigate yet, is it possible that function like vector, max and min running in STL may cause significant slow?
Android environment it self?
Lower hardware specification of Android Tablet (Acer Iconia tab - 1GHz Nvidia Tegra 250 dual-core processor and has 1GB of RAM)
Would be better to run in Java?
PSSS. If you have time please check out the code
I've taken a look to your code and can make the following recommendations:
First of all, you need to add the line APP_ABI := armeabi-v7a into your Application.mk file. Otherwise your code is compiled for old armv5te architecture where you have no any FPU (all floating point arithmetic is emulated), have less registers available and so on.
Your SLIC implementation intensively uses double floating-point values for computation. You should replace them with float wherever possible because ARM still misses hardware support for double type.

Resources