cudaSetDevice() allocates more than 580 MB of global memory - memory

I have a sophisticated CUDA-based Linux application. It runs on an i7 machine with one NVIDIA GTX 560 Ti card (1 GB memory), using Ubuntu 12.04 (x86_64) and NVIDIA driver 295.41 + CUDA 4.2 Toolkit.
The application requires about 600-700 MB of global memory in GPU, and it fails to run due to "out of memory" error on calls to cudaMalloc().
After some debugging, I found that the first call to cudaSetDevice() at the very beginning of the application allocates about 580 MB of global memory at once, and the available memory for the rest of application is only 433 MB.
The CUDA reference manual says that it initializes a "primary context" for the device and allocates various resources such as CUDA kernels (called "module" in the driver API) and constant variables. The application has some __device__ __constant__ variables but the total amount of them is just a few KB. There are about 20-30 kernels and device functions.
I have no idea why CUDA allocates such a large amount of GPU memory during initialization.
In a separate minimal program that do only cudaSetDevice(0); cudaMemGetInfo(&a, &t); printf("%ld, %ld\n", a, t); shows about 980 MB of available memory. So the problem should reside at my application, but I could not figure out what causes such large memory allocation because the implementation detail of cudaSetDevice() is completely proprietary.
Could I get some other ideas?

I presume that cudaSetDevice is the 1st CUDA call you are doing in your application, therefore as a CUDA developer you should know that 1st CUDA call is very expensive because CUDA 1st allocates its components on the graphic card, which is around 500 MB.
Try starting your program using another CUDA command, e.g. cudaMalloc, you'll experience that same amount of allocation by CUDA. You can also run deviceQuery under the CUDA Samples to see how much memory is in use.

It sounds like an issue, would you like to file a bug to Nvidia? The step are:
1. Open page http://developer.nvidia.com/cuda/join-cuda-registered-developer-program;
2. If not registered, please click "Join Now", otherwise click "Login Now";
3. Input e-mail and password to login;
4. On the left panel, there is a "Bug Report" item in Home section, click it to file a bug;
5. Fill the required itmes, other items are optional, but detailed information will help us to target and fix the issue a lot;
6. If necessary, an attachment should be uploaded;
7. For Linux system, it is better to attach an nvidia-bug-report;
8. If an issue is related to specific code pattern, a sample code and instructions to compile it are desired for reproduction.

I had a similar problem when the first call to any cudaXXX() function caused the reported VmData (UNIX) to spike massively, sometimes to tens of GB. This is not a bug and the reason is given here:
Why does the Cuda runtime reserve 80 GiB virtual memory upon initialization?

Related

Why Erlang / Elixir observer memory usage numbers do not add up?

I am starting out with Elixir and observing some strange behavior when connect to my remote production node using iex.
As in the screenshot below, the observer reports that total of 92 MB memory is in use. However, when you sum up the memory consumption of processes, atoms, binaries, code and ets, it comes up to be: ~69 MB
Processes 19.00 MB
Atoms 0.97 MB (969 kB)
Binaries 13.00 MB
Code 28.00 MB
ETS 7.69 MB (7685 kB)
-------------------
Total 68.66 MB
So, my first question is where is this extra 23 MB of memory is coming from? I am pretty sure its not just a reporting issue. Because when I look at my Kubernetes pod's memory consumption, it is ~102 MB which is in alignment with the numbers observer is showing.
Only thing I can think of is that those 23 MB has not been garbage collected yet. Is my assumption valid? If so, its been 6 hours since this container started. And I have been monitoring the memory consumption from very beginning. Shouldn't this be garbage collected by now?
And second question: are there any Erlang VM / Elixir configuration tweaks I can make to optimize on memory footprint?
I have also been attempting to solve issues regarding memory management in OTP applications and one tool that has been particularly useful for me is the library written by Fred Hebert called recon. Especially the recon_alloc module that provides very useful information on memory usage in the Erlang VM.
The missing MegaBytes
The following quote is directly taken from the documentation of the recon_alloc:memory() function and might provide you an insight of what's going on :
The memory reported by `allocated' should roughly match what the OS
reports. If this amount is different by a large margin, it may be the
sign that someone is allocating memory in C directly, outside of
Erlang's own allocator -- a big warning sign. There are currently
three sources of memory alloction that are not counted towards this
value: The cached segments in the mseg allocator, any memory allocated
as a super carrier, and small pieces of memory allocated during
startup before the memory allocators are initialized. Also note that
low memory usages can be the sign of fragmentation in memory, in which
case exploring which specific allocator is at fault is recommended.
So I think that the extra 23 MB of memory usage might be caused by some undesired allocations, or perhaps due to fragmentation.
Tweaking ( with great caution /!\ )
As for your second question, there is a tool in Erlang called erts_alloc that also describes manual configuration of memory allocators. It can be done by passing command-line flags to the emulator, for example :
erl +SOMEFLAG +SOMEOTHERFLAG
But there's a big red warning in the documentation that strongly suggests that messing with these flags can result in much worse behaviour than with the default configuration.
So my advice would be to resort to these modifications if it is really the only way to solve the problem. In that case, there is a book about the Erlang Runtime System that has helped me understanding some aspects so I would also recommend giving it a read beforehand.
NOTE : Wild shot in the dark here and not answering your question directly, but it might be useful to double check what is going on with your binaries, as I see that there are 13 MB reported by the observer. Depending on their size (smaller or larger than 64 bytes), they are stored in process heaps or accessed by reference. I have faced case #1 with lots of small binaries piling up and ultimately crashing my system.
There are a few other helpful resources I found while trying to fix those problems :
This specific snippet from a blog post authored by Fred Hebert as well :
[erlang:garbage_collect(Pid) || Pid <- processes()].
It will trigger a GC on all running processes immediately. In my case it has done wonders. You can add an option to call it asynchronously too, so you don't have to block until it's all done :
[erlang:garbage_collect(Pid, [{async, RequestId}]) || Pid <- processes()].
This article about GC in Erlang
Efficiency guidelines in the Erlang docs for binaries, that provide useful implementation details.
Stuff goes Bad : Erlang in Anger, another free ebook written by ... yes it is Fred Hebert.
Hope this helps :)

cudaMallocPitch is failed while multi GPUs are controlled by separated CPU processes despite the fact that enough memory is exist

I'm getting 'out of memory' error while using cudaMallocPitch API with GeForce GTX 1080 TI and\or GeForce GTX 1080 GPUs which are part of an entire PC server that include 4 GPUs (1 1080 TI and 3 1080) and two CPUs.
Each GPU is controlled by a dedicated CPU thread which calls to cudaSetDevice with the right device index at the begining of its running.
Based on a configuration file information the application know how much CPU threads shall be created.
I can also run my application several times as a separated processes that each one will control different GPU.
I'm using OpenCV version 3.2 in order to perform an image Background Subtraction.
First, you shall create the BackgroundSubtractorMOG2 object by using this method: cv::cuda::createBackgroundSubtractorMOG2 and after that you shall call its apply method.
The first time apply method is called all required memory is alocated once.
My image size is 10000 cols and 7096 rows. Each pixel is 1B (Grayscale).
When I run my application as a one process which have several threads (each one for each GPU) everything works fine but when I run it 4 times as a separated processes (each one for each GPU) the OpenCV apply function start to fail due to cudaMallocPitch 'not enough memory' failure.
For all GPUs i was verified that I have enough available memory before apply was activated for the first time. For the 1080 it is reported that I have ~5.5GB and for the the 1080 TI I have ~8.3GB and the requested size is: width - 120000bytes, Height - 21288bytes - ~2.4GB.
Please advise.
The problem source was found:
cudaMallocPitch API returned value was cudaErrorMemoryAllocation due to the fact that there wasn’t available OS virtual memory which used by the OS when the process performs read\write accesses to the GPU physical memory.
Because of that, the CUDA driver fails any kind of GPU physical memory allocation.
The complexity here was to figured out why this API is failed while enough GPU physical memory is exist (checked by cudaMemGetInfo API).
I started to analyze two points:
Why I don’t have enough virtual memory in my PC?
By performing the following link instructions I changed its size and the problem was disappeared:
https://www.online-tech-tips.com/computer-tips/simple-ways-to-increase-your-computers-performace-configuring-the-paging-file/
Why my process consume a lot of OS virtual memory?
In the past I figured it out that in order to have a better performance during processing time I shall allocate all required GPU physical memory only once at the beginning because an allocation operation takes a lot of time depends on the required memory size.
Due to the fact that I’m working with a frame resolution of ~70Mbytes and my processing logics required a huge amount of auxiliary buffers, a massive GPU and CPU memory areas were required to be allocated which empty the OS virtual memory available areas.

Get available memory for a process

I use Delphi 2007, so there is a 32-bit limit of available memory.
Using the IMAGE_FILE_LARGE_ADDRESS_AWARE PE flag, there should be a 3 GB limit instead of 2 GB:
{$SetPEFlags IMAGE_FILE_LARGE_ADDRESS_AWARE} // Allows usage of more than 2GB memory
This is the method I use to get the current memory usage of the process:
function MemoryUsed: Int64;
var
PMC: _PROCESS_MEMORY_COUNTERS_EX;
begin
Win32Check(GetProcessMemoryInfo(GetCurrentProcess, #PMC, SizeOf(PMC)));
Result := PMC.PrivateBytes;
end;
Now I want a way to get the total amount of available memory for the process. It should be around 3 GB. But I don't want to hardcode it, as in the future we will move to new Delphi and 64-bit.
What Win32 API function should I use?
Available memory - Computers available memory - Maybe 8 GB RAM is installed. If more is required the OS start to swap memory to disk.
Process available memory - A limitation in executable and Windows. Now most Windows is 64-bit so that is not a problem. But if executable is compiled as 32-bit with IMAGE_FILE_LARGE_ADDRESS_AWARE, the limit should be 3 GB, right? When executable is 64-bit, it will be much larger, maybe 64 GB (but then swapping may happen if installed RAM is less...).
So my question is, how can I get the process's available memory?
There are a couple of obvious things you can do. Call GetSystemInfo and subtract lpMinimumApplicationAddress from lpMaximumApplicationAddress to find the amount of address space available to your process.
The amount of physical memory available to you is much harder to obtain, and is not a fixed quantity. You are competing with all the other processes for that, and so this is a very fluid and dynamic concept. You can find out how much physical memory is available on the system by calling GlobalMemoryStatusEx. That returns other information too but it's very easy to misinterpret it. In fact this API will also tell you how much virtual memory is available to your process which would give you the same information as in the first paragraph.
Perhaps what you want is the minimum of the total physical and total virtual memory. But I would not like to say. I've seen many examples of code that needlessly limits its ability to perform by taking bad decisions based on misinterpreted memory statistics.

AMD 7970 reporting incorrect DEVICE_GLOBAL_MEM_SIZE

I'm testing some OpenCL image processing on an AMD HD 7970 (Sapphire GHz edition). This particular card has 6GB of RAM onboard, however this call:
clGetDeviceInfo(device, CL_DEVICE_GLOBAL_MEM_SIZE, sizeof(buf_ulong), &buf_ulong, NULL);
returns a value of 2,147,483,648.
Is there an issue with this OpenCL call for returning the actual memory size of a card? Is there some sort of setting for this card that limits the amount of OpenCL memory that can be used?
Any insight would be helpful!
My HD7970 (3GB version, Sapphire GHz edition) reports 2GB of free memory too. It is quite normal to reports less memory than the total amount (OS and driver have to reserve some memory) however your value looks really too low for a 6GB version.
On older AMD drivers was possible to set the amount of memory reserved to OpenCL via a couple of env. variables. However, it is a features that has never been officially supported and I'm afraid it is not available anymore on the latest drivers.

Huge jump in memory use after calling clGetPlatformIDs

I'm in the process of learning something about openCL, and am having what I hope is not a unique problem (found nothing from google, but..). When I call:
clGetPlatformIDs
from my host program I see a sudden increase in the 'VIRT' memory usage as reported by 'top' to about 45 GB. The values for resident and shared memory don't change noticeably and I'm not completely sure as to what top is reporting here. However, if I repeatedly call a function that runs openCL commands I see some fluctuation in the 'VIRT' memory usage, until openCL calls fail with CL_OUT_OF_HOST_MEMORY. I have 32 GB of memory, so this seems a bit absurd.
I see this in some code (C++) that performs maximum intensity projections on image stacks, but I see exactly the same behaviour in code I took from Erik Smistad's blog.
http://www.thebigblob.com/getting-started-with-opencl-and-gpu-computing/
running that example through GDB, the first call to openCL functions has the same effect as in my code:
cl_platform_id platform_id = NULL;
cl_uint ret_num_platforms;
cl_int ret = clGetPlatformIDs(1, &platform_id, &ret_num_platforms);
VIRT memory jumps massively (again to about 45 GB).
Since I haven't seen anything like this anywhere, I suspect that there may be something funny about my setup:
openSUSE 12.1
GeForce GTX 560Ti 1024 MB
nvidia-computeG02-295.49-17.1.x86_64
but.. CUDA toolkit for openSUSE 11.2 downloaded from NVIDIA, which may expect driver versions 295.41 rather than the 295.49 installed with openSUSE.
I'm hoping someone here has seen a similar problem and has some idea as to what's going on, or some idea as to where to look. I'd very much like to work this out as apart from this issue it's working pretty nicely.

Resources