I run an LSTM in a test mode. Every time an LSTM is called and makes an output, the internal memory usage increases (output is a single float or just a few floats, so that shouldn't be the issue) until my memory is all used and the program gets terminated due to the lack of memory. Is it a right observation and LSTM behavior, and if so what could be done about it?
The increase in memory usage was caused by the Autograd engine. Turning it off with
with torch.no_grad_() solves the memory issue.
Related
I recently started using the wandb module with my PyTorch script, to ensure that the GPU's are operating efficiently. However, I am unsure as to what exactly the charts indicate.
I have been following the tutorial in this link, https://lambdalabs.com/blog/weights-and-bias-gpu-cpu-utilization/ , and was confused by this plot:
I am uncertain about the GPU % and the GPU Memory Access % charts. The descriptions in the blog are as following:
GPU %: This graph is probably the most important one. It tracks the percent of the time over the past sample period during which one or more kernels was executing on the GPU. Basically, you want this to be close to 100%, which means GPU is busy all the time doing data crunching. The above diagram has two curves. This is because there are two GPUs and only of them (blue) is used for the experiment. The Blue GPU is about 90% busy, which means it is not too bad but still has some room for improvement. The reason for this suboptimal utilization is due to the small batch size (4) we used in this experiment. The GPU fetches a small amount of data from its memory very often, and can not saturate the memory bus nor the CUDA cores. Later we will see it is possible to bump up this number by merely increasing the batch size.
GPU Memory Access %: This is an interesting one. It measures the percent of the time over the past sample period during which GPU memory was being read or written. We should keep this percent low because you want GPU to spend most of the time on computing instead of fetching data from its memory. In the above figure, the busy GPU has around 85% uptime accessing memory. This is very high and caused some performance problem. One way to lower the percent here is to increase the batch size, so data fetching becomes more efficient.
I had the following questions:
The aforementioned values do not sum to 100%. It seems as though our GPU can either be spending time on computation or spending time on reading/writing memory. How can the sum of these two values be greater than 100%?
Why does increasing batch size decrease the time spent accessing GPU Memory?
GPU Utilization and GPU memory access should add up to 100% is true if the hardware is doing both the process sequentially. But modern hardware doesn't do operations like this. GPU will be busy computing the numbers at the same time it will be accessing the memory.
GPU% is actually GPU Utilization %. We want this to be 100%. Thus it will do the desired computation 100% of the time.
GPU memory access % is the amount of time GPU is either reading from or writing to GPU memory. We want this number to be low. If the GPU memory access % is high there can be some delay before GPU can use the data to compute on. That doesn't mean that it's a sequential process.
W&B allows you to monitor both the metrics and take decisions based on them. Recently I implemented a data pipeline using tf.data.Dataset. The GPU utilization was close to 0% while memory access was close to 0% as well. I was reading three different image files and stacking them. Here CPU was the bottleneck. To counter this I created a dataset by stacking the images. The ETA went from 1h per epoch to 3 min.
From the plot, you can infer that the memory access of GPU increased while GPU utilization is close to 100%. CPU utilization decreased which was the bottleneck.
Here's a nice article by Lukas answering this question.
I am dealing with a Fortran code with MPI parallelization where I'm running out of memory for intensive runs. I'm careful to allocate nearly all of the memory that is required at the beginning of my simulation. Subroutine static memory allocation is typically small, but if I was to run out of memory due to these subroutines, it would occur early in the simulation because the memory allocation should not grow as time progresses. My issue is that far into the simulation, I am running into memory errors such as:
Insufficient memory to allocate Fortran RTL message buffer, message #174 = hex 000000ae.
The only thing that I can think of is that my MPI calls are using memory that I cannot preallocate at the beginning of the simulation. I'm using mostly MPI_Allreduce, MPI_Alltoall, and MPI_Alltoallv while the simulation is running and sometimes I am passing large amounts of data. Could the memory issues be a result of internal buffers created by MPI? How can I prevent a surprise memory issue like this? Can this internal buffer grow during the simulation?
I've looked at Valgrind and besides the annoying MPI warnings, I'm not seeing any other memory issues.
It's a bit hard to tell if MPI is at fault here without knowing more details. You can try massif (one of the valgrind tools) to find out where memory is being allocated.
Be sure you don't introduce any resource leaks: If you create new MPI resources (communicators, groups, requests etc.), make sure to release them properly.
In general, be aware of the buffer sizes required for all-to-all communication, especially at large scale. Use MPI_IN_PLACE if possible, or send data in small chunks rather than as single large blocks.
Lately I've been trying to understand and track down a nasty memory leak in my software. To do this, I started to monitor memory usage over long periods of time to try to figure out if there is any pattern that would serve as a clue to understand this issue.
In the graphic bellow, the virtual memory is drawn in purple and the % of CPU memory in green, the x-axis represent time in seconds.
There are some big spikes that occur when a video streaming feature is activated, but this doesn't seem to be an issue since the software seems to be able to clear these.
Around second 7500 there is a big drop because of a stand-by feature of the system that was activated for a few seconds. After the system returns to normal, it clears some memory that was accumulated from before.. so far this makes sense. The thing I can't understand is that, if the amount of stored memory decreases, why doesn't the %Mem also decrease? In this case, it is actually increasing.
There is not a clear correlation between %Mem and virtual memory usage. Can anyone help me understand this?
I realized that Virtual Memory is not actually related to %MEM, since it is stored in the hard-drive and swapped to RAM when a process needs it. The memory of a process that is related to the %MEM is the RSS (Resident Set Size), which is the memory that is stored in the RAM.
In the next graphic I monitored RSS instead, and the correlation is obvious.
When running my app in the simulator and analyzing its memory allocations using Instruments, the App runs very slow, it runs at less than 1/30 of its normal speed.
The app uses about 50 MB RAM and has approximately 900,000 life objects (according to Instruments).
Could this be the reason for the slow performance?
When running in the app on the device or in the simulator without using Instruments, it performs well (except the memory issue I am trying to debug).
Do you have any idea on how to solve this issue?
Did you encounter slow performance using the Memory Allocation
instruments?
Would you consider having more than 900,000 life
objects "concerning"?
Considering your Analyzer performance issue
In your specific case monitoring the app over a long period of time will not be necessary, as you reach the state of high memory consumption very soon. You could simply stop recording at this point. Then you won't have problems navigating through the different views and statistics to find the cause of the memory issue.
Analyzing the memory issue
Slowing down is normal. 1/30 sounds quite alarming.
You probably should track how the amount of life objects and the memory usage change while you use the app.
It is difficult to decide if a certain amount of life objects at a specific point in time is critical (though 900,000 seems very high).
In general: if life objects and memory usage grow continuously and don't shrink, that is a bad sign.
If you take a look Statistics -> Object Summary (Screenshot), Live Bytes should be a lot smaller than Overall Bytes and the amount of #Living objects should be a lot smaller than the amount of #Transitory objects.
The second thing you can look at, is the Call Tree view.
It gives you a nice overview of which parts of the application are responsible for reserving large amount of memory:
Possible solutions
Once you detect the parts of your code that are responsible for reserving the large memory amount you can look for retain-cycles or you could try to use more autorelease pools in that spot.
Check that you have enough available disk space. I had 8gb left and it seems like that was too little. Instruments was extreeemely slow. Used a minute just to start and didn't quite get around at all.
I cleared out more disk space and then it suddenly went back to being fast as before.
Hi folks and thanks for your time in advance.
I'm currently extending our C# test framework to monitor the memory consumed by our application. The intention being that a bug is potentially raised if the memory consumption significantly jumps on a new build as resources are always tight.
I'm using System.Diagnostics.Process.GetProcessByName and then checking the PrivateMemorySize64 value.
During developing the new test, when using the same build of the application for consistency, I've seen it consume differing amounts of memory despite supposedly executing exactly the same code.
So my question is, if once an application has launched, fully loaded and in this case in it's idle state, hence in an identical state from run to run, can I expect the private bytes consumed to be identical from run to run?
I need to clarify that I can expect memory usage to be consistent as any degree of varience starts to reduce the effectiveness of the test as a degree of tolerance would need to be introduced, something I'd like to avoid.
So...
1) Should the memory usage be 100% consistent presuming the application is behaving consistenly? This was my expectation.
or
2) Is there is any degree of variance in the private byte usage returned by windows or in the memory it allocates when requested by an app?
Currently, if the answer is memory consumed should be consistent as I was expecteding, the issue lies in our app actually requesting a differing amount of memory.
Many thanks
H
Almost everything in .NET uses the runtime's garbage collector, and when exactly it runs and how much memory it frees depends on a lot of factors, many of which are out of your hands. For example, when another program needs a lot of memory, and you have a lot of collectable memory at hand, the GC might decide to free it now, whereas when your program is the only one running, the GC heuristics might decide it's more efficient to let collectable memory accumulate a bit longer. So, short answer: No, memory usage is not going to be 100% consistent.
OTOH, if you have really big differences between runs (say, a few megabytes on one run vs. half a gigabyte on another), you should get suspicious.
If the program is deterministic (like all embedded programs should be), then yes. In an OS environment you are very unlikely to get the same figures due to memory fragmentation and numerous other factors.
Update:
Just noted this a C# app, so no, but the numbers should be relatively close (+/- 10% or less).