GPU Usage While Training LSTM Model Using Darts - time-series

I am currently using the Darts library to do some time series forecasting using an LSTM.
In my arguments I set the model to utilize my GPU and the output from the dashboard during training shows that my GPU is indeed being used.
Looking at the documentation as well confirms that what I am seeing means the GPU is being used for training. https://unit8co.github.io/darts/userguide/gpu_and_tpu_usage.html.
However, looking at my GPU in task manager it says that my GPU usage is at 0%. That does not make any sense to me? Does anyone know what goes on behind the scenes and might be able to explain why the GPU usage would be at 0% during training?

It might be that your GPU is under-used, e.g. because your CPU is not feeding it fast enough. Try playing with the num_loader_workers parameter, increasing the batch size, and reading the recommendations provided here.

Related

Multi GPU training for Transformers with different GPUs

I want to fine tune a GPT-2 model using Huggingface’s Transformers. Preferably the medium model but large if possible. Currently, I have a RTX 2080 Ti with 11GB of memory and I can train the small model just fine.
My question is: will I run into any issues if I added an old Tesla K80 (24GB) to my machine and distributed the training? I cannot find information about using different capacity GPUs during training and issues I could run into.
Will my model size limit essentially be sum of all available GPU memory? (35GB?)
I’m not interested in doing this in AWS.
You already solved your problem. That's great. I would like to point out a different approach and address a few questions.
Will my model size limit essentially be sum of all available GPU
memory? (35GB?)
This depends on the training technique you use. The standard data parallelism replicates the model, gradients and optimiser states to each of the GPUs. So each GPU must have enough memory to hold all these. The data is splitted across the GPUs. However, the bottleneck is usually the optimiser states and the model not the data.
The state-of-the-art approach in training is ZeRO. Not only the dataset, but also the model parameters, the gradients and the optimizer states are splitted across the GPUs. This allows you to train huge models without hitting OOM. See the nice illustration below from the paper. The baseline is the standard case that I mentioned. They gradually split optimizer states, gradients and model parameter accross the GPU's and compare the memory usage per GPU.
The authors of the paper created a library called DeepSpeed and it is very easy to integrate it with huggingface. With that I was able to increase my model size from 260 Million to 11 Billion :)
If you want to understand in detail how it works, here is the paper:
https://arxiv.org/pdf/1910.02054.pdf
More information on integrating DeepSpeed with Huggingface can be found here:
https://huggingface.co/docs/transformers/main_classes/deepspeed
PS: There is a the model parallelism technique in which each GPU trains different layers of the model but it lost its popularity and is not being actively used.

Efficient inference of 3D deep learning model (pytorch)

I am trying to use a Pytorch 3D UNet for inference (from here: https://github.com/wolny/pytorch-3dunet) which receives images of size (96, 96, 96). I would like to use it on CPU instances, but I am getting very high memory usages (~18 GB). After researching on the subject I found out that this was due to the way convolutions are implemented on CPU (see https://discuss.pytorch.org/t/pytorch-high-memory-demand/2798/5). I thus have the following questions:
Is there a way to use a more memory-efficient implementation of the convolution in Pytorch?
How can I optimize my model for CPU inference? I saw that some tools like AWS Neo, Intel OpenVINO, etc. exist; could they solve my problem?
Does Tensorflow have a similar problem for using convolutions on CPU?
Any other tip, link on how to deploy such models in an efficient way is welcome!
Thanks!
You could benchmark your model's performance with DNN-Bench and choose the best inference engine for your application and your hardware. You might need to convert your model to ONNX first.

If I train a neural network with CUDA do I need to run the outputted algorithm with CUDA?

Let's say I used CUDA to train an object tracking program. Could I then put that program on another computer that didn't have a powerful gpu and run the object tracking program? Or is gpu support required to run the outputted algorithm as well as train it?
No, it does not matter how you trained your model. You can execute it in completely different scenario, using CPU, GPU, cloud or whatever you want. Since execution is usually much cheaper than training - you will usually need much less powerful hardware.

Would it work and be faster if I call function in OpenCV GPU module in my kernel function?

OpenCV has a gpu. GPU-accelerated Computer Vision module (http://docs.opencv.org/modules/gpu/doc/gpu.html). There are many functions which is already use GPU techniques. So I can directly use the function OpenCV applies. But I wonder whether it would be faster if I write my own kernel and in each kernel I call function of OpenCV GPU module. This is in the case I have many images. To handle each image I call OpenCV funtion in GPU module. Then it would be parallel-nested-parallel.
Your question is not entirely clear to me, but I would like to say this: it's impossible to say which would be faster, unless somebody already implemented that same algorithm using the approach you have in mind, and then shared a report about the benchmark tests.
There's a number of factors involved:
It depends on the type of operation you are trying to implement: techniques that have a high arithmetic intensity are better fit for GPUs for sure, however, not all problems can be modeled for GPUs.
The size of the input images matter: wasting time sending data from RAM to the GPU might not compensate in the end, so running the algorithm on the CPU can be faster for small images.
The model/power of the CPU/GPU: if the computer has a really crappy GPU, then it's probably better to run the algorithms on the CPU.
What I'm saying is: don't assume OpenCV GPU's module will always run it's algorithms faster than the CPU you got. Test it, measure it! The only way to know for sure is through experimentation and benchmark.

Fastest method to compute convolution

I have to apply a convolution filter on each row of many images. The classic is 360 images of 1024x1024 pixels. In my use case it is 720 images 560x600 pixels.
The problem is that my code is much slower than what is advertised in articles.
I have implemented the naive convolution, and it takes 2m 30s. I then switched to FFT using fftw. I used complex 2 complex, filtering two rows in each transform. I'm now around 20s.
The thing is that articles advertise around 10s and even less for the classic condition.
So I'd like to ask the experts here if there could be a faster way to compute the convolution.
Numerical recipes suggest to avoid the sorting done in the dft and adapt the frequency domain filter function accordingly. But there is no code example how this could be done.
Maybe I lose time in copying data. With real 2 real transform I wouldn't have to copy the data into the complexe values. But I have to pad with 0 anyway.
EDIT: see my own answer below for progress feedback and further information on solving this issue.
Question (precise reformulation):
I'm looking for an algorithm or piece of code to apply a very fast convolution to a discrete non periodic function (512 to 2048 values). Apparently the discrete time Fourier transform is the way to go. Though, I'd like to avoid data copy and conversion to complex, and avoid the butterfly reordering.
FFT is the fastest technique known for convolving signals, and FFTW is the fastest free library available for computing the FFT.
The key for you to get maximum performance (outside of hardware ... the GPU is a good suggestion) will be to pad your signals to a power of two. When using FFTW use the 'patient' setting when creating your plan to get the best performance. It's highly unlikely that you will hand-roll a faster implementation than what FFTW provides (forget about N.R.). Also be sure to be using the Real version of the forward 1D FFT and not the Complex version; and only use single (floating point) precision if you can.
If FFTW is not cutting it for you, then I would look at Intel's (very affordable) IPP library. The have hand tuned FFT's for Intel processors that have been optimized for images with various bit depths.
Paul
CenterSpace Software
You may want to add image processing as a tag.
But, this article may be of interest, esp with the assumption the image is a power or 2. You can also see where they optimize the FFT. I expect that the articles you are looking at made some assumptions and then optimized the equations for those.
http://www.gamasutra.com/view/feature/3993/sponsored_feature_implementation_.php
If you want to go faster you may want to use the GPU to actually do the work.
This book may be helpful for you, if you go with the GPU:
http://www.springerlink.com/content/kd6qm361pq8mmlx2/
This answer is to collect progress report feedback on this issue.
Edit 11 oct.:
The execution time I measured doesn't reflect the effective time of the FFT. I noticed that when my program ends, the CPU is still busy in system time up to 42% for 10s. When I wait until the CPU is back to 0%, before restarting my program I then get the 15.35s execution time which comes from the GPU processing. I get the same time if I comment out the FFT filtering.
So the FFT is in fact currently faster then the GPU and was simply hindered by a competing system task. I don't know yet what this system task is. I suspect it results from the allocation of a huge heap block where I copy the processing result before writing it to disk. For the input data I use a memory map.
I'll now change my code to get an accurate measurement of the FFT processing time. Making it faster is still actuality because there is room to optimize the GPU processing like for instance by pipelining the transfer of data to process.

Resources