Colab unable to load cache - machine-learning

I am trying to train a YOLOv5 neural network for recognizing vehicles. However, when it is trained on Google Colab, it always stops at here:
train: Scanning 'MyDataset/train/labels.cache' for images and labels... 26559 found, 0 missing, 0 empty, 0 corrupted: 100% 26559/26559 [00:00<?, ?it/s]
train: Caching images (8.5GB): 62% 16425/26559 [00:46<00:30, 330.41it/s]C
CPU times: user 850 ms, sys: 162 ms, total: 1.01 s
Wall time: 1min 26s
I followed the tutorial from roboflow. When I switched to the smaller database provided by roboflow, the training was able to proceed. I'm a Colab Pro+ user, so it shouldn't be a matter of not having enough memory.

I switched to a smaller dataset and now it loads without any problems.
train: Caching images (4.6GB): 100% 8853/8853 [00:18<00:00, 483.20it/s]
Then it started training smoothly.
I think it is indeed a matter of too much data. However Colab is not giving me any indication of running out of memory.

Related

Does number of samples affect the GPU memory?

I am trying to train a CNN network for video frame prediction. My images are large (10 * 480 * 1440 * 3). I want to know if the number of samples that I am using for training is going to affect the GPU memory use, or only the batch size (and also network parameters) need to fit into the GPU memory?
The problem is when I load 100 samples for training with batch_size = 1, I can train the model. However, when I increase the number of samples to 200 I run out of GPU memory.
My machine configuration is:
GPU: A100 NVIDIA 40 GB memory
System memory: 1008 GB
I would appreciate any suggestion to solve this issue.

Do Tensorflow Serving run inference with cache?

When I serve my TF model with tensorflow serveing, on version 2.1.0, through docker, I perform a stress testing with Jmeter. There is a problem. TPS will hit 4400 by testing with single data, while it only reach 1700 with multiple data in a txt file. The model is BiLSTM which I've trained without any cache setting. The experiments all perform in local server rather than through network.
Metrics:
In single data task, I set running HTTP request with identical data without interval by 30 request threads for 10 minutes.
TPS: 4491
CPU occupied: 2100%
99% Latancy Line(ms): 17
error rate: 0
In multiple data task, I set running HTTP request with reading a txt file, a dataset with 9740000 different examples, by 30 request threads.
TPS: 1711
CPU occupied: 2300%
99% Latancy Line(ms): 42
error rate: 0
Hardware:
CPU cores:12
processor: 24
Intel(R) Xeon(R) Silver 4214 CPU # 2.20GHz
Is there a cache in Tensorflow Serving?
Why is TPS with single data testing larger thrice than with various data testing in stress testing task?
I've solved the problem. Request threads reading the same file needs to wait for which cost CPU for running Jmeter.

Turi Create: slow training performance on Blackmagic eGPU

On my Macbook Pro 13" I have the Blackmagic eGPU (AMD Radeon Pro 580) connected via USB-C. This should theoretically speed up my model training with Turi Create enormously.
For a small model in my case 15 labeled images (4k x 3k) and 500 iterations are used, which which take about 2 hours including the eGPU. Only CPU takes 4h, so the GPU speeds up, but not extremely.
In the Guide to Turi Create there is said that an object detection model with ~700 images and 4000 iterations is processed in 1 hour. So way faster.
While using CreateML I observe an increase of performance of at least 5x for transfer learning during the feature detection phase when using the eGPU.
Is this a problem of the framework itself?
Can I optimize the data or training parameters for better usage of the eGPU?
Is the data too small or the resolution too big to have optimal GPU usage over USB-C?
Class : ObjectDetector
Schema
------
Model : darknet-yolo
Number of classes : 4
Non-maximum suppression threshold : 0.45
Input image shape : (3, 416, 416)
Training summary
----------------
Training time : 1h 29m 8s
Training epochs : 1066
Training iterations : 500
Number of examples (images) : 15
Number of bounding boxes (instances) : 49
Final loss (specific to model) : 1.808
It is the image size/resolution (4k x 3k) which creates the bottleneck to the GPU. Scaling the images down (and setting the labels accordingly) gets full speed of the eGPU (100x vs CPU).

Why is my GPU slower than CPU when training LSTM/RNN models?

My machine has the following spec:
CPU: Xeon E5-1620 v4
GPU: Titan X (Pascal)
Ubuntu 16.04
Nvidia driver 375.26
CUDA tookit 8.0
cuDNN 5.1
I've benchmarked on the following Keras examples with Tensorflow as the backed reference:
SCRIPT NAME GPU CPU
stated_lstm.py 5sec 5sec
babi_rnn.py 10sec 12sec
imdb_bidirectional_lstm.py 240sec 116sec
imbd_lstm.py 113sec 106sec
My gpu is clearly out performing my cpu in non-lstm models.
SCRIPT NAME GPU CPU
cifar10_cnn.py 12sec 123sec
imdb_cnn.py 5sec 119sec
mnist_cnn.py 3sec 47sec
Has anyone else experienced this?
If you use Keras, use CuDNNLSTM in place of LSTM or CuDNNGRU in place of GRU. In my case (2 Tesla M60), I am seeing 10x boost of performance. By the way I am using batch size 128 as suggested by #Alexey Golyshev.
Too small batch size. Try to increase.
Results for my GTX1050Ti:
imdb_bidirectional_lstm.py
batch_size time
32 (default) 252
64 131
96 87
128 66
imdb_lstm.py
batch_size time
32 (default) 108
64 50
96 34
128 25
It's just a tip.
Using GPU is powerful when
1. your neural network model is big.
2. batch size is big.
It's what I found from googling.
I have got similar issues here:
Test 1
CPU: Intel(R) Xeon(R) CPU E5-2697 v3 # 2.60GHz
Ubuntu 14.04
imdb_bidirectional_lstm.py: 155s
Test 2
GPU: GTX 860m
Nvidia Driver: 369.30
CUDA Toolkit: v8.0
cuDNN: v6.0
imdb_bidirectional_lstm.py:450s
Analyse
When I observe the GPU load curve, I found one interesting thing:
for lstm, GPU load jumps quickly between ~80% and ~10%
GPU load
This is mainly due to the sequential computation in LSTM layer. Remember that LSTM requires sequential input to calculate hidden layer weights iteratively, in other words, you must wait for hidden state at time t-1 to calculate hidden state at time t.
That's not a good idea for GPU cores, since they are many small cores who like doing computations in parallel, sequential compuatation can't fully utilize their computing powers. That's why we are seeing GPU load around 10% - 20% most of the time.
But in the phase of backpropagation, GPU could run derivative computation in parallel, so we can see GPU load peak around 80%.

I have started a training Cascade but it is VERY slow... is swap memory the issue?

I have started training a cascade with ~600 negative images and ~120 positives (distorted and transformed to make ~1500 positive). I am using opencv_traincascade and I have set the parameters as such:
numPos: 1000
numNeg: 609
numStages: 20
preCalcValBufSize: 4096 (mb)
preCalcIdxBufSize: 4096 (mb)
stageType: BOOST
featureType: Haar
sampleWidth: 80
sampleHeight: 80
maxFalseAlarmRate: 0.5
weightTrimRate: 0.95
maxDepth: 1
maxWeakCount: 100
mode: All
My computer is a mac mini with 16 GB of memory and it is running a quad core i7. It also has a hardrive not SSD. It has been running for about 1 Day 8 hours and it is on training stage 3.
I am wondering if there's any reason that the training is taking so long. At this rate it will take 6-7 days for the training to complete. One thing that I have noticed is that I am typically using 1-2 GB of swap memory and it occasionally says there is "pressure" on my memory. I don't know much about swap memory but I think it might be slowing my training down. How does this work? Also should I restart the training lowering to my memory usage to 2048 mb for both Buf size for the sake of time?
I know i'm late, but maybe someone will still need answer for this question.
Most likely your problem is using quite big images - i remember running some training (and i took about a the or two) with sampleWidth and sampleHeight set to 20. In the article mentioned above width is 80 and height is 40, but the number of samples is much smaller and the shape of object is quite simple.
The parameters you mentioned (preCalcValBufSize and preCalcIdxBufSize) sets maximum ram usage - so it's like saying "you can use x MB of my ram for preCalcValBufSize and y mb for preCalcIdxBufSize and don't even look at the rest of my memory - i need this for something else." As long as preCalcValBufSize + preCalcIdxBufSize < avaible ram size there is no need to use swap (so OS will not use it).

Resources