Yolov5 Training keep running on local system - machine-learning

I recently bought a GPU (RTX 3060 Ti) before that I used to work on google collab (Free version). I have downloaded yolov5 on my local machine and made environment variable for it and downloaded the required dependency libraries. I ran training for 3 epochs to test my gpu with same dataset which I use on collab that takes only around 30 seconds to complete (Tesla T4 which has around 2000 cuda cores less than RTX 3060 Ti)on the otherhand my GPU kept running for around 3 hours but didnt stop (So I Intrupted it).
Screenshot of Yolo in VS Code
The code I ran on my local machine is:
# !git clone https://github.com/ultralytics/yolov5 # clone
# %cd yolov5
%pip install -qr requirements.txt # install
import torch
import utils
display = utils.notebook_init() # checks
# Train YOLOv5s on COCO128 for 3 epochs
!python train.py --img 412 --batch 16 --epochs 3 --data train_data/data.yaml --weights yolov5s.pt

Related

tensorflow export model in docker gives error in operative_config.gin for t5 models

I came across a peculiar difference when exporting a t5 model after fine-tuning.
When the code runs in docker using the original tensorflow image with Python and exports to a folder shared between the docker and host (using 'volumes'), it throws an error about mixture not found. Updating operative_config.gin for the mixture used in fine-tuning allows to export albeit with some other non-critical tensorflow errors.
Versions: docker 20.10.16, docker-compose 1.28.2, tensorflow:latest-gpu, tensorflow 2.6.2, python 3.6.7, t5 0.9.3
When the code runs on the host with Python and Tensorflow, there are no export issues.
Versions: tensorflow 2.3.1, python 3.8.5, t5 0.7.1
The only differences I noticed is that 1) docker changes the shared folder user:group ownership to root:root where 2) folder permission settings for group lack 'w' (i.e. write) and 3) software/package versions.
Have you encountered and solved?
Update 1
No changes when running docker-compose as a current user instead of root - see this article
Update 2
Following .gin update in docker, the exported model does not seem to have any defects in its application.

GPU Dask Cuda cluster: client.submit

I am quite familiar with Dask distributed for CPUs. I'd like to explore a transition to running my code on GPU cores. When I submit a task to the LocalCUDACluster I get this error:
ValueError: tuple is not allowed for map key
This is my test case:
import cupy as cp
import numpy as np
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
cluster = LocalCUDACluster()
c = Client(cluster)
def test_function(x):
return(x+1)
sample_np = np.array([0,1])
sample_cp = cp.asarray(sample_np)
test_1 = test_function(sample_cp)
test_2 = c.submit(test_function,sample_cp)
test_2 = test_2.result()
test_1 output:
array([1, 2])
test_2 output:
distributed.protocol.core - CRITICAL - Failed to deserialize
.....
ValueError: tuple is not allowed for map key
How do I correctly distribute tasks on CUDA cores?
UPDATE:
I managed to get it working by first installing the Dask Distributed and Dask CUDA release.
However, I noticed that only 1 worker is available, but I have 600 CUDA cores. How do I distributed individual tasks on these 600 CUDA cores? I'd like to parallelize tasks on these 600 cores.
Versions:
dask 2.17.2
dask-cuda 0.13.0
cupy 7.5.0
cudf 0.13.0
msgpack-python 1.0.0
It looks like this question has an answer in the comments. I'm going to copy a response from Nick Becker
Dask's distributed scheduler is single threaded (CPU and GPU), and Dask-CUDA uses a one worker per GPU model. This means that each task assigned to a given GPU will run serially, but that the task itself will use the GPU for parallelized computation. You may want to look at the Dask documentation and explore Dask.Array (which also supports GPU arrays).

Keras with TensorFlow backend not using GPU

I built the gpu version of the docker image https://github.com/floydhub/dl-docker with keras version 2.0.0 and tensorflow version 0.12.1. I then ran the mnist tutorial https://github.com/fchollet/keras/blob/master/examples/mnist_cnn.py but realized that keras is not using GPU. Below is the output that I have
root#b79b8a57fb1f:~/sharedfolder# python test.py
Using TensorFlow backend.
Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz
x_train shape: (60000, 28, 28, 1)
60000 train samples
10000 test samples
Train on 60000 samples, validate on 10000 samples
Epoch 1/12
2017-09-06 16:26:54.866833: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-06 16:26:54.866855: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-06 16:26:54.866863: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-09-06 16:26:54.866870: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-06 16:26:54.866876: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
Can anyone let me know if there are some settings that need to be made before keras uses GPU ? I am very new to all these so do let me know if I need to provide more information.
I have installed the pre-requisites as mentioned on the page
Install Docker following the installation guide for your platform: https://docs.docker.com/engine/installation/
I am able to launch the docker image
docker run -it -p 8888:8888 -p 6006:6006 -v /sharedfolder:/root/sharedfolder floydhub/dl-docker:cpu bash
GPU Version Only: Install Nvidia drivers on your machine either from Nvidia directly or follow the instructions here. Note that you don't have to install CUDA or cuDNN. These are included in the Docker container.
I am able to run the last step
cv#cv-P15SM:~$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 375.66 Mon May 1 15:29:16 PDT 2017
GCC version: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)
GPU Version Only: Install nvidia-docker: https://github.com/NVIDIA/nvidia-docker, following the instructions here. This will install a replacement for the docker CLI. It takes care of setting up the Nvidia host driver environment inside the Docker containers and a few other things.
I am able to run the step here
# Test nvidia-smi
cv#cv-P15SM:~$ nvidia-docker run --rm nvidia/cuda nvidia-smi
Thu Sep 7 00:33:06 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66 Driver Version: 375.66 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 780M Off | 0000:01:00.0 N/A | N/A |
| N/A 55C P0 N/A / N/A | 310MiB / 4036MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 Not Supported |
+-----------------------------------------------------------------------------+
I am also able to run the nvidia-docker command to launch a gpu supported image.
What I have tried
I have tried the following suggestions below
Check if you have completed step 9 of this tutorial ( https://github.com/ignaciorlando/skinner/wiki/Keras-and-TensorFlow-installation ). Note: Your file paths may be completely different inside that docker image, you'll have to locate them somehow.
I appended the suggested lines to my bashrc and have verified that the bashrc file is updated.
echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-8.0/lib64:/usr/local/cuda-8.0/extras/CUPTI/lib64' >> ~/.bashrc
echo 'export CUDA_HOME=/usr/local/cuda-8.0' >> ~/.bashrc
To import the following commands in my python file
import os
os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID" # see issue #152
os.environ["CUDA_VISIBLE_DEVICES"]="0"
Both steps, done separately or together unfortunately did not solve the issue. Keras is still running with the CPU version of tensorflow as its backend. However, I might have found the possible issue. I checked the version of my tensorflow via the following commands and found two of them.
This is the CPU version
root#08b5fff06800:~# pip show tensorflow
Name: tensorflow
Version: 1.3.0
Summary: TensorFlow helps the tensors flow
Home-page: http://tensorflow.org/
Author: Google Inc.
Author-email: opensource#google.com
License: Apache 2.0
Location: /usr/local/lib/python2.7/dist-packages
Requires: tensorflow-tensorboard, six, protobuf, mock, numpy, backports.weakref, wheel
And this is the GPU version
root#08b5fff06800:~# pip show tensorflow-gpu
Name: tensorflow-gpu
Version: 0.12.1
Summary: TensorFlow helps the tensors flow
Home-page: http://tensorflow.org/
Author: Google Inc.
Author-email: opensource#google.com
License: Apache 2.0
Location: /usr/local/lib/python2.7/dist-packages
Requires: mock, numpy, protobuf, wheel, six
Interestingly, the output shows that keras is using tensorflow version 1.3.0 which is the CPU version and not 0.12.1, the GPU version
import keras
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D
from keras import backend as K
import tensorflow as tf
print('Tensorflow: ', tf.__version__)
Output
root#08b5fff06800:~/sharedfolder# python test.py
Using TensorFlow backend.
Tensorflow: 1.3.0
I guess now I need to figure out how to have keras use the gpu version of tensorflow.
It is never a good idea to have both tensorflow and tensorflow-gpu packages installed side by side (the one single time it happened to me accidentally, Keras was using the CPU version).
I guess now I need to figure out how to have keras use the gpu version of tensorflow.
You should simply remove both packages from your system, and then re-install tensorflow-gpu [UPDATED after comment]:
pip uninstall tensorflow tensorflow-gpu
pip install tensorflow-gpu
Moreover, it is puzzling why you seem to use the floydhub/dl-docker:cpu container, while according to the instructions you should be using the floydhub/dl-docker:gpu one...
I had similar kind of issue - keras didn't use my GPU. I had tensorflow-gpu installed according to instruction into conda, but after installation of keras it simply not listed GPU as available device. I've realized that installation of keras adds tensorflow package! So I had both tensorflow and tensorflow-gpu packages. I've found that there is keras-gpu package available. After complete uninstallation of keras, tensorflow, tensorflow-gpu and installation of tensorflow-gpu, keras-gpu the problem was solved.
In the future, you can try using virtual environments to separate tensorflow CPU and GPU, for example:
conda create --name tensorflow python=3.5
activate tensorflow
pip install tensorflow
AND
conda create --name tensorflow-gpu python=3.5
activate tensorflow-gpu
pip install tensorflow-gpu
This worked for me:
Install tensorflow v2.2.0
pip install tensorflow==2.2.0
Also remove tensorflow-gpu (if it's present)

gcloud ml-engine predict is very slow on inference

I'm testing a segmentation model on gcloud and the inference is incredibly slow. It takes 3 min to get the result (averaged over 5 runs). Same model runs ~2.5 s on my laptop when running through tf-serving.
Is it normal? I didn't find any mention in the documentation on how to define the instance type and it seems impossible to run inference on GPU.
The steps I'm using is fairly straightforward and follows the examples and tutorials:
gcloud ml-engine models create "seg_model"
gcloud ml-engine versions create v1 \
--model "seg_model" \
--origin $DEPLOYMENT_SOURCE \
--runtime-version 1.2 \
--staging-bucket gs://$BUCKET_NAME
gcloud ml-engine predict --model ${MODEL_NAME} --version v1 --json-instances request.json
Upd: after running more experiments I found that redirecting output to a file gets the inference time down to 27s. Model output size is 512x512, which probably causes some delays on a client side. Although it is much lower than 3 min, it is still an order of magnitude slower than tf-serving.

XGB via Scikit learn API doesn't seem to be running in GPU although compiled to run for GPU

It appears although XGB is compiled to run on GPU, when called/executed via Scikit learn API, it doesn't seem to be running on GPU.
Please advise if this is expected behaviour
As far as I can tell, the Scikit learn API does not currently support GPU. You need to use the learning API (e.g. xgboost.train(...)). This also requires you to first convert your data into xgboost DMatrix.
Example:
params = {"updater":"grow_gpu"}
train = xgboost.DMatrix(x_train, label=y_train)
clf = xgboost.train(params, train, num_boost_round=10)
UPDATE:
The Scikit Learn API now supports GPU via the **kwargs argument:
http://xgboost.readthedocs.io/en/latest/python/python_api.html#id1
I couldn't get this working from the pip installed XGBoost, but I pulled the most recent XGBoost from GitHub (git clone --recursive https://github.com/dmlc/xgboost) and compiled it with the PLUGIN_UPDATER_GPU flag which allowed me to use the GPU with the sklearn API. This required me to also change some NVCC flags to work on my GTX960 that was causing some build errors, then some runtime errors due to architecture mismatch. After it built, I installed with pip install -e python-package/ within the repo directory. To use the Scikit learn API (using either grow_gpu or grow_hist_gpu):
import xgboost as xgb
model = xgb.XGBClassifier(
max_depth=5,
objective='binary:logistic',
**{"updater": "grow_gpu"}
)
model.fit(train_x, train_y)
If anyone is interested in the process to fix the build with the GPU flag, here is the process that I went through on Ubuntu 14.04.
i) git clone git clone --recursive https://github.com/dmlc/xgboost
ii) cd insto xgboost and make -j4 to create multi-threaded, if no GPU is desired
iii) to make GPU, edit make/config.mk to use PLUGIN_UPDATER_GPU
iv) Edit the makefile Makefile, on the NVCC section to use the flag --gpu-architecture=sm_xx for GPU version (5.2 for GTX 960) on line 101
#CODE = $(foreach ver,$(COMPUTE),-gencode arch=compute_$(ver),code=sm_$(ver)) TO
CODE = --gpu-architecture=sm_52
v) Run the ./build.sh, it should say completed in multi-threaded mode or the NVCC build probably failed (or another error, look above for the error)
vi) In the virtualenv (if desired) in the same directory run pip install -e python-package/
These are some things that caused some nvcc errors for me:
i) Installing/updating the Cuda Toolkit by downloading the cuda toolkit .deb from Nvidia (version 8.0 worked for me, and is required in some cases?).
ii) Install/update cuda
sudo apt-get update
sudo apt-get install cuda
iii) Add nvcc to your path. Mine was in /usr/local/cuda/bin/
iv) A restart may be required if running nvidia-smi does not work due to some of the cuda/driver/toolkit updates.

Resources