any doable approach to use multiple GPUs, multiple process with tensorflow? - docker

I am using docker container to run my experiment. I have multiple GPUs available and I want to use all of them for my experiment. I mean I want to utilize all GPUs for one program. To do so, I used tf.distribute.MirroredStrategy that suggested on tensorflow site, but it is not working. here is the full error messages on gist.
here is available GPUs info:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:6A:00.0 Off | 0 |
| N/A 31C P8 15W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 Off | 00000000:6B:00.0 Off | 0 |
| N/A 31C P8 15W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla T4 Off | 00000000:6C:00.0 Off | 0 |
| N/A 34C P8 15W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla T4 Off | 00000000:6D:00.0 Off | 0 |
| N/A 34C P8 15W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
my current attempt
here is my attempt using tf.distribute.MirroredStrategy:
device_type = "GPU"
devices = tf.config.experimental.list_physical_devices(device_type)
devices_names = [d.name.split("e:")[1] for d in devices]
strategy = tf.distribute.MirroredStrategy(devices=devices_names[:3])
with strategy.scope():
model.compile(optimizer=opt, loss="categorical_crossentropy", metrics=["accuracy"])
The above attempt is not working and gave the error that listed on above gist. I don't find another way of using multiple GPUs for a single experiment.
does anyone any workable approach to make this happens? any thoughts?

Is the MirrordStrategy proper way to distribute the workload
The approach is correct, as long as the GPUs are on the same host. The TensorFlow manual has examples how the tf.distribute.MirroredStrategy can be used with keras to train the MNIST set.
Is it the MirrordStrategy the only strategy
No, there are multiple strategies that can be used to acheive the workload distribution. For example, the tf.distribute.MultiWorkerMirroredStrategy can also be used to distribute the work on multiple devices trough multiple workers.
The TF documentation explains the strategies, the limitations associated with the strategies and provides some examples to help kick-start the work.
The strategy is throwing an error
According to the issue from github, the ValueError: SyncOnReadVariable does not support 'assign_add' ..... is a bug in TensorFlow which is fixed in TF 2.4
You can try to upgrade the tensorflow libraries by
pip install --ignore-installed --upgrade tensorflow
Implementing variables that are not aware of distributed strategy
If you have tried the standard example from the documentation, and it works fine, but your model is not working, you might be having variables that are incorrectly set-up or you are using distributed variables that do not have support for the aggregation functions required by the distributed strategy.
As per the TF documentation:
..."
A distributed variable is variables created on multiple devices. As discussed in the glossary, mirrored variable and SyncOnRead variable are two examples.
"...
To better understand how to implement the custom support for the distributed varialbes, check the following page in the documentation

Related

ROS services are not building properly

I am trying to build my custom ROS services. They are inside a another parent package
the structure is as follows:
|--catkine_ws
| |--src
| | |--Parent
| | | |--CMakeLists.txt
| | | |--package.xml
| | | |--ChildA
| | | | |--CMakeLists.txt
| | | | |--package.xml
| | | | |--srv
| | | | | |--SomeService.srv
| | | |--ChildB
The packages are building correctly and I am able to use them in other nodes and packags.
however when I try to use rossrv list the custom services do not appear. I think that this is causing some issues when I try to build my Simulink controller and it cannot find the service message definition.
Does any one have any idea what is going on?
I was able to fix the problem, while not obvious, the solution was rather simple. I had to change the slightly change the structure of the package by making the parent package a meta package then do some handling to make sure that the sub packages still had access to the cmakes to locate my external packages.

Even Easier Introduction to CUDA - not printing after memory initialization

I'm following the Even Easier Introduction to CUDA tutorial. I have literally copied and pasted the complete code. add.cu compiles, however, when I run it, it doesn't print anything. I put in some more print statements and narrowed it down:
printf("Hi\n");
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
printf("Bye");
It prints "Hi", but never prints "Bye". So something seems to be wrong with the memory initialization. What is going wrong here?
I solved the problem myself. Basically, my device drivers were screwed up. To check if you have the same problem, run Command Prompt as administrator and run nvidia-smi. If you have the same problem, it will give you an error saying it failed to communicate because your device drivers are not up to date or wrong or something of the sort.
Download the latest nvidia driver for your computer (I found mine on Dell Drivers & Downlods) and install it. Now, when you run nvidia-smi as admin on command prompt, it should give you a whole bunch of details about your setup (driver version, Cuda version, etc) like this:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 441.14 Driver Version: 441.14 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 105... WDDM | 00000000:01:00.0 Off | N/A |
| N/A 30C P8 N/A / N/A | 78MiB / 4096MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
You should now be able to compile and run Cuda scripts with unified memory.

Dynamically decide which GPU to run on - TF on NVIDIA docker

I have a queue of models, which I allow only 2 to be executed in parallel, since I have 2 GPUs.
For that, in the beginning of my code I try to determine which GPU is available by using GPUtil. Maybe its relevant, this code in run inside a docker container that was launched using the --runtime=nvidia flag.
The code that determines which GPU to run on, looks like this:
import os
import GPUtil
gpu1, gpu2 = GPUtil.getGPUs()
available_gpu = gpu1 if gpu1.memoryFree > gpu2.memoryFree else gpu2
os.environ['CUDA_VISIBLE_DEVICES'] = str(available_gpu.id)
import tensorflow as tf
Now, I launched two scripts this way (with a slight delay until the first one occupied a GPU) but both of them tried to use the same GPU!
I went further to examine the problem - I manually set the os.environ['CUDA_VISIBLE_DEVICES'] = '1' and let the model run.
As it was training, I checked the output of nvidia-smi and saw the following
user#server:~$ docker exec awesome_gpu_container nvidia-smi
Mon Mar 12 06:59:27 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111 Driver Version: 384.111 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 980 Ti Off | 00000000:01:00.0 On | N/A |
| 0% 50C P2 131W / 280W | 5846MiB / 6075MiB | 81% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 1080 Off | 00000000:03:00.0 Off | N/A |
| 0% 39C P8 14W / 200W | 2MiB / 8114MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
And I notice that while I've set the visible device to be 1 it is actually running on 0
I stress again, that my mission is while queuing multiple models that each one that start running will decide for itself which GPU to use.
I explored allow_soft_placement=True, but that allocated the memory on both GPUs so I stopped the process.
Bottom line, how can I make sure my training scripts only use one GPU, and make them choose the free one?
As described in the CUDA programming guide, the default device enumeration used by CUDA is "fastest first":
CUDA_​DEVICE_​ORDER
FASTEST_FIRST, PCI_BUS_ID, (default is FASTEST_FIRST)
FASTEST_FIRST causes CUDA to guess which device is
fastest using a simple heuristic, and make that device 0, leaving the
order of the rest of the devices unspecified.
PCI_BUS_ID orders devices by PCI bus ID in ascending order.
If you set CUDA_​DEVICE_​ORDER=PCI_BUS_ID the CUDA ordering will match the device ordering shown by nvidia-smi.
Since you are using docker, you can also enforce a stronger isolation with our runtime:
docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0 ...
But that's at container startup time.

Does Release Management 2013 rollback across tags

We're heavy users of tags and I'm confused how tags and rollbacks interact together.
I understand that rollbacks cascade (at least within a sequence) from this article:
http://incyclesoftware.com/2014/03/understanding-rollbacks-release-management/
But I'm not clear how this would interact when you use tags, i.e. we tag servers by what features are installed on them (web, database, service) and vary the mix of features depending on environment (i.e. DEV might have web & services running on the same machine, but UAT & PROD would have seperate machines)
So does the rollback go back across the tag boundaries? If for example your sequence looked like this
+--Database tag --+
| Backup DB |
| | |
| Update DB |
| | | <- Runs against SQL server
| +--Rollback--+ |
| | Restore DB | |
| +------------+ |
+-----------------+
|
+---Web Tag-------+
| Do Stuff | <- Runs against WEB server
+-----------------+
|
+---Service tag----+
| Backup |
| | |
| Install new ver | <- Runs against Service server
| | |
| Smoke test |
| | |
| +--Rollback----+ |
| | Replace with | |
| | backup | |
| +--------------+ |
+------------------+
Would a roll back inside the service tag cause the database tag to execute it's rollback? Do rollbacks cascade across sequences?
I haven't had time to set this up yet and test so I thought I'd ask the question instead.
By accident I've managed to test this out with a suitable release and roll back does roll back across the tags as #joerage says.
It appears I was wrong... faulty memory and all that. Rollbacks work across tag boundaries.
I generally recommend against using rollback blocks, since their behavior is generally backwards, unpredictable, and not immediately obvious. The current best practice is actually to not use agent-based releases at all, as they will not be portable to the forthcoming Release Management Service.

Obtain Bacula status in parseable format

Is it possible to obtain the status of Bacula backup system Director in some parseable format?
It looks like the human-readable representation (one you can see when using bacula-console) is formed on the director side during the TCP control connection.
In what language? The easiest way would be to invoke bconsole and send command as stdin, then parse stdout and stderr.
Bacula has interactive mode in bconsole, but if you know commands in advance, this is not an issue.
You can also pull directly from the database, depending on your needs.
Example:
mysql> select JobId, Name, JobStatus from Job ORDER BY JobId DESC Limit 10;
+--------+-------------------------------------+-----------+
| JobId | Name | JobStatus |
+--------+-------------------------------------+-----------+
| 231215 | dbs16 Daily MysqlC XBM Snapshot | T |
| 231214 | dbs09 Daily MysqlS XBM Snapshot | T |
| 231213 | dbs10 Daily MysqlQ XBM Snapshot | T |
| 231212 | dbs11 Daily MysqlT XBM Snapshot | T |
| 231211 | dbs16 Daily MysqlI XBM Snapshot | T |
| 231210 | dbs19 Daily MysqlE XBM Snapshot | T |
| 231209 | dbs18 Daily MysqlB XBM Snapshot | R |
| 231208 | dbs17 Daily MysqlG XBM Snapshot | R |
| 231207 | Daily Catalog Backup | C |
| 231206 | adm6 svnops SVN Backup | R |
+--------+-------------------------------------+-----------+

Resources