Even Easier Introduction to CUDA - not printing after memory initialization

Even Easier Introduction to CUDA - not printing after memory initialization - memory

I'm following the Even Easier Introduction to CUDA tutorial. I have literally copied and pasted the complete code. add.cu compiles, however, when I run it, it doesn't print anything. I put in some more print statements and narrowed it down:
printf("Hi\n");
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
printf("Bye");
It prints "Hi", but never prints "Bye". So something seems to be wrong with the memory initialization. What is going wrong here?

I solved the problem myself. Basically, my device drivers were screwed up. To check if you have the same problem, run Command Prompt as administrator and run nvidia-smi. If you have the same problem, it will give you an error saying it failed to communicate because your device drivers are not up to date or wrong or something of the sort.
Download the latest nvidia driver for your computer (I found mine on Dell Drivers & Downlods) and install it. Now, when you run nvidia-smi as admin on command prompt, it should give you a whole bunch of details about your setup (driver version, Cuda version, etc) like this:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 441.14 Driver Version: 441.14 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 105... WDDM | 00000000:01:00.0 Off | N/A |
| N/A 30C P8 N/A / N/A | 78MiB / 4096MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
You should now be able to compile and run Cuda scripts with unified memory.

Related

OpenCV Python on WSL 2 error: "(-217:Gpu API call) CUDA driver version is insufficient for CUDA runtime version"

I am trying to run YOLO v4 on OpenCV 4.6 Python with CUDA on WSL 2 with Ubuntu 20.04 (Focal Fossa):
import cv2 as cv
import time
Conf_threshold = 0.4
NMS_threshold = 0.4
COLORS = [(0, 255, 0), (0, 0, 255), (255, 0, 0),
(255, 255, 0), (255, 0, 255), (0, 255, 255)]
class_name = []
with open('classes.txt', 'r') as f:
class_name = [cname.strip() for cname in f.readlines()]
# print(class_name)
net = cv.dnn.readNet('yolov4-tiny.weights', 'yolov4-tiny.cfg')
net.setPreferableBackend(cv.dnn.DNN_BACKEND_CUDA)
net.setPreferableTarget(cv.dnn.DNN_TARGET_CUDA_FP16)
model = cv.dnn_DetectionModel(net)
model.setInputParams(size=(416, 416), scale=1/255, swapRB=True)
cap = cv.VideoCapture('output.avi')
starting_time = time.time()
frame_counter = 0
while True:
ret, frame = cap.read()
frame_counter += 1
if ret == False:
break
classes, scores, boxes = model.detect(frame, Conf_threshold, NMS_threshold)
...
I got the following error:
Traceback (most recent call last): \
File "yolov4.py", line 28, in classes, scores, boxes = model.detect(frame, Conf_threshold, NMS_threshold)
cv2.error: OpenCV(4.6.0-dev) /home/tong/source/opencv/modules/dnn/src/cuda4dnn/csl/memory.hpp:54: error: (-217:Gpu API call) CUDA driver version is insufficient for CUDA runtime version in function 'ManagedPtr'
I have checked the Nvidia driver by using nvidia-smi.exe:
NVIDIA-SMI 516.59 Driver Version: 516.59 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... WDDM | 00000000:0D:00.0 On | N/A |
| 24% 51C P0 44W / 215W | 2711MiB / 8192MiB | 6% Default |
| | | N/A
As well as the CUDA instance installed by dpkg -l | grep cuda-toolkit:
ii cuda-toolkit-11-6 11.6.2-1 amd64 CUDA Toolkit 11.6 meta-package
ii cuda-toolkit-11-6-config-common 11.6.55-1 all Common config package for CUDA Toolkit 11.6.
ii cuda-toolkit-11-7 11.7.0-1 amd64 CUDA Toolkit 11.7 meta-package
ii cuda-toolkit-11-7-config-common 11.7.60-1 all Common config package for CUDA Toolkit 11.7.
ii cuda-toolkit-11-config-common 11.7.60-1 all Common config package for CUDA Toolkit 11.
ii cuda-toolkit-config-common 11.7.60-1 all Common config package for CUDA Toolkit.
OpenCV should be compiled by CUDA 11.7 according to file CMakeCache.txt:
// Version of CUDA as computed from nvcc.
CUDA_VERSION:STRING=11.7
The nvcc is also pointed to 11.7:
nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Tue_May__3_18:49:52_PDT_2022
Cuda compilation tools, release 11.7, V11.7.64
Build cuda_11.7.r11.7/compiler.31294372_0
How can I fix this problem?

any doable approach to use multiple GPUs, multiple process with tensorflow?

I am using docker container to run my experiment. I have multiple GPUs available and I want to use all of them for my experiment. I mean I want to utilize all GPUs for one program. To do so, I used tf.distribute.MirroredStrategy that suggested on tensorflow site, but it is not working. here is the full error messages on gist.
here is available GPUs info:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06 Driver Version: 450.51.06 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 Off | 00000000:6A:00.0 Off | 0 |
| N/A 31C P8 15W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla T4 Off | 00000000:6B:00.0 Off | 0 |
| N/A 31C P8 15W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla T4 Off | 00000000:6C:00.0 Off | 0 |
| N/A 34C P8 15W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla T4 Off | 00000000:6D:00.0 Off | 0 |
| N/A 34C P8 15W / 70W | 0MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
my current attempt
here is my attempt using tf.distribute.MirroredStrategy:
device_type = "GPU"
devices = tf.config.experimental.list_physical_devices(device_type)
devices_names = [d.name.split("e:")[1] for d in devices]
strategy = tf.distribute.MirroredStrategy(devices=devices_names[:3])
with strategy.scope():
model.compile(optimizer=opt, loss="categorical_crossentropy", metrics=["accuracy"])
The above attempt is not working and gave the error that listed on above gist. I don't find another way of using multiple GPUs for a single experiment.
does anyone any workable approach to make this happens? any thoughts?

Is the MirrordStrategy proper way to distribute the workload
The approach is correct, as long as the GPUs are on the same host. The TensorFlow manual has examples how the tf.distribute.MirroredStrategy can be used with keras to train the MNIST set.
Is it the MirrordStrategy the only strategy
No, there are multiple strategies that can be used to acheive the workload distribution. For example, the tf.distribute.MultiWorkerMirroredStrategy can also be used to distribute the work on multiple devices trough multiple workers.
The TF documentation explains the strategies, the limitations associated with the strategies and provides some examples to help kick-start the work.
The strategy is throwing an error
According to the issue from github, the ValueError: SyncOnReadVariable does not support 'assign_add' ..... is a bug in TensorFlow which is fixed in TF 2.4
You can try to upgrade the tensorflow libraries by
pip install --ignore-installed --upgrade tensorflow
Implementing variables that are not aware of distributed strategy
If you have tried the standard example from the documentation, and it works fine, but your model is not working, you might be having variables that are incorrectly set-up or you are using distributed variables that do not have support for the aggregation functions required by the distributed strategy.
As per the TF documentation:
..."
A distributed variable is variables created on multiple devices. As discussed in the glossary, mirrored variable and SyncOnRead variable are two examples.
"...
To better understand how to implement the custom support for the distributed varialbes, check the following page in the documentation

Dynamically decide which GPU to run on - TF on NVIDIA docker

I have a queue of models, which I allow only 2 to be executed in parallel, since I have 2 GPUs.
For that, in the beginning of my code I try to determine which GPU is available by using GPUtil. Maybe its relevant, this code in run inside a docker container that was launched using the --runtime=nvidia flag.
The code that determines which GPU to run on, looks like this:
import os
import GPUtil
gpu1, gpu2 = GPUtil.getGPUs()
available_gpu = gpu1 if gpu1.memoryFree > gpu2.memoryFree else gpu2
os.environ['CUDA_VISIBLE_DEVICES'] = str(available_gpu.id)
import tensorflow as tf
Now, I launched two scripts this way (with a slight delay until the first one occupied a GPU) but both of them tried to use the same GPU!
I went further to examine the problem - I manually set the os.environ['CUDA_VISIBLE_DEVICES'] = '1' and let the model run.
As it was training, I checked the output of nvidia-smi and saw the following
user#server:~$ docker exec awesome_gpu_container nvidia-smi
Mon Mar 12 06:59:27 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.111 Driver Version: 384.111 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 980 Ti Off | 00000000:01:00.0 On | N/A |
| 0% 50C P2 131W / 280W | 5846MiB / 6075MiB | 81% Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX 1080 Off | 00000000:03:00.0 Off | N/A |
| 0% 39C P8 14W / 200W | 2MiB / 8114MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
And I notice that while I've set the visible device to be 1 it is actually running on 0
I stress again, that my mission is while queuing multiple models that each one that start running will decide for itself which GPU to use.
I explored allow_soft_placement=True, but that allocated the memory on both GPUs so I stopped the process.
Bottom line, how can I make sure my training scripts only use one GPU, and make them choose the free one?

As described in the CUDA programming guide, the default device enumeration used by CUDA is "fastest first":
CUDA_DEVICE_ORDER
FASTEST_FIRST, PCI_BUS_ID, (default is FASTEST_FIRST)
FASTEST_FIRST causes CUDA to guess which device is
fastest using a simple heuristic, and make that device 0, leaving the
order of the rest of the devices unspecified.
PCI_BUS_ID orders devices by PCI bus ID in ascending order.
If you set CUDA_DEVICE_ORDER=PCI_BUS_ID the CUDA ordering will match the device ordering shown by nvidia-smi.
Since you are using docker, you can also enforce a stronger isolation with our runtime:
docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=0 ...
But that's at container startup time.

Torque jobs cannot find GPU when CUDA_VISIBLE_DEVICES not equal 0

I'm facing a strange problem with Torque assignment of GPUs.
I'm running Torque 6.1.0 on a single machine that has two NVIDIA GTX Titan X GPUs. I'm using pbs_sched for scheduling. nvidia-smi output at rest is as follows:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.39 Driver Version: 375.39 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX TIT... Off | 0000:03:00.0 On | N/A |
| 22% 40C P8 15W / 250W | 0MiB / 12204MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GTX TIT... Off | 0000:04:00.0 Off | N/A |
| 22% 33C P8 14W / 250W | 0MiB / 12207MiB | 0% E. Process |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I have a simple test script to assess GPU assignment as follows:
#PBS -S /bin/bash
#PBS -l nodes=1:ppn=1:gpus=1:reseterr:exclusive_process
echo "CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES"
~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery
deviceQuery is the utility that comes with CUDA. When I run it from the command line, it correctly finds both GPUs. When I restrict to one device from the command-line like this...
CUDA_VISIBLE_DEVICES=0 ~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery
#or
CUDA_VISIBLE_DEVICES=1 ~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery
... it also correctly finds one or the other GPU.
When I submit test.sh to the queue with qsub, and when no other jobs are running, it again works correctly. Here's the output:
CUDA_VISIBLE_DEVICES: 0
~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX TITAN X" CUDA Driver Version / Runtime Version 8.0 / 8.0 CUDA Capability Major/Minor version number: 5.2 Total amount of global memory: 12204 MBytes (12796887040 bytes) (24) Multiprocessors, (128) CUDA Cores/MP: 3072 CUDA Cores GPU Max Clock rate: 1076 MHz (1.08 GHz) Memory Clock rate: 3505 Mhz Memory Bus Width: 384-bit L2 Cache Size: 3145728 bytes Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096) Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Device PCI Domain ID / Bus ID / location ID: 0 / 3 / 0 Compute Mode:
< Exclusive Process (many threads in one process is able to use ::cudaSetDevice() with this device) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX TITAN X Result = PASS
However, if a job is already running on gpu0 (i.e. if it is assigned CUDA_VISIBLE_DEVICES=1), the job cannot find any GPUs. Output:
CUDA_VISIBLE_DEVICES: 1
~/test/NVIDIA_CUDA-8.0_Samples/bin/x86_64/linux/release/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
cudaGetDeviceCount returned 38
-> no CUDA-capable device is detected
Result = FAIL
Anyone know what is going on here?

I think I've solved my own problem, but unfortunately I tried two things at once. I don't want to go back and confirm which solved the issue. It's one of the following:
Remove the --enable-cgroups option from Torque's configure script before building.
Running these steps in the Torque install process:
make packages
sh torque-package-server-linux-x86_64.sh --install
sh torque-package-mom-linux-x86_64.sh --install
sh torque-package-clients-linux-x86_64.sh --install
For the second option, I know that these steps are properly documented in the Torque install instructions. However, I have a simple setup where I just have a single node (compute node and server are same machine). I thought that 'make install' should do everything that the package installs do for that single node, but maybe I was mistaken.

Cypher load CSV eager and long action duration

im loading a file with 85K lines - 19M,
server has 2 cores, 14GB RAM, running centos 7.1 and oracle JDK 8
and it can take 5-10 minutes with the following server config:
dbms.pagecache.memory=8g
cypher_parser_version=2.0
wrapper.java.initmemory=4096
wrapper.java.maxmemory=4096
disk mounted in /etc/fstab:
UUID=fc21456b-afab-4ff0-9ead-fdb31c14151a /mnt/neodata
ext4 defaults,noatime,barrier=0 1 2
added this to /etc/security/limits.conf:
* soft memlock unlimited
* hard memlock unlimited
* soft nofile 40000
* hard nofile 40000
added this to /etc/pam.d/su
session required pam_limits.so
added this to /etc/sysctl.conf:
vm.dirty_background_ratio = 50
vm.dirty_ratio = 80
disabled journal by running:
sudo e2fsck /dev/sdc1
sudo tune2fs /dev/sdc1
sudo tune2fs -o journal_data_writeback /dev/sdc1
sudo tune2fs -O ^has_journal /dev/sdc1
sudo e2fsck -f /dev/sdc1
sudo dumpe2fs /dev/sdc1
besides that,
when running a profiler, i get lots of "Eagers", and i really cant understand why:
PROFILE LOAD CSV WITH HEADERS FROM 'file:///home/csv10.csv' AS line
FIELDTERMINATOR '|'
WITH line limit 0
MERGE (session :Session { wz_session:line.wz_session })
MERGE (page :Page { page_key:line.domain+line.page })
ON CREATE SET page.name=line.page, page.domain=line.domain,
page.protocol=line.protocol,page.file=line.file
Compiler CYPHER 2.3
Planner RULE
Runtime INTERPRETED
+---------------+------+---------+---------------------+--------------------------------------------------------+
| Operator | Rows | DB Hits | Identifiers | Other |
+---------------+------+---------+---------------------+--------------------------------------------------------+
| +EmptyResult | 0 | 0 | | |
| | +------+---------+---------------------+--------------------------------------------------------+
| +UpdateGraph | 9 | 9 | line, page, session | MergeNode; Add(line.domain,line.page); :Page(page_key) |
| | +------+---------+---------------------+--------------------------------------------------------+
| +Eager | 9 | 0 | line, session | |
| | +------+---------+---------------------+--------------------------------------------------------+
| +UpdateGraph | 9 | 9 | line, session | MergeNode; line.wz_session; :Session(wz_session) |
| | +------+---------+---------------------+--------------------------------------------------------+
| +ColumnFilter | 9 | 0 | line | keep columns line |
| | +------+---------+---------------------+--------------------------------------------------------+
| +Filter | 9 | 0 | anon[181], line | anon[181] |
| | +------+---------+---------------------+--------------------------------------------------------+
| +Extract | 9 | 0 | anon[181], line | anon[181] |
| | +------+---------+---------------------+--------------------------------------------------------+
| +LoadCSV | 9 | 0 | line | |
+---------------+------+---------+---------------------+--------------------------------------------------------+
all the labels and properties have indices / constrains
thanks for the help
Lior

He Lior,
we tried to explain the Eager Loading here:
And Marks original blog post is here: http://www.markhneedham.com/blog/2014/10/23/neo4j-cypher-avoiding-the-eager/
Rik tried to explain it in easier terms:
http://blog.bruggen.com/2015/07/loading-belgian-corporate-registry-into_20.html
Trying to understand the "Eager Operation"
I had read about this before, but did not really understand it until Andres explained it to me again: in all normal operations, Cypher loads data lazily. See for example this page in the manual - it basically just loads as little as possible into memory when doing an operation. This laziness is usually a really good thing. But it can get you into a lot of trouble as well - as Michael explained it to me:
"Cypher tries to honor the contract that the different operations
within a statement are not affecting each other. Otherwise you might
up with non-deterministic behavior or endless loops. Imagine a
statement like this:
MATCH (n:Foo) WHERE n.value > 100 CREATE (m:Foo {m.value = n.value + 100});
If the two statements would not be
isolated, then each node the CREATE generates would cause the MATCH to
match again etc. an endless loop. That's why in such cases, Cypher
eagerly runs all MATCH statements to exhaustion so that all the
intermediate results are accumulated and kept (in memory).
Usually
with most operations that's not an issue as we mostly match only a few
hundred thousand elements max.
With data imports using LOAD CSV,
however, this operation will pull in ALL the rows of the CSV (which
might be millions), execute all operations eagerly (which might be
millions of creates/merges/matches) and also keeps the intermediate
results in memory to feed the next operations in line.
This also
disables PERIODIC COMMIT effectively because when we get to the end of
the statement execution all create operations will already have
happened and the gigantic tx-state has accumulated."
So that's what's going on my load csv queries. MATCH/MERGE/CREATE caused an eager pipe to be added to the execution plan, and it effectively disables the batching of my operations "using periodic commit". Apparently quite a few users run into this issue even with seemingly simple LOAD CSV statements. Very often you can avoid it, but sometimes you can't."

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart