I have a tensorflow app that runs fine in ubuntu 16.04 but when I attempt to run it in the tensorflow/tensorflow docker image w/ nvidia-docker, it gets to this point and then freezes:
2017-07-12 22:06:10.917255: W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use SSE4.1 instructions, but these are
available on your machine and could speed up CPU computations.
2017-07-12 22:06:10.917289: W
tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow
library wasn't compiled to use SSE4.2 instructions, but these are
available on your machine and could speed up CPU computations.
2017-07-12 22:06:11.023765: I
tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:901] successful
NUMA node read from SysFS had negative value (-1), but there must be
at least one NUMA node, so returning NUMA node zero
2017-07-12 22:06:11.024133: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 0
with properties:
name: Quadro M4000
major: 5 minor: 2 memoryClockRate (GHz) 0.7725
pciBusID 0000:00:05.0
Total memory: 7.93GiB
Free memory: 7.87GiB
2017-07-12 22:06:11.024159: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0
2017-07-12 22:06:11.024168: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0: Y
2017-07-12 22:06:11.024190: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating
TensorFlow device (/gpu:0) -> (device: 0, name: Quadro M4000, pci
bus id: 0000:00:05.0)
Since it's not outputting an error message, I don't know where to start; any suggestions for something I might be missing or steps to troubleshoot this further?
I verified that my nvidia-docker installation is functioning correctly.
Turns out that the application was running, it just appeared to have frozen because output from python apps running in docker containers tends to get stuck in the buffer and never show up in the docker logs. To fix the problem I passed -u to python - I can see my application's output in docker logs now and all is well.
Related
Well, I have some sort of problems and that one may be the root of them. I just want to know if that output is normal.
~$ echo $DISPLAY
:1
Is there any problem? According to my research, it should be :0.
Why is $DISPLAY sometimes :0 and sometimes :1
:0 is usually the local display (i.e. the main display of the computer when you sit in front of it).
:1 is often used by services like SSH when you enable display forwarding and log into a remote computer.
Although it is my local display, $DISPLAY is :1. Thus, how can we fix the problem(if it is a problem)?
Additional information: While I log in, the screens is flickering.
System Info:
~$ neofetch
.-/+oossssoo+/-. ********#***************
`:+ssssssssssssssssss+:` -------------------------------
-+ssssssssssssssssssyyssss+- OS: Ubuntu 20.04.4 LTS x86_64
.ossssssssssssssssssdMMMNysssso. Host: 20URS0BG00 ThinkPad T15g Gen 1
/ssssssssssshdmmNNmmyNMMMMhssssss/ Kernel: 5.13.0-30-generic
+ssssssssshmydMMMMMMMNddddyssssssss+ Uptime: 2 hours, 32 mins
/sssssssshNMMMyhhyyyyhmNMMMNhssssssss/ Packages: 3895 (dpkg), 26 (snap)
.ssssssssdMMMNhsssssssssshNMMMdssssssss. Shell: bash 5.0.17
+sssshhhyNMMNyssssssssssssyNMMMysssssss+ Resolution: 1920x1080, 1920x1080, 1920x1080
ossyNMMMNyMMhsssssssssssssshmmmhssssssso DE: GNOME
ossyNMMMNyMMhsssssssssssssshmmmhssssssso WM: Mutter
+sssshhhyNMMNyssssssssssssyNMMMysssssss+ WM Theme: Adwaita
.ssssssssdMMMNhsssssssssshNMMMdssssssss. Theme: Yaru-dark [GTK2/3]
/sssssssshNMMMyhhyyyyhdNMMMNhssssssss/ Icons: Yaru [GTK2/3]
+sssssssssdmydMMMMMMMMddddyssssssss+ Terminal: x-terminal-emul
/ssssssssssshdmNNNNmyNMMMMhssssss/ CPU: Intel i7-10750H (12) # 5.000GHz
.ossssssssssssssssssdMMMNysssso. GPU: NVIDIA 01:00.0 NVIDIA Corporation Device 1e91
-+sssssssssssssssssyyyssss+- GPU: Intel UHD Graphics
`:+ssssssssssssssssss+:` Memory: 6676MiB / 31894MiB
.-/+oossssoo+/-.
Hi, I'm having issues with dockerized TF Serving seeing but not using my GPU.
It adds the GPU as device 0, allocates memory on it, but then loads the ML model into CPU device memory and runs inference using only the CPUs. GPU-util on nvidia-smi never leaves 0%.
Could anyone help me figure out why this is happening, and what should be changed?
The setup:
OS: Amazon/Deep Learning AMI (Ubuntu 18.04) on EC2 g4dn.xlarge
GPU: Tesla T4
Model: pretrained gpt2-xl tensorflow from huggingface, which I froze into a SavedModel and uploaded to S3.
Docker: came stock with Deep Learning AMI. I've already checked and confirmed that nvidia-smi runs containerized, so it's not a nvidia+docker issue.
TF Serving: I use the below Dockerfile to pull the latest-gpu image and download the model directly into it at buildtime:
FROM tensorflow/serving:latest-gpu
RUN apt-get update
ENV TZ=America
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone
RUN apt-get update
RUN apt-get install -y awscli
ENV AWS_ACCESS_KEY_ID=...
ENV AWS_SECRET_ACCESS_KEY=...
ARG model_name
ENV MODEL_NAME=$model_name
# Use AWS CLI to download the SavedModel into the docker container from S3 bucket
RUN aws s3 cp s3://v3-models/models/pretrained_tf_serving/${MODEL_NAME} /models/${MODEL_NAME} --recursive
EXPOSE 8500
I build and run the above Dockerfile with these commands:
#!/bin/bash
# first build the image with the model_name arg, and tag it as xl-serving
docker build -t xl-serving --build-arg model_name=gpt2-xl ../../model_server
# then run it with gpus, exposing gRPC port
docker run -it --rm --gpus all --runtime nvidia -p 8500:8500 xl-serving
Running the serving container prints this output. Notice that the GPU is added.
2020-11-06 05:25:34.671071: I tensorflow_serving/model_servers/server.cc:87] Building single TensorFlow model file config: model_name: gpt2-xl model_base_path: /models/gpt2-xl
2020-11-06 05:25:34.671274: I tensorflow_serving/model_servers/server_core.cc:464] Adding/updating models.
2020-11-06 05:25:34.671295: I tensorflow_serving/model_servers/server_core.cc:575] (Re-)adding model: gpt2-xl
2020-11-06 05:25:34.771644: I tensorflow_serving/core/basic_manager.cc:739] Successfully reserved resources to load servable {name: gpt2-xl version: 1}
2020-11-06 05:25:34.771673: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: gpt2-xl version: 1}
2020-11-06 05:25:34.771687: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: gpt2-xl version: 1}
2020-11-06 05:25:34.771724: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: /models/gpt2-xl/1
2020-11-06 05:25:35.222512: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve }
2020-11-06 05:25:35.222545: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:234] Reading SavedModel debug info (if present) from: /models/gpt2-xl/1
2020-11-06 05:25:35.222672: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-11-06 05:25:35.223994: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-11-06 05:25:35.262238: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-06 05:25:35.263132: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties:
pciBusID: 0000:00:1e.0 name: Tesla T4 computeCapability: 7.5
coreClock: 1.59GHz coreCount: 40 deviceMemorySize: 14.75GiB deviceMemoryBandwidth: 298.08GiB/s
2020-11-06 05:25:35.263149: I external/org_tensorflow/tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.
2020-11-06 05:25:35.263236: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-06 05:25:35.264122: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-06 05:25:35.264948: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0
2020-11-06 05:25:36.185140: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-11-06 05:25:36.185165: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1263] 0
2020-11-06 05:25:36.185171: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0: N
2020-11-06 05:25:36.185334: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-06 05:25:36.186222: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-06 05:25:36.187046: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-11-06 05:25:36.187852: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 13896 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5)
2020-11-06 05:25:37.279837: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:199] Restoring SavedModel bundle.
2020-11-06 05:25:56.154008: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:183] Running initialization op on SavedModel bundle at path: /models/gpt2-xl/1
2020-11-06 05:25:57.551535: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:303] SavedModel load for tags { serve }; Status: success: OK. Took 22777844 microseconds.
2020-11-06 05:25:57.832736: I tensorflow_serving/servables/tensorflow/saved_model_warmup_util.cc:59] No warmup data file found at /models/gpt2-xl/1/assets.extra/tf_serving_warmup_requests
2020-11-06 05:25:57.835030: I tensorflow_serving/core/loader_harness.cc:87] Successfully loaded servable version {name: gpt2-xl version: 1}
2020-11-06 05:25:57.838329: I tensorflow_serving/model_servers/server.cc:367] Running gRPC ModelServer at 0.0.0.0:8500 ...
[warn] getaddrinfo: address family for nodename not supported
2020-11-06 05:25:57.840415: I tensorflow_serving/model_servers/server.cc:387] Exporting HTTP/REST API at:localhost:8501 ...
[evhttp_server.cc : 238] NET_LOG: Entering the event loop ...
I then hit this server with a single, non-batched gRPC call. It will successfully run and return a correct GPT2 output. However, it takes as long as the same setup takes on a CPU. htop shows that 8gb of ram (gpt2-xl model size) is loaded into the CPU machine. It then shows the TF Serving process running, and maxing out one or two CPU cores. It appears to only run on CPU.
This is what nvidia-smi looks like while the call is running. Notice the allocated memory, and 0% GPU-Util:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:00:1E.0 Off | 0 |
| N/A 36C P0 26W / 70W | 14240MiB / 15109MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 13357 C tensorflow_model_server 14221MiB |
+-----------------------------------------------------------------------------+
I've scoured the web and can't find any advice for this. Closest I found was this github issue: GPU utilization with TF serving #1440, for which the fixes did not work for me. They were dealing with low GPU-util, I'm dealing with 0%.
Any advice on what the issue is?
Thank you very much. I've been banging my head against the wall for days on this, so I very much appreciate your help :)
Update #1 :
I've written a python script (below) to use tensorflow==2.3.0 to load the model and run it. It's running in a conda env with CUDA=11.0. It successfully runs inference on the GPU, and it's a good 15x faster than what I'm getting on TF-serving.
import tensorflow as tf
import numpy as np
model = tf.saved_model.load('/home/ubuntu/models/gpt2-xl/1/')
servable = model.signatures["forward"]
# Create input tensor
tensor_in = tf.constant([[198, 15667, 6530, 25, 29437, 1706, 1610, 977, 948, 33611]])
# Run a loop of 10 inferences on the model, to predict the next 10 tokens.
for i in range(10):
pred = servable(tensor_in)
logits = pred['output_0']
logits = logits[:, -1, :] / 0.8
next_id = tf.random.categorical(tf.nn.log_softmax(logits, axis=-1), num_samples=1)
next_id = tf.dtypes.cast(next_id, tf.int32).numpy()
tensor_in = np.concatenate((tensor_in, next_id), axis=1)
Up next: will be trying running tf-serving outside of container. Update to come...
How did you save your model? Add clear_devices=True when saving model and have anather try.
spark version is 2.4.0, my cluster has four nodes and each node has 16 CPU and 128g RAM.
I am using jupyter-notebook conncet pyspark. The working process is read kudu data by spark then calculate by pandas udf. On the terminal start pyspark
PYSPARK_DRIVER_PYTHON="/opt/anaconda2/envs/py3/bin/jupyter" PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark2 --jars kudu-spark2_2.11-1.7.0-cdh5.15.0.jar
--conf spark.executor.memory=40g --conf spark.executor.memoryOverhead=5g --num-executors=4 --executor-cores=8 --conf yarn.nodemanager.vmem-check-enabled=false
My dataset only 6g and 32 partitions. when running i can see each node has a executor contains 8 python worker and each python worker uses 6g memory! Container killed by yarn because memory limit.
Container killed by YARN for exceeding memory limits. 45.1 GB of 45 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead
I'm confused why does it take up so much memory? data size for each partiton only ~200M. Isn't pandas_udf avoiding serialization and deserialization by pyarrow? Maybe jupyter causes the quesiton?
I am very grateful if anyone helps me.
This is my code.
df = spark.range(0, 800000000)
df= df.select("id",rand(seed=10).alias("uniform"),randn(seed=27).alias("normal"),
randn(seed=27).alias("normal1"),randn(seed=1).alias("normal3"))
df=df.withColumn("flag",
F.array(
F.lit("0"),
F.lit("1"),
F.lit("2"),
F.lit("3"),
F.lit("4"),
F.lit("5"),
F.lit("6"),
F.lit("7"),
F.lit("8"),
F.lit("9"),
F.lit("10"),
F.lit("11"),
F.lit("12"),
F.lit("13"),
F.lit("14"),
F.lit("15"),
F.lit("16"),
F.lit("17"),
F.lit("18"),
F.lit("19"),
F.lit("20"),
F.lit("21"),
F.lit("22"),
).getItem(
(F.rand()*23).cast("int")
)
)
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import *
schema = StructType([
StructField("flag", IntegerType()),
])
#pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def del_data(data):
import os
os.environ["ARROW_PRE_0_15_IPC_FORMAT"] = "1"
return data[["flag"]]
df.groupBy('flag').apply(del_data).write.csv('/tmp/')
In the same hardware/software env, With the same net and solver, just differ in command line.
While command line is:
caffe-master/build/tools/caffe train --solver=solver_base.prototxt --gpu=6
It tasks about 50 seconds per 100 iters.
While command is :
caffe-master/build/tools/caffe train --solver=solver_base.prototxt --gpu=4,5,6,7
It takes about 48 seconds per 100 iters.
As usual, multi-gpu training should cost more time than single-gpu because of cost like replication. So does anyone can tell me why. Thanks very much!
Env:
2 * Intel(R) Xeon(R) CPU E5-2699 v4 # 2.20GHz
8 * Nvidia Tesla V100 PCIE 16GB
Caffe 1.0.0 / use_cudnn on
Cuda 9.0.176
Cudnn 6.0.21
I have one java based application which is having huge line of source code(~1m).Now I am using jenkins with sonar-runner-2.4 to run analysis with code coverage and test cases count.I have upgraded sonarqube server from 5.4 to 6.3.1.Before upgrade this job took 9hrs to complete the whole analysis (still it is very much long time but fine) but after upgrade to sonarqube-6.3.1 same job taking 13hrs to complete the same analysis.
How do I improve analysis time at least my earlier time 9hr ?
EDIT
Here is my JAVA_OPTS for sonarqube-6.3.1 instance
sonar.web.javaOpts=-Xmx6G -Xms2G -XX:MaxPermSize=1G -XX:+HeapDumpOnOutOfMemoryError -Djava.net.preferIPv4Stack=true
Available Hardware :
$lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 26
Stepping: 5
CPU MHz: 1596.000
BogoMIPS: 3999.44
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 4096K
NUMA node0 CPU(s): 0-3
NUMA node1 CPU(s): 4-7
Available Memory :
$free -m
total used free shared buff/cache available
Mem: 128714 58945 66232 430 3535 68298
Swap: 32767 957 31810
sonar-project.properties for the long running job:
sonar-project.properties
As you haven't really given many details, I can't really give many details in the answer, but the simple answer is that you have to make the scan do less work.
Look at your codebase. Is your scan processing generated classes? Is it scanning test classes? Is it scanning classes that have little real business logic? If you answer "yes" to any of those, consider excluding those classes.
Look at the SonarQube plugins you're using. Are you running every possible plugin you can run? Are there some heuristics you don't need to run, or perhaps you could run less frequently?