onnxruntime inference is way slower than pytorch on GPU - machine-learning

I was comparing the inference times for an input using pytorch and onnxruntime and I find that onnxruntime is actually slower on GPU while being significantly faster on CPU
I was tryng this on Windows 10.
ONNX Runtime installed from source - ONNX Runtime version: 1.11.0 (onnx version 1.10.1)
Python version - 3.8.12
CUDA/cuDNN version - cuda version 11.5, cudnn version 8.2
GPU model and memory - Quadro M2000M, 4 GB
Relevant code -
import torch
from torchvision import models
import onnxruntime # to inference ONNX models, we use the ONNX Runtime
import onnx
import os
import time
batch_size = 1
total_samples = 1000
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
def convert_to_onnx(resnet):
resnet.eval()
dummy_input = (torch.randn(batch_size, 3, 224, 224, device=device)).to(device=device)
input_names = [ 'input' ]
output_names = [ 'output' ]
torch.onnx.export(resnet,
dummy_input,
"resnet18.onnx",
verbose=True,
opset_version=13,
input_names=input_names,
output_names=output_names,
export_params=True,
do_constant_folding=True,
dynamic_axes={
'input': {0: 'batch_size'}, # variable length axes
'output': {0: 'batch_size'}}
)
def infer_pytorch(resnet):
print('Pytorch Inference')
print('==========================')
print()
x = torch.randn((batch_size, 3, 224, 224))
x = x.to(device=device)
latency = []
for i in range(total_samples):
t0 = time.time()
resnet.eval()
with torch.no_grad():
out = resnet(x)
latency.append(time.time() - t0)
print('Number of runs:', len(latency))
print("Average PyTorch {} Inference time = {} ms".format(device.type, format(sum(latency) * 1000 / len(latency), '.2f')))
def to_numpy(tensor):
return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()
def infer_onnxruntime():
print('Onnxruntime Inference')
print('==========================')
print()
onnx_model = onnx.load("resnet18.onnx")
onnx.checker.check_model(onnx_model)
# Input
x = torch.randn((batch_size, 3, 224, 224))
x = x.to(device=device)
x = to_numpy(x)
so = onnxruntime.SessionOptions()
so.execution_mode = onnxruntime.ExecutionMode.ORT_SEQUENTIAL
so.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
exproviders = ['CUDAExecutionProvider', 'CPUExecutionProvider']
model_onnx_path = os.path.join(".", "resnet18.onnx")
ort_session = onnxruntime.InferenceSession(model_onnx_path, so, providers=exproviders)
options = ort_session.get_provider_options()
cuda_options = options['CUDAExecutionProvider']
cuda_options['cudnn_conv_use_max_workspace'] = '1'
ort_session.set_providers(['CUDAExecutionProvider'], [cuda_options])
#IOBinding
input_names = ort_session.get_inputs()[0].name
output_names = ort_session.get_outputs()[0].name
io_binding = ort_session.io_binding()
io_binding.bind_cpu_input(input_names, x)
io_binding.bind_output(output_names, device)
#warm up run
ort_session.run_with_iobinding(io_binding)
ort_outs = io_binding.copy_outputs_to_cpu()
latency = []
for i in range(total_samples):
t0 = time.time()
ort_session.run_with_iobinding(io_binding)
latency.append(time.time() - t0)
ort_outs = io_binding.copy_outputs_to_cpu()
print('Number of runs:', len(latency))
print("Average onnxruntime {} Inference time = {} ms".format(device.type, format(sum(latency) * 1000 / len(latency), '.2f')))
if __name__ == '__main__':
torch.cuda.empty_cache()
resnet = (models.resnet18(pretrained=True)).to(device=device)
convert_to_onnx(resnet)
infer_onnxruntime()
infer_pytorch(resnet)
Output
If run on CPU,
Average onnxruntime cpu Inference time = 18.48 ms
Average PyTorch cpu Inference time = 51.74 ms
but, if run on GPU, I see
Average onnxruntime cuda Inference time = 47.89 ms
Average PyTorch cuda Inference time = 8.94 ms
If I change graph optimizations to onnxruntime.GraphOptimizationLevel.ORT_DISABLE_ALL, I see some improvements in inference time on GPU, but its still slower than Pytorch.
I use io binding for the input tensor numpy array and the nodes of the model are on GPU.
Further, during the processing for onnxruntime, I print device usage stats and I see this -
Using device: cuda:0
GPU Device name: Quadro M2000M
Memory Usage:
Allocated: 0.1 GB
Cached: 0.1 GB
So, GPU device is being used.
Further, I have used the resnet18.onnx model from the ModelZoo to see if it is a converted mode issue, but i get the same results.
What am I doing wrong or missing here?

When calculating inference time exclude all code that should be run once like resnet.eval() from the loop.
Please include imports in example
import torch
from torchvision import models
import onnxruntime # to inference ONNX models, we use the ONNX Runtime
import onnx
import os
import time
After running your example GPU only I found that time differs only ~x2, so the speed difference may be caused by framework characteristics. For more details explore onnx conversion optimization
Onnxruntime Inference
==========================
Number of runs: 1000
Average onnxruntime cuda Inference time = 4.76 ms
Pytorch Inference
==========================
Number of runs: 1000
Average PyTorch cuda Inference time = 2.27 ms

for CPU you don't need to use io-binding, it is required only for GPU.
and don't change session options as onnxruntime by default selects the best options.
the following things may help to speed up the gpu
make sure to install onnxruntime-gpu which comes with prebuilt CUDA EP and TensortRT EP.
you are currently binding the inputs and outputs to the CPU. when using onnxruntime with CUDA EP you should bind them to GPU (to avoid copying inputs/output btw CPU and GPU) refer here
I suggest you use io_binding.bind_input() method instead of io_binding.bind_cpu_input()
for i in range(total_samples):
t0 = time.time()
ort_session.run_with_iobinding(io_binding)
latency.append(time.time() - t0)
-> ort_outs = io_binding.copy_outputs_to_cpu()
copying output every time from GPU to CPU for 1000 times drops the performance.

Related

Using Bloom AI Model on Mac M1 for continuing prompts (Pytorch)

I try to run the bigscience Bloom AI Model on my Macbook M1 Max 64GB, freshly installed pytorch for Mac M1 chips and Python 3.10.6 running.
I'm not able to get any output at all.
With other AI Models I have the same issue and I really don't know how I should fix it.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "mps" if torch.backends.mps.is_available() else "cpu"
if device == "cpu" and torch.cuda.is_available():
device = "cuda" #if the device is cpu and cuda is available, set the device to cuda
print(f"Using {device} device") #print the device
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom")
model = AutoModelForCausalLM.from_pretrained("bigscience/bloom").to(device)
input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
I've tried it with other Models (smaller bert Models) and also tried letting it just run on CPU without using the mps device at all.
Maybe anyone could help
It might be taking too long to get the output. Do you want to break it down to serial calls involving
a) embedding layer b) the 70 bloom blocks c) then the output layer norm and d) the token decoding?
An example to run this code is available at https://nbviewer.org/urls/arteagac.github.io/blog/bloom_local.ipynb .
It basically boils down to:
def forward(input_ids):
# 1. Create attention mask and position encodings
attention_mask = torch.ones(len(input_ids)).unsqueeze(0).bfloat16().to(device)
alibi = build_alibi_tensor(input_ids.shape[1], config.num_attention_heads,
torch.bfloat16).to(device)
# 2. Load and use word embeddings
embeddings, lnorm = load_embeddings()
hidden_states = lnorm(embeddings(input_ids))
del embeddings, lnorm
# 3. Load and use the BLOOM blocks sequentially
for block_num in range(70):
load_block(block, block_num)
hidden_states = block(hidden_states, attention_mask=attention_mask, alibi=alibi)[0]
print(".", end='')
hidden_states = final_lnorm(hidden_states)
#4. Load and use language model head
lm_head = load_causal_lm_head()
logits = lm_head(hidden_states)
# 5. Compute next token
return torch.argmax(logits[:, -1, :], dim=-1)
Please refer the linked notebook to get the implementation for functions used in the forward call.

How to parallelize a training loop ever samples of a batch when CPU is only available in pytorch?

I want to parallelize over single examples or batch of example (in my situation is that I only have cpus, I have up to 112). I tried it but I get a bug that the losses cannot have the gradient out of separate processes (which entirely ruins my attempt). I still want to do it and it essential that after the multiproessing happens that I can do an optimizer step. How do I get around it? I made a totally self contained example:
import torch
import torch.nn as nn
from torch.optim.lr_scheduler import StepLR
from torch.utils.data import Dataset, DataLoader
from torch.multiprocessing import Pool
class SimpleDataSet(Dataset):
def __init__(self, Din, num_examples=23):
self.x_dataset = [torch.randn(Din) for _ in range(num_examples)]
# target function is x*x
self.y_dataset = [x**2 for x in self.x_dataset]
def __len__(self):
return len(self.x_dataset)
def __getitem__(self, idx):
return self.x_dataset[idx], self.y_dataset[idx]
def get_loss(args):
x, y, model = args
y_pred = model(x)
criterion = nn.MSELoss()
loss = criterion(y_pred, y)
return loss
def get_dataloader(D, num_workers, batch_size):
ds = SimpleDataSet(D)
dl = DataLoader(ds, batch_size=batch_size, num_workers=num_workers)
return dl
def train_fake_data():
num_workers = 2
Din, Dout = 3, 1
model = nn.Linear(Din, Dout).share_memory()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)
batch_size = 2
num_epochs = 10
# num_batches = 5
num_procs = 5
dataloader = get_dataloader(Din, num_workers, batch_size)
scheduler = StepLR(optimizer, step_size=1, gamma=0.7)
for epoch in range(num_epochs):
for _, batch in enumerate(dataloader):
batch = [(torch.randn(Din), torch.randn(Dout), model) for _ in batch]
with Pool(num_procs) as pool:
optimizer.zero_grad()
losses = pool.map(get_loss, batch)
loss = torch.mean(losses)
loss.backward()
optimizer.step()
# scheduler
scheduler.step()
if __name__ == '__main__':
# start = time.time()
# train()
train_fake_data()
# print(f'execution time: {time.time() - start}')
Error:
Traceback (most recent call last):
File "/Users/brando/anaconda3/envs/coq_gym/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3427, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-2-ea57e03ba088>", line 1, in <module>
runfile('/Users/brando/ML4Coq/playground/multiprocessing_playground/multiprocessing_cpu_pytorch.py', wdir='/Users/brando/ML4Coq/playground/multiprocessing_playground')
File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_bundle/pydev_umd.py", line 197, in runfile
pydev_imports.execfile(filename, global_vars, local_vars) # execute the script
File "/Applications/PyCharm.app/Contents/plugins/python/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/Users/brando/ML4Coq/playground/multiprocessing_playground/multiprocessing_cpu_pytorch.py", line 95, in <module>
train_fake_data()
File "/Users/brando/ML4Coq/playground/multiprocessing_playground/multiprocessing_cpu_pytorch.py", line 83, in train_fake_data
losses = pool.map(get_loss, batch)
File "/Users/brando/anaconda3/envs/coq_gym/lib/python3.7/multiprocessing/pool.py", line 290, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/Users/brando/anaconda3/envs/coq_gym/lib/python3.7/multiprocessing/pool.py", line 683, in get
raise self._value
multiprocessing.pool.MaybeEncodingError: Error sending result: '[tensor(0.5237, grad_fn=<MseLossBackward>)]'. Reason: 'RuntimeError('Cowardly refusing to serialize non-leaf tensor which requires_grad, since autograd does not support crossing process boundaries. If you just want to transfer the data, call detach() on the tensor before serializing (e.g., putting it on the queue).')'
I am sure I want to do this. How should I be doing this?
New attempt using DDP
"""
Based on: https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
Note: as opposed to the multiprocessing (torch.multiprocessing) package, processes can use
different communication backends and are not restricted to being executed on the same machine.
"""
import torch
from torch import nn, optim
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP
import os
num_epochs = 5
batch_size = 8
Din, Dout = 10, 5
data_x = torch.randn(batch_size, Din)
data_y = torch.randn(batch_size, Dout)
data = [(i*data_x, i*data_y) for i in range(num_epochs)]
class OneDeviceModel(nn.Module):
"""
Toy example for a model ran in parallel but not distributed accross gpus
(only processes with their own gpu or hardware)
"""
def __init__(self):
super().__init__()
self.net1 = nn.Linear(Din, Din)
self.relu = nn.ReLU()
self.net2 = nn.Linear(Din, Dout)
def forward(self, x):
return self.net2(self.relu(self.net1(x)))
def setup_process(rank, world_size, backend='gloo'):
"""
Initialize the distributed environment (for each process).
gloo: is a collective communications library (https://github.com/facebookincubator/gloo). My understanding is that
it's a library/API for process to communicate/coordinate with each other/master. It's a backend library.
"""
# set up the master's ip address so this child process can coordinate
# os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
# - use NCCL if you are using gpus: https://pytorch.org/tutorials/intermediate/dist_tuto.html#communication-backends
if torch.cuda.is_available():
backend = 'nccl'
# Initializes the default distributed process group, and this will also initialize the distributed package.
dist.init_process_group(backend, rank=rank, world_size=world_size)
def cleanup():
""" Destroy a given process group, and deinitialize the distributed package """
dist.destroy_process_group()
def run_parallel_training_loop(rank, world_size):
"""
Distributed function to be implemented later.
This is the function that is actually ran in each distributed process.
Note: as DDP broadcasts model states from rank 0 process to all other processes in the DDP constructor,
you don’t need to worry about different DDP processes start from different model parameter initial values.
"""
print()
print(f"Start running DDP with model parallel example on rank: {rank}.")
print(f'current process: {mp.current_process()}')
print(f'pid: {os.getpid()}')
setup_process(rank, world_size)
# create model and move it to GPU with id rank
model = OneDeviceModel().to(rank) if torch.cuda.is_available() else OneDeviceModel().share_memory()
# ddp_model = DDP(model, device_ids=[rank])
ddp_model = DDP(model)
for batch_idx, batch in enumerate(data):
x, y = batch
loss_fn = nn.MSELoss()
optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)
optimizer.zero_grad()
outputs = ddp_model(x)
labels = y.to(rank) if torch.cuda.is_available() else y
# Gradient synchronization communications take place during the backward pass and overlap with the backward computation.
loss_fn(outputs, labels).backward() # When the backward() returns, param.grad already contains the synchronized gradient tensor.
optimizer.step() # TODO how does the optimizer know to do the gradient step only once?
print()
print(f"Start running DDP with model parallel example on rank: {rank}.")
print(f'current process: {mp.current_process()}')
print(f'pid: {os.getpid()}')
# Destroy a given process group, and deinitialize the distributed package
cleanup()
def main():
print()
print('running main()')
print(f'current process: {mp.current_process()}')
print(f'pid: {os.getpid()}')
# args
world_size = mp.cpu_count()
mp.spawn(run_parallel_training_loop, args=(world_size,), nprocs=world_size)
if __name__ == "__main__":
print('starting __main__')
main()
print('Done!\a\n')
it seems it works but my question is in line 74 do I need to do this
model = OneDeviceModel().to(rank) if torch.cuda.is_available() else OneDeviceModel().share_memory()
or
model = OneDeviceModel().to(rank) if torch.cuda.is_available() else OneDeviceModel()
for it to work properly in multiple CPUs?
Serial is faster than parallel even if I have 112 cpu cores?
My current issue is that the cpu parallel job is slower than the serially running one when only cpus are available.
I want to know how to set up python and parallel cpus. e.g. if I have X cpus how many processes should I be running...X? or what? How do I choose this number, even if its heursitics rough.
related links from research:
https://discuss.pytorch.org/t/multiprocessing-for-loop-on-cpu/59836
How to use multiprocessing in PyTorch?
https://discuss.pytorch.org/t/how-to-parallelize-a-loop-over-the-samples-of-a-batch/32698/7
https://www.reddit.com/r/pytorch/comments/sm073v/how_to_parallelize_a_training_loop_ever_samples/
Torch will use multiple CPU to parallelize operations, so your serial is maybe using multi-core vectorization.
Take this simple example
import torch
c = 0;
for i in range(10000):
A = torch.randn(1000, 1000, device='cpu');
B = torch.randn(1000, 1000, device='cpu');
c += torch.sum(A # B)
No code is used to parallelize, however 80% of 12 CPUs with the default configuration.
You can use torch.set_num_threads to set intraop parallelism on CPU. In particular if you are running multiple process and you want each process to use a single CPU you may want to set in each process the intraop parallelism to 1.
However, parallelizing the operations has a cost, I am unable go into the implementation details but we can run a quick experiment that shows the overhead of using multiple threads.
import matplotlib.pyplot as plt
import numpy as np
import torch;
import time;
A = torch.randn(1000, 1000, device='cpu');
B = torch.randn(1000, 1000, device='cpu');
funcs = {
'sin': lambda a,b: torch.sin(A),
'tanh': lambda a,b: torch.tanh(A),
'log': lambda a,b: torch.log(A),
'matmul': lambda a,b: A # B.T
}
t = np.zeros(20)
for k,f in funcs.items():
for i in range(1, len(t) + 1):
torch.set_num_threads(i)
c = 0;
t0 = time.time();
for _ in range(100):
f(A,B)
tf = time.time()
t[i-1] = (tf - t0)*i;
plt.plot(np.arange(1, len(t)+1), t, '-o', label=k)
plt.xlabel('Number of threads')
plt.legend()
plt.ylabel('Core x time')
The operations tends to run faster with parallelism
But if we take the total CPU time, by multiplying by the number of threads, we see that the single thread version is more efficient.
If you are able to parallelize your experiment at a higher level, by running independent processes, you should try that with a single core for each process, otherwise each process will try to use all the CPUs and all of them will run very slowly because your system is overloaded.
Tweaking DDP example
I modified hyperparameters of your example scripts intentionally in a way that weights in favor of multi process.
comparably less initialization overhead
comparably less communication between processes
"""
Based on: https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
Note: as opposed to the multiprocessing (torch.multiprocessing) package, processes can use
different communication backends and are not restricted to being executed on the same machine.
"""
import torch
from torch import nn, optim
import torch.distributed as dist
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP
import argparse
import os
# More than one epoch so that the initialization is less significant
# than compared to the model processing time
num_epochs = 10
# for the experiment select a number that has a lot of divisors
# as I want to test with equal number of batches
num_batches = 16*9*5
# Uses a larger batch so that more work is done in each process
# between two gradient synchronizations
# apparently the intraop optimization is not helping
# (at least not too much) in the batch dimension
batch_size = 10000
# Use smaller dimensions, so that the intraop parallelization becomes less
# helpful
Din, Dout = 3, 5
data_x = torch.randn(batch_size, Din)
data_y = torch.randn(batch_size, Dout)
data = [(i*data_x, i*data_y) for i in range(num_batches)]
class OneDeviceModel(nn.Module):
"""
Toy example for a model ran in parallel but not distributed accross gpus
(only processes with their own gpu or hardware)
"""
def __init__(self):
super().__init__()
# -- Use more layers
self.net = [nn.Linear(Din, Din) for _ in range(10)]
# -- Bob: use more complex activation
self.tanh = nn.Tanh()
self.sigmoid = nn.Sigmoid()
self.relu = nn.ReLU()
self.net2 = nn.Linear(Din, Dout)
def forward(self, x):
# apply the 10 layers sequentially
for i in range(10):
x = self.net[i](x)
x = self.sigmoid(x)
x = self.tanh(x)
x = self.relu(x)
return self.net2(x)
def setup_process(rank, world_size, backend='gloo'):
"""
Initialize the distributed environment (for each process).
gloo: is a collective communications library (https://github.com/facebookincubator/gloo). My understanding is that
it's a library/API for process to communicate/coordinate with each other/master. It's a backend library.
"""
# set up the master's ip address so this child process can coordinate
# os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
# - use NCCL if you are using gpus: https://pytorch.org/tutorials/intermediate/dist_tuto.html#communication-backends
if torch.cuda.is_available():
backend = 'nccl'
# Initializes the default distributed process group, and this will also initialize the distributed package.
dist.init_process_group(backend, rank=rank, world_size=world_size)
def cleanup():
""" Destroy a given process group, and deinitialize the distributed package """
dist.destroy_process_group()
def run_parallel_training_loop(rank, world_size):
"""
Distributed function to be implemented later.
This is the function that is actually ran in each distributed process.
Note: as DDP broadcasts model states from rank 0 process to all other processes in the DDP constructor,
you don’t need to worry about different DDP processes start from different model parameter initial values.
"""
print()
print(f"Start running DDP with model parallel example on rank: {rank}.")
print(f'current process: {mp.current_process()}')
print(f'pid: {os.getpid()}')
setup_process(rank, world_size)
torch.set_num_threads(mp.cpu_count() // world_size)
# create model and move it to GPU with id rank
model = OneDeviceModel().to(rank) if torch.cuda.is_available() else OneDeviceModel().share_memory()
# ddp_model = DDP(model, device_ids=[rank])
ddp_model = DDP(model)
for _ in range(num_epochs):
for batch_idx, batch in enumerate(data[rank::world_size]):
x, y = batch
loss_fn = nn.MSELoss()
optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)
optimizer.zero_grad()
outputs = ddp_model(x)
labels = y.to(rank) if torch.cuda.is_available() else y
# Gradient synchronization communications take place during the backward pass and overlap with the backward computation.
loss_fn(outputs, labels).backward() # When the backward() returns, param.grad already contains the synchronized gradient tensor.
optimizer.step() # TODO how does the optimizer know to do the gradient step only once?
print()
print(f"Start running DDP with model parallel example on rank: {rank}.")
print(f'current process: {mp.current_process()}')
print(f'pid: {os.getpid()}')
# Destroy a given process group, and deinitialize the distributed package
cleanup()
def main():
print()
print('running main()')
print(f'current process: {mp.current_process()}')
print(f'pid: {os.getpid()}')
parser = argparse.ArgumentParser()
parser.add_argument('--world-size', default=1, type=int)
args = parser.parse_args()
assert num_batches % args.world_size == 0
mp.spawn(run_parallel_training_loop, args=(args.world_size,), nprocs=args.world_size)
if __name__ == "__main__":
print('starting __main__')
main()
print('Done!\a\n')
$ time python3 ddp.py --world-size 1 > /dev/null
real 0m59.092s
user 8m46.589s
sys 0m7.320s
$ time python3 ddp.py --world-size 1 > /dev/null
real 1m11.124s
user 10m54.209s
sys 0m9.595s
$ time python3 ddp.py --world-size 6 > /dev/null
real 0m18.348s
user 2m28.799s
sys 0m18.068s
$ time python3 ddp.py --world-size 12 > /dev/null
real 0m26.352s
user 4m3.074s
sys 0m39.179s
$ time python3 ddp.py --world-size 3 > /dev/null
real 0m23.047s
user 3m51.172s
sys 0m11.483s
$ time python3 ddp.py --world-size 4 > /dev/null
real 0m18.195s
user 2m55.241s
sys 0m12.841s
$ time python3 ddp.py --world-size 2 > /dev/null
real 0m26.955s
user 4m15.837s
sys 0m7.127s
If I remove the line
torch.set_num_threads(mp.cpu_count() // world_size)
$ time python3 ddp.py --world-size 4 > /dev/null
real 0m40.574s
user 6m39.176s
sys 0m19.025s
$ time python3 ddp.py --world-size 2 > /dev/null
real 0m28.066s
user 3m17.775s
sys 0m8.410s
$ time python3 ddp.py --world-size 1 > /dev/null
real 0m37.114s
user 2m19.743s
sys 0m4.866s
Using
torch.set_num_threads(mp.cpu_count() // world_size // 2)
$ time python3 ddp.py --world-size 6 > /dev/null
real 0m16.399s
user 1m38.915s
sys 0m20.780s
$ time python3 ddp.py --world-size 4 > /dev/null
real 0m15.649s
user 1m1.821s
sys 0m13.589s
$ time python3 ddp.py --world-size 3 > /dev/null
real 0m16.947s
user 1m29.696s
sys 0m10.069s
$ time python3 ddp.py --world-size 2 > /dev/null
real 0m21.851s
user 2m4.564s
sys 0m7.486s
My Opinion
DDP in a single node seems not particularly advantageous. Unless you have a model that does a lot of work that is particularly not well handled by pytorch intraop parallelism, have large batches, and preferrably models with less parameters and more operations, meaning less gradients to synchronize, e.g. a convolutional model on a very large input.
Other scenarios where DDP might be helpful is if you are using too much python in your model, instead of vectorized operations.

With Dask the tensorflow can only detect 1 GPU from the machine which has 2 GPU installed

Our HPC node has 2 K80 GPUs. When I run following code on the HPC node using python, the code will detect 2 GPUs and display "gpu device types:['TeslaK80', 'TeslaK80']"
But when I run the same code with DASK, it can only detect 1 GPU. It displays "gpu device types:['TeslaK80']"
Following is the code to detect GPU
import tensorflow as tf
def init_gpu()
print("\n\n\n ... tensorflow version = ", tf.__version__)
from tensorflow.python.client import device_lib
local_device_protos = device_lib.list_local_devices()
print("local device protos:{0}".format(local_device_protos))
_gpu_raw_info = [(x.name,x.physical_device_desc) for x in local_device_protos if x.device_type == 'GPU']
print("gpu raw info:{0}".format(_gpu_raw_info))
_gpu_names = [x[0] for x in _gpu_raw_info]
_gpu_devices = [x[1] for x in _gpu_raw_info]
_gpu_device_types = [x.split(':')[2].split(',')[0].replace(' ','') for x in _gpu_devices]
print("gpu device types:{0}".format(_gpu_device_types))
Following is the DASK LSF cluster code to launch job on the cluster:
cluster = LSFCluster(queue=queue_name, project=hpc_project, alltime='80:00', cores=1, processes=1, local_directory='dask-worker-space', memory='250GB', job_extra=['-gpu "num=2"'], log_directory='scheduler_log', dashboard_address=':8787'))
cluster.scale(1* 1)
client = Client(cluster.scheduler_address, timeout=60)
wbsd_results = []
r = dask.delayed(init_gpu)()
wbsd_results.append(r)
client.compute(wbsd, sync=True)
Please help. Thanks.

Slow training of RL algorith and low cpu usage

I have a python 3.6.4 script, which is a simple reinforcement learning script based on a 4 layer MLP(input, hidden-128, hidden-256, output). I can't publish my code here cause it is too big, I will post only a small part of it.
def train(model, epochs):
total = 0
start = time.time()
entire_hist = []
profits = []
for i in range(epochs):
loss = 0
accuracy = 0
hist = []
env.reset()
game_over = False
input_t = env.observe()
while not game_over:
input_tm1 = input_t
if np.random.random() <= epsilon:
action = np.random.randint(0, num_actions, size=None)
else:
q = model.predict(input_tm1)
action = np.argmax(q[0])
input_t, reward, game_over = env.act(action)
exp_replay.remember([input_tm1, action, reward, input_t], game_over)
inputs, targets = exp_replay.get_batch(model, batch_size=batch_size)
loss, accuracy = model.train_on_batch(inputs, targets)
hist.append([loss, accuracy, env.main])
print(f'counter: {env.counter}, action taken: {action}, reward: {round(reward, 2)}, main: {round(env.main)}, secondary: {env.secondary}')
if game_over:
print('GAME OVER!')
entire_hist.append(hist)
profits.append(total)
print(f'total profit: {env.total_profit}')
print(f'epoch: {i}, loss: {loss}, accuracy: {accuracy}')
print('\n')
print('*'*20)
end = int(time.time() - start)
print(f'training time: {end} seconds')
return entire_hist, total
The problem is, that when I run it, CPU usage is only about 20-30% and GPU usage is about 5%. I tried running on different machines and I get similar results, the more powerful CPU I use, less % of it script uses.
And it would be ok, unless it would take a few days to train such a small network if running it for 1000-5000 epochs. Can someone help to make it train faster and increase the cpu usage.
I tried running on both cpu/gpu version of tensorflow.
My set up:
Latest keras with tensorflow backend
Tensorflow-gpu==1.4.0
Actually, it was quite simple, I was just using not all of my cores. Since my system was disributing load from one core to all 4 I didn't notice that.

Allocator runs out of memory even on very low batch sizes

This problem never used to occur but since today Tensorflow always tries to allocate a huge amount of memory, even when using very small batch sizes.
I followed this tutorial:
https://blog.keras.io/building-powerful-image-classification-models-using-very-little-data.html
"Using the bottleneck features of a pre-trained network: 90% accuracy in a minute"
This is my code:
import numpy as np
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import Dropout, Flatten, Dense
from keras import applications
img_width, img_height = 150, 150
top_model_weights_path = 'bottleneck_fc_model.h5'
train_data_dir = 'C:\\ImageData\\Augmented\\Train'
validation_data_dir = 'C:\\ImageData\\Augmented\\Validate'
#train_data_dir = 'C:\\Users\\NSA\\flower_photos\\Train'
#validation_data_dir = 'C:\\Users\\NSA\\flower_photos\\Validate'
nb_train_samples = 25
nb_validation_samples = 5
epochs = 10
my_batch_size = 10
def save_bottleneck_features():
datagen = ImageDataGenerator(rescale=1./255)
# build the VGG16 network
model = applications.VGG16(include_top=False, weights='imagenet')
generator = datagen.flow_from_directory(
train_data_dir,
target_size=(img_width, img_height),
batch_size=my_batch_size,
class_mode=None,
shuffle=False)
bottleneck_features_train = model.predict_generator(
generator,
steps=nb_train_samples // my_batch_size,
max_queue_size=10,
workers=1,
use_multiprocessing=False,
verbose=1)
np.save(open('bottleneck_features_train.npy', 'w'),
bottleneck_features_train)
generator = datagen.flow_from_directory(
validation_data_dir,
target_size=(img_width, img_height),
batch_size=my_batch_size,
class_mode=None,
shuffle=False)
bottleneck_features_validation = model.predict_generator(
generator, nb_validation_samples // my_batch_size)
np.save(open('bottleneck_features_validation.npy', 'w'),
bottleneck_features_validation)
def train_top_model():
train_data = np.load(open('bottleneck_features_train.npy'))
train_labels = np.array(
[0] * (nb_train_samples / 2) + [1] * (nb_train_samples / 2))
validation_data = np.load(open('bottleneck_features_validation.npy'))
validation_labels = np.array(
[0] * (nb_validation_samples / 2) + [1] * (nb_validation_samples / 2))
model = Sequential()
model.add(Flatten(input_shape=train_data.shape[1:]))
model.add(Dense(256, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop',
loss='binary_crossentropy', metrics=['accuracy'])
model.fit(train_data, train_labels,
epochs=epochs,
batch_size=my_batch_size,
validation_data=(validation_data, validation_labels))
model.save_weights(top_model_weights_path)
save_bottleneck_features()
train_top_model()
And this is the error I get:
PS C:\Users\NSA\ownCloud\Documents\Tensorflow\Skripts> cd 'c:\Users\NSA\ownCloud\Documents\Tensorflow\Skripts'; ${env:PYTHONIOENCODING}='UTF-8'; ${env:PYTHONUNBUFFERED}='1'; & 'C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\python.exe' 'C:\Users\NSA\.vscode\extensions\ms-python.python-2018.3.1\pythonFiles\PythonTools\visualstudio_py_launcher.py' 'c:\Users\NSA\ownCloud\Documents\Tensorflow\Skripts' '50490' '34806ad9-833a-4524-8cd6-18ca4aa74f14' 'RedirectOutput,RedirectOutput' 'c:\Users\NSA\ownCloud\Documents\Tensorflow\Skripts\first_try_real_transfer_learning_keras_vgg16.py'
C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\h5py\__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will
be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
Using TensorFlow backend.
Bottleneck Features saven
2018-04-09 16:02:08.772206: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\platform\cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2018-04-09 16:02:09.345010: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:1212] Found device 0 with properties:
name: GeForce 940MX major: 5 minor: 0 memoryClockRate(GHz): 1.189
pciBusID: 0000:02:00.0
totalMemory: 2.00GiB freeMemory: 1.66GiB
2018-04-09 16:02:09.356147: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:1312] Adding visible gpu devices: 0
2018-04-09 16:02:10.108947: I C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\gpu\gpu_device.cc:993] Creating TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1429 MB memory) -> physical GPU (device: 0, name: GeForce 940MX, pci bus id: 0000:02:00.0, compute capability: 5.0)
Found 109 images belonging to 2 classes.
2018-04-09 16:02:16.979539: W C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.33GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-04-09 16:02:17.441196: W C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.19GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-04-09 16:02:17.792983: W C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.14GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2018-04-09 16:02:18.122577: W C:\tf_jenkins\workspace\rel-win\M\windows-gpu\PY\36\tensorflow\core\common_runtime\bfc_allocator.cc:219] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.17GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2/2 [==============================] - 4s 2s/step
Traceback (most recent call last):
File "c:\Users\NSA\ownCloud\Documents\Tensorflow\Skripts\first_try_real_transfer_learning_keras_vgg16.py", line 94, in <module>
save_bottleneck_features()
File "c:\Users\NSA\ownCloud\Documents\Tensorflow\Skripts\first_try_real_transfer_learning_keras_vgg16.py", line 56, in save_bottleneck_features
bottleneck_features_train)
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\numpy\lib\npyio.py", line 511, in save
pickle_kwargs=pickle_kwargs)
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\numpy\lib\format.py", line 565, in write_array
version)
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\numpy\lib\format.py", line 335, in _write_array_header
fp.write(header_prefix)
TypeError: write() argument must be str, not bytes
PS C:\Users\NSA\ownCloud\Documents\Tensorflow\Skripts>
The error occurs specifically when calling model.predict_generator()
At first I thought its running out of memory because my batch size is too large, but even when I use a batch size of 1 it requires over 2GiB of memory. I have installed CUDA 9.0, cuDNN 7.0, Tensorflow 1.6.0 and Keras 2.1.5 using TensorFlow backend. This used to work without issue but it suddenly started giving me this error. I'm using a NVIDIA GeForce 940MX
Your problem has nothing to do with memory or tensorflow. A file opened as text is being written bytes.
Instead of opening the file as text:
open('bottleneck_features_train.npy', 'w')
open it as bytes:
open('bottleneck_features_train.npy', 'wb')
This applies to all the calls to open you have.

Resources