run tensorflow textsum model on gpu - machine-learning

On cpu, the textsum training code (https://github.com/tensorflow/models/tree/master/research/textsum) runs flawlessly.
To run code on gpu for faster training, I have changed "cpu" to "gpu" in following 4 lines:
https://github.com/tensorflow/models/blob/master/research/textsum/seq2seq_attention.py#L84
https://github.com/tensorflow/models/blob/master/research/textsum/seq2seq_attention_model.py#L149
https://github.com/tensorflow/models/blob/master/research/textsum/seq2seq_attention_model.py#L217
https://github.com/tensorflow/models/blob/master/research/textsum/seq2seq_attention_model.py#L228
Model is still getting trained, but again on cpu, and not on gpu. (I checked it by using nvidia-smi)
Please let me know if anything else needs to be changed to train it on gpu.

Changing this lines of code will help you in training model on GPU
textsum/seq2seq_attention.py
# change this line
# tf.app.flags.DEFINE_integer('num_gpus', 0, 'Number of gpus used.')
tf.app.flags.DEFINE_integer('num_gpus', 1, 'Number of gpus used.')
textsum/seq2seq_attention_model.py
# if self._num_gpus > 1:
# self._cur_gpu = (self._cur_gpu + 1) % (self._num_gpus-1)
self._cur_gpu = (self._cur_gpu + 1) % (self._num_gpus-1 if self._num_gpus > 1 else self._num_gpus)
reference issue link

Related

Using Bloom AI Model on Mac M1 for continuing prompts (Pytorch)

I try to run the bigscience Bloom AI Model on my Macbook M1 Max 64GB, freshly installed pytorch for Mac M1 chips and Python 3.10.6 running.
I'm not able to get any output at all.
With other AI Models I have the same issue and I really don't know how I should fix it.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
device = "mps" if torch.backends.mps.is_available() else "cpu"
if device == "cpu" and torch.cuda.is_available():
device = "cuda" #if the device is cpu and cuda is available, set the device to cuda
print(f"Using {device} device") #print the device
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom")
model = AutoModelForCausalLM.from_pretrained("bigscience/bloom").to(device)
input_text = "translate English to German: How old are you?"
input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to(device)
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0]))
I've tried it with other Models (smaller bert Models) and also tried letting it just run on CPU without using the mps device at all.
Maybe anyone could help
It might be taking too long to get the output. Do you want to break it down to serial calls involving
a) embedding layer b) the 70 bloom blocks c) then the output layer norm and d) the token decoding?
An example to run this code is available at https://nbviewer.org/urls/arteagac.github.io/blog/bloom_local.ipynb .
It basically boils down to:
def forward(input_ids):
# 1. Create attention mask and position encodings
attention_mask = torch.ones(len(input_ids)).unsqueeze(0).bfloat16().to(device)
alibi = build_alibi_tensor(input_ids.shape[1], config.num_attention_heads,
torch.bfloat16).to(device)
# 2. Load and use word embeddings
embeddings, lnorm = load_embeddings()
hidden_states = lnorm(embeddings(input_ids))
del embeddings, lnorm
# 3. Load and use the BLOOM blocks sequentially
for block_num in range(70):
load_block(block, block_num)
hidden_states = block(hidden_states, attention_mask=attention_mask, alibi=alibi)[0]
print(".", end='')
hidden_states = final_lnorm(hidden_states)
#4. Load and use language model head
lm_head = load_causal_lm_head()
logits = lm_head(hidden_states)
# 5. Compute next token
return torch.argmax(logits[:, -1, :], dim=-1)
Please refer the linked notebook to get the implementation for functions used in the forward call.

grid of mtry values while training random forests with ranger

I am working with a subset of the 'Ames Housing' dataset and have originally 17 features. Using the 'recipes' package, I have engineered the original set of features and created dummy variables for nominal predictors with the following code. That has resulted in 35 features in the 'baked_train' dataset below.
blueprint <- recipe(Sale_Price ~ ., data = _train) %>%
step_nzv(Street, Utilities, Pool_Area, Screen_Porch, Misc_Val) %>%
step_impute_knn(Gr_Liv_Area) %>%
step_integer(Overall_Qual) %>%
step_normalize(all_numeric_predictors()) %>%
step_other(Neighborhood, threshold = 0.01, other = "other") %>%
step_dummy(all_nominal_predictors(), one_hot = FALSE)
prepare <- prep(blueprint, data = ames_train)
baked_train <- bake(prepare, new_data = ames_train)
baked_test <- bake(prepare, new_data = ames_test)
Now, I am trying to train random forests with the 'ranger' package using the following code.
cv_specs <- trainControl(method = "repeatedcv", number = 5, repeats = 5)
param_grid_rf <- expand.grid(mtry = seq(1, 35, 1),
splitrule = "variance",
min.node.size = 2)
rf_cv <- train(blueprint,
data = ames_train,
method = "ranger",
trControl = cv_specs,
tuneGrid = param_grid_rf,
metric = "RMSE")
I have set the grid of 'mtry' values based on the number of features in the 'baked_train' data. It is my understanding that 'caret' will apply the blueprint within each resample of 'ames_train' creating a baked version at each CV step.
The text Hands-On Machine Learning with R by Boehmke & Greenwell says on section 3.8.3,
Consequently, the goal is to develop our blueprint, then within each resample iteration we want to apply prep() and bake() to our resample training and validation data. Luckily, the caret package simplifies this process. We only need to specify the blueprint and caret will automatically prepare and bake within each resample.
However, when I run the code above I get an error,
mtry can not be larger than number of variables in data. Ranger will EXIT now.
I get the same error when I specify 'tuneLength = 20' instead of the 'tuneGrid'. Although the code works fine when the grid of 'mtry' values is specified to be from 1 to 17 (the number of features in the original training data 'ames_train').
When I specify a grid of 'mtry' values from 1 to 17, info about the final model after CV is shown below. Notice that it mentions Number of independent variables: 35 which corresponds to the 'baked_train' data, although specifying a grid from 1 to 35 throws an error.
Type: Regression
Number of trees: 500
Sample size: 618
Number of independent variables: 35
Mtry: 15
Target node size: 2
Variable importance mode: impurity
Splitrule: variance
OOB prediction error (MSE): 995351989
R squared (OOB): 0.8412147
What am I missing here? Specifically, why do I have to specify the number of features in 'ames_train' instead of 'baked_train' when essentially 'caret' is supposed to create a baked version before fitting and evaluating the model for each resample?
Thanks.

onnxruntime inference is way slower than pytorch on GPU

I was comparing the inference times for an input using pytorch and onnxruntime and I find that onnxruntime is actually slower on GPU while being significantly faster on CPU
I was tryng this on Windows 10.
ONNX Runtime installed from source - ONNX Runtime version: 1.11.0 (onnx version 1.10.1)
Python version - 3.8.12
CUDA/cuDNN version - cuda version 11.5, cudnn version 8.2
GPU model and memory - Quadro M2000M, 4 GB
Relevant code -
import torch
from torchvision import models
import onnxruntime # to inference ONNX models, we use the ONNX Runtime
import onnx
import os
import time
batch_size = 1
total_samples = 1000
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
def convert_to_onnx(resnet):
resnet.eval()
dummy_input = (torch.randn(batch_size, 3, 224, 224, device=device)).to(device=device)
input_names = [ 'input' ]
output_names = [ 'output' ]
torch.onnx.export(resnet,
dummy_input,
"resnet18.onnx",
verbose=True,
opset_version=13,
input_names=input_names,
output_names=output_names,
export_params=True,
do_constant_folding=True,
dynamic_axes={
'input': {0: 'batch_size'}, # variable length axes
'output': {0: 'batch_size'}}
)
def infer_pytorch(resnet):
print('Pytorch Inference')
print('==========================')
print()
x = torch.randn((batch_size, 3, 224, 224))
x = x.to(device=device)
latency = []
for i in range(total_samples):
t0 = time.time()
resnet.eval()
with torch.no_grad():
out = resnet(x)
latency.append(time.time() - t0)
print('Number of runs:', len(latency))
print("Average PyTorch {} Inference time = {} ms".format(device.type, format(sum(latency) * 1000 / len(latency), '.2f')))
def to_numpy(tensor):
return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()
def infer_onnxruntime():
print('Onnxruntime Inference')
print('==========================')
print()
onnx_model = onnx.load("resnet18.onnx")
onnx.checker.check_model(onnx_model)
# Input
x = torch.randn((batch_size, 3, 224, 224))
x = x.to(device=device)
x = to_numpy(x)
so = onnxruntime.SessionOptions()
so.execution_mode = onnxruntime.ExecutionMode.ORT_SEQUENTIAL
so.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
exproviders = ['CUDAExecutionProvider', 'CPUExecutionProvider']
model_onnx_path = os.path.join(".", "resnet18.onnx")
ort_session = onnxruntime.InferenceSession(model_onnx_path, so, providers=exproviders)
options = ort_session.get_provider_options()
cuda_options = options['CUDAExecutionProvider']
cuda_options['cudnn_conv_use_max_workspace'] = '1'
ort_session.set_providers(['CUDAExecutionProvider'], [cuda_options])
#IOBinding
input_names = ort_session.get_inputs()[0].name
output_names = ort_session.get_outputs()[0].name
io_binding = ort_session.io_binding()
io_binding.bind_cpu_input(input_names, x)
io_binding.bind_output(output_names, device)
#warm up run
ort_session.run_with_iobinding(io_binding)
ort_outs = io_binding.copy_outputs_to_cpu()
latency = []
for i in range(total_samples):
t0 = time.time()
ort_session.run_with_iobinding(io_binding)
latency.append(time.time() - t0)
ort_outs = io_binding.copy_outputs_to_cpu()
print('Number of runs:', len(latency))
print("Average onnxruntime {} Inference time = {} ms".format(device.type, format(sum(latency) * 1000 / len(latency), '.2f')))
if __name__ == '__main__':
torch.cuda.empty_cache()
resnet = (models.resnet18(pretrained=True)).to(device=device)
convert_to_onnx(resnet)
infer_onnxruntime()
infer_pytorch(resnet)
Output
If run on CPU,
Average onnxruntime cpu Inference time = 18.48 ms
Average PyTorch cpu Inference time = 51.74 ms
but, if run on GPU, I see
Average onnxruntime cuda Inference time = 47.89 ms
Average PyTorch cuda Inference time = 8.94 ms
If I change graph optimizations to onnxruntime.GraphOptimizationLevel.ORT_DISABLE_ALL, I see some improvements in inference time on GPU, but its still slower than Pytorch.
I use io binding for the input tensor numpy array and the nodes of the model are on GPU.
Further, during the processing for onnxruntime, I print device usage stats and I see this -
Using device: cuda:0
GPU Device name: Quadro M2000M
Memory Usage:
Allocated: 0.1 GB
Cached: 0.1 GB
So, GPU device is being used.
Further, I have used the resnet18.onnx model from the ModelZoo to see if it is a converted mode issue, but i get the same results.
What am I doing wrong or missing here?
When calculating inference time exclude all code that should be run once like resnet.eval() from the loop.
Please include imports in example
import torch
from torchvision import models
import onnxruntime # to inference ONNX models, we use the ONNX Runtime
import onnx
import os
import time
After running your example GPU only I found that time differs only ~x2, so the speed difference may be caused by framework characteristics. For more details explore onnx conversion optimization
Onnxruntime Inference
==========================
Number of runs: 1000
Average onnxruntime cuda Inference time = 4.76 ms
Pytorch Inference
==========================
Number of runs: 1000
Average PyTorch cuda Inference time = 2.27 ms
for CPU you don't need to use io-binding, it is required only for GPU.
and don't change session options as onnxruntime by default selects the best options.
the following things may help to speed up the gpu
make sure to install onnxruntime-gpu which comes with prebuilt CUDA EP and TensortRT EP.
you are currently binding the inputs and outputs to the CPU. when using onnxruntime with CUDA EP you should bind them to GPU (to avoid copying inputs/output btw CPU and GPU) refer here
I suggest you use io_binding.bind_input() method instead of io_binding.bind_cpu_input()
for i in range(total_samples):
t0 = time.time()
ort_session.run_with_iobinding(io_binding)
latency.append(time.time() - t0)
-> ort_outs = io_binding.copy_outputs_to_cpu()
copying output every time from GPU to CPU for 1000 times drops the performance.

Slow training of RL algorith and low cpu usage

I have a python 3.6.4 script, which is a simple reinforcement learning script based on a 4 layer MLP(input, hidden-128, hidden-256, output). I can't publish my code here cause it is too big, I will post only a small part of it.
def train(model, epochs):
total = 0
start = time.time()
entire_hist = []
profits = []
for i in range(epochs):
loss = 0
accuracy = 0
hist = []
env.reset()
game_over = False
input_t = env.observe()
while not game_over:
input_tm1 = input_t
if np.random.random() <= epsilon:
action = np.random.randint(0, num_actions, size=None)
else:
q = model.predict(input_tm1)
action = np.argmax(q[0])
input_t, reward, game_over = env.act(action)
exp_replay.remember([input_tm1, action, reward, input_t], game_over)
inputs, targets = exp_replay.get_batch(model, batch_size=batch_size)
loss, accuracy = model.train_on_batch(inputs, targets)
hist.append([loss, accuracy, env.main])
print(f'counter: {env.counter}, action taken: {action}, reward: {round(reward, 2)}, main: {round(env.main)}, secondary: {env.secondary}')
if game_over:
print('GAME OVER!')
entire_hist.append(hist)
profits.append(total)
print(f'total profit: {env.total_profit}')
print(f'epoch: {i}, loss: {loss}, accuracy: {accuracy}')
print('\n')
print('*'*20)
end = int(time.time() - start)
print(f'training time: {end} seconds')
return entire_hist, total
The problem is, that when I run it, CPU usage is only about 20-30% and GPU usage is about 5%. I tried running on different machines and I get similar results, the more powerful CPU I use, less % of it script uses.
And it would be ok, unless it would take a few days to train such a small network if running it for 1000-5000 epochs. Can someone help to make it train faster and increase the cpu usage.
I tried running on both cpu/gpu version of tensorflow.
My set up:
Latest keras with tensorflow backend
Tensorflow-gpu==1.4.0
Actually, it was quite simple, I was just using not all of my cores. Since my system was disributing load from one core to all 4 I didn't notice that.

How to do proper testing in Weka and how to get desired results ?

I am currently working over a application of ANN, SVM and Linear Regression methods for prediction of fruit yield of a region based on meteorological factors (13 factors )
Total data set is: 36
While Implementing those methods on WEKA I am getting BAD results:
Like in the case of MultilayerPreceptron my results are :
(i divided the dataset with 28 for training and 8 for test )
=== Run information ===
Scheme: weka.classifiers.functions.MultilayerPerceptron -L 0.3 -M 0.2 -N 500 -V 0 -S 0 -E 20 -H a -G -R
Relation: apr6_data
Instances: 28
Attributes: 15
Time taken to build model: 3.69 seconds
=== Predictions on test set ===
inst# actual predicted error
1 2.551 2.36 -0.191
2 2.126 3.079 0.953
3 2.6 1.319 -1.281
4 1.901 3.539 1.638
5 2.146 3.635 1.489
6 2.533 2.917 0.384
7 2.54 2.744 0.204
8 2.82 3.473 0.653
=== Evaluation on test set ===
=== Summary ===
Correlation coefficient -0.4415
Mean absolute error 0.8493
Root mean squared error 1.0065
Relative absolute error 144.2248 %
Root relative squared error 153.5097 %
Total Number of Instances 8
In case of SVM for regression :
inst# actual predicted error
1 2.551 2.538 -0.013
2 2.126 2.568 0.442
3 2.6 2.335 -0.265
4 1.901 2.556 0.655
5 2.146 2.632 0.486
6 2.533 2.24 -0.293
7 2.54 2.766 0.226
8 2.82 3.175 0.355
=== Evaluation on test set ===
=== Summary ===
Correlation coefficient 0.2888
Mean absolute error 0.3417
Root mean squared error 0.3862
Relative absolute error 58.0331 %
Root relative squared error 58.9028 %
Total Number of Instances 8
What can be the possible error in my application ? Please let me know !
Thanks
Do I need to normalize the data ? I guess it is being done by WEKA classifiers.
If you want to normalize data, you have to do it. Preprocess tab - > Filters (choose) -> then find normalize and then click apply.
If you want to discretize your data, you have to follow the same process.
You might have better luck with discretising the prediction e.g. into low/medium/high yield.
You need to normalize or discretize- this cannot be said based on your data or on your single run. For instance, discretization brings in better result for naive baye's classifiers. For SVM- not sure.
I did not see your Precision, Recall or F-score from your data. But as you are saying you have bad results on test set, then it is very possible that your classifier is experiencing overfitting. Try to increase training instances (36 is too less I guess). Keep us posting what is happening when you increase training instances.

Resources