Do Tensorflow Serving run inference with cache? - docker

When I serve my TF model with tensorflow serveing, on version 2.1.0, through docker, I perform a stress testing with Jmeter. There is a problem. TPS will hit 4400 by testing with single data, while it only reach 1700 with multiple data in a txt file. The model is BiLSTM which I've trained without any cache setting. The experiments all perform in local server rather than through network.
Metrics:
In single data task, I set running HTTP request with identical data without interval by 30 request threads for 10 minutes.
TPS: 4491
CPU occupied: 2100%
99% Latancy Line(ms): 17
error rate: 0
In multiple data task, I set running HTTP request with reading a txt file, a dataset with 9740000 different examples, by 30 request threads.
TPS: 1711
CPU occupied: 2300%
99% Latancy Line(ms): 42
error rate: 0
Hardware:
CPU cores:12
processor: 24
Intel(R) Xeon(R) Silver 4214 CPU # 2.20GHz
Is there a cache in Tensorflow Serving?
Why is TPS with single data testing larger thrice than with various data testing in stress testing task?

I've solved the problem. Request threads reading the same file needs to wait for which cost CPU for running Jmeter.

Related

Colab unable to load cache

I am trying to train a YOLOv5 neural network for recognizing vehicles. However, when it is trained on Google Colab, it always stops at here:
train: Scanning 'MyDataset/train/labels.cache' for images and labels... 26559 found, 0 missing, 0 empty, 0 corrupted: 100% 26559/26559 [00:00<?, ?it/s]
train: Caching images (8.5GB): 62% 16425/26559 [00:46<00:30, 330.41it/s]C
CPU times: user 850 ms, sys: 162 ms, total: 1.01 s
Wall time: 1min 26s
I followed the tutorial from roboflow. When I switched to the smaller database provided by roboflow, the training was able to proceed. I'm a Colab Pro+ user, so it shouldn't be a matter of not having enough memory.
I switched to a smaller dataset and now it loads without any problems.
train: Caching images (4.6GB): 100% 8853/8853 [00:18<00:00, 483.20it/s]
Then it started training smoothly.
I think it is indeed a matter of too much data. However Colab is not giving me any indication of running out of memory.

Why is my CPU doing matrix operations faster than GPU instead?

When I tried to verify that the GPU does matrix operations over the CPU, I got unexpected results.CPU performs better than GPU according to my experience result, it makes me confused.
I used cpu and gpu to do matrix multiplication respectively.Programming environment is MXNet and cuda-10.1.
with gpu:
import mxnet as mx
from mxnet import nd
x = nd.random.normal(shape=(100000,100000),ctx=mx.gpu())
y = nd.random.normal(shape=(100000,100000),ctx=mx.gpu())
%timeit nd.dot(x,y)
50.8 µs ± 1.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
with cpu:
x1 = nd.random.normal(shape=(100000,100000),ctx=mx.cpu())
y1 = nd.random.normal(shape=(100000,100000),ctx=mx.cpu())
%timeit nd.dot(x1,y1)
33.4 µs ± 1.54 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Why CPU faster? My CPU model is i5-6300HQ and GPU model is Nividia GTX 950M.
TLDR: Your matrix multiplication is actually not running :)
MXNet is an asynchronous framework that piles work requests in a queue treated asynchronously on a need-to-run basis by its execution engine. So what you're measuring is only the time it took to send the request, not to execute it. That's why it is so small (microseconds on a 100k*100k matrix would be surprisingly fast) and roughly equal for both CPU and GPU. To force execution, you need to add a call that forces production of a result, for example a print or a nd.dot(x, y).wait_to_read(). See here a code very similar to your benchmark https://github.com/ThomasDelteil/MXNetParisWorkshop/blob/master/FromNDArrayToTrainedModel.ipynb
Extra comments:
The gain of using GPU vs CPU comes with the size of the
parallelism opportunity. On simple tasks, that gain can be small to
non existent. CPU core frequencies are actually 2 to 3 times bigger
than GPU frequencies (your i5-6300HQ does 2.3GHz with 3.2GHz boost
ability while your GTX 950M does 0.9GHz with 1.1GHz boost ability).
MXNet ndarray is very fast at matrix algebra on CPU, because (1) its asynchronous paradigm optimizes the order of computation (2) its C++ backend runs things in parallel and (3) I believe the default MXNet build comes with Intel MKL, which significantly boosts algebra capacities of Intel CPUs
(https://medium.com/apache-mxnet/mxnet-boosts-cpu-performance-with-mkl-dnn-b4b7c8400f98). Its ability to run compute on GPU within the same API is also a big strength over Numpy for example.
I don't think your test will run on GPU: instantiating such a big matrix on
an NVIDIA Tesla V100 (16GB men, 4x more than a GTX 950M) runs in a
"large tensor size error"
I don't know the module you're using but your CPU can access your memory way quicker and also saves a lot of stuff in cache. Your GPU has longer times to load the data into GPU memory and also takes longer to be called from your CPU.
Thats always the downside of GPU computation. When you can load a bunch of data into GPU memory, there's a good chance of being faster.
Btw, thats why deep learning frameworks work in batches. When you can't work with batches, I'd always use the CPU. You also got some potential for performance improvements with multiprocessing.

Dask: many small workers vs a big worker

I am trying to understand this simple example from the dask-jobqueue documentation:
from dask_jobqueue import PBSCluster
cluster = PBSCluster(cores=36,
memory"100GB",
project='P48500028',
queue='premium',
walltime='02:00:00')
cluster.start_workers(100) # Start 100 jobs that match the description above
from dask.distributed import Client
client = Client(cluster) # Connect to that cluster
I think it means that there will be 100 jobs each using 36 cores.
Let's say I can use 48 cores on a cluster.
Should I use 1 worker with 48 cores or 48 workers of 1 core each?
If your computations mostly release the GIL, then you'll probably want several threads per process. This is true if you're doing mostly Numpy, Pandas, Scikit-Learn, Numba/Cython programming on numeric data. I might do something like six processes with eight cores each.
If your computations are mostly pure Python code, for example you process text data, or iterate heavily with Python for loops over dicts/list/etc then you'll want fewer threads per process, maybe two.

Why is my GPU slower than CPU when training LSTM/RNN models?

My machine has the following spec:
CPU: Xeon E5-1620 v4
GPU: Titan X (Pascal)
Ubuntu 16.04
Nvidia driver 375.26
CUDA tookit 8.0
cuDNN 5.1
I've benchmarked on the following Keras examples with Tensorflow as the backed reference:
SCRIPT NAME GPU CPU
stated_lstm.py 5sec 5sec
babi_rnn.py 10sec 12sec
imdb_bidirectional_lstm.py 240sec 116sec
imbd_lstm.py 113sec 106sec
My gpu is clearly out performing my cpu in non-lstm models.
SCRIPT NAME GPU CPU
cifar10_cnn.py 12sec 123sec
imdb_cnn.py 5sec 119sec
mnist_cnn.py 3sec 47sec
Has anyone else experienced this?
If you use Keras, use CuDNNLSTM in place of LSTM or CuDNNGRU in place of GRU. In my case (2 Tesla M60), I am seeing 10x boost of performance. By the way I am using batch size 128 as suggested by #Alexey Golyshev.
Too small batch size. Try to increase.
Results for my GTX1050Ti:
imdb_bidirectional_lstm.py
batch_size time
32 (default) 252
64 131
96 87
128 66
imdb_lstm.py
batch_size time
32 (default) 108
64 50
96 34
128 25
It's just a tip.
Using GPU is powerful when
1. your neural network model is big.
2. batch size is big.
It's what I found from googling.
I have got similar issues here:
Test 1
CPU: Intel(R) Xeon(R) CPU E5-2697 v3 # 2.60GHz
Ubuntu 14.04
imdb_bidirectional_lstm.py: 155s
Test 2
GPU: GTX 860m
Nvidia Driver: 369.30
CUDA Toolkit: v8.0
cuDNN: v6.0
imdb_bidirectional_lstm.py:450s
Analyse
When I observe the GPU load curve, I found one interesting thing:
for lstm, GPU load jumps quickly between ~80% and ~10%
GPU load
This is mainly due to the sequential computation in LSTM layer. Remember that LSTM requires sequential input to calculate hidden layer weights iteratively, in other words, you must wait for hidden state at time t-1 to calculate hidden state at time t.
That's not a good idea for GPU cores, since they are many small cores who like doing computations in parallel, sequential compuatation can't fully utilize their computing powers. That's why we are seeing GPU load around 10% - 20% most of the time.
But in the phase of backpropagation, GPU could run derivative computation in parallel, so we can see GPU load peak around 80%.

IOPS versus Throughput

What is the key difference between IOPS and Throughput in large data storage?
Does file size have an effect on IOPS? Why?
IOPS measures the number of read and write operations per second, while throughput measures the number of bits read or written per second.
Although they measure different things, they generally follow each other as IO operations have about the same size.
If you have large files, you simply need more IO operations to read the entire file. The file size has no effect on the IOPS as it measures the number of clusters read or written, not the number of files.
If you have small files, there will be more overhead, so while the IOPS and throughput look good, you may experience a lower actual performance.
This is the analogy I came up with when talking about Throughput and IOPS.
Think of it as:
You have 4 buckets (Disk blocks) of the same size that you want to fill or empty with water.
You'll be using a jug to transfer the water into the buckets. Now your question will be:
At a given time (per second), how many jugs of water can you pour (write) or withdraw (read)? This is IOPS.
At a given time (per second) what's the amount (bit, kb, mb, etc) of water the jug can transfer into/out of the bucket continuously? This is throughput.
Additionally, there is a delay in the process of you pouring and/or withdrawing the water. This is Latency.
There's 3 things to consider when talking about IOPS and Throughput:
Size (file size/block size)
Patterns (Random/Sequential)
Mix (Read/Write) percentage
The Disk IOPS Describes the count of input/output operations on the disk per seconds, regardless block size.
The disk throughput describes how many data may be transferred per second, so the block size play a huge role upon calculating the throughput required by app
Let's consider as the sample the 3000 IOPS and SQL database engine, the block size in terms of db engine is called the page size and for SQL Server it's equal to 8 KB. If you wish to calculate the actual throughput, if the IOPS defined, you will end up with the formula below:
throughput = [IOPS] * [block size] = 3000 * 8 = 24 000 KB/s = 24 MB/s
IOPS - Number of read write operations mostly useful for OLTP transactions used in AWS for DBs like Cassandra.
Throughput - Is the number of bit transferred per sec. i.e.data transferred per sec.
Mainly a unit for high data transfer applications like big data hadoop,kafka streaming
IOPS- The time taken for a storage system to perform an Input/Output operation per second from start to finish constitutes IOPS.
Throughput- Data transfer speed in megabytes per second is often termed as throughput. Earlier, it was measured in Kilobytes. But now the standard has become megabytes.
More about this see: What is the difference between IOPS and throughput?

Resources