tensorflow conv2d memory consumption explain? - memory

output = tf.nn.conv2d(input, weights, strides = [1,3,3,1], padding = 'VALID')
My input has shape 200x225x225x1, weights is 15x15x1x64. Hence, the output has shape 200x71x71x64 since (225-15)/3 + 1 = 71
Tensorboard shows that this operation consumes totally 768MB (see pic below). Assuming it takes into account the size of input (38.6MB), weights (0.06MB) and output (246.2MB) the total memory consumption should not exceed 300MB. So where does the rest of the memory consumption come from?

Although I'm not able to reproduce your graph and values based on information provided, it's possible that you're seeing additional memory usage due to intermediary values materialized during the computation of Conv2D. It's also possible that the instrumentation is incorrect. (e.g. reshape operations that do not result in a copy of Tensor memory end up duplicating the "memory usage" in the TF Node Stats instrumentation.) Without a reproducible test case, it's hard to say more. If you do feel like this is a bug in TensorFlow, please do raise an issue on Github!

Related

How to check bus utilization / bus load for GPU during ML inference?

I am running an ML inference for image recognition on the GPU using onnxruntime and I am seeing an upper limit for how much performance improvement batching of images is giving me - there is reduction in inference time upto around batch_size of 8, beyond that the time remains constant. I assume this must be because of some max utilization of the GPU resources, as I dont see any such limitation mentioned in the onnx documentation.
I tried using the package pynmvl.smi to get nvidia_smi and printed some utilization factors during inference as such -
utilization_percent = nvidia_smi.getInstance().DeviceQuery()['gpu'][0]['utilization']
gpu_util.append(utilization_percent ['gpu_util'])
mem_util.append(utilization_percent ['memory_util'])
What I do see is that the gpu_util and the memory_util are within 25% for the entire run of my inference, even at batch size like 32 or 64, so these are unlikely to be the cause of the bottleneck.
I assume then, that it must be the bus load limitation that might be causing this. I did not find any option within nvidia-smi to print the GPU bus load.
How can I find the bus load during the inference?

Dask-ml ParallelPostFit not using distributed and causing memory error on local machine

I want to do Random Forest predictions on a large dataset and save the result as an dataframe. I read https://examples.dask.org/machine-learning/parallel-prediction.html and it says "Workers can write the predicted values to a shared file system, without ever having to collect the data on a single machine", but I cant figure out how to do this. I tried this by connecting to a distributed cluster and doing:
x = da.from_array(i,100000)
t = model.predict(x)
t= client.persist(t)
df=dd.from_array(t)
df.to_parquet("xy.parquet")
However this does not trigger any computation on the cluster (observed with dashboard), and runs my 1TB RAM machine into a memory error when to_parquet computes, even for a test where the numpy size of x and t is 7GB. Anything else I submit to the cluster is computed there.
So how do I save the results of the prediction?
EDIT:
This seems to be an issue of size for the input x. It has the shape (24507731,8). If I instead just throw in random data with the shape (24507,8) the computation finished. This is quite surprising as ParallelPostfit is supposed to make prediction on large data possible in the first place.

SelectKBest gives memory error even with sparse matrix

I have some spreadsheet data that is over a GB and wanting to use random forest. Following some other questions on here I was able to tune the algorithm to work with my data but unfortunately to get the best performance I needed to do one hot encoding of a categorical feature and now my input matrix has over 3000 features resulting in a memory error.
I'm trying to reduce these features so I'm using SelectKBest with chi2 which according to docs will deal with my sparse matrix but I'm still getting memory error.
I tried using to_sparse with fill_value=0 which seems to reduce memory footprint, but when I call fit_transform I get memory error
MemoryError Traceback (most recent call last)
in ()
4 Y_sparse = df_processed.loc[:,'Purchase'].to_sparse(fill_value=0)
5
----> 6 X_new = kbest.fit_transform(X_sparse, Y_sparse)
kbest = SelectKBest(mutual_info_regression, k = 5)
X_sparse = df_processed.loc[:,df_processed.columns != 'Purchase'].to_sparse(fill_value=0)
Y_sparse = df_processed.loc[:,'Purchase'].to_sparse(fill_value=0)
X_new = kbest.fit_transform(X_sparse, Y_sparse)
I simply want to reduce the 3000 features to something more manageable say 20 that correlate well with my Y values (continuous response)
The reason you are getting an error on everything is because to do anything in Pandas or sklearn, the entire dataset has to be loaded in memory along with all the other data from temporary steps.
Instead of doing one hot encoding, try binary encoding or hashing encoding. One-hot-encoding has a linear growth rate n where n is the number of categories in a categorical feature. Binary encoding has log_2(n) growth rate so you will be able to avoid memory error. If not, try hashing encoding.

How to put datasets created by torchvision.datasets in GPU in one operation?

I’m dealing with CIFAR10 and I use torchvision.datasets to create it. I’m in need of GPU to accelerate the calculation but I can’t find a way to put the whole dataset into GPU at one time. My model need to use mini-batches and it is really time-consuming to deal with each batch separately.
I've tried to put each mini-batch into GPU separately but it seems really time-consuming.
TL;DR
You won't save time by moving the entire dataset at once.
I don't think you'd necessarily want to do that even if you have the GPU memory to handle the entire dataset (of course, CIFAR10 is tiny by today's standards).
I tried various batch sizes and timed the transfer to GPU as follows:
num_workers = 1 # Set this as needed
def time_gpu_cast(batch_size=1):
start_time = time()
for x, y in DataLoader(dataset, batch_size, num_workers=num_workers):
x.cuda(); y.cuda()
return time() - start_time
# Try various batch sizes
cast_times = [(2 ** bs, time_gpu_cast(2 ** bs)) for bs in range(15)]
# Try the entire dataset like you want to do
cast_times.append((len(dataset), time_gpu_cast(len(dataset))))
plot(*zip(*cast_times)) # Plot the time taken
For num_workers = 1, this is what I got:
And if we try parallel loading (num_workers = 8), it becomes even clearer:
I've got an answer and I'm gonna try it later. It seems promising.
You can write a dataset class where in the init function, you red the entire dataset and apply all the transformations you need, and convert them to tensor format. Then, send this tensor to GPU (assuming there is enough memory). Then, in the getitem function you can simply use the index to retrieve the elements of that tensor which is already on GPU.

How does one calculate the GPU memory required to run a model in TensorFlow?

Is there a straightforward way to find the GPU memory consumed by, say, an inception-resnet-v2 model that is initialized in tensorflow? This includes the inference and the backprop memories required.
You can explicitly calculate the memory needed to store parameters, but I am afraid it would be difficult to compute the size of all buffers needed for training. Probably, a more clever way would be to make TF do it for you. Set the gpu_options.allow_growth config option to True and see how much does it consume. Another option is to try smaller values for gpu_options.per_process_gpu_memory_fraction until it fails with out of memory.
Since using gpu.options.allow_growth and gpu_options.per_process_gpu_memory_fraction for model size estimation is currently a trial-and-error and tedious solution, I suggest using tf.RunMetadata() in combination with tensorboard.
Example:
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
summary, _ = sess.run(train_step, feed_dict, options=run_options, run_metadata=run_metadata)
train_writer.add_run_metadata(run_metadata, 'step%d' % i)
Run your model and tensorboard, navigate to the desired part of your graph and read the node statistics.
Source: https://www.tensorflow.org/get_started/graph_viz

Resources