Number of MAC operations for softmax activation function - machine-learning

I am analysing the number of MAC operation required in a softmax layer of 10 units. I used STM32 X-CUBE-AI and it tells me it uses 150 MAC operations. I don't understand where this number comes from.
Can you show me why a softmax layer of 10 units uses 150 MACs ?

Related

Backpropagation with Lots of Inputs

I'm a beginner and I'm trying to implement Backpropagation in C# for school purposes (so no tensorflow for now, we have to learn it manually). I have 64 nodes for input layer and 64 nodes for the output layer, somewhat an Autoencoder structure because we will be discussing MLP later on.
I'm calculating Delta Output as:
delta_out = (y_out) * (1 - y_out) * (desired - y_out)
I have tested my program to an XOR input/output scenario and it will correctly guess for this scenario but if I will put all the 64 nodes of input and output, then it will not give me the correct prediction (like 0% accuracy).
I'm also trying to total all of the delta_out abs(delta_out). For the XOR scenario, the absolute sum of delta_out is approaching zero as training progresses. But if I choose the 64 input and output test, then the absolute sum of all delta_out starts from a very small number and stays there.
For the XOR that's properly working (I have also tried OR and AND tests which just works fine), I used the following structure 2 nodes for input, 4 nodes for hidden, and 1 node for output.
For the 64 input and output, I have tested various numbers of nodes for the hidden layer, starting from 8 nodes to 128 nodes. If I use 64 or more nodes for the hidden layer, then the absolute sum of all the delta_out is near 0 even at the start and changes too slow.
I have also tested various learning rates (different learning rate for hidden and output layer). I tested from 0.1 to 0.75 but it seems it doesn't help for the 64 input/output that I'm supposed to accomplish. I have also changed the number of epochs from 100k to 500k but nothing seems to help.
Maybe I don't understand the Backpropagation concept well?

How to calculate optimal batch size

Sometimes I run into a problem:
OOM when allocating tensor with shape
e.g.
OOM when allocating tensor with shape (1024, 100, 160)
Where 1024 is my batch size and I don't know what's the rest. If I reduce the batch size or the number of neurons in the model, it runs fine.
Is there a generic way to calculate optimal batch size based on model and GPU memory, so the program doesn't crash?
In short: I want the largest batch size possible in terms of my model, which will fit into my GPU memory and won't crash the program.
From the recent Deep Learning book by Goodfellow et al., chapter 8:
Minibatch sizes are generally driven by the following factors:
Larger batches provide a more accurate estimate of the gradient, but
with less than linear returns.
Multicore architectures are usually
underutilized by extremely small batches. This motivates using some
absolute minimum batch size, below which there is no reduction in the
time to process a minibatch.
If all examples in the batch are to be
processed in parallel (as is typically the case), then the amount of
memory scales with the batch size. For many hardware setups this is
the limiting factor in batch size.
Some kinds of hardware achieve
better runtime with speciļ¬c sizes of arrays. Especially when using
GPUs, it is common for power of 2 batch sizes to offer better runtime.
Typical power of 2 batch sizes range from 32 to 256, with 16 sometimes
being attempted for large models.
Small batches can offer a
regularizing effect (Wilson and Martinez, 2003), perhaps due to the
noise they add to the learning process. Generalization error is often
best for a batch size of 1. Training with such a small batch size
might require a small learning rate to maintain stability because of
the high variance in the estimate of the gradient. The total runtime
can be very high as a result of the need to make more steps, both
because of the reduced learning rate and because it takes more steps
to observe the entire training set.
Which in practice usually means "in powers of 2 and the larger the better, provided that the batch fits into your (GPU) memory".
You might want also to consult several good posts here in Stack Exchange:
Tradeoff batch size vs. number of iterations to train a neural network
Selection of Mini-batch Size for Neural Network Regression
How large should the batch size be for stochastic gradient descent?
Just keep in mind that the paper by Keskar et al. 'On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima', quoted by several of the posts above, has received some objections by other respectable researchers of the deep learning community.
Hope this helps...
UPDATE (Dec 2017):
There is a new paper by Yoshua Bengio & team, Three Factors Influencing Minima in SGD (Nov 2017); it is worth reading in the sense that it reports new theoretical & experimental results on the interplay between learning rate and batch size.
UPDATE (Mar 2021):
Of interest here is also another paper from 2018, Revisiting Small Batch Training for Deep Neural Networks (h/t to Nicolas Gervais), which runs contrary to the larger the better advice; quoting from the abstract:
The best performance has been consistently obtained for mini-batch sizes between m=2 and m=32, which contrasts with recent work advocating the use of mini-batch sizes in the thousands.
You can estimate the largest batch size using:
Max batch size= available GPU memory bytes / 4 / (size of tensors + trainable parameters)
Use the summaries provided by pytorchsummary (pip install) or keras (builtin).
E.g.
from torchsummary import summary
summary(model)
.....
.....
================================================================
Total params: 1,127,495
Trainable params: 1,127,495
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.02
Forward/backward pass size (MB): 13.93
Params size (MB): 4.30
Estimated Total Size (MB): 18.25
----------------------------------------------------------------
Each instance you put in the batch will require a full forward/backward pass in memory, your model you only need once. People seem to prefer batch sizes of powers of two, probably because of automatic layout optimization on the gpu.
Don't forget to linearly increase your learning rate when increasing the batch size.
Let's assume we have a Tesla P100 at hand with 16 GB memory.
(16000 - model_size) / (forward_back_ward_size)
(16000 - 4.3) / 18.25 = 1148.29
rounded to powers of 2 results in batch size 1024
Here is a function to find batch size for training the model:
def FindBatchSize(model):
"""model: model architecture, that is yet to be trained"""
import os, sys, psutil, gc, tensorflow, keras
import numpy as np
from keras import backend as K
BatchFound= 16
try:
total_params= int(model.count_params()); GCPU= "CPU"
#find whether gpu is available
try:
if K.tensorflow_backend._get_available_gpus()== []:
GCPU= "CPU"; #CPU and Cuda9GPU
else:
GCPU= "GPU"
except:
from tensorflow.python.client import device_lib; #Cuda8GPU
def get_available_gpus():
local_device_protos= device_lib.list_local_devices()
return [x.name for x in local_device_protos if x.device_type == 'GPU']
if "gpu" not in str(get_available_gpus()).lower():
GCPU= "CPU"
else:
GCPU= "GPU"
#decide batch size on the basis of GPU availability and model complexity
if (GCPU== "GPU") and (os.cpu_count() >15) and (total_params <1000000):
BatchFound= 64
if (os.cpu_count() <16) and (total_params <500000):
BatchFound= 64
if (GCPU== "GPU") and (os.cpu_count() >15) and (total_params <2000000) and (total_params >=1000000):
BatchFound= 32
if (GCPU== "GPU") and (os.cpu_count() >15) and (total_params >=2000000) and (total_params <10000000):
BatchFound= 16
if (GCPU== "GPU") and (os.cpu_count() >15) and (total_params >=10000000):
BatchFound= 8
if (os.cpu_count() <16) and (total_params >5000000):
BatchFound= 8
if total_params >100000000:
BatchFound= 1
except:
pass
try:
#find percentage of memory used
memoryused= psutil.virtual_memory()
memoryused= float(str(memoryused).replace(" ", "").split("percent=")[1].split(",")[0])
if memoryused >75.0:
BatchFound= 8
if memoryused >85.0:
BatchFound= 4
if memoryused >90.0:
BatchFound= 2
if total_params >100000000:
BatchFound= 1
print("Batch Size: "+ str(BatchFound)); gc.collect()
except:
pass
memoryused= []; total_params= []; GCPU= "";
del memoryused, total_params, GCPU; gc.collect()
return BatchFound
I ran into a similar GPU mem error which was solved by configuring the tensorflow session with the following:
# See https://www.tensorflow.org/tutorials/using_gpu#allowing_gpu_memory_growth
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
see: google colaboratory `ResourceExhaustedError` with GPU

TensorFlow seq2seq model with low number of target_vocab_size

I am experimenting with the tensorflow seq2seq_model.py model.
The target vocab size I have is around 200.
The documentation the says:
For vocabularies smaller than 512, it might be a better idea to just use a standard softmax loss.
The source-code also has the check:
if num_samples > 0 and num_samples < self.target_vocab_size:
Running the model with only 200 target output vocabulary does not invoke the if statement.
Do I need to write a "standard" softmax loss function to ensure a good training, or can I just let the model run as it comes?
Thanks for the help!
I am doing the same thing. In order to just get my fingers wet with different kinds of structures in the training data I am working in an artificial test-world with just 117 words in the (source and) target vocabulary.
I asked myself the same question and decided to not go through that hassle. My models train well even though I didn't touch the loss, thus still using the sampled_softmax_loss.
Further experiences with those small vocab sizes:
- batchsize 32 is best in my case (smaller ones make it really unstable and I run into nan-issues quickly)
- I am using AdaGrad as the optimizer and it works like magic
- I am working with the model_with_buckets (addressed through translate.py) and having size 512 with num_layers 2 produces the desired outcomes in many cases.

frequency sampling limit for beaglebone adc

I intend to use the beaglebone to sample a shaped signal of the order of 1 microsec. I need to fit the signal after and therefore i would like to have a sampling rate of let's 10 MHZ. Something that seems feasible with PRU and libpruio. The point is, looking to the adc specifications it seems there is a limit at 200KHz. Is my reasoning correct?
thanks
You'll need additional hardware for a sampling rate of 10 MHz! libpruio isn't designed to work at that speed, as well as the BBB hardware.
The ADC subsystem in the AM335x CPU is clocked at 24 MHz and needs 15 cycles for a sample (14 in continous mode). This leads to a maximum sample rate of 1.6 (1.74) MSamples/s. See SRM, chapter 12 for details.
The problem is to get the samples in to the host memory. I couldn't get this working faster than ~250 kSamples/s (by CPU access - I didn't try DMA).
As long as you don't need more values than the FIFO can hold, you can sample a single line at maximum 1.7 MHz.
BR

about CUFFT input sizes

It's written that CUFFT library supports algorithms that higly optimized for input sizes can be written in the folowing form: 2^a X 3^b X 5^c X 7^d.
How could they managed to do that?
For as far as I know, FFT must provide best perfomance only for 2^a input size.
This means that input sizes with prime factors larger than 7 would go slower.
The Cooley-Tukey algorithm can operate on a variety of DFT lengths which can be expressed as N = N_1*N_2. The algorithm recursively expresses a DFT of length N into N_1 smaller DFTs of length N_2.
As you note, the fastest is generally the radix-2 factorization, which recursively breaks a DFT of length N into 2 smaller DFTs of length N/2, running in O(NlogN).
However, the actual performance will depend on hardware and implementation. For example, if we are considering the cuFFT with a thread warp size of 32 then DFTs that have a length of some multiple of 32 would be optimal (note: just an example, I'm not aware of the actual optimizations that exist under the hood of the cuFFT.)
Short answer: the underlying code is optimized for any prime factorization up to 7 based on the Cooley-Tukey radix-n algorithm.
http://mathworld.wolfram.com/FastFourierTransform.html
https://en.wikipedia.org/wiki/Cooley-Tukey_FFT_algorithm

Resources