Currently I am reading the following paper: "SqueezeNet: AlexNet-level accuracy with 50 x fewer parameters and <0.5 MB model size".
In this 4.2.3 (Activation function layer), there is the following statement:
The ramifications of the activation function is almost entirely
constrained to the training phase, and it has little impact on the
computational requirements during inference.
I understand the influence of activation function as follows.
An activation function (ReLU etc.) is applied to each unit of the feature map after convolution operation processing. I think that processing at this time is the same processing in both the training mode and the inference mode. Why can we say that it has a big influence on training and does not have much influence on inference?
Can someone please explain it.
I think that processing at this time is the same processing in both the training mode and the inference mode.
You are right, the processing time of the activation function is the same.
But still there is big difference between training time and test time:
Training time involves applying the forward pass for a number of epochs, where each epoch usually consists of the whole training dataset. Even for a small dataset, such as MNIST (consisting of 60000 training images) this accounts for tens of thousands invocations. Exact runtime impact depends on a number of factors, e.g. GPUs allow a lot of computation in parallel. But in any case it's several orders of magnitude larger than the number of invocations at test time, when usually a single batch is processed exactly once.
On top of that you shouldn't forget about the backward pass, in which the derivative of the activation is also applied for the same number of epochs. For some activations the derivative can be significantly more expensive, e.g. elu vs relu (elu has learnable parameters that need to be updated).
In the end, you are likely to ignore 5% slowdown at inference time, because the inference of a neural network it's blazingly fast anyway. But you might care about extra minutes to hours of training of a single architecture, especially if you need to do cross-validation or hyper-parameters tuning of a number of models.
Related
I'm using Keras for a task that has one main target and several optional targets. When I train against all targets (and so have multiple outputs), it takes the same amount of time per epoch as when training with the single target/output. Shouldn't we expect it to take somewhat longer in the multi-target scenario?
This generally depends on how big is the output layer. In theory, more neurons to the output layer mean a bigger transition matrix (i.e weights, when assuming MLP), which means that the backpropagation should calculate derivatives with respect to more values, but when these values are relatively small to begin with I presume there wouldn't be much difference
I tried to implement a model on keras with GRUs and LSTMs. The model architecture is same for both the implementations. As I read in many blog posts the inference time for GRU is faster compared to LSTM. But in my case the GRU is not faster and infact comparitively slower with respect to LSTMs. Can anybody find a reason for this. Is there anything to do with GRU's in Keras or am I going wrong anywhere.
A small help is highly appreciated...
Thanks in Advance
I would first check to see if the LSTM that you use is CuDNNLSTM or simple LSTM. The former is a variant which is GPU-accelerated, and runs much faster than the simple LSTM, though the training, say, runs on GPU in both cases.
Yes, the papers do not lie; in fact, there are fewer computations for a GRU-cell than an LSTM-cell.
Ensure that you do not compare simple GRU with CuDNN-LSTM.
For a true benchmark, ensure that you compare LSTM with GRU and CuDNNLSTM with CuDNNGRU.
LSTM (Long Short Term Memory): LSTM has three gates (input, output and forget gate)
GRU (Gated Recurring Units): GRU has two gates (reset and update gate).
GRU use less training parameters and therefore use less memory, execute faster and train faster than LSTM's whereas LSTM is more accurate on datasets using longer sequence.
In short, if sequence is large or accuracy is very critical, please go for LSTM whereas for less memory consumption and faster operation go for GRU. It all depends on your training time and accuracy trade off.
If in your case both the architectures are same, there might be an issue with the batch size for both the models. Make sure that the batch size and sequence length are also same for both the models.
For PyTorch's tutorial on performing transfer learning for computer vision (https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html), we can see that there is a higher validation accuracy than training accuracy. Applying the same steps to my own dataset, I see similar results. Why is this the case? Does it have something to do with ResNet 18's architecture?
Assuming there aren't bugs in your code and the train and validation data are in the same domain, then there are a couple reasons why this may occur.
Training loss/acc is computed as the average across an entire training epoch. The network begins the epoch with one set of weights and ends the epoch with a different (hopefully better!) set of weights. During validation you're evaluating everything using only the most recent weights. This means that the comparison between validation and train accuracy is misleading since training accuracy/loss was computed with samples from potentially much worse states of your model. This is usually most noticeable at the start of training or right after the learning rate is adjusted since the network often starts the epoch in a much worse state than it ends. It's also often noticeable when the training data is relatively small (as is the case in your example).
Another difference is the data augmentations used during training that aren't used during validation. During training you randomly crop and flip the training images. While these random augmentations are useful for increasing the ability of your network to generalize they aren't performed during validation because they would diminish performance.
If you were really motivated and didn't mind spending the extra computational power you could get a more meaningful comparison by running the training data back through your network at the end of each epoch using the same data transforms used for validation.
The short answer is that train and validation data are from different distributions, and it's "easier" for model to predict target in validation data then it is for training.
The likely reason for this particular case, as indicated by this answer, is data augmentation during training. This is a way to regularize your model by increasing variability in the training data.
Other architectures can use Dropout (or its modifications), which are deliberately "hurting" training performance, reducing the potential of overfitting.
Notice, that you're using pretrained model, which already contains some information about how to solve classification problem. If your domain is not that different from the data it was trained on, you can expect good performance off-the-shelf.
When training a net does it matter if the number of samples in the epoch is not an exact multiple of the batch size? My training code doesnt seem to mind if this is the case, though my loss curve is pretty noisy at the moment (in case that is a related issue).
This would be useful to know, as if it is not an issue it saves on messing around with the dataset to make it quantized by batch size. It may also be less wasteful of captured data.
does it matter if the number of samples in the epoch is not an exact multiple of the batch size
No, it does not. Your number of samples can be say 1000, and your batch size can be 400.
You can decide the total number of iterations (where each iteration = sampling a batch and doing gradient descent) based on the overall number of epochs you want to cover. Say, you want to have roughly 5 epochs, then roughly your number of iterations >= 5 * 1000 / 400 = 13. So you will sample a random batch 13 times to get roughly 5 epochs.
In the context of Convolution Neural Networks (CNN), Batch size is the number of examples that are fed to the algorithm at a time. This is normally some small power of 2 like 32,64,128 etc. During training an optimization algorithm computes the average cost over a batch then runs backpropagation to update the weights. In a single epoch the algorithm is run with $n_{batches} = {n_{examples} \over batchsize} $ times. Generally the algorithm needs to train for several epochs to achieve convergence of weight values. Every batch is normally sampled randomly from the whole example set.
The idea is this: mini-batch optimization wrt (x1,..., xn) is equivalent to consecutive optimization steps wrt x1, ..., xn inputs, because the gradient is a linear operator. This means that mini-batch update equals to the sum of its individual updates. Important note here: I assume that NN doesn't apply batch-norm or any other layer that adds an explicit variation to the inference model (in this case the math is a bit more hairy).
So the batch size can be seen as a pure computational idea that speeds up the optimization through vectorization and parallel computing. Assuming that one can afford arbitrarily long training and the data are properly shuffled, the batch size can be set to any value. But it isn't automatically true for all hyperparameters, for example very high learning rate can easily force the optimization to diverge, so don't make a mistake thinking hyperparamer tuning isn't important in general.
I have this confusion related to kernel svm. I read that with kernel svm the number of support vectors retained is large. That's why it is difficult to train and is time consuming. I didn't get this part why is it difficult to optimize. Ok I can say that noisy data requires large number of support vectors. But what does it have to do with the training time.
Also I was reading another article where they were trying to convert non linear SVM kernel to linear SVM kernel. In the case of linear kernel it is just the dot product of the original features themselves. But in the case of non linear one it is RBF and others. I didn't get what they mean by "manipulating the kernel matrix imposes significant computational bottle neck". As far as I know, the kernel matrix is static isn't it. For linear kernel it is just the dot product of the original features. In the case of RBF it is using the gaussian kernel. So I just need to calculate it once, then I am done isn't it. So what's the point of manipulating and the bottleneck thinkg
Support Vector Machine (SVM) (Cortes and Vap- nik, 1995) as the state-of-the-art classification algo- rithm has been widely applied in various scientific do- mains. The use of kernels allows the input samples to be mapped to a Reproducing Kernel Hilbert S- pace (RKHS), which is crucial to solving linearly non- separable problems. While kernel SVMs deliver the state-of-the-art results, the need to manipulate the k- ernel matrix imposes significant computational bottle- neck, making it difficult to scale up on large data.
It's because the kernel matrix is a matrix that is N rows by N columns in size where N is the number of training samples. So imagine you have 500,000 training samples, then that would mean the matrix needs 500,000*500,000*8 bytes (1.81 terabytes) of RAM. This is huge and would require some kind of parallel computing cluster to deal with in any reasonable way. Not to mention the time it takes to compute each element. For example, if it took your computer 1 microsecond to compute 1 kernel evaluation then it would take 69.4 hours to compute the entire kernel matrix. For comparison, a good linear solver can handle a problem of this size in a few minutes or an hour on a regular desktop workstation. So that's why linear SVMs are preferred.
To understand why they are so much faster you have to take a step back and think about how these optimizers work. At the highest level you can think of them as searching for a function that gives the correct outputs on all the training samples. Moreover, most solvers are iterative in the sense that they have a current best guess at what this function should be and in each iteration they test it on the training data and see how good it is. Then they update the function in some way to improve it. They keep doing this until they find the best function.
Keeping this in mind, the main reason why linear solvers are so fast is because the function they are learning is just a dot product between a fixed size weight vector and a training sample. So in each iteration of the optimization it just needs to compute the dot product between the current weight vector and all the samples. This takes O(N) time, moreover, good solvers converge in just a few iterations regardless of how many training samples you have. So the working memory for the solver is just the memory required to store the single weight vector and all the training samples. This means the entire process takes only O(N) time and O(N) bytes of RAM.
A non-linear solver on the other hand is learning a function that is not just a dot product between a weight vector and a training sample. In this case, it is a function that is the sum of a bunch of kernel evaluations between a test sample and all the other training samples. So in this case, just evaluating the function you are learning against one training sample takes O(N) time. Therefore, to evaluate it against all training samples takes O(N^2) time. There have been all manner of clever tricks devised to try and keep the non-linear function compact to speed this up. But all of them are at least a little bit heuristic or approximate in some sense while good linear solvers find exact solutions. So that's part of the reason for the popularity of linear solvers.