Freezing some layers in Keras has slower training time - machine-learning

I have two identically structured keras models. One is a standard DNN and one is the same archetecture with some layers set to trainable=False. They both are able to get comparable accurate results.\
I assumed the network with frozen weights would be faster to train because there are fewer parameters to tune. However, this is the opposite. The frozen network takes 116 sec while the unfrozen takes 105 sec. Why would this be? Is it overhead from the Keras model?

Related

About the impact of activation functions in CNN on computation time

Currently I am reading the following paper: "SqueezeNet: AlexNet-level accuracy with 50 x fewer parameters and <0.5 MB model size".
In this 4.2.3 (Activation function layer), there is the following statement:
The ramifications of the activation function is almost entirely
constrained to the training phase, and it has little impact on the
computational requirements during inference.
I understand the influence of activation function as follows.
An activation function (ReLU etc.) is applied to each unit of the feature map after convolution operation processing. I think that processing at this time is the same processing in both the training mode and the inference mode. Why can we say that it has a big influence on training and does not have much influence on inference?
Can someone please explain it.
I think that processing at this time is the same processing in both the training mode and the inference mode.
You are right, the processing time of the activation function is the same.
But still there is big difference between training time and test time:
Training time involves applying the forward pass for a number of epochs, where each epoch usually consists of the whole training dataset. Even for a small dataset, such as MNIST (consisting of 60000 training images) this accounts for tens of thousands invocations. Exact runtime impact depends on a number of factors, e.g. GPUs allow a lot of computation in parallel. But in any case it's several orders of magnitude larger than the number of invocations at test time, when usually a single batch is processed exactly once.
On top of that you shouldn't forget about the backward pass, in which the derivative of the activation is also applied for the same number of epochs. For some activations the derivative can be significantly more expensive, e.g. elu vs relu (elu has learnable parameters that need to be updated).
In the end, you are likely to ignore 5% slowdown at inference time, because the inference of a neural network it's blazingly fast anyway. But you might care about extra minutes to hours of training of a single architecture, especially if you need to do cross-validation or hyper-parameters tuning of a number of models.

Shuffling batches of data in training neural networks

I have 6.5 GB worth of training data to be used for my GRU network. I intend to split the training time, i.e. pause and resume training since I am using a laptop computer. I'm assuming that it will take me days to train my neural net using the whole 6.5 GB, so, I'll be pausing the training and then resume again at some other time.
Here's my question. If I will shuffle the batches of training data, will the neural net remember which data has been used already for training or not?
Please note that I'm using the global_step parameter of tf.train.Saver().save.
Thank you very much in advance!
I would advise you to save your model at certain epoch,lets say you have 80 epochs,it would be wise to save your model at each 20epochs(20,40,60)but again this will depend on the capacity of your laptop,the reason is that at one epoch,your network will have seen all the datasets(training set).If your whole dataset can't be processed in a single epoch,i would advise you to randomly sample from your whole dataset what will be the training set.The whole point of shuffling is to let the network do some generalization over the whole dataset and it is usually done on either batch or selecting training dataset,or starting a new training epoch.As for your main question,its definetly ok to shuffle bacthes when training and resuming.Shuffling batches ensures that the gradients are calculated along the batch instead of over one image

Relationship Between Training Set Size and Training Epochs

I'm currently training a convulotional network on the Cifar10 dataset. Let's say I have 50,000 training images and 5,000 validation images.
To start testing my model, let's say I start with just 10,000 images to get an idea of how successful the model will be.
After 40 epochs of training and a batch size of 128 - ie every epoch I'm running my Optimizer to minimize loss 10,000 / 128 ~ 78 times on a batch of 128 images (SGD).
Now, let's say I found a model that achieves 70% accuracy on the validation set. Satisfied, I move on to train on the full training set.
This time, for every epoch I run the Optimizer to minimize loss 5 * 10,000 / 128 ~ 391 times.
This makes me think that my accuracy at each epoch should be higher on than on the limited set of 10,000. Much to my dismay, the accuracy on the limited training set increases much more quickly. At the end of the 40 epochs with the full training set, my accuracy is 30%.
Thinking the data may be corrupt, I perform limited runs on training images 10-20k, 20-30k, 30-40k, and 40-50k. Surprisingly, each of these runs resulted in an accuracy ~70%, close to the accuracy for images 0-10k.
Thus arise two questions:
Why would validation accuracy go down when the data set is larger and I've confirmed that that each segment of data indeed provides decent results on its own?
For larger training set, would I need to train through more epochs even though each epoch represents a larger number of training steps (391 vs 78)?
It turns out my intuition was right, but my code was not.
Basically, I had been running my validation accuracy with training data (data used to train the model), not with validation data (data the model had not yet seen).
After correcting this error, the validation accuracy did inevitably improve with a bigger training data set, as expected.

Effect of Data Parallelism on Training Result

I'm currently trying to implement multi-GPU training with the Tensorflow network. One solution for this would be to run one model per GPU, each having their own data batches, and combine their weights after each training iteration. In other words "Data Parallelism".
So for example if I use 2 GPUs, train with them in parallel, and combine their weights afterwards, then shouldn't the resulting weights be different compared to training with those two data batches in sequence on one GPU? Because both GPUs have the same input weights, whereas the single GPU has modified weights for the second batch.
Is this difference just marginal, and therefore not relevant for the end result after many iterations?
The order of the batches fed into training makes some difference. But the difference may be small if you have large number of batches. Each batch pulls the variables in the model a bit towards the minimum of the loss. The different order may make the path towards minimum a bit different. But as long as the loss is decreasing, your model is training and its evaluation becomes better and better.
Sometimes, to avoid the same batches "pull" the model together and avoid being too good only for some input data, the input for each model replica would be randomly shuffled before feeding into the training program.

System freezes while running Tensorflow model training code

I am trying to train model for handwritten digits. My data consists of images, each of size 80X80. So I get 6400 features as inputs.
While trying to train model as per code used in https://www.kaggle.com/kakauandme/digit-recognizer/tensorflow-deep-nn, my systems hangs after 200 iterations with training accuracy 0.06.
Why is this happening? I don't get any errors. My system just freezes.
Also, how do I set pooling and convolution layers parameters? Is the problem related to these parameters?
PS: I'm not using GPU.

Resources