Effect of Data Parallelism on Training Result - machine-learning

I'm currently trying to implement multi-GPU training with the Tensorflow network. One solution for this would be to run one model per GPU, each having their own data batches, and combine their weights after each training iteration. In other words "Data Parallelism".
So for example if I use 2 GPUs, train with them in parallel, and combine their weights afterwards, then shouldn't the resulting weights be different compared to training with those two data batches in sequence on one GPU? Because both GPUs have the same input weights, whereas the single GPU has modified weights for the second batch.
Is this difference just marginal, and therefore not relevant for the end result after many iterations?

The order of the batches fed into training makes some difference. But the difference may be small if you have large number of batches. Each batch pulls the variables in the model a bit towards the minimum of the loss. The different order may make the path towards minimum a bit different. But as long as the loss is decreasing, your model is training and its evaluation becomes better and better.
Sometimes, to avoid the same batches "pull" the model together and avoid being too good only for some input data, the input for each model replica would be randomly shuffled before feeding into the training program.

Related

Determining the splitting ratio when augmenting image data

I have an image dataset that is quite imbalanced, with one class having 2873 images and another having only 115. The rest of the classes have ~250 images each. For reducing the imbalance, I decided to split the dataset into Train-Valid-Test components, with the major class having less proportion of images in the training set compared to the minor classes.
Then I'll be augmenting the data in the training set. I intend to perform an 80-10-10 split on the dataset.
Which outcome shall be considered as an 80-10-10 split?
Splitting the dataset in the proportion 80-10-10, and THEN augmenting the training images (which would eventually result in >80% proportion for the training set after augmentation).
Splitting the dataset in a proportion such that it eventually results in an 80-10-10 split AFTER augmentation.
Also, is it acceptable to have an 85-7.5-7.5 split, provided it reduces imbalance in the dataset?
First, the splitting of the dataset should be stratified which means that each class should exist in the training, test and validation sets with the defined proportions.
The data augmentation should only be applied on the training dataset, no need to apply it on the test data because its main goal is to improve the performance and outcomes of the models by making different examples in the train dataset. So, there is no need to apply data augmentation on the test and validation sets.
For the last question, it is possible whatever proportions you want for test and validation as long as they give you enough data to evaluate the model

How do neural networks learn functions instead of memorize them?

For a class project, I designed a neural network to approximate sin(x), but ended up with a NN that just memorized my function over the data points I gave it. My NN took in x-values with a batch size of 200. Each x-value was multiplied by 200 different weights, mapping to 200 different neurons in my first layer. My first hidden layer contained 200 neurons, each one a linear combination of the x-values in the batch. My second hidden layer also contained 200 neurons, and my loss function was computed between the 200 neurons in my second layer and the 200 values of sin(x) that the input mapped to.
The problem is, my NN perfectly "approximated" sin(x) with 0 loss, but I know it wouldn't generalize to other data points.
What did I do wrong in designing this neural network, and how can I avoid memorization and instead design my NN's to "learn" about the patterns in my data?
It is same with any machine learning algorithm. You have a dataset based on which you try to learn "the" function f(x), which actually generated the data. In real life datasets, it is impossible to get the original function from the data, and therefore we approximate it using something g(x).
The main goal of any machine learning algorithm is to predict unseen data as best as possible using the function g(x).
Given a dataset D you can always train a model, which will perfectly classify all the datapoints (you can use a hashmap to get 0 error on the train set), but which is overfitting or memorization.
To avoid such things, you yourself have to make sure that the model does not memorise and learns the function. There are a few things which can be done. I am trying to write them down in an informal way (with links).
Train, Validation, Test
If you have large enough dataset, use Train, Validation, Test splits. Split the dataset in three parts. Typically 60%, 20% and 20% for Training, Validation and Test, respectively. (These numbers can vary based on need, also in case of imbalanced data, check how to get stratified partitions which preserve the class ratios in every split). Next, forget about the Test partition, keep it somewhere safe, don't touch it. Your model, will be trained using the Training partition. Once you have trained the model, evaluate the performance of the model using the Validation set. Then select another set of hyper-parameter configuration for your model (eg. number of hidden layer, learaning algorithm, other parameters etc.) and then train the model again, and evaluate based on Validation set. Keep on doing this for several such models. Then select the model, which got you the best validation score.
The role of validation set here is to check what the model has learned. If the model has overfit, then the validation scores will be very bad, and therefore in the above process you will discard those overfit models. But keep in mind, although you did not use the Validation set to train the model, directly, but the Validation set was used indirectly to select the model.
Once you have selected a final model based on Validation set. Now take out your Test set, as if you just got new dataset from real life, which no one has ever seen. The prediction of the model on this Test set will be an indication how well your model has "learned" as it is now trying to predict datapoints which it has never seen (directly or indirectly).
It is key to not go back and tune your model based on the Test score. This is because once you do this, the Test set will start contributing to your mode.
Crossvalidation and bootstrap sampling
On the other hand, if your dataset is small. You can use bootstrap sampling, or k-fold cross-validation. These ideas are similar. For example, for k-fold cross-validation, if k=5, then you split the dataset in 5 parts (also be carefull about stratified sampling). Let's name the parts a,b,c,d,e. Use the partitions [a,b,c,d] to train and get the prediction scores on [e] only. Next, use the partitions [a,b,c,e] and use the prediction scores on [d] only, and continue 5 times, where each time, you keep one partition alone and train the model with the other 4. After this, take an average of these scores. This is indicative of that your model might perform if it sees new data. It is also a good practice to do this multiple times and perform an average. For example, for smaller datasets, perform a 10 time 10-folds cross-validation, which will give a pretty stable score (depending on the dataset) which will be indicative of the prediction performance.
Bootstrap sampling is similar, but you need to sample the same number of datapoints (depends) with replacement from the dataset and use this sample to train. This set will have some datapoints repeated (as it was a sample with replacement). Then use the missing datapoins from the training dataset to evaluate the model. Perform this multiple times and average the performance.
Others
Other ways are to incorporate regularisation techniques in the classifier cost function itself. For example in Support Vector Machines, the cost function enforces conditions such that the decision boundary maintains a "margin" or a gap between two class regions. In neural networks one can also do similar things (although it is not same as in SVM).
In neural network you can use early stopping to stop the training. What this does, is train on the Train dataset, but at each epoch, it evaluates the performance on the Validation dataset. If the model starts to overfit from a specific epoch, then the error for Training dataset will keep on decreasing, but the error of the Validation dataset will start increasing, indicating that your model is overfitting. Based on this one can stop training.
A large dataset from real world tends not to overfit too much (citation needed). Also, if you have too many parameters in your model (to many hidden units and layers), and if the model is unnecessarily complex, it will tend to overfit. A model with lesser pameter will never overfit (though can underfit, if parameters are too low).
In the case of you sin function task, the neural net has to overfit, as it is ... the sin function. These tests can really help debug and experiment with your code.
Another important note, if you try to do a Train, Validation, Test, or k-fold crossvalidation on the data generated by the sin function dataset, then splitting it in the "usual" way will not work as in this case we are dealing with a time-series, and for those cases, one can use techniques mentioned here
First of all, I think it's a great project to approximate sin(x). It would be great if you could share the snippet or some additional details so that we could pin point the exact problem.
However, I think that the problem is that you are overfitting the data hence you are not able to generalize well to other data points.
Few tricks that might work,
Get more training points
Go for regularization
Add a test set so that you know whether you are overfitting or not.
Keep in mind that 0 loss or 100% accuracy is mostly not good on training set.

LSTM: choice of batch size when State reused for generation later

I am building an LSTM model that generates symbols step-by-step. The task is to train the model up to some point of the data sequence and then to use the trained model to process the remaining pieces of the sequence in the test phase -- these remaining pieces weren't seen during Training.
For this task, I am attempting to re-use the latest state from the Training phase for the subsequent Prediction phase (i.e. not to start predicting with clean zero-state, but to sort-of continue where things were left off during training).
In this context, I am wondering how to best choose the Batch size for training.
My Training data is one long sequence of time-ordered observations. If that sequence is chopped up into N batches for Training, then my understanding is that the State tensor will be of shape [N, Network_Size] during Training, and [1, Network_Size] during Prediction. So for Prediction, I simply take the last element of the [N, Network_Size] tensor, which is of shape [1, Network_Size].
That seems to work in terms of mechanics, but this means that the value of N determines how many observations that last vector of the original State has seen during Training.
Is there a best practice for determining how to chose N? The network trains much faster with a larger batch size, but I am concerned that this way the last part of the State tensor may have not seen enough. Obviously I'm trying various combinations, but curious how others have dealt with it.
Also, I have seen a few examples where parameters like this (or Cell size/etc.) are set as powers-of-2 (i.e. 64, 128, etc.). Is there any theoretical reason behind that vs simple 50/100/etc.? Or just a quirky choice?
First, for your last question: for computers powers of two are simpler than powers of 10 (memory size and alignment constraints, for example, are likelier to be powers of two).
It is unclear from your question what you mean by training; if updating parameters or just computing RNN forward steps. Updating parameters doesn't make much sense because for RNNs (including LSTMs) you'd ideally update parameters only after seeing an entire batch of sequences (and you often need many updates until the model is at all reasonable). Similarly, RNN forward steps don't make much sense to me because the state for each example is independent of the batch size (ignoring any batch normalization you might be doing).

Time Series Prediction using LSTM

I am using Jason Brownlee's tutorial (mirror) to apply LSTM network on some syslog/network log data. He's a master!
I have syslog data(a specific event) for each day for last 1 year and so I am using LSTM network for time series analysis. I am using LSTM from Keras deep learning library.
As I understand -
About Batch_size
A batch of data is a fixed-sized number of rows from the training
dataset that defines how many patterns to process before updating
the weights of the network. Based on the batch_size the Model
takes random samples from the data for the analysis. For time series
this is not desirable, hence the batch_size should always be 1.
About setting value for shuffle value
By default, the samples within an epoch are shuffled prior to being exposed to the network. This is undesirable for the LSTM
because we want the network to build up state as it learns across
the sequence of observations. We can disable the shuffling of
samples by setting “shuffle” to “False“.
Scenario1 -
Using above two rules/guidelines - I ran several trials with different number of neurons, epoch size and different layers and got better results from the baseline model(persistence model).
Scenario2-
Without using above guidelines/rules - I ran several trials with different number of neurons, epoch size and different layers and got even better results than Scenario 1.
Query - Setting shuffle to True and Batch_size values to 1 for time series. Is this a rule or a guideline?
It seems logical reading the tutorial that the data for time series should not be shuffled as we do not want to change the sequence of data, but for my data the results are better if I let the data be shuffled.
At the end what I think, what matters is how I get better predictions with my runs.
I think I should try and put away "theory" over concrete evidence, such as metrics, elbows, RMSEs,etc.
Kindly enlighten.
It depends a lot on the size of your data, also in the number of variables, decreasing batch size in my experience gives better results since the update is more frequent but in huge datasets it is very expensive. And you have to play with this trade-off (training time vs result).
About your shuffle it may be the case that your data is not that correlated with the past, if that is the case shuffling the data helps the network to learn and be able to generalize (like ordered by label) check reason 7 of the following 37 reasons your neural network not working
Batch size the larger the difficult it is to generalize (reason 11). When data clearly depends on the past you can declare your LSTM in Keras to stateful, this means: "that the states computed for the samples in one batch will be reused as initial states for the samples in the next batch" according to Keras API. Hope this helps.

How to predict using Keras?

I'm using model.predict() to make predictions batch by batch.
By doing so, I find that Keras seems to normalize the input on a batch basis which means if I combine the results, they will become meaningless because they are not normalized together and the probabilities vary a lot from batch to batch.
So what should I do to tackle this? Loading all the data into one batch is not an option as the dataset is too large. Should I pass in one sample per time at the cost of performance?

Resources