Shuffling batches of data in training neural networks - machine-learning

I have 6.5 GB worth of training data to be used for my GRU network. I intend to split the training time, i.e. pause and resume training since I am using a laptop computer. I'm assuming that it will take me days to train my neural net using the whole 6.5 GB, so, I'll be pausing the training and then resume again at some other time.
Here's my question. If I will shuffle the batches of training data, will the neural net remember which data has been used already for training or not?
Please note that I'm using the global_step parameter of tf.train.Saver().save.
Thank you very much in advance!

I would advise you to save your model at certain epoch,lets say you have 80 epochs,it would be wise to save your model at each 20epochs(20,40,60)but again this will depend on the capacity of your laptop,the reason is that at one epoch,your network will have seen all the datasets(training set).If your whole dataset can't be processed in a single epoch,i would advise you to randomly sample from your whole dataset what will be the training set.The whole point of shuffling is to let the network do some generalization over the whole dataset and it is usually done on either batch or selecting training dataset,or starting a new training epoch.As for your main question,its definetly ok to shuffle bacthes when training and resuming.Shuffling batches ensures that the gradients are calculated along the batch instead of over one image

Related

Should I use the same training set from one epoch to another (convolutional neural network)

From what I know about convolutional neural networks, you must feed the same training examples each epoch, but shuffled (so the network won't remember some particular order while training).
However, in this article, they're feeding the network 64000 random samples each epoch (so only some of the training examples were "seen" before):
Each training instance was a uniformly sampled set of 3 images, 2 of
which are of the same class (x and x+), and the third (x−) of a
different class. Each training epoch consisted of 640000 such
instances (randomly chosen each epoch), and a fixed set of 64000
instances used for test.
So, do I have to use the same training examples each epoch, and why?
Experimental results are poor when I use random samples - the accuracy varies a lot. But I want to know why.
Most of the time you might want to use as much data as you can. However, in the paper you cite they train a triplet loss, which uses triples of images, and there could be billions of such triples.
You might wonder, why introduce the idea of epoch in the first place if we're likely to obtain different training sets each time. The answer is technical: we'd like to evaluate the network on the validation data once in a while, also you might want to do learning rate decay based on the number of completed epochs.

Do trained weights depend on the order in which trained data has been input?

Suppose one makes a neural network using Keras. Do the trained weights depend on the order in which the training data has been fed into the system ? Is it ok to feed data belonging to one category first and then data belonging to another category or should they be random?
As the training will be done in batches, which means optimizing the weights on data chunk by chunk, the main assumption is that the batches of data are somewhat representative of the dataset. To make it representative it is thus better to randomly sample the data.
Bottomline : It will theoritically learn better if you feed randomly the neural network. I strongly advise yo to shuffle your dataset when you feed it in training mode (and there is an option in the .fit() function).
In inference mode, if you only want to make a forward pass on the neural net, then the order doesn't matter at all since you don't change the weights.
I hope this clarifies things a bit for you :-)
Nassim answer is believed to be True for small networks and datasets but recent articles (or e.g. this one) makes us believe that for deeper networks (with more than 4 layers) - not shuffling your data set might be considered as some kind of regularization - as poor minima are expected to be deep but small and good minima are expected to be wide and hard to leave.
In case of inference time - the only way where this might harm your inference process is when you are using a training distribution of your data in a highly coupled manner - e.g. using BatchNormalization or Dropout like in a training phase (this is sometimes used for some kinds of Bayesian Deep Learning).

Time Series Prediction using LSTM

I am using Jason Brownlee's tutorial (mirror) to apply LSTM network on some syslog/network log data. He's a master!
I have syslog data(a specific event) for each day for last 1 year and so I am using LSTM network for time series analysis. I am using LSTM from Keras deep learning library.
As I understand -
About Batch_size
A batch of data is a fixed-sized number of rows from the training
dataset that defines how many patterns to process before updating
the weights of the network. Based on the batch_size the Model
takes random samples from the data for the analysis. For time series
this is not desirable, hence the batch_size should always be 1.
About setting value for shuffle value
By default, the samples within an epoch are shuffled prior to being exposed to the network. This is undesirable for the LSTM
because we want the network to build up state as it learns across
the sequence of observations. We can disable the shuffling of
samples by setting “shuffle” to “False“.
Scenario1 -
Using above two rules/guidelines - I ran several trials with different number of neurons, epoch size and different layers and got better results from the baseline model(persistence model).
Scenario2-
Without using above guidelines/rules - I ran several trials with different number of neurons, epoch size and different layers and got even better results than Scenario 1.
Query - Setting shuffle to True and Batch_size values to 1 for time series. Is this a rule or a guideline?
It seems logical reading the tutorial that the data for time series should not be shuffled as we do not want to change the sequence of data, but for my data the results are better if I let the data be shuffled.
At the end what I think, what matters is how I get better predictions with my runs.
I think I should try and put away "theory" over concrete evidence, such as metrics, elbows, RMSEs,etc.
Kindly enlighten.
It depends a lot on the size of your data, also in the number of variables, decreasing batch size in my experience gives better results since the update is more frequent but in huge datasets it is very expensive. And you have to play with this trade-off (training time vs result).
About your shuffle it may be the case that your data is not that correlated with the past, if that is the case shuffling the data helps the network to learn and be able to generalize (like ordered by label) check reason 7 of the following 37 reasons your neural network not working
Batch size the larger the difficult it is to generalize (reason 11). When data clearly depends on the past you can declare your LSTM in Keras to stateful, this means: "that the states computed for the samples in one batch will be reused as initial states for the samples in the next batch" according to Keras API. Hope this helps.

Effect of Data Parallelism on Training Result

I'm currently trying to implement multi-GPU training with the Tensorflow network. One solution for this would be to run one model per GPU, each having their own data batches, and combine their weights after each training iteration. In other words "Data Parallelism".
So for example if I use 2 GPUs, train with them in parallel, and combine their weights afterwards, then shouldn't the resulting weights be different compared to training with those two data batches in sequence on one GPU? Because both GPUs have the same input weights, whereas the single GPU has modified weights for the second batch.
Is this difference just marginal, and therefore not relevant for the end result after many iterations?
The order of the batches fed into training makes some difference. But the difference may be small if you have large number of batches. Each batch pulls the variables in the model a bit towards the minimum of the loss. The different order may make the path towards minimum a bit different. But as long as the loss is decreasing, your model is training and its evaluation becomes better and better.
Sometimes, to avoid the same batches "pull" the model together and avoid being too good only for some input data, the input for each model replica would be randomly shuffled before feeding into the training program.

Meaning of an Epoch in Neural Networks Training

while I'm reading in how to build ANN in pybrain, they say:
Train the network for some epochs. Usually you would set something
like 5 here,
trainer.trainEpochs( 1 )
I looked for what is that mean , then I conclude that we use an epoch of data to update weights, If I choose to train the data with 5 epochs as pybrain advice, the dataset will be divided into 5 subsets, and the wights will update 5 times as maximum.
I'm familiar with online training where the wights are updated after each sample data or feature vector, My question is how to be sure that 5 epochs will be enough to build a model and setting the weights probably? what is the advantage of this way on online training? Also the term "epoch" is used on online training, does it mean one feature vector?
One epoch consists of one full training cycle on the training set. Once every sample in the set is seen, you start again - marking the beginning of the 2nd epoch.
This has nothing to do with batch or online training per se. Batch means that you update once at the end of the epoch (after every sample is seen, i.e. #epoch updates) and online that you update after each sample (#samples * #epoch updates).
You can't be sure if 5 epochs or 500 is enough for convergence since it will vary from data to data. You can stop training when the error converges or gets lower than a certain threshold. This also goes into the territory of preventing overfitting. You can read up on early stopping and cross-validation regarding that.
sorry for reactivating this thread.
im new to neural nets and im investigating the impact of 'mini-batch' training.
so far, as i understand it, an epoch (as runDOSrun is saying) is a through use of all in the TrainingSet (not DataSet. because DataSet = TrainingSet + ValidationSet). in mini batch training, you can sub divide the TrainingSet into small Sets and update weights inside an epoch. 'hopefully' this would make the network 'converge' faster.
some definitions of neural networks are outdated and, i guess, must be redefined.
The number of epochs is a hyperparameter that defines the number times that the learning algorithm will work through the entire training dataset. One epoch means that each sample in the training dataset has had an opportunity to update the internal model parameters.

Resources