Are epochs and training steps the same thing? - machine-learning

features = [tf.contrib.layers.real_valued_column("x", dimension=1)]
estimator = tf.contrib.learn.LinearRegressor(feature_columns=features)
y = np.array([0., -1., -2., -3.])
input_fn = tf.contrib.learn.io.numpy_input_fn({"x":x}, y, batch_size=4,
num_epochs=1000)
estimator.fit(input_fn=input_fn, steps=1000)
For example, do these "steps=1000" and "num_epochs=1000" mean exactly the same thing? If yes, why does it need to be duplicated? If not, can i set these two parameters differently?

Here is the basic difference between epoch and steps in any machine learning algorithm or framework:
Once the framework goes through all the data points in its training set to update its parameters it's called one epoch. A step is one update of the parameters (e.g. weights of the neural network if it training DNN). This update can be obtained using a single data point, or a mini-batch of data points (e.g. randomly drawing 100 data points, with or without replacement), or all the points. Hence as you can see if all your datapoints are used in one step (or update of parameters) it becomes one epoch i.e. one step = one epoch.
Typically frameworks use mini-batching and in one step they batch 100 (or some other number) datapoints together and do one update. In this case, if say you have total 1 million datapoints (10^6) then one epoch has 10000 steps since one step contains 100 datapoints.

No, they are not the same. As with most (all?) Frameworks, Tensorflow has some commands that specify epochs, and some that work on steps, a.k.a iterations. A step is one batch, which is governed by the batch size specified in the model's input.
For instance, if you are using AlexNet with its default batch size of 256, and the ILSVRC 2012 data set of roughly 1.28M images, then you have about 5000 steps per epoch (1,280,000 / 256).
Batch size is the number of images processed in parallel. If there are 1.28M images in the data set, then you have to process 12.8M images per epoch: that's the definition of epoch -- process each input once. Now is that arithmetic clear?

Related

Validation and training loss per batch and epoch

I am using Pytorch to run some deep learning models. I am currently keeping track of training and validation loss per epoch, which is pretty standard. However, what is the best way of going about keeping track of training and validation loss per batch/iteration?
For training loss, I could just keep a list of the loss after each training loop. But, validation loss is calculated after a whole epoch, so I’m not sure how to go about the validation loss per batch. The only thing I can think of is to run the whole validation step after each training batch and keeping track of those, but that seems overkill and a lot of computation.
For example, the training is like this:
for epoch in range(2): # loop over the dataset multiple times
running_loss = 0.0
for i, data in enumerate(trainloader, 0):
# get the inputs; data is a list of [inputs, labels]
inputs, labels = data
# zero the parameter gradients
optimizer.zero_grad()
# forward + backward + optimize
outputs = net(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
# print statistics
running_loss += loss.item()
And for validation loss:
with torch.no_grad():
for data in testloader:
images, labels = data
outputs = net(images)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()
# validation loss
batch_loss = error(outputs.float(), labels.long()).item()
loss_test += batch_loss
loss_test /= len(testloader)
The validation loss/test part is done per epoch. I’m looking for a way to get the validation loss per batch, which is my point above.
Any tips?
Well, you're right that's the way to do it "run the whole validation step after each training batch and keeping track of those" and also as you've thought it's pretty time-consuming and would be overkill. However, If that's something you really need then there's a way you can do it. What you can do is, let's say you've 1000 batches in your data. Now to calculate per batch val_loss you can choose not to run the validation step for each of the batch (then you'd have to do it 1000 times!) but for a small subset of those batches, let's say 50/100 (choose as you please or find feasible). Now, you can use some statistical power so that your calculation for 50/100 batches becomes very very close to that of 1000 batches (meaning this val_loss for a small number of batches must be as close as to those of 1000 batches if you had calculated that), so to achieve it you can introduce some randomness in your batch selection.
This means you randomly select 100 batches from your 1000 batches for which you'll run the validation step.
An epoch is the process of making the model go through the entire training set - which is, generally, divided into batches. Also, it tends to be shuffled. The validation set, on the other hand is used to tune the hyper-parameters of your training and find out what's your model's behavior towards new data. In that respect, to me, evaluating at epoch=1/2 doesn't make much sense. Because the question is - whatever the performance on the evaluation set at epoch=1/2 - what can you do about it? Since, you don't know which data it has been going through in the first half of the epoch, there's no way to take advantage of 'a first half being better'... And remember your data will likely be shuffled into batches.
Therefore, I would stick with the classic approach: train on the entire set then, and only then, evaluate on another set. In some cases, you won't even allow yourself to evaluate once per epoch, because of the computation time. Instead you would evaluate every n epochs. But then again it will depend on your dataset size, your sampling from that dataset, the batch size, and the computation cost.
For the training loss, you can keep track of its value per-update-step vs. per-epoch. This will give you much more control over whether or not your model is learning independently from the validation phase.
Edit - As an alternative for not having to run the entire evaluation set per train batch you could do the following: shuffle your validation and set the same batch size as your trainset.
len(trainset)//batch_size is the number of updates per epoch
len(validset)//batch_size is the number of allowed evaluation per epoch
Every len(trainset)//len(validset) train updates you can evaluate on 1
batch
This allows you to get a feedback len(trainset)//len(validset) times per epoch.
If you set your train/valid ratio as 0.1, then len(validset)=0.1*len(trainset), that's ten partial evaluations per epoch.

Why max_batches independent of the size of the dataset?

I am wondering why the number of images has no influence on the number of iterations when training. Here is an example to to make my question clearer:
Suppose we have 6400 images for a training to recognize 4 classes. Based on AlexeyAB explanations, we keep batch= 64, subdivisions = 16 and write max_batches = 8000 since max_batches is determined by #classes x 2000.
Since we have 6400 images, a complete epoch requires 100 iterations. Therefore this training ends after 80 epochs.
Now, suppose that we have 12800 images. In that case, an epoch needs 200 iterations. Therefore the training ends after 40 epochs.
Since an epoch refers to one cycle through the full training dataset, I'm wondering why we don't increase the number of iterations when our dataset increases, in order to keep the number of epochs constant.
Said differently, I'm asking for a simple explanation as to why the number of epochs seems to be irrelevant to the quality of the training. I feel that it's a consequence of Yolo's construction but I am not knowledgeable enough to understand how.
Why the number of images has no influence on the number of iterations when training?
In darknet yolo, the number of iterations depends on the max_batches parameter in .cfg file. After running for max_batches, the darknet saves the final_weights.
In each epoch, all the data samples are passed through the network, so if you have many images, the training time for one epoch (and iteration) will be higher, you can test that by increasing images in your data.
The sub-division accounts for the number of mini-batches. Let's say, you have 100 images in your dataset. your batch size is 10, sub-division is 2, max_batches is 20.
So, in each iteration, 10 images are passed to the network in two mini-batches (Each having 5 samples), once you have done 20 baches (20*10 data samples), the training will be completed. (The details can be a little different, I'm using a slightly modified darknet by original author pjreddie)
The instructions are updated now. max_batches is equal to classes*2000 but not less than number of training images and not less than 6000. Please find it at this link.

Tensorflow RNN example limited to fixed batch size?

When looking at the RNN example at Tensorflow im having an issue with how the initial state is constructed. At build time of the graph we limit the graph to only handle input of one batch size. This is an issue for me since I want to be able feed in a single example and get a prediction for that single example.
The part of the code that restricts this is:
initial_state = state = tf.zeros([batch_size, lstm.state_size])
So my question is how can I expand the example so that I can use a variable batch size so that I can use the same model for training with batch size and then use single example for predictions?
This is how I'm doing this. You can pass the batch_size as a variable like this:
batch_size = tf.placeholder(tf.int32)
init_state = cell.zero_state(batch_size, tf.float32)
where cell is one of RNN cells (BasicLSTMCell, BasicGRUCell, MultiRNNCell, etc). However, if you're preserving the state over multiple batches that won't work since its' size has to be constant.
The Tensorflow text generation tutorial explains how to do this (now TF 2.0). It seems that the batch_size becomes part of the built model, so you have to rebuild/reload from the saved weights with a new batch size:
https://www.tensorflow.org/tutorials/text/text_generation#restore_the_latest_checkpoint
To keep this prediction step simple, use a batch size of 1.
Because of the way the RNN state is passed from timestep to timestep,
the model only accepts a fixed batch size once built.
To run the model with a different batch_size, we need to rebuild the
model and restore the weights from the checkpoint.
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)
model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
model.build(tf.TensorShape([1, None]))
model.summary()
I don't know for sure why you have to do this, but I always assumed it's because batching for recurrent layers requires management of multiple, parallel hidden state pipelines, so it preallocates them.

What is the difference between steps and epochs in TensorFlow?

In most of the models, there is a steps parameter indicating the number of steps to run over data. But yet I see in most practical usage, we also execute the fit function N epochs.
What is the difference between running 1000 steps with 1 epoch and running 100 steps with 10 epoch? Which one is better in practice? Any logic changes between consecutive epochs? Data shuffling?
A training step is one gradient update. In one step batch_size examples are processed.
An epoch consists of one full cycle through the training data. This is usually many steps. As an example, if you have 2,000 images and use a batch size of 10 an epoch consists of:
2,000 images / (10 images / step) = 200 steps.
If you choose your training image randomly (and independently) in each step, you normally do not call it epoch. [This is where my answer differs from the previous one. Also see my comment.]
An epoch usually means one iteration over all of the training data. For instance if you have 20,000 images and a batch size of 100 then the epoch should contain 20,000 / 100 = 200 steps. However I usually just set a fixed number of steps like 1000 per epoch even though I have a much larger data set. At the end of the epoch I check the average cost and if it improved I save a checkpoint. There is no difference between steps from one epoch to another. I just treat them as checkpoints.
People often shuffle around the data set between epochs. I prefer to use the random.sample function to choose the data to process in my epochs. So say I want to do 1000 steps with a batch size of 32. I will just randomly pick 32,000 samples from the pool of training data.
As I am currently experimenting with the tf.estimator API I would like to add my dewy findings here, too. I don't know yet if the usage of steps and epochs parameters is consistent throughout TensorFlow and therefore I am just relating to tf.estimator (specifically tf.estimator.LinearRegressor) for now.
Training steps defined by num_epochs: steps not explicitly defined
estimator = tf.estimator.LinearRegressor(feature_columns=ft_cols)
train_input = tf.estimator.inputs.numpy_input_fn({'x':x_train},y_train,batch_size=4,num_epochs=1,shuffle=True)
estimator.train(input_fn=train_input)
Comment: I have set num_epochs=1 for the training input and the doc entry for numpy_input_fn tells me "num_epochs: Integer, number of epochs to iterate over data. If None will run forever.". With num_epochs=1 in the above example the training runs exactly x_train.size/batch_size times/steps (in my case this was 175000 steps as x_train had a size of 700000 and batch_size was 4).
Training steps defined by num_epochs: steps explicitly defined higher than number of steps implicitly defined by num_epochs=1
estimator = tf.estimator.LinearRegressor(feature_columns=ft_cols)
train_input = tf.estimator.inputs.numpy_input_fn({'x':x_train},y_train,batch_size=4,num_epochs=1,shuffle=True)
estimator.train(input_fn=train_input, steps=200000)
Comment: num_epochs=1 in my case would mean 175000 steps (x_train.size/batch_size with x_train.size=700,000 and batch_size=4) and this is exactly the number of steps estimator.train albeit the steps parameter was set to 200,000 estimator.train(input_fn=train_input, steps=200000).
Training steps defined by steps
estimator = tf.estimator.LinearRegressor(feature_columns=ft_cols)
train_input = tf.estimator.inputs.numpy_input_fn({'x':x_train},y_train,batch_size=4,num_epochs=1,shuffle=True)
estimator.train(input_fn=train_input, steps=1000)
Comment: Although I have set num_epochs=1 when calling numpy_input_fnthe training stops after 1000 steps. This is because steps=1000 in estimator.train(input_fn=train_input, steps=1000) overwrites the num_epochs=1 in tf.estimator.inputs.numpy_input_fn({'x':x_train},y_train,batch_size=4,num_epochs=1,shuffle=True).
Conclusion:
Whatever the parameters num_epochs for tf.estimator.inputs.numpy_input_fn and steps for estimator.train define, the lower bound determines the number of steps which will be run through.
In easy words
Epoch: Epoch is considered as number of one pass from entire dataset
Steps: In tensorflow one steps is considered as number of epochs multiplied by examples divided by batch size
steps = (epoch * examples)/batch size
For instance
epoch = 100, examples = 1000 and batch_size = 1000
steps = 100
Epoch: A training epoch represents a complete use of all training data for gradients calculation and optimizations(train the model).
Step: A training step means using one batch size of training data to train the model.
Number of training steps per epoch: total_number_of_training_examples / batch_size.
Total number of training steps: number_of_epochs x Number of training steps per epoch.
According to Google's Machine Learning Glossary, an epoch is defined as
"A full training pass over the entire dataset such that each example has been seen once. Thus, an epoch represents N/batch_size training iterations, where N is the total number of examples."
If you are training model for 10 epochs with batch size 6, given total 12 samples that means:
the model will be able to see whole dataset in 2 iterations ( 12 / 6 = 2) i.e. single epoch.
overall, the model will have 2 X 10 = 20 iterations (iterations-per-epoch X no-of-epochs)
re-evaluation of loss and model parameters will be performed after each iteration!
Since there’re no accepted answer yet :
By default an epoch run over all your training data. In this case you have n steps, with n = Training_lenght / batch_size.
If your training data is too big you can decide to limit the number of steps during an epoch.[https://www.tensorflow.org/tutorials/structured_data/time_series?_sm_byp=iVVF1rD6n2Q68VSN]
When the number of steps reaches the limit that you’ve set the process will start over, beginning the next epoch.
When working in TF, your data is usually transformed first into a list of batches that will be fed to the model for training. At each step you process one batch.
As to whether it’s better to set 1000 steps for 1 epoch or 100 steps with 10 epochs I don’t know if there’s a straight answer.
But here are results on training a CNN with both approaches using TensorFlow timeseries data tutorials :
In this case, both approaches lead to very similar prediction, only the training profiles differ.
steps = 20 / epochs = 100
steps = 200 / epochs = 10
Divide the length of x_train by the batch size with
steps_per_epoch = x_train.shape[0] // batch_size
We split the training set into many batches. When we run the algorithm, it requires one epoch to analyze the full training set. An epoch is composed of many iterations (or batches).
Iterations: the number of batches needed to complete one Epoch.
Batch Size: The number of training samples used in one iteration.
Epoch: one full cycle through the training dataset. A cycle is composed of many iterations.
Number of Steps per Epoch = (Total Number of Training Samples) / (Batch Size)
Example
Training Set = 2,000 images
Batch Size = 10
Number of Steps per Epoch = 2,000 / 10 = 200 steps
Hope this helps for better understanding.

Parameter tuning/model selection using resampling

I have been trying to get into more details of resampling methods and implemented them on a small data set of 1000 rows. The data was split into 800 training set and 200 validation set. I used K-fold cross validation and repeated K-fold cross validation to train the KNN using the training set. Based on my understanding I have done some interpretations of the results - however, I have certain doubts about them (see questions below):
Results :
10 Fold Cv
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 720, 720, 720, 720, 720, 720, ...
Resampling results across tuning parameters:
k Accuracy Kappa
5 0.6600 0.07010791
7 0.6775 0.09432414
9 0.6800 0.07054371
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
Repeated 10 fold with 10 repeats
Resampling results across tuning parameters:
k Accuracy Kappa
5 0.670250 0.10436607
7 0.676875 0.09288219
9 0.683125 0.08062622
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
10 fold, 1000 repeats
k Accuracy Kappa
5 0.6680438 0.09473128
7 0.6753375 0.08810406
9 0.6831800 0.07907891
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was k = 9.
10 fold with 2000 repeats
k Accuracy Kappa
5 0.6677981 0.09467347
7 0.6750369 0.08713170
9 0.6826894 0.07772184
Doubts:
While selecting the parameter, K=9 is the optimal value for highest accuracy. However, I don't understand how to take Kappa into consideration while finally choosing parameter value?
Repeat number has to be increased until we get stabilised result, the accuracy changes when the repeats are increased from 10 to 1000. However,the results are similar for 1000 repeats and 2000 repeats. Will it be right to consider the results of 1000/2000 repeats to be stabilised performance estimate?
Any thumb rule for the repeat number?
Finally,should I train the model on my complete training data (800 rows) now test the accuracy on the validation set ?
Accuracy and Kappa are just different classification performance metrics. In a nutshell, their difference is that Accuracy does not take possible class imbalance into account when calculating the metrics, while Kappa does. Therefore, with imbalanced classes, you might be better off using Kappa. With R caret you can do so via the train::metric parameter.
You could see a similar effect of slightly different performance results when running e.g. the 10CV with 10 repeats multiple times - you will just get slightly different results for those as well. Something you should look out for is the variance of classification performance over your partitions and repeats. In case you obtain a small variance you can derive that you by training on all your data, you likely obtain a model that will give you similar (hence stable) results on new data. But, in case you obtain a huge variance, you can derive that just by chance (being lucky or unlucky) you might instead obtain a model that either gives you rather good or rather bad performance on new data. BTW: the prediction performance variance is something e.g. R caret::train will give you automatically, hence I'd advice on using it.
See above: look at the variance and increase the repeats until you can e.g. repeat the whole process and obtain a similar average performance and variance of performance.
Yes, CV and resampling methods exist to give you information about how well your model will perform on new data. So, after performing CV and resampling and obtaining this information, you will usually use all your data to train a final model that you use in your e.g. application scenario (this includes both train and test partition!).

Resources