I'm trying to train a small CNN from scratch to classify images of 10 different animal species. The images have different dimensions, but I'd say around 300x300. Anyway, every image is resized to 224x224 before going into the model.
Here is the network I'm training:
# Convolution 1
self.cnn1 = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, stride=1, padding=0)
self.relu1 = nn.ReLU()
# Max pool 1
self.maxpool1 = nn.MaxPool2d(kernel_size=2)
# Convolution 2
self.cnn2 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, stride=1, padding=0)
self.relu2 = nn.ReLU()
# Max pool 2
self.maxpool2 = nn.MaxPool2d(kernel_size=2)
# Fully connected 1
self.fc1 = nn.Linear(32 * 54 * 54, 10)
I'm using a SGD optimizer with fixed learning rate = 0.005 and weight decay = 0.01. I'm using a cross entropy function.
The accuracy of the model is good (around 99% after the 43-th epoch). However:
in some epoch I get a 'nan' as training loss
in some other epoch the accuracy drops significantly (sometimes the two happen in the same epoch). However, in the next epoch the accuracy comes back to a normal level.
If I understood it correctly a nan in training loss most of the times is caused by gradient values getting too small (underflow) or too big (overflow). Could this be the case?
Should I try by increasing the weight decay to 0.05? Or should I do gradient clipping to avoid exploding gradients? If so which would be a reasonable bound?
Still I don't understand the second issue.
I have a Caffe prototxt as follows:
stepsize: 20000
iter_size: 4
batch_size: 10
gamma =0.1
in which, the dataset has 40.000 images. It means after 20000 iters, the learning rate will decrease 10 times. In pytorch, I want to compute the number of the epoch to have the same behavior in caffe (for learning rate). How many epoch should I use to decrease learning rate 10 times (note that, we have iter_size=4 and batch_size=10). Thanks
Ref: Epoch vs Iteration when training neural networks
My answer: Example: if you have 40000 training examples, and batch size is 10, then it will take 40000/10 =4000 iterations to complete 1 epoch. Hence, 20000 iters to reduce learning rate in caffe will same as 5 epochs in pytorch.
You did not take into account iter_size: 4: when batch is too large to fit into memory, you can "split" it into several iterations.
In your example, the actual batch size is batch_sizexiter_size=10 * 4 = 40. Therefore, an epoch takes only 1,000 iterations and therefore you need to decrease the learning rate after 20 epochs.
I am trying to predict the hygrothermal response of a wall, given the interior and exterior climate. Based on literature research, I believe this should be possible with RNN but I have not been able to get good accuracy.
The dataset has 12 input features (time-series of exterior and interior climate data) and 10 output features (time-series of hygrothermal response), both containing hourly values for 10 years. This data was created with hygrothermal simulation software, there is no missing data.
Dataset features:
Dataset targets:
Unlike most time-series prediction problems, I want to predict the response for the full length of the input features time-series at each time-step, rather than the subsequent values of a time-series (eg financial time-series prediction). I have not been able to find similar prediction problems (in similar or other fields), so if you know of one, references are very welcome.
I think this should be possible with RNN, so I am currently using LSTM from Keras. Before training, I preprocess my data the following way:
Discard first year of data, as the first time steps of the hygrothermal response of the wall is influenced by the initial temperature and relative humidity.
Split into training and testing set. Training set contains the first 8 years of data, the test set contains the remaining 2 years.
Normalise training set (zero mean, unit variance) using StandardScaler from Sklearn. Normalise test set analogously using mean an variance from training set.
This results in: X_train.shape = (1, 61320, 12), y_train.shape = (1, 61320, 10), X_test.shape = (1, 17520, 12), y_test.shape = (1, 17520, 10)
As these are long time-series, I use stateful LSTM and cut the time-series as explained here, using the stateful_cut() function. I only have 1 sample, so batch_size is 1. For T_after_cut I have tried 24 and 120 (24*5); 24 appears to give better results. This results in X_train.shape = (2555, 24, 12), y_train.shape = (2555, 24, 10), X_test.shape = (730, 24, 12), y_test.shape = (730, 24, 10).
Next, I build and train the LSTM model as follows:
model = Sequential()
model.add(LSTM(128,
batch_input_shape=(batch_size,T_after_cut,features),
return_sequences=True,
stateful=True,
))
model.addTimeDistributed(Dense(targets)))
model.compile(loss='mean_squared_error', optimizer=Adam())
model.fit(X_train, y_train, epochs=100, batch_size=batch=batch_size, verbose=2, shuffle=False)
Unfortunately, I don't get accurate prediction results; not even for the training set, thus the model has high bias.
The prediction results of the LSTM model for all targets
How can I improve my model? I have already tried the following:
Not discarding the first year of the dataset -> no significant difference
Differentiating the input features time-series (subtract previous value from current value) -> slightly worse results
Up to four stacked LSTM layers, all with the same hyperparameters -> no significant difference in results but longer training time
Dropout layer after LSTM layer (though this is usually used to reduce variance and my model has high bias) -> slightly better results, but difference might not be statistically significant
Am I doing something wrong with the stateful LSTM? Do I need to try different RNN models? Should I preprocess the data differently?
Furthermore, training is very slow: about 4 hours for the model above. Hence I am reluctant to do an extensive hyperparameter gridsearch...
In the end, I managed to solve this the following way:
Using more samples to train instead of only 1 (I used 18 samples to train and 6 to test)
Keep the first year of data, as the output time-series for all samples have the same 'starting point' and the model needs this information to learn
Standardise both input and output features (zero mean, unit variance). I found this improved prediction accuracy and training speed
Use stateful LSTM as described here, but add reset states after epoch (see below for code). I used batch_size = 6 and T_after_cut = 1460. If T_after_cut is longer, training is slower; if T_after_cut is shorter, accuracy decreases slightly. If more samples are available, I think using a larger batch_size will be faster.
use CuDNNLSTM instead of LSTM, this speed up the training time x4!
I found that more units resulted in higher accuracy and faster convergence (shorter training time). Also I found that the GRU is as accurate as the LSTM tough converged faster for the same number of units.
Monitor validation loss during training and use early stopping
The LSTM model is build and trained as follows:
def define_reset_states_batch(nb_cuts):
class ResetStatesCallback(Callback):
def __init__(self):
self.counter = 0
def on_batch_begin(self, batch, logs={}):
# reset states when nb_cuts batches are completed
if self.counter % nb_cuts == 0:
self.model.reset_states()
self.counter += 1
def on_epoch_end(self, epoch, logs={}):
# reset states after each epoch
self.model.reset_states()
return(ResetStatesCallback)
model = Sequential()
model.add(layers.CuDNNLSTM(256, batch_input_shape=(batch_size,T_after_cut ,features),
return_sequences=True,
stateful=True))
model.add(layers.TimeDistributed(layers.Dense(targets, activation='linear')))
optimizer = RMSprop(lr=0.002)
model.compile(loss='mean_squared_error', optimizer=optimizer)
earlyStopping = EarlyStopping(monitor='val_loss', min_delta=0.005, patience=15, verbose=1, mode='auto')
ResetStatesCallback = define_reset_states_batch(nb_cuts)
model.fit(X_dev, y_dev, epochs=n_epochs, batch_size=n_batch, verbose=1, shuffle=False, validation_data=(X_eval,y_eval), callbacks=[ResetStatesCallback(), earlyStopping])
This gave me very statisfying accuracy (R2 over 0.98):
This figure shows the temperature (left) and relative humidity (right) in the wall over 2 years (data not used in training), prediction in red and true output in black. The residuals show that the error is very small and that the LSTM learns to capture the long-term dependencies to predict the relative humidity.
I have two classes with 3 images each. I tried this code in Keras.
trainingDataGenerator = ImageDataGenerator()
trainGenerator = trainingDataGenerator.flow_from_directory(
trainingDataDir,
target_size=(28, 28),
batch_size = 1,
seed=7,
class_mode='binary',
)
FilterSize = (3,3)
inputShape = (imageWidth, imageHeight,3)
model = Sequential()
model.add (Conv2D(32, FilterSize, input_shape= inputShape))
model.add (Activation('relu'))
model.add ( MaxPooling2D(pool_size=(2,2)))
model.add(Flatten())
model.add(Activation('relu'))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer = 'rmsprop',
metrics=['accuracy'])
model.fit_generator(
trainGenerator,
steps_per_epoch=3,
epochs=epochs)
My Output:
When I train this model, I get this output:
Using TensorFlow backend.
Found 2 images belonging to 2 classes.
Epoch 1/1
3/3 [==============================] - 0s - loss: 5.3142 - acc: 0.6667
My Question:
I wonder how it determines the loss and accuracy and on what basis? (ie: loss: 5.3142 - acc: 0.6667 ). I have not given any validation image to validate the model to find accuracy and loss. Does this loss, and accuracy is against the input image itself?
In short, can we say something like this: "This model has accuracy of %, and loss of % without validation images"?
The training loss and accuracy is calculated not by comparing to validation data but rather by comparing the prediction of your neural network of sample x with the label y for that sample that you provide in your training set.
You initialize your neural network and (usually) set all weights to a random value with a certain deviation. After that you feed the features of your training dataset into your network, and let it "guess" the outcome aka the label that you have (if you do supervised learning like in your case).
Then your framework compares that guess with the actual label and calculates the error which it then backpropagates through your network thereby adjusting and improving all weights.
This works perfectly well without any validation data.
Validation data serves you to see the quality of your model (loss, accuracy etc.) by letting the model predict on unseen data. With that you get the so called validation loss / accuracy and with this information you tune your hyperparameters.
In a last step you use your test data to evaluate the final quality of your training.
How do you compute for the training accuracy for SGD? Do you compute it using the batch data you trained your network with? Or using the entire dataset? (for each batch optimization iteration)
I tried computing the training accuracy for each iteration using the batch data I trained my network with. And it almost always gives me 100% training accuracy (sometimes 100%, 90%, 80%, always multiples of 10%, but the very first iteration gave me 100%). Is this because I am computing the accuracy on the same batch data I trained it with for that iteration? Or is my model overfitting that it gave me 100% instantly, but the validation accuracy is low? (this is the main question here, if this is acceptable, or there is something wrong with the model)
Here are the hyperparameters I used.
batch_size = 64
kernel_size = 60 #from 60 #optimal 2
depth = 15 #from 60 #optimal 15
num_hidden = 1000 #from 1000 #optimal 80
learning_rate = 0.0001
training_epochs = 8
total_batches = train_x.shape[0] // batch_size
Calculating the training accuracy on the batch data during the training process is correct. If the number of the accuracy is always multiple of 10%, then most likely it is because your batch size is 10. For example, if 8 of the training outputs match the labels, then your training accuracy will be 80%. If the training accuracy number goes up and down, there are two main possibilities:
1. If you print out the accuracy numbers multiple time over one epoch, it is normal, especially at the early stage of training, because the model is predicting over different data samples;
2. If you print out the accuracy once each epoch, and if you see the training accuracy goes up and down during the later stage of the training, that means your learning rate is too big. You need to decease that overtime during the training.
If these do not answer your question, please provider more details so that we can help.