I was doing CIFAR-10 training on CPU with Tensorflow. During the first few rounds, the loss seemed alright. But after the step 10210 the loss varies and ends up becoming NaN.
My network model the CIFAR-10 CNN model from their website. Here is my setting,
image_size = 32
num_channels = 3
num_classes = 10
num_batches_to_run = 50000
batch_size = 128
eval_batch_size = 64
initial_learning_rate = 0.1
learning_rate_decay_factor = 0.1
num_epochs_per_decay = 350.0
moving_average_decay = 0.9999
and the result is shown as below.
2017-05-12 21:53:05.125242: step 10210, loss = 4.99 (124.9 examples/sec; 1.025 sec/batch)
2017-05-12 21:53:13.960001: step 10220, loss = 7.55 (139.5 examples/sec; 0.918 sec/batch)
2017-05-12 21:53:23.491228: step 10230, loss = 6.63 (149.5 examples/sec; 0.856 sec/batch)
2017-05-12 21:53:33.355805: step 10240, loss = 8.08 (113.3 examples/sec; 1.129 sec/batch)
2017-05-12 21:53:43.007007: step 10250, loss = 7.18 (126.7 examples/sec; 1.010 sec/batch)
2017-05-12 21:53:52.650118: step 10260, loss = 16.61 (138.0 examples/sec; 0.928 sec/batch)
2017-05-12 21:54:02.537279: step 10270, loss = 9.60 (137.6 examples/sec; 0.930 sec/batch)
2017-05-12 21:54:12.390117: step 10280, loss = 46526.25 (145.5 examples/sec; 0.880 sec/batch)
2017-05-12 21:54:22.060741: step 10290, loss = 133479743509972411931057146822656.00 (130.4 examples/sec; 0.982 sec/batch)
2017-05-12 21:54:31.691058: step 10300, loss = nan (115.8 examples/sec; 1.105 sec/batch)
Any idea about the NaN loss?
This happens a lot in practice when your learning rate is too high, I tend to start at 0.001 and move from there, 0.1 is on the very high side on most datasets, especially if you aren't dividing your loss by your batch size.
You can clip the gradients, if you are using Keras with Tensorflow backend, you could do as follows,
The parameters clipnorm and clipvalue can be used with all optimizers to control gradient clipping:
from keras import optimizers
# All parameter gradients will be clipped to
# a maximum norm of 1.
sgd = optimizers.SGD(lr=0.01, clipnorm=1.)
or
from keras import optimizers
# All parameter gradients will be clipped to
# a maximum value of 0.5 and
# a minimum value of -0.5.
sgd = optimizers.SGD(lr=0.01, clipvalue=0.5)
You might have the cross entropy loss and take log(0). Just add a small constant within the log.
(you might also want to look into gradient clipping)
Related
This is a time series regression problem for the battery capacity as output and a single input variable as voltage; the relation is non-linear.
LSTM Model prediction of the test data always returns a semi-flat line, probably the mean of the output variable in the training data.
This is an example of predicted vs test set output values, with the following model parameters:
(Window size: 10, batch site: 256, LSTM nodes: 16)
Prediction of the test data
Data had been normalized, down-sampled to 1 sec and later to 3 sec, original sampling was 10 Hz.
I was suspecting the voltage fluctuation is the problem, but sampling at 3 seconds hadn't resulted into noticeable improvement.
Here are the data after being down-sampled to 3 seconds:
Normalized Training Data ; Y:SOC, X: Voltage
Normalized Test Data ; Y:SOC, X: Voltage
I've tried many changes in the model and learning parameters as follows, but still the behavior is the same.
That's why i think it's not a parameter tuning issue, rather the model is not learning at all.
LSTM layer: always single, followed by Dense with no options.
LSTM nodes: [4,8,16,32]
Epoch: : [16,32,64,128]
window size (input vector depth): [8,32,64,128]
Batch size: [32,64,128,256]
learning rate: [.0005,.0001,.001]
optimizer : ADAM, options:[ none, clipnorm=1, clipvalue=0.5]
Model specification Code:
backend.clear_session()
model1 = Sequential()
model1.add(LSTM(16,input_shape=(win_sz, features_cnt) )) # stateless
model1.add(layers.Dense(1))
model1.summary()
Model training and validation Code:
n_epochs = 12
iterations = tr_samples_sh_cnt // batch_sz_tr
loss = tf.keras.losses.MeanAbsoluteError()
optimizer = tf.optimizers.Adam(learning_rate = 0.001)
loss_history = []
#tf.function
def train_model_on_batch():
start = epoch * batch_sz_tr
X_batch = df_feat_tr_3D[start:start+batch_sz_tr, :, :]
y_batch = df_SOC_tr_2D[start:start+batch_sz_tr, :]
with tf.GradientTape() as tape:
current_loss = loss(model1(X_batch), y_batch)
gradients = tape.gradient(current_loss, model1.trainable_variables)
optimizer.apply_gradients(zip(gradients, model1.trainable_variables))
return current_loss
for epoch in range(n_epochs+1):
for iteration in range(iterations):
current_loss = train_model_on_batch()
if epoch % 1 == 0:
loss_history.append(current_loss.numpy())
print("{}. \t\tLoss: {}".format(
epoch, loss_history[-1]))
print('\nTraining complete.')
P_test = model1.predict(df_feat_test_3D)
After adding sigmoid activation function in both LSTM and Dense layers, a very small change observed, but far from reasonable fit.
Prediction of the test data after adding activation function
The problem was the activation function as #Dr. Snoopy recommended
I'm using Random Forest Regressor to fit a 10-dimensional regression problem with around 300 thousand samples. Although not necessary when dealing with Random Forest I started by putting the data on the same scale (by using preprocessing of sklearn) and then I did a randomised search over the following parameter space:
n_estimators=[int(x) for x in linspace (start=100, stop= 2000, num=11)]
max_features= auto, sqrt
max_depth= from 1- to 150 with step =11
min_sampl_split=2,5,10,12
min_samples_leaf=1,2,4,6
Bootstrap true or false
Moreover, after getting the best parameters I did a second narrower search.
Though I am using a 10-Fold cross validation scheme with the random search I'm still getting a serious overfitting problem!
Moreover, I have also tried using DBSCAN algorithm to check for outliers. After excluding some parts of the dataset I got even worse results!
Should I include other parameters of the Random Forest in the randomised search? or should I apply some more preprocessing techniques on the data set before fitting?
For convenience, this is my implementation I wrote:
from sklearn.model_selection import ShuffleSplit
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
n_estimators = [int(x) for x in np.linspace(start = 1, stop =
15, num = 15)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
min_samples_split = [2, 5, 10,12]
min_samples_leaf = [1, 2, 4,6]
bootstrap = [True, False]
cv = ShuffleSplit(n_splits=10, test_size=0.01, random_state=0)
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
rf = RandomForestRegressor()
rf_random = RandomizedSearchCV(estimator = rf, param_distributions
= random_grid, n_iter = 50, cv = cv, verbose=2, random_state=42,
n_jobs = 32)
rf_random.fit(x_train, y_train)
the best parameters returned by the randomizedsearch function:
bootstrap: Fasle. Min_samples_leaf=2. n_estimators= 1647. Max_features: sqrt. min_samples_split=3. Max_depth: None.
The range of the target is from 0 to 10000 [unit]. This model is resulting in 6.98 [unit] RMSE accuracy on the training set and and average of 67.54 [unit] RMSE accuracy on the test sets.
that line
max_depth= from 1- to 150 with step =11
For a 10 feature problem, the optimum depth is under 10. You are overfitting like crazy beacause of that. consider putting max_depth from 1 to 15 with step 1
min_sampl_split=2,5,10,12
min_samples_leaf=1,2,4,6
This should help reduce the variance, however, the step of 11 for max_depth is killing all the efforts you could possibly make
I am trying to compare multiple classifiers on a dataset that I have. To get accurate accuracy scores for the classifiers I am now performing 10 fold cross validation for each classifier. This goes well for all of them except SVM (both linear and rbf kernels). The data is loaded like this:
dataset = pd.read_csv("data/distance_annotated_indels.txt", delimiter="\t", header=None)
X = dataset.iloc[:, [5,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26]].values
y = dataset.iloc[:, 4].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
Cross validation for for example a Random Forest works fine:
start = time.time()
classifier = RandomForestClassifier(n_estimators = 100, criterion = 'entropy')
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
cv = ShuffleSplit(n_splits=10, test_size=0.2)
scores = cross_val_score(classifier, X, y, cv=10)
print(classification_report(y_test, y_pred))
print("Random Forest accuracy after 10 fold CV: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2) + ", " + str(round(time.time() - start, 3)) + "s")
Output:
precision recall f1-score support
0 0.97 0.95 0.96 3427
1 0.95 0.97 0.96 3417
avg / total 0.96 0.96 0.96 6844
Random Forest accuracy after 10 fold CV: 0.92 (+/- 0.06), 90.842s
However for SVM this process takes ages (waited for 2 hours, still nothing). The sklearn website does not make me any wiser. Is there something I should be doing different for SVM classifiers? The SVM code is as follows:
start = time.time()
classifier = SVC(kernel = 'linear')
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
scores = cross_val_score(classifier, X, y, cv=10)
print(classification_report(y_test, y_pred))
print("Linear SVM accuracy after 10 fold CV: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2) + ", " + str(round(time.time() - start, 3)) + "s")
If you have a lot of samples the computational complexity of the problem gets in the way, see Training complexity of Linear SVM.
Consider playing with the verbose flag of cross_val_score to see more logs about progress. Also, with n_jobs set to a value > 1 (or even using all CPUs with n_jobs set to -1, if memory allows) you could speed up computation via parallelization. http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html can be useful to evaluate these options.
If performance is poor I'd consider reducing the value of cv (see https://stats.stackexchange.com/questions/27730/choice-of-k-in-k-fold-cross-validation for a discussion on this)
Also you can control the time with changing max_iter. If it set to -1 it can go forever according to soltion space. Set some integer value say 10000 as a stopping criteria.
alternatively you can try using optimized SVM implementation - for example with scikit-learn-intelex - https://github.com/intel/scikit-learn-intelex
First install package
pip install scikit-learn-intelex
And then add in your python script
from sklearnex import patch_sklearn
patch_sklearn()
I'm training several CNNs to do image classification in TensorFlow. The training losses decrease normally. However the test accuracy never changed throughout the whole training procedure, plus the accuracy is very low (0.014) where the accuracy for randomly guessing would be 0.003 (There are around 300 classes). One thing I've noticed is that only those models that I applied batch norm to showed such a weird behavior. What can possibly be wrong to cause this issue? The training set has 80000 samples, in case you might figure this was caused by overfitting. Below is part of the code for evaluation:
Accuracy function:
correct_prediction = tf.equal(tf.argmax(Model(test_image), 1), tf.argmax(test_image_label, 0))
accuracy = tf.cast(correct_prediction, tf.float32)
the test_image is a batch with only one sample in it while the test_image_label is a scalar.
Session:
with tf.Session() as sess:
sess.run(tf.local_variables_initializer())
sess.run(tf.global_variables_initializer())
saver = tf.train.Saver()
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord, start=True)
print('variables initialized')
step = 0
for epoch in range(epochs):
sess.run(enqueue_train)
print('epoch: %d' %epoch)
if epoch % 5 == 0:
save_path = saver.save(sess, savedir + "/Model")
for batch in range(num_batch):
if step % 400 == 0:
summary_str = cost_summary.eval(feed_dict={phase: True})
file_writer.add_summary(summary_str, step)
else:
sess.run(train_step, feed_dict={phase: True})
step += 1
sess.run(train_close)
sess.run(enqueue_test)
accuracy_vector = []
for num in range(len(testnames)):
accuracy_vector.append(sess.run(accuracy, feed_dict={phase: False}))
mean_accuracy = sess.run(tf.divide(tf.add_n(accuracy_vector), len(testnames)))
print("test accuracy %g"%mean_accuracy)
sess.run(test_close)
save_path = saver.save(sess, savedir + "/Model_final")
coord.request_stop()
coord.join(threads)
file_writer.close()
The phase above is to indicate if it is training or testing for the batch norm layer.
Note that I tried to calculate the accuracy with the training set, which led to the minimal loss. However it gives the same poor accuracy. Please help me, I really appreciate it!
I built a RNN using Keras. The RNN is used to solve a regression problem:
def RNN_keras(feat_num, timestep_num=100):
model = Sequential()
model.add(BatchNormalization(input_shape=(timestep_num, feat_num)))
model.add(LSTM(input_shape=(timestep_num, feat_num), output_dim=512, activation='relu', return_sequences=True))
model.add(BatchNormalization())
model.add(LSTM(output_dim=128, activation='relu', return_sequences=True))
model.add(BatchNormalization())
model.add(TimeDistributed(Dense(output_dim=1, activation='relu'))) # sequence labeling
rmsprop = RMSprop(lr=0.00001, rho=0.9, epsilon=1e-08)
model.compile(loss='mean_squared_error',
optimizer=rmsprop,
metrics=['mean_squared_error'])
return model
The whole process looks fine. But the loss stays the exact same over epochs.
61267 in the training set
6808 in the test set
Building training input vectors ...
888 unique feature names
The length of each vector will be 888
Using TensorFlow backend.
Build model...
# Each batch has 1280 examples
# The training data are shuffled at the beginning of each epoch.
****** Iterating over each batch of the training data ******
Epoch 1/3 : Batch 1/48 | loss = 11011073.000000 | root_mean_squared_error = 3318.232910
Epoch 1/3 : Batch 2/48 | loss = 620.271667 | root_mean_squared_error = 24.904161
Epoch 1/3 : Batch 3/48 | loss = 620.068665 | root_mean_squared_error = 24.900017
......
Epoch 1/3 : Batch 47/48 | loss = 618.046448 | root_mean_squared_error = 24.859678
Epoch 1/3 : Batch 48/48 | loss = 652.977051 | root_mean_squared_error = 25.552946
****** Epoch 1: RMSD(training) = 24.897174
Epoch 2/3 : Batch 1/48 | loss = 607.372620 | root_mean_squared_error = 24.644049
Epoch 2/3 : Batch 2/48 | loss = 599.667786 | root_mean_squared_error = 24.487448
Epoch 2/3 : Batch 3/48 | loss = 621.368103 | root_mean_squared_error = 24.926300
......
Epoch 2/3 : Batch 47/48 | loss = 620.133667 | root_mean_squared_error = 24.901398
Epoch 2/3 : Batch 48/48 | loss = 639.971924 | root_mean_squared_error = 25.297264
****** Epoch 2: RMSD(training) = 24.897174
Epoch 3/3 : Batch 1/48 | loss = 651.519836 | root_mean_squared_error = 25.523636
Epoch 3/3 : Batch 2/48 | loss = 673.582581 | root_mean_squared_error = 25.952084
Epoch 3/3 : Batch 3/48 | loss = 613.930054 | root_mean_squared_error = 24.776562
......
Epoch 3/3 : Batch 47/48 | loss = 624.460327 | root_mean_squared_error = 24.988203
Epoch 3/3 : Batch 48/48 | loss = 629.544250 | root_mean_squared_error = 25.090448
****** Epoch 3: RMSD(training) = 24.897174
I do NOT think it is normal. Do I miss something?
UPDATE:
I find that all predictions are always zero after all epochs. This is the reason why all RMSDs are all the same because the predictions are all the same, i.e. 0. I checked the training y. It only contains just a few zeros. So it is not due to data imbalance.
So now I am thinking if it is because of the layers and activation that I am using.
Your RNN functions seems to be ok.
The speed of reduction in loss depends on optimizer and learning rate.
Any how you are using decay rate 0.9. try with bigger learning rate, any how it is going to decrease with 0.9 rate.
Try out other optimizers with different learning rates
Other optimizers available with keras: https://keras.io/optimizers/
Many times, some optimizers work well on some data sets while some may fails.
Have you tried changing activation function from relu to softmax?
Relu activation has the tendency to diverge. However, if initializing the weight with eigenmatrix may result in a better convergence.
Since you are using RNNs for regression problem (not for classification), you should use 'linear' activation at the last layer.
In your code,
model.add(TimeDistributed(Dense(output_dim=1, activation='relu'))) # sequence labeling
change to activation='linear' instead of 'relu'.
If it doesn't work, remove activation='relu' in second layer.
Also learning rate for rmsprop usually ranges from 0.1 to 0.0001.