How to visualize this kind of information - machine-learning

I am training a logistic regression algorithm and it returns me the following information for each iteration. I am collecting these entities as arrays for the entire classification.
Can you suggest me some ways to visualize it? For example, is it appropriate to plot loss vs accuracy? Or what kind of graphic type I should use?
***** Iteration #74 *****
Loss: 170.07
Feature L2-norm: 12.5714
Learning rate (eta): 0.00778819
Total number of feature updates: 236800
Loss variance: 5.01839
Seconds required for this iteration: 0.01
Accuracy: 0.9800 (784/800)
Micro P, R, F1: 0.9771 (384/393), 0.9821 (384/391), 0.9796
***** Iteration #75 *****
Loss: 166.81
Feature L2-norm: 12.4385
Learning rate (eta): 0.00769234
Total number of feature updates: 240000
Loss variance: 4.68113
Seconds required for this iteration: 0.01
Accuracy: 0.9800 (784/800)
Micro P, R, F1: 0.9771 (384/393), 0.9821 (384/391), 0.9796

I' dont think taht you should visualize this information. All what you could see is that L2 norm is decreased over time(since it is target minimsation function) and accuracy increased. But since F1 is so high I think it is metrics for evaluation on training data.
So I would recommend to do Micro P, R, F1: 0.9771 (384/393), 0.9821 (384/391), 0.9796 such report on test data(data wich is not used for training) and create plot of iteration vs F1. And then you will see when you actually start overfitting data by peak on the plot.

For your own analysis you should plot accuracy vs. time, so you know when you start to overfit.
For publication, you can pick the metrics others have reported, so you can compare to them.

Related

Why is train loss and validate loss both in a straight line?

I am using Conv-LSTM for training, and the input features have been proven to be effective in some papers, and I can use CNN+FC networks to extract features and classify them. I change the task to regression here, and I can also achieve model convergence with Conv+FC. Later, I tried to use Conv-LSTM for processing to consider the timing characteristics of the corresponding data. Specifically: return the output of the current moment based on multiple historical inputs and the input of the current moment. The Conv-LSTM code I used: https://github.com/ndrplz/ConvLSTM_pytorch. My Loss is L1-Loss and optimizer is Adam.
A loss curve is below:
Example loss value:
Epoch:1/500 AVG Training Loss:16.40108 AVG Valid Loss:22.40100
Best validation loss: 22.400997797648113
Saving best model for epoch 1
Epoch:2/500 AVG Training Loss:16.42522 AVG Valid Loss:22.40100
Epoch:3/500 AVG Training Loss:16.40599 AVG Valid Loss:22.40100
Epoch:4/500 AVG Training Loss:16.40175 AVG Valid Loss:22.40100
Epoch:5/500 AVG Training Loss:16.42198 AVG Valid Loss:22.40101
Epoch:6/500 AVG Training Loss:16.41907 AVG Valid Loss:22.40101
Epoch:7/500 AVG Training Loss:16.42531 AVG Valid Loss:22.40101
My attempt:
Adjust the data set to only a few samples, verify that it can be overfitted, and the network code should be fine.
Adjusting the learning rate, I tried 1e-3, 1e-4, 1e-5 and 1e-6, but the loss curve is still flat as before, and even the value of the loss curve has not changed much.
Replace the optimizer with SGD, and the training result is also the above problem.
Because my data is wireless data (I-Q), neither CV nor NLP input type, here are some questions to ask about deep learning training.
After some testing, I finally found that my initial learning rate was too small. According to my previous single-point data training, the learning rate of 1e-3 is large enough, so here is preconceived, and it is adjusted from 1e-3 to a small tune, but in fact, the learning rate of 1e-3 is too small, resulting in the network not learning at all. Later, the learning rate was adjusted to 1e-2, and both the train loss and validate loss of the network achieved rapid decline (And the optimizer is Adam). When adjusting the learning rate later, you can start from 1 to the minor, do not preconceive.

High AUC and 100% recall, but precision and F1 are low

I have an imbalanced dataset which has 43323 rows and 9 of them belong to 'failure' class, other rows belong to 'normal' class. I trained a classifier with 100% recall and 94.89% AUC for test data (0.75/0.25 split with stratify = y). However, the classifier has 0.18% precision & 0.37% F1 score. I assumed I can find better F1 score by changing the threshold but I failed (I checked the threshold between 0 to 1 with step = 0.01). Also, it seems weired to me that usually when dealing with imbalanced dataset, it is hard to get a high recall. The goal is to get a better F1 score. What can I do for the next step? Thanks!
(To be clear, I used SMOTE to upsample the failure samples in training dataset)
Getting 100% recall is trivial in fact: just classify everything as 1.
Is the precision/recall curve any good? Perhaps a more thorough scan could yield a better result:
probabilities = model.predict_proba(X_test)
precision, recall, thresholds = sklearn.metrics.precision_recall_curve(y_test, probabilities)
f1_scores = 2 * recall * precision / (recall + precision)
best_f1 = np.max(f1_scores)
best_thresh = thresholds[np.argmax(f1_scores)]

Is this LSTM underfitting?

I am trying to create a model that predicts if it will rain in the next 5 days (multi-step) or not, so I dont need the precipitation value, just a "yes" or "no". I've been testing with some different tools/algorithms and I guess the big challenge here is dealing with the zero skewed data.
The dataset consists of hourly data that has columns such as precipitation, temperature, pressure, wind speed, humidity. It has around 1 milion rows. There is no requisite to use a multivariate approach.
Rain occurs mostly on months 1,2,3,11 and 12.
So I tried using a univariate LSTM on the data, and with hourly sample I had the best results. I used the following architecture:
model=Sequential()
model.add(LSTM(150,return_sequences=True,input_shape=(1,look_back)))
model.add(LSTM(50,return_sequences=True))
model.add(LSTM(50))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
history = model.fit(trainX, trainY, epochs=15, batch_size=4096, validation_data=(testX, testY), shuffle=False)
I'm using a lookback value of 24*60, which should mean 2 months.
Train/Validation Loss:
https://i.stack.imgur.com/CjDbR.png
Final result:
https://i.stack.imgur.com/p6SnD.png
So I read that this train/validation loss means the model is underfitting, is it? What could I do to prevent this?
Before using LSTM I tried using Prophet, which rendered really bad results and tried used autoarima, but it couldn't handle a yearly seasonality (365 days).
In case of underfitting what you can do is icreasing the learning rate, increasing training duration and number of training data.
It is also worth having some external metric such as the F1 score because loss isn't a good metrics for human evaluation.
Just looking at your example I would start with experimenting a bit with the loss function, it seems like your data is binary so it would be wiser to use a binary loss instead of a regression loss

Training accuracy increases aggresively, test accuracy settles

While training a convolutional neural network following this article, the accuracy of the training set increases too much while the accuracy on the test set settles.
Below is an example with 6400 training examples, randomly chosen at each epoch (so some examples might be seen at the previous epochs, some might be new), and 6400 same test examples.
For a bigger data set (64000 or 100000 training examples), the increase in training accuracy is even more abrupt, going to 98 on the third epoch.
I also tried using the same 6400 training examples each epoch, just randomly shuffled. As expected, the result is worse.
epoch 3 loss 0.54871 acc 79.01
learning rate 0.1
nr_test_examples 6400
TEST epoch 3 loss 0.60812 acc 68.48
nr_training_examples 6400
tb 91
epoch 4 loss 0.51283 acc 83.52
learning rate 0.1
nr_test_examples 6400
TEST epoch 4 loss 0.60494 acc 68.68
nr_training_examples 6400
tb 91
epoch 5 loss 0.47531 acc 86.91
learning rate 0.05
nr_test_examples 6400
TEST epoch 5 loss 0.59846 acc 68.98
nr_training_examples 6400
tb 91
epoch 6 loss 0.42325 acc 92.17
learning rate 0.05
nr_test_examples 6400
TEST epoch 6 loss 0.60667 acc 68.10
nr_training_examples 6400
tb 91
epoch 7 loss 0.38460 acc 95.84
learning rate 0.05
nr_test_examples 6400
TEST epoch 7 loss 0.59695 acc 69.92
nr_training_examples 6400
tb 91
epoch 8 loss 0.35238 acc 97.58
learning rate 0.05
nr_test_examples 6400
TEST epoch 8 loss 0.60952 acc 68.21
This is my model (I'm using RELU activation after each convolution):
conv 5x5 (1, 64)
max-pooling 2x2
dropout
conv 3x3 (64, 128)
max-pooling 2x2
dropout
conv 3x3 (128, 256)
max-pooling 2x2
dropout
conv 3x3 (256, 128)
dropout
fully_connected(18*18*128, 128)
dropout
output(128, 128)
What could be the cause?
I'm using Momentum Optimizer with learning rate decay:
batch = tf.Variable(0, trainable=False)
train_size = 6400
learning_rate = tf.train.exponential_decay(
0.1, # Base learning rate.
batch * batch_size, # Current index into the dataset.
train_size*5, # Decay step.
0.5, # Decay rate.
staircase=True)
# Use simple momentum for the optimization.
optimizer = tf.train.MomentumOptimizer(learning_rate,
0.9).minimize(cost, global_step=batch)
This is very much expected. This problem is called over-fitting. This is when your model starts "memorizing" the training examples without actually learning anything useful for the Test set. In fact, this is exactly why we use a test set in the first place. Since if we have a complex enough model we can always fit the data perfectly, even if not meaningfully. The test set is what tells us what the model has actually learned.
Its also useful to use a Validation set which is like a test set, but you use it to find out when to stop training. When the Validation error stops lowering you stop training. why not use the test set for this? The test set is to know how well your model would do in the real world. If you start using information from the test set to choose things about your training process, than its like your cheating and you will be punished by your test error no longer representing your real world error.
Lastly, convolutional neural networks are notorious for their ability to over-fit. It has been shown the Conv-nets can get zero training error even if you shuffle the labels and even random pixels. That means that there doesn't have to be a real pattern for the Conv-net to learn to represent it. This means that you have to regularize a conv-net. That is, you have to use things like Dropout, batch normalization, early stopping.
I'll leave a few links if you want to read more:
Over-fitting, validation, early stopping
https://elitedatascience.com/overfitting-in-machine-learning
Conv-nets fitting random labels:
https://arxiv.org/pdf/1611.03530.pdf
(this paper is a bit advanced, but its interresting to skim through)
P.S. to actually improve your test accuracy you will need to change your model or train with data augmentation. You might want to try transfer learning as well.

understanding test and validation set usage for early stop and model selection

I implemented an ANN (1 hidden layer of 64 units, learning rate = 0.001, epsilon = 0.001, iters = 500) with pythons OpenCV module. Train error ~ 3% and test error ~ 12%
In order to improve the accruacy/ generalisation of my NN I decided to proceed by- implementing model selection (of #hidden units and learning rate) to get an accurate value of hyperparameters and plotting learning curves to determine if more data is needed (currently have 2.5k).
Having read some sources regarding NN training and model selection, I'm very confused on the following matter -
1) In order to perform model selection, I know the following needs to be done-
create set possibleHiddenUnits {4, 8, 16, 32, 64}
randomly select Tr & Va sets from the total set of Tr + Va with some split e.g. 80/20
foreach ele in possibleHiddenUnits
(*) compute weights for the NN using backpropagation and an iterative optimisation algorithm like Gradient Descent (where we provide the termination criteria in the form of number of iterations / epsilon)
compute Validation set error using these trained weights
select the number of hidden units which min Va set error
Alternatively, I believe we can also use k-fold cross validation.
a. how do you decide what the number of iterations/ epsilon for GD should be?
b. does 1 iteration out of x iterations of GD (where the entire training set is used to compute the gradients of cost wrt weights through backprop) constitute an 'epoch'?
2) Sources (whats is the difference between train, validation and test set, in neural networks? and How to use k-fold cross validation in a neural network) mention that the training for a NN is done in the following way as it prevents over-fitting
for each epoch
for each training data instance
propagate error through the network
adjust the weights
calculate the accuracy over training data
for each validation data instance
calculate the accuracy over the validation data
if the threshold validation accuracy is met
exit training
else
continue training
a. I believe this method should be executed once the model selection has been done. But then how do we avoid overfitting of the model in step (*) of the model selection process above?
b. Am I right in assuming that one epoch constitues one iteration of training where weights are calculated using the entire Tr set through GD + backprop and GD involves x (>1) iters over the entire Tr set to calculate the weights ?
Also, out off 1b and 2b which is correct?
This is more of a comment but since I cant make comments yet I write it here. Have you tried other methods like l2 regularization or dropout? I dont know a lot about model selection but dropout has a very similiar effect like taking lots of models and averaging them. Normaly dropout should do the trick and you wont have problems with overfitting anymore.

Resources