Loss in Keras Model evaluation - machine-learning

I am doing binary classification with Keras
loss='binary_crossentropy', optimizer=tf.keras.optimizers.Adam and final layer is keras.layers.Dense(1, activation=tf.nn.sigmoid).
As I know, loss value is used to evaluate the model during training phase. However, when I use Keras model evaluation for my testing dataset (e.g. m_recall.evaluate(testData,testLabel), there are also loss values, accompanied by accuracy values like the output below
test size: (1889, 18525)
1889/1889 [==============================] - 1s 345us/step
m_acc: [0.5690245978371045, 0.9523557437797776]
1889/1889 [==============================] - 1s 352us/step
m_recall: [0.24519687695911097, 0.9359449444150344]
1889/1889 [==============================] - 1s 350us/step
m_f1: [0.502442331737344, 0.9216516675489677]
1889/1889 [==============================] - 1s 360us/step
metric name: ['loss', 'acc']
What is the meaning/usage of loss during testing? Why it is so high (e.g. 0.5690 in m_acc)? The accuracy evaluation seems fine to me (e.g. 0.9523 in m_acc) but I am concerned about the loss too, does it make my model perform badly?
P.S.
m_acc, m_recall, etc. are just the way I name my models (they were trained by on different metrics in GridSearchCV)
Update:
I just realized that loss values are not in percentage, so how are they calculated? And with current values, are they good enough or do I need to optimize them more?
Suggestions for further reading are appreciated too!

When defining a machine learning model, we want a way to measure the performance of our model so that we could compare it with other models to choose the best one and also make sure that it is good enough. Therefore, we define some metrics like accuracy (in the context of classification), which is the proportion of correctly classified samples by the model, to measure how our model performs and whether it is good enough for our task or not.
Although these metrics are truly comprehensible by us, however the problem is that they cannot be directly used by the learning process of our models to tune the parameters of the model. Instead, we define other measures, which are usually called loss functions or objective functions, which can be directly used by the training process (i.e. optimization). These functions are usually defined such that we expect that when their values are low we would have a high accuracy. That's why you would commonly see that the machine learning algorithms are trying to minimize a loss function with the expectation that the accuracy increases. In other words, the models are indirectly learning by optimizing the loss functions. The loss values are important during training of the model, e.g. if they are not decreasing or fluctuating then this means there is a problem somewhere that needs to be fixed.
As a result, what we are ultimately (i.e. when testing a model) concerned about is the value of metrics (like accuracy) we have initially defined and we don't care about the final value of loss functions. That's why you don't hear things like "the loss value of a [specific model] on the ImageNet dataset is 8.732"! That does not tell you anything whether the model is great, good, bad or terrible. Rather, you would hear that "this model performs with 87% accuracy on the ImageNet dataset".

Related

Why does pre-trained ResNet18 have a higher validation accuracy than training?

For PyTorch's tutorial on performing transfer learning for computer vision (https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html), we can see that there is a higher validation accuracy than training accuracy. Applying the same steps to my own dataset, I see similar results. Why is this the case? Does it have something to do with ResNet 18's architecture?
Assuming there aren't bugs in your code and the train and validation data are in the same domain, then there are a couple reasons why this may occur.
Training loss/acc is computed as the average across an entire training epoch. The network begins the epoch with one set of weights and ends the epoch with a different (hopefully better!) set of weights. During validation you're evaluating everything using only the most recent weights. This means that the comparison between validation and train accuracy is misleading since training accuracy/loss was computed with samples from potentially much worse states of your model. This is usually most noticeable at the start of training or right after the learning rate is adjusted since the network often starts the epoch in a much worse state than it ends. It's also often noticeable when the training data is relatively small (as is the case in your example).
Another difference is the data augmentations used during training that aren't used during validation. During training you randomly crop and flip the training images. While these random augmentations are useful for increasing the ability of your network to generalize they aren't performed during validation because they would diminish performance.
If you were really motivated and didn't mind spending the extra computational power you could get a more meaningful comparison by running the training data back through your network at the end of each epoch using the same data transforms used for validation.
The short answer is that train and validation data are from different distributions, and it's "easier" for model to predict target in validation data then it is for training.
The likely reason for this particular case, as indicated by this answer, is data augmentation during training. This is a way to regularize your model by increasing variability in the training data.
Other architectures can use Dropout (or its modifications), which are deliberately "hurting" training performance, reducing the potential of overfitting.
Notice, that you're using pretrained model, which already contains some information about how to solve classification problem. If your domain is not that different from the data it was trained on, you can expect good performance off-the-shelf.

Decreasing training loss, stable validation loss - is the model overfitting?

Does my model overfit? I would be sure it overfitted, if the validation loss increased heavily, while the training loss decreased. However the validation loss is nearly stable, so I am not sure. Can you please help?
I assume that you're using different hyperparameters? Perhaps save
the parameters and resume with a different set of hyperparameters.
This comment really depends on how you're doing hyperparameter
optimization.
Try with different training/test splits. It might be idiosyncratic.
Especially with so few epochs.
Depending on how costly it is to train the model and evaluate it,
consider bagging your models, akin to how a random forest operates.
In others words, fit your model to many different train/test splits,
and average the model outputs, either in terms of a majority
classification vote, or an averaging of the predicted probabilities.
In this case, I'd err on the side of a slightly overfit model,
because of the way that averaging can mitigate overfitting. But I
wouldn't train to death either, unless you're going to fit very very
many neural nets, and somehow ensure that you're decorrelating them
akin to the method of random subspaces from random forests.

Identifying accuracy and dropped features with AutoML (ml.net)

I have been playing with ML.Net AutoML and having a blast with it. I still have some questions and hope someone either could help or guide me in the right direction with some of my questions.
Question 1:
I have a trained binary classification model from AutoML. This resulted in a top 5 list of algorithms based on highest accuracy, and I ended up with a SdcaLogisticRegressionBinary binary classification model with an accuracy of 89%.
Now when I do my evaluation the accuracy drops to 84%. Would this mean the original training model was overfitted by 5%? Would it be fair to say that the accuracy of my model is not 89% but actually 84% based on the evaluation?
Question 2:
AutoML also drops features during training where needed. Is there a way to retrieve the actual list of features that was included in the final model, e.g. determine which features were dropped and didn't improve the accuracy of the model?
When I inspect the final model, the OutputSchema tends to always include all the features based on the initial training data.
Would this mean the original training model was overfitted by 5%?
This terminology says nothing, and it is never used. Sadly, "overfitting" is a much abused term nowadays, used to mean almost everything linked to suboptimal performance; nevertheless, and practically speaking, overfitting means something very specific: its telltale signature is when your validation loss starts increasing, while your training loss continues decreasing, i.e.:
The 5% "margin" between your training and validation accuracy is another story altogether (it is called generalization gap), and does not signify overfitting.
Would it be fair to say that the accuracy of my model is not 89% but actually 84% based on the evaluation?
As you have already probably suspected, "accuracy" by itself is an ambiguous term; truth is that, in practice, when used without any other signifier, it it usually taken to mean the validation accuracy (practically nobody bothers for the exact value of the training accuracy). In any case, the correct report of your results would be - training accuracy 89%, validation accuracy 85%.

Test accuracy is greater than train accuracy what to do?

I am using the random forest.My test accuracy is 70% on the other hand train accuracy is 34% ? what to do ? How can I solve this problem.
Test accuracy should not be higher than train since the model is optimized for the latter. Ways in which this behavior might happen:
you did not use the same source dataset for test. You should do a proper train/test split in which both of them have the same underlying distribution. Most likely you provided a completely different (and more agreeable) dataset for test
an unreasonably high degree of regularization was applied. Even so there would need to be some element of "test data distribution is not the same as that of train" for the observed behavior to occur.
The other answers are correct in most cases. But I'd like to offer another perspective. There are specific training regimes that could cause the training data to be harder for the model to learn - for instance, adversarial training or adding Gaussian noise to the training examples. In these cases, the benign test accuracy could be higher than train accuracy, because benign examples are easier to evaluate. This isn't always a problem, however!
If this applies to you, and the gap between train and test accuracies is larger than you'd like (~30%, as in your question, is a pretty big gap), then this indicates that your model is underfitting to the harder patterns, so you'll need to increase the expressibility of your model. In the case of random forests, this might mean training the trees to a higher depth.
First you should check the data that is used for training. I think there is some problem with the data, the data may not be properly pre-processed.
Also, in this case, you should try more epochs. Plot the learning curve to analyze when the model is going to converge.
You should check the following:
Both training and validation accuracy scores should increase and loss should decrease.
If there is something wrong in step 1 after any particular epoch, then train your model until that epoch only, because your model is over-fitting after that.

Should a neural network be able to have a perfect train accuracy?

The title says it all: Should a neural network be able to have a perfect train accuracy? Mine saturates at ~0.9 accuracy and I am wondering if that indicates a problem with my network or the training data.
Training instances: ~4500 sequences with an average length of 10 elements.
Network: Bi-directional vanilla RNN with a softmax layer on top.
Perfect accuracy on training data is usually a sign of a phenomenon called overfitting (https://en.wikipedia.org/wiki/Overfitting) and the model may generalize poorly to unseen data. So, no, probably this alone is not an indication that there is something wrong (you could still be overfitting but it is not possible to tell from the information in your question).
You should check the accuracy of the NN on the validation set (data your network has not seen during training) and judge its generalizability. usually it's an iterative process where you train many networks with different configurations in parallel and see which one performs best on the validation set. Also see cross validation (https://en.wikipedia.org/wiki/Cross-validation_(statistics))
If you have low measurement noise, a model may still not get zero training error. This could be for many reasons including that the model is not flexible enough to capture the true underlying function (which can be a complicated, high-dimensional, non-linear function). You can try increasing the number of hidden layers and nodes but you have to be careful about the same things like overfitting and only judge based on evaluation through cross validation.
You can definitely get a 100% accuracy on training datasets by increasing model complexity but I would be wary of that.
You cannot expect your model to be better on your test set than on your training set. This means if your training accuracy is lower than the desired accuracy, you have to change something. Most likely you have to increase the number of parameters of your model.
The reason why you might be ok with not having a perfect training accuracy is (1) the problem of overfitting (2) training time. The more complex your model is, the more likely is overfitting.
You might want to have a look at Structural Risc Minimization:
(source: svms.org)

Resources