Low R2 but high MAPE - machine-learning

Low R2 but high MAPE - machine-learning

I'm currently working on a project where I have to solve a regression based problem. I basically have to try different models and compare the accuracy of each one. Until now I've tried Decision Trees, Random Forests and Bagging.Currently,I'm trying out ANN's.The metrics that I'm using to evaluate the model performance are the R2 score, RMSE and MAPE. For the first 3 models, the results that I'm getting are:
Decision Tree:
R2: 0.608
RMSE:11.640681667132872
MAPE:78.73%
Bagging:
R2: 0.752
RMSE:9.193
MAPE:78.46%
Random Forest:
R2: 0.726
RMSE:9.731
MAPE: 78.27%
However, with the ANN, the results that I'm getting are really baffling.
R2:0.264
RMSE:12.034
MAPE:88.73%
As you can see, although the R2 score is very low compared to the other models, the MAPE accuracy is surprisingly high. Can anyone please give me some insight as to why this might be happening?
The code that I'm using for calculating the MAPE accuracy is:
#Function to calculate MAPE accuracy
def evaluate(model, test_features, test_labels):
predictions = model.predict(test_features)
errors = abs(predictions - test_labels)
mape = 100 * np.mean(errors / test_labels)
accuracy = 100 - mape
print('Model Performance')
print('Average Error: {:0.4f} degrees.'.format(np.mean(errors)))
print('Accuracy = {:0.2f}%.'.format(accuracy))
return accuracy
P.S. I'm using the hold out method of evaluation.

Related

High AUC and 100% recall, but precision and F1 are low

I have an imbalanced dataset which has 43323 rows and 9 of them belong to 'failure' class, other rows belong to 'normal' class. I trained a classifier with 100% recall and 94.89% AUC for test data (0.75/0.25 split with stratify = y). However, the classifier has 0.18% precision & 0.37% F1 score. I assumed I can find better F1 score by changing the threshold but I failed (I checked the threshold between 0 to 1 with step = 0.01). Also, it seems weired to me that usually when dealing with imbalanced dataset, it is hard to get a high recall. The goal is to get a better F1 score. What can I do for the next step? Thanks!
(To be clear, I used SMOTE to upsample the failure samples in training dataset)

Getting 100% recall is trivial in fact: just classify everything as 1.
Is the precision/recall curve any good? Perhaps a more thorough scan could yield a better result:
probabilities = model.predict_proba(X_test)
precision, recall, thresholds = sklearn.metrics.precision_recall_curve(y_test, probabilities)
f1_scores = 2 * recall * precision / (recall + precision)
best_f1 = np.max(f1_scores)
best_thresh = thresholds[np.argmax(f1_scores)]

Display inverted ROC Curve

my anomaly detection algorithm gave me an array of predictions where all the values greater than 0 should be of the positive class (= 0) and all the other should be classified as anomalies (= 1). I built my classifier as well: (I have three datasets, the one with only non-anomaly values and the other with all anomaly values):
normal = np.load('normal_score.pkl')
anom_1 = np.load('anom1_score.pkl')
anom2_ = np.load('anom2_score.pkl')
y_normal = np.asarray([0]*len(normal)) # I know they are normal
y_anom_1 = np.asarray([1]*len(anom_1)) # I know they are anomaly
y_anom_2 = np.asarray([1]*len(anom_2)) # I know they are anomaly
score = np.concatenate([normal, anom_1, anom_2])
y = np.concatenate([y_normal, y_anom_1, y_anom_2])
auc = roc_auc_score(y, score)
fpr, tpr, thresholds = roc_curve(y, score)
display = RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=auc)
The AUC score I get is 0.02 and the plot looks like:
From what I understood this result is great because I should just reverse the labels to make it almost 0.98, but my question is: is there a way to specify it and automatically reverse it through a function?
The values in my normal score data are all in the range (21;57) and the anomalies values are in the range (-1090; -1836) so it should be easy to spot them.

"I should just reverse the labels to make it almost 0.98"
That's not how it should be done. It is because if you can predict "normal", let's say, with 95% confidence, you can not infer from this that you can also predict "anomaly" with the same confidence.
It becomes crucial in case of heavily imbalanced data which is probably the case here.
You should define which of these two you want to predict with high confidence and what are the target prediction metrics. For example, if you have a target on the precision and recall for predicting the "anomaly" then that should be your class "1" and calculate the metrics accordingly, and vice versa.

When do I use scoring vs metrics to evaluate ML performance

hi what is the basic difference between 'scoring' and 'metrics'. these are used to measure performance but how do they differ?
if you see the example
in the below the cross val is using 'neg_mean_squared_error' for scoring
X = array[:, 0:13]
Y = array[:, 13]
seed = 7
kfold = model_selection.KFold(n_splits=10, random_state=seed)
model = LinearRegression()
scoring = 'neg_mean_squared_error'
results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
print("MSE: %.3f (%.3f)") % (results.mean(), results.std())
but in the below xgboost example I am using metrics = 'rmse'
cmatrix = xgb.DMatrix(data=X, label=y)
params = {'objective': 'reg:linear', 'max_depth': 3}
cv_results = xgb.cv(dtrain=cmatrix, params=params, nfold=3, num_boost_round=5, metrics='rmse', as_pandas=True, seed=123)
print(cv_results)

how do they differ?
They don't; these are actually just different terms, to declare the same thing.
To be very precise, scoring is the process in which one measures the model performance, according to some metric (or score). The scikit-learn term choice for the argument scoring (as in your first snippet) is rather unfortunate (it actually implies a scoring function), as the MSE (and its variants, as negative MSE and RMSE) are metrics or scores. But practically speaking, as shown in your example snippets, these two terms are used as synonyms and are frequently used interchangeably.
The real distinction of interest here is not between "score" and "metric", but between loss (often referred to as cost) and metrics such as the accuracy (for classification problems); this is often a source of confusion among new users. You may find my answers in the following threads useful (ignore the Keras mentions in some titles, the answers are generally applicable):
Loss & accuracy - Are these reasonable learning curves?
How does Keras evaluate the accuracy?
Optimizing for accuracy instead of loss in Keras model

Is there an optimizer in keras based on precision or recall instead of loss?

I am developping a segmentation neural network with only two classes, 0 and 1 (0 is the background and 1 the object that I want to find on the image). On each image, there are about 80% of 1 and 20% of 0. As you can see, the dataset is unbalanced and it makes the results wrong. My accuracy is 85% and my loss is low, but that is only because my model is good at finding the background !
I would like to base the optimizer on another metric, like precision or recall which is more usefull in this case.
Does anyone know how to implement this ?

You don't use precision or recall to be optimize. You just track them as valid scores to get the best weights. Do not mix loss, optimizer, metrics and other. They are not meant for the same thing.
THRESHOLD = 0.5
def precision(y_true, y_pred, threshold_shift=0.5-THRESHOLD):
# just in case
y_pred = K.clip(y_pred, 0, 1)
# shifting the prediction threshold from .5 if needed
y_pred_bin = K.round(y_pred + threshold_shift)
tp = K.sum(K.round(y_true * y_pred_bin)) + K.epsilon()
fp = K.sum(K.round(K.clip(y_pred_bin - y_true, 0, 1)))
precision = tp / (tp + fp)
return precision
def recall(y_true, y_pred, threshold_shift=0.5-THRESHOLD):
# just in case
y_pred = K.clip(y_pred, 0, 1)
# shifting the prediction threshold from .5 if needed
y_pred_bin = K.round(y_pred + threshold_shift)
tp = K.sum(K.round(y_true * y_pred_bin)) + K.epsilon()
fn = K.sum(K.round(K.clip(y_true - y_pred_bin, 0, 1)))
recall = tp / (tp + fn)
return recall
def fbeta(y_true, y_pred, beta = 2, threshold_shift=0.5-THRESHOLD):
# just in case
y_pred = K.clip(y_pred, 0, 1)
# shifting the prediction threshold from .5 if needed
y_pred_bin = K.round(y_pred + threshold_shift)
tp = K.sum(K.round(y_true * y_pred_bin)) + K.epsilon()
fp = K.sum(K.round(K.clip(y_pred_bin - y_true, 0, 1)))
fn = K.sum(K.round(K.clip(y_true - y_pred, 0, 1)))
precision = tp / (tp + fp)
recall = tp / (tp + fn)
beta_squared = beta ** 2
return (beta_squared + 1) * (precision * recall) / (beta_squared * precision + recall)
def model_fit(X,y,X_test,y_test):
class_weight={
1: 1/(np.sum(y) / len(y)),
0:1}
np.random.seed(47)
model = Sequential()
model.add(Dense(1000, input_shape=(X.shape[1],)))
model.add(Activation('relu'))
model.add(Dropout(0.35))
model.add(Dense(500))
model.add(Activation('relu'))
model.add(Dropout(0.35))
model.add(Dense(250))
model.add(Activation('relu'))
model.add(Dropout(0.35))
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adamax',metrics=[fbeta,precision,recall])
model.fit(X, y,validation_data=(X_test,y_test), epochs=200, batch_size=50, verbose=2,class_weight = class_weight)
return model

No. To do a 'gradient descent', you need to compute a gradient. For this the function need to be somehow smooth. Precision/recall or accuracy is not a smooth function, it has only sharp edges on which the gradient is infinity and flat places on which the gradient is zero. Hence you can not use any kind of numerical method to find a minimum of such a function - you would have to use some kind of combinatorial optimization and that would be NP-hard.

As others have stated, precision/recall is not directly usable as a loss function. However, better proxy loss functions have been found that help with a whole family of precision/recall related functions (e.g. ROC AUC, precision at fixed recall, etc.)
The research paper Scalable Learning of Non-Decomposable Objectives covers this with a method to sidestep the combinatorial optimization by the use of certain calculated bounds, and some Tensorflow code by the authors is available at the tensorflow/models repository. Additionally, there is a followup question on StackOverflow that has an answer that adapts this into a usable Keras loss function.
Special thanks to Francois Chollet and other participants on the Keras issue thread here that turned up that research paper. You may also find that thread provides other useful insights into the problem at hand.

Having the same problem with an unbalanced dataset, I'd suggest you use the F1 score as the metric of your optimizer.
Andrew Ng teaches that having ONE metric for the model is the simplest (best?) way to train a model. If you have 2 metrics, like precision and recall - it's not clear which one is more important. Trying to set limits on one metric obviously impacts the other metric...
F1 score is the prodigy of recall and precision - it is their harmonic mean.
Keras that I'm using, unfortunately has no implementation of F1 score as a metric, like there is one for accuracy, or many other Keras metrics https://keras.io/api/metrics/.
I found an implementation of the F1 score as a Keras metric, used at each epoch at:
https://medium.com/#aakashgoel12/how-to-add-user-defined-function-get-f1-score-in-keras-metrics-3013f979ce0d
I've implemented the simple function from the above article and the model trains now on F1 score as its Keras optimizer metric. Results on test: accuracy went down a bit and F1 score went up a lot.

I have the same problem regarding an unbalanced dataset for binary classification and I want to increase the recall sensitivity too. I found out that there is a built-in function for recall in tf.keras and can be used in the compile statement as follow:
from tensorflow.keras.metrics import Recall, Accuracy
model.compile(loss='binary_crossentropy' , optimizer=opt, metrics=[Accuracy(),Recall()])

The recommended approach to deal with an unbalanced dataset like you have is to use class_weights or sample_weights. See the model fit API for details.
Quote:
class_weight: Optional dictionary mapping class indices (integers) to a weight (float) value, used for weighting the loss function (during training only). This can be useful to tell the model to "pay more attention" to samples from an under-represented class.
With weights that are inversely proportional to the class frequency the loss will avoid just predicting the background class.
I understand that this is not how you formulated the question but imho it is the most practical approach to the issue you are facing.

I think that the Callbacks and Early Stopping mechanisms provide one with techniques that can lead you as close as possible to what you want to achieve. Please read the following article by Jason Brownlee about Early Stopping (read to the end!):
https://machinelearningmastery.com/how-to-stop-training-deep-neural-networks-at-the-right-time-using-early-stopping/

Cost function training target versus accuracy desired goal

When we train neural networks, we typically use gradient descent, which relies on a continuous, differentiable real-valued cost function. The final cost function might, for example, take the mean squared error. Or put another way, gradient descent implicitly assumes the end goal is regression - to minimize a real-valued error measure.
Sometimes what we want a neural network to do is perform classification - given an input, classify it into two or more discrete categories. In this case, the end goal the user cares about is classification accuracy - the percentage of cases classified correctly.
But when we are using a neural network for classification, though our goal is classification accuracy, that is not what the neural network is trying to optimize. The neural network is still trying to optimize the real-valued cost function. Sometimes these point in the same direction, but sometimes they don't. In particular, I've been running into cases where a neural network trained to correctly minimize the cost function, has a classification accuracy worse than a simple hand-coded threshold comparison.
I've boiled this down to a minimal test case using TensorFlow. It sets up a perceptron (neural network with no hidden layers), trains it on an absolutely minimal dataset (one input variable, one binary output variable) assesses the classification accuracy of the result, then compares it to the classification accuracy of a simple hand-coded threshold comparison; the results are 60% and 80% respectively. Intuitively, this is because a single outlier with a large input value, generates a correspondingly large output value, so the way to minimize the cost function is to try extra hard to accommodate that one case, in the process misclassifying two more ordinary cases. The perceptron is correctly doing what it was told to do; it's just that this does not match what we actually want of a classifier. But the classification accuracy is not a continuous differentiable function, so we can't use it as the target for gradient descent.
How can we train a neural network so that it ends up maximizing classification accuracy?
import numpy as np
import tensorflow as tf
sess = tf.InteractiveSession()
tf.set_random_seed(1)
# Parameters
epochs = 10000
learning_rate = 0.01
# Data
train_X = [
[0],
[0],
[2],
[2],
[9],
]
train_Y = [
0,
0,
1,
1,
0,
]
rows = np.shape(train_X)[0]
cols = np.shape(train_X)[1]
# Inputs and outputs
X = tf.placeholder(tf.float32)
Y = tf.placeholder(tf.float32)
# Weights
W = tf.Variable(tf.random_normal([cols]))
b = tf.Variable(tf.random_normal([]))
# Model
pred = tf.tensordot(X, W, 1) + b
cost = tf.reduce_sum((pred-Y)**2/rows)
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
tf.global_variables_initializer().run()
# Train
for epoch in range(epochs):
# Print update at successive doublings of time
if epoch&(epoch-1) == 0 or epoch == epochs-1:
print('{} {} {} {}'.format(
epoch,
cost.eval({X: train_X, Y: train_Y}),
W.eval(),
b.eval(),
))
optimizer.run({X: train_X, Y: train_Y})
# Classification accuracy of perceptron
classifications = [pred.eval({X: x}) > 0.5 for x in train_X]
correct = sum([p == y for (p, y) in zip(classifications, train_Y)])
print('{}/{} = perceptron accuracy'.format(correct, rows))
# Classification accuracy of hand-coded threshold comparison
classifications = [x[0] > 1.0 for x in train_X]
correct = sum([p == y for (p, y) in zip(classifications, train_Y)])
print('{}/{} = threshold accuracy'.format(correct, rows))

How can we train a neural network so that it ends up maximizing classification accuracy?
I'm asking for a way to get a continuous proxy function that's closer to the accuracy
To start with, the loss function used today for classification tasks in (deep) neural nets was not invented with them, but it goes back several decades, and it actually comes from the early days of logistic regression. Here is the equation for the simple case of binary classification:
The idea behind it was exactly to come up with a continuous & differentiable function, so that we would be able to exploit the (vast, and still expanding) arsenal of convex optimization for classification problems.
It is safe to say that the above loss function is the best we have so far, given the desired mathematical constraints mentioned above.
Should we consider this problem (i.e. better approximating the accuracy) solved and finished? At least in principle, no. I am old enough to remember an era when the only activation functions practically available were tanh and sigmoid; then came ReLU and gave a real boost to the field. Similarly, someone may eventually come up with a better loss function, but arguably this is going to happen in a research paper, and not as an answer to a SO question...
That said, the very fact that the current loss function comes from very elementary considerations of probability and information theory (fields that, in sharp contrast with the current field of deep learning, stand upon firm theoretical foundations) creates at least some doubt as to if a better proposal for the loss may be just around the corner.
There is another subtle point on the relation between loss and accuracy, which makes the latter something qualitatively different than the former, and is frequently lost in such discussions. Let me elaborate a little...
All the classifiers related to this discussion (i.e. neural nets, logistic regression etc) are probabilistic ones; that is, they do not return hard class memberships (0/1) but class probabilities (continuous real numbers in [0, 1]).
Limiting the discussion for simplicity to the binary case, when converting a class probability to a (hard) class membership, we are implicitly involving a threshold, usually equal to 0.5, such as if p[i] > 0.5, then class[i] = "1". Now, we can find many cases whet this naive default choice of threshold will not work (heavily imbalanced datasets are the first to come to mind), and we'll have to choose a different one. But the important point for our discussion here is that this threshold selection, while being of central importance to the accuracy, is completely external to the mathematical optimization problem of minimizing the loss, and serves as a further "insulation layer" between them, compromising the simplistic view that loss is just a proxy for accuracy (it is not). As nicely put in the answer of this Cross Validated thread:
the statistical component of your exercise ends when you output a probability for each class of your new sample. Choosing a threshold beyond which you classify a new observation as 1 vs. 0 is not part of the statistics any more. It is part of the decision component.
Enlarging somewhat an already broad discussion: Can we possibly move completely away from the (very) limiting constraint of mathematical optimization of continuous & differentiable functions? In other words, can we do away with back-propagation and gradient descend?
Well, we are actually doing so already, at least in the sub-field of reinforcement learning: 2017 was the year when new research from OpenAI on something called Evolution Strategies made headlines. And as an extra bonus, here is an ultra-fresh (Dec 2017) paper by Uber on the subject, again generating much enthusiasm in the community.

I think you are forgetting to pass your output through a simgoid. Fixed below:
import numpy as np
import tensorflow as tf
sess = tf.InteractiveSession()
tf.set_random_seed(1)
# Parameters
epochs = 10000
learning_rate = 0.01
# Data
train_X = [
[0],
[0],
[2],
[2],
[9],
]
train_Y = [
0,
0,
1,
1,
0,
]
rows = np.shape(train_X)[0]
cols = np.shape(train_X)[1]
# Inputs and outputs
X = tf.placeholder(tf.float32)
Y = tf.placeholder(tf.float32)
# Weights
W = tf.Variable(tf.random_normal([cols]))
b = tf.Variable(tf.random_normal([]))
# Model
# CHANGE HERE: Remember, you need an activation function!
pred = tf.nn.sigmoid(tf.tensordot(X, W, 1) + b)
cost = tf.reduce_sum((pred-Y)**2/rows)
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
tf.global_variables_initializer().run()
# Train
for epoch in range(epochs):
# Print update at successive doublings of time
if epoch&(epoch-1) == 0 or epoch == epochs-1:
print('{} {} {} {}'.format(
epoch,
cost.eval({X: train_X, Y: train_Y}),
W.eval(),
b.eval(),
))
optimizer.run({X: train_X, Y: train_Y})
# Classification accuracy of perceptron
classifications = [pred.eval({X: x}) > 0.5 for x in train_X]
correct = sum([p == y for (p, y) in zip(classifications, train_Y)])
print('{}/{} = perceptron accuracy'.format(correct, rows))
# Classification accuracy of hand-coded threshold comparison
classifications = [x[0] > 1.0 for x in train_X]
correct = sum([p == y for (p, y) in zip(classifications, train_Y)])
print('{}/{} = threshold accuracy'.format(correct, rows))
The output:
0 0.28319069743156433 [ 0.75648874] -0.9745011329650879
1 0.28302448987960815 [ 0.75775659] -0.9742625951766968
2 0.28285878896713257 [ 0.75902224] -0.9740257859230042
4 0.28252947330474854 [ 0.76154679] -0.97355717420578
8 0.28187844157218933 [ 0.76656926] -0.9726400971412659
16 0.28060704469680786 [ 0.77650583] -0.970885694026947
32 0.27818527817726135 [ 0.79593837] -0.9676888585090637
64 0.2738055884838104 [ 0.83302218] -0.9624817967414856
128 0.26666420698165894 [ 0.90031379] -0.9562843441963196
256 0.25691407918930054 [ 1.01172411] -0.9567816257476807
512 0.2461051195859909 [ 1.17413962] -0.9872989654541016
1024 0.23519910871982574 [ 1.38549554] -1.088881492614746
2048 0.2241383194923401 [ 1.64616168] -1.298340916633606
4096 0.21433120965957642 [ 1.95981205] -1.6126530170440674
8192 0.2075471431016922 [ 2.31746769] -1.989408016204834
9999 0.20618653297424316 [ 2.42539024] -2.1028473377227783
4/5 = perceptron accuracy
4/5 = threshold accuracy

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Low R2 but high MAPE - machine-learning

Related

High AUC and 100% recall, but precision and F1 are low

Display inverted ROC Curve

When do I use scoring vs metrics to evaluate ML performance

Is there an optimizer in keras based on precision or recall instead of loss?

Cost function training target versus accuracy desired goal

Categories

Resources