What Does The MAE Actually Telling me? - machine-learning

I've created a simple linear regression model to predict S&P 500 closing prices. then calculated the Mean Absolute Error (MAE) and got an MAE score of 1290. Now, I don't want to know if this is right or wrong but I want to know what MAE of 1290 is telling me about my model.

To be honest "in general" it tells you nearly nothing. The value is quite arbitrary, and only if you understand exactly your data you can draw any conclusions.
MAE stands for Mean Absolute Error, thus if yours is 1290 it means, that if you randomly choose a data point from your data, then, you would expect your prediction to be 1290 away from the true value. Is it good? Bad? Depends on the scale of your output. If it is in millions, then the error this big is nothing, and the model is good. If your output values are in the range of thousands, this is horrible.
If I understand correctly S&P 500 closing prices are numbers between 0 and 2500 (for last 36 years), thus error of 1290 looks like your model learned nothing. This is pretty much like a constant model, always answering "1200" or something around this value.

MAE obtained with a model should always be verified against a baseline model.
A frequently used baseline is median value assignment. Calculate the MAE for the case when all your predictions are always equal to the median of your target variable vector, then see for yourself if your model's MAE is significantly below that. If it is — congrats.
Note that, in this case the baseline MAE will depend on the target distribution. If your test sample contains lots of instances that are really close to the median, then it will be almost impossible to get a model with a MAE better than the baseline. Thus, MAE should only be used when your test sample is sufficiently diverse. In the extreme case of only 1 instance in the test sample you will get the baseline MAE=0, which will always be no worse than any model you may come up with.
This issue with MAE is especially notable, when you get a MAE for your total sample and then want to check how it changes across different subsamples. Say, you have a model that predicts yearly income based on education, age, marital status etc. You get a MAE of $1.2k, the baseline MAE is $5k, so you conclude that your model is pretty good. Then you want to check how the model deals with bottom-earners and get a MAE of $1.7k with a baseline of $0.5k. The same is likely to occur, if you inspect the errors in the 18-22yo demographics.

Related

WER for wav2vec2-base model remains as 1 throughout the whole training process

I am trying to run the wav2vec2 speech recognition model as shared in https://huggingface.co/docs/transformers/tasks/asr
This is the loss and WER during the training process, whereby the validation loss is reducing significantly, whereas the WER remains as 1.
I tried to print out the predicted and label values and this is what I got for the last 3 outputs, which results in the WER = 1.
This is the set of parameters of the model. model param.
What may actually go wrong here? Please help.. Thanks!
I have tried tuning the hyperparameters and hoping to reduce the WER.
Thank you for providing some useful information for troubleshooting.
Your loss is reducing, which shows that the model is training, however your learning rate of 0.01 is very high. Consider changing this to something like 1e-5 as shown in the example on Hugging Face.
The other thing I noticed was that all your input text is in UPPER CASE LIKE THIS. Depending on the training data used for the original model, it may not be expecting upper case text. Try lower-casing your text to see if that yields a lower WER.
Your save_steps and eval_steps are also both far too high. This is how far the model "looks backwards" to evaluate - with a count of 1 on both these parameters, the model doesn't have enough history to compare better predictions. Increase these parameters and try again.

Does tweedie_variance_power matter when log-transforming predictions?

I haven't been able to find any canonical sources on how tweedie_variance_power comes into play when predicting using an XGBoost algorithm with objective=reg:tweedie. My dependent variable is log-transformed auto insurance claim amount, so when I go to predict, in order to get units in dollars, I apply exp to the "raw" predictions from XGBoost (which look like they're on a log scale).
However (and perhaps this is due to this model not being a very good one), when I apply exp(log_predictions), the resultant and presumably-dollar-amount predictions are so much lower than expected, given the dollar amounts in the training data. Am I missing something? Does my tweedie_variance_power = 2 for this model need to also be accounted for when transforming back to dollar units?
Related question: Xgboost tweedie: Why is the formula to get the prediction from the link = exp(link)/ 2?

Advantage of MAPE loss function over MAE and RMSE

I'm reading this article: Rolling Window Regression: a Simple Approach for Time Series Next value Predictions and he explains there the difference between five loss functions:
The first question is asking how do we measure success? We do this via
a loss function, where we try to minimize the loss function. There are
several loss functions, and they are different pros and cons.
I managed to understand the first two loss functions:
MAE ( Mean absolute error) — here all errors, big and small, are treated equally
Root Mean Square Error (RMSE) — this penalizes large errors due to the squared term. For example, with errors [0.5, 0.5] and [0.1, 0.9],
MSE for both will be 0.5 while RMSE is 0.5 and. 0.45.
But I don't understand the thrid one:
MAPE ( Mean Absolute Percentage Error) — Since #1 and #2 depending on the value range of the target variable, they cannot be compared
across datasets. In contrast, MAPE is a percentage, hence relative. It
is like accuracy in a classification problem, where everyone knows 99%
accuracy is pretty good.
Why depending on the value range of the target variable, they cannot be compared across datasets?
Why is MAPE better than them?
I don't understand his explanation.
The thing is - MAPE uses percentage.
In both MAE and RMSE I get the mean error or the root of the mean error of a dataset. Therefore in one dataset, let's say beer prices, the numbers will be small, whereas in another dataset, let's say house prices, the numbers will be large. Therefore I cannot compare the success of MAE/RMSE on one dataset to their success on another.
In contrast to them, MAPE represents the error in percentage and therefore it's not relative to the size of the numbers in the data itself, and therefore I can compare its success on the beer prices and the house prices datasets.

What does inconsistent test results mean?

I'm doing some research on CNN for text classification using tensorflow. When I run my model I get a very high training accuracy (arround 100%). However, on test split I get an inconsistent accuracy results (sometimes 11% and sometimes 90%).
Moreover, I noticed also that the loss in training is decreasing until it reaches small numbers like 0.000499564048368, while in testing it is not and sometimes it gets high values like 70. What does this mean? Any ideas?
If you get very high training accuracy and bad testing accuracy, you are almost definitely overfitting. To get a better picture of what your models real accuracy is, use cross-validation.
Cross validation splits the dataset into a training and validation set, and does this multiple times, slightly changing the training and validation data each time. This is beneficial because it can prevent scenarios where you train your model on one label, and it can't accurately identify another one. For example, picture a training set like this:
Feature1, Feature2, Label
x y 0
a y 0
b c 1
If we train the model only on the first two datapoints, it will not be able to identify the third datapoint because it is not built generally.

training set with only one label, missing the other

Hi I've been doing a machine learning project about predicting if a given (query, answer) pair is a good match (label the pair with 1 if it is a good match, 0 otherwise). But the problem is, in the training set, all the items are labelled with 1. So I got confused because I don't think the training set has strong discriminative power. To be more specific, now I could extract some features like:
1. textual similarity between query and answer
2. some attributes like the posting date, who created it, which aspect is it about etc.
Maybe I should try semi supervised learning (never studied it so have no idea if it will work)? But with such a training set I even cannot do validation....
Actually, you can train a data set on only positive examples; 1-class SVM does this. However, this presumes that anything "sufficiently outside" the original data set is negative data, with "sufficiently outside" affected mainly by gamma (allowed error rate) and k (degree of the kernel function).
A solution for your problem depends on the data you have. You are quite correct that a model trains better when given representative negative examples. The description you give strongly suggests that you do know there are insufficient matches.
Do you need a strict +/- scoring for the matches? Most applications simply rank them: the match strength is the score. This changes your problem from a classification to a prediction case. If you do need a strict +/- partition (classification), then I suggest that you slightly alter your training set: include only obvious examples: throw out anything scored near your comfort threshold for declaring a match.
With these inputs only, train your model. You'll have a clear "alley" between good and bad matches, and the model will "decide" which way to judge the in-between cases in testing and production.

Resources