Time series with XGBoost - time-series

On some time series data I am working with XGBoost and I am getting a large value of RMSE:
I scaled all the data (including the target) and I got the logic results of values between 0 and 1:
I'm not sure if I can say that my model is accurate according to the scaled data values?

Generally, we use MAE as the test statistic for real-world data.
High MSE is an indicator that there are big outliers in your predictions.
MAE vs MSE:
Mean Absolute Error (MAE) is less susceptible to outliers since it does not "penalise" outliers.
It is used in cases where performance is measured using continuous variable data.
It produces a linear number that equalizes the weighted individual disparities.
Mean Squared Error (MSE) is more susceptible to outliers as it "penalise" outliers heavily.
This metric excels when the dataset contains outliers, or unexpected values (too high or too low values).
Additional tips:
You should also look into the Root Mean Squared Error (RMSE) metric.
This allows you to identify your model prediction errors to fix it.
If RMSE is close to MAE, the model makes many relatively small errors.
If RMSE is close to MSE, the model makes few but large errors.
MAE ≤ RMSE ≤ MSE (for Regression)

Related

LSTM higher validation accuracy at higher validation loss ( Python: Keras )

When training my LSTM ( using the Keras library in Python ) the validation loss keeps increasing, although it eventually does obtain a higher validation accuracy. Which leads me to 2 questions:
How/Why does it obtain a (significantly) higher validation accuracy at a (significantly) higher validation loss?
Is it problematic that the validation loss increases? ( because it eventually does obtain a good validation accuracy either way )
This is an example history log of my LSTM for which this applies:
As visible when comparing epoch 0 with epoch ~430:
52% val accuracy at 1.1 val loss vs. 61% val accuracy at 1.8 val loss
For the loss function I'm using tf.keras.losses.CategoricalCrossentropy and I'm using the SGD optimizer at a high learning rate of 50-60% ( as it obtained the best validation accuracy with it ).
Initially I thought it may be overfitting, but then I don't understand how the validation accuracy does eventually get quite a lot higher at almost 2 times as high of a validation loss.
Any insights would be much appreciated.
EDIT: Another example of a different run, less fluctuating validation accuracy but still significantly higher validation accuracy as the validation loss increases:
In this run I used a low instead of high dropout.
As you stated, "at a high learning rate of 50-60%", this might be the reason why graphs are oscillating. Lowering the learning rate or adding regularization should solve the oscillating problem.
More generally,
Cross Entropy loss is not a bounded loss, so having very badly outliers would make it explode.
Accuracy can go higher which means your model is able to learn the rest of the dataset except the outliers.
Validation set has too many outliers that causing the oscillation of the loss values.
To conclude if you are overfitting or not, you should inspect validation set for outliers.

Advantage of MAPE loss function over MAE and RMSE

I'm reading this article: Rolling Window Regression: a Simple Approach for Time Series Next value Predictions and he explains there the difference between five loss functions:
The first question is asking how do we measure success? We do this via
a loss function, where we try to minimize the loss function. There are
several loss functions, and they are different pros and cons.
I managed to understand the first two loss functions:
MAE ( Mean absolute error) — here all errors, big and small, are treated equally
Root Mean Square Error (RMSE) — this penalizes large errors due to the squared term. For example, with errors [0.5, 0.5] and [0.1, 0.9],
MSE for both will be 0.5 while RMSE is 0.5 and. 0.45.
But I don't understand the thrid one:
MAPE ( Mean Absolute Percentage Error) — Since #1 and #2 depending on the value range of the target variable, they cannot be compared
across datasets. In contrast, MAPE is a percentage, hence relative. It
is like accuracy in a classification problem, where everyone knows 99%
accuracy is pretty good.
Why depending on the value range of the target variable, they cannot be compared across datasets?
Why is MAPE better than them?
I don't understand his explanation.
The thing is - MAPE uses percentage.
In both MAE and RMSE I get the mean error or the root of the mean error of a dataset. Therefore in one dataset, let's say beer prices, the numbers will be small, whereas in another dataset, let's say house prices, the numbers will be large. Therefore I cannot compare the success of MAE/RMSE on one dataset to their success on another.
In contrast to them, MAPE represents the error in percentage and therefore it's not relative to the size of the numbers in the data itself, and therefore I can compare its success on the beer prices and the house prices datasets.

How to deal with this unbalanced-class skewed data-set?

I have to deal with Class Imbalance Problem and do a binary-classification of the input test data-set where majority of the class-label is 1 (the other class-label is 0) in the training data-set.
For example, following is some part of the training data :
93.65034,94.50283,94.6677,94.20174,94.93986,95.21071,1
94.13783,94.61797,94.50526,95.66091,95.99478,95.12608,1
94.0238,93.95445,94.77115,94.65469,95.08566,94.97906,1
94.36343,94.32839,95.33167,95.24738,94.57213,95.05634,1
94.5774,93.92291,94.96261,95.40926,95.97659,95.17691,0
93.76617,94.27253,94.38002,94.28448,94.19957,94.98924,0
where the last column is the class-label - 0 or 1. The actual data-set is very skewed with a 10:1 ratio of classes, that is around 700 samples have 0 as their class label, while the rest 6800 have 1 as their class label.
The above mentioned are only a few of the all the samples in the given data-set, but the actual data-set contains about 90% of samples with class-label as 1, and the rest with class-label being 0, despite the fact that more or less all the samples are very much similar.
Which classifier should be best for handling this kind of data-set ?
I have already tried logistic-regression as well as svm with class-weight parameter set as "balanced", but got no significant improvement in accuracy.
but got no significant improvement in accuracy.
Accuracy isn't the way to go (e.g. see Accuracy paradox). With a 10:1 ratio of classes you can easily get a 90% accuracy just by always predicting class-label 0.
Some good starting points are:
try a different performance metric. E.g. F1-score and Matthews correlation coefficient
"resample" the dataset: add examples from the under-represented class (over-sampling) / delete instances from the over-represented class (under-sampling; you should have a lot of data)
a different point of view: anomaly detection is a good try for an imbalanced dataset
a different algorithm is another possibility but not a silver shoot. Probably you should start with decision trees (often perform well on imbalanced datasets)
EDIT (now knowing you're using scikit-learn)
The weights from the class_weight (scikit-learn) parameter are used to train the classifier (so balanced is ok) but accuracy is a poor choice to know how well it's performing.
The sklearn.metrics module implements several loss, score and utility functions to measure classification performance. Also take a look at How to compute precision, recall, accuracy and f1-score for the multiclass case with scikit learn?.
Have you tried plotting a ROC curve and AUC curve to check your parameters and different thresholds? If not that should give you a good starting point.

Cross Validation in Classification

I have two different datasets, datset X and dataset Y... From which I calculate features to use for classification..
Case 1. When I combine both together as one large datset then use 10 fold cross validation I get very good classification results with accuracy and AUC > 95%
Case2. Yet if I use one of the datasets for training and the other for testing, results fall severely low with both accuracy and AUC becoming ~ 50%
My questions are:
Which of the cases' results is more reliable??
And why the huge difference in results??
Thanks..
There could be a bias in the way the datasets were obtained that makes you get worst results.
Read this.
Another thing is that on one case you are training your classifier with a smaller dataset (the two combined is larger assuming they are about the same size, even with the 10 fold cross validation). This necessarily causes a poorer performance.
So my answers would be:
Depends on how you obtained both datasets and on how the final classifier will be used.
Differences in the size of the training set and bias on how they are obtained.

Why the average weight of rnn keeps climbing?

I'm using Pybrain to train a recurrent neural network. However, the average of the weights keeps climbing and after several iterations the train and test accuracy become lower. Now the highest performance on train data is about 55% and on test data is about 50%.
I think maybe the rnn have some training problems because of its high weights. How can I solve it? Thank you in advance.
The usual way to restrict the network parameters is to use a constrained error-functional which somehow penalizes the absolute magnitude of the parameters. Such is done in "weight decay" where you add to your sum-of-squares error the norm of the weights ||w||. Usually this is the Euclidian norm, but sometimes also the 1-norm in which case it is called "Lasso". Note that weight decay is also called ridge regression or Tikhonov regularization.
In PyBrain, according to this page in the documentation, there is available a Lasso-version of weight decay, which can be parametrized by the parameter wDecay.

Resources