Xgboost forecasting model missing holiday period - machine-learning

I am building a forecasting system to predict the number of cable subscribers that would disconnect at a given point in time. I am using Python and out of the different models i tried, XGBoost performs the best.
I have a self referential system in place which works in a moving window fashion, e.g, as i run out of actuals, i start using forecasted figures in my lags.
To build the forecasting system, i used previous 800 days of lags(disconnects a day), moving averages, ratios, seasonality, indicators for year, month, day, week etc. However, Holidays, is where is gets a little messed up. Initially i used just one column to indicate holidays of all sorts, but later i figured out that different holidays may have a different impact (some holidays cause high sales, some holidays cause churn) so i added a column for each holiday, i also added indicators for long weekends, for holidays which fall on a Sunday etc. i also added a columns for 'season' indicating festive period such as thanksgiving, new year holidays etc.
Even after adding so many holiday related columns, i largely miss the thanksgiving and the new year period. Although it does take care of holidays to some extent, it completely misses the spike. And as can be seen from the chart, the spikes are a trend and appear every year (orange). my forecast (grey) does address the holidays in dec 17, but it under forecasts, any idea on how that can be taken care of.
p.s. I tuned the xgboost hyperparameters using gridsearch

As I understand, If you cleaned your data, removed outliers, your model will give a more stable prediction set overall, but it will fail to predict said outliers.
If you did clean the data, I'd play with the threshold, see the if the wider regular-day-errors balance with the ability to predict the higher spikes.

Related

Features from consecutive measurements for classification

I'm currently working on a small machine learning project.
The task deals with medical data of a couple of thousands of patients. For each patient there where taken 12 of measurements of the same bunch of vital signs each one hour apart.
These measurements must note been taken immediately after the patient has entered the hospital but could start with some offset. However the patient will stay 24h in the hospital in total, so they can't start later than after 11 hours after the entrance.
Now the task is to predict for each patient whether none, one or multiple of 10 possible tests will be ordered during the remainder of the stay, and also to predict the future mean value of some of the vital signs for the remainder of the stay.
I have a training set that comes together with the labels that I should predict.
My question is mainly about how I can process the features, I thought about turning the measurement results for a patient into one long vector and use it as training example for a classifier.
However I'm not quite shure how I should include the Time information of each measurement into the features (should I even consider time at all?).
If I understood correctly, you want to include time information of each measurement into features. One way I thought is to make an empty vector of length 24, as the patient stays for 24 hours in the hospital. Then you can use one-hot representation, for example, if the measurement was taken in 12th, 15th and 20th hours of his stay, your time feature vector will have 1 at 12th, 15th and 20th position and all others are zero. You can append this time vector with other features and make a single vector for each patient of length = length(other vector) + length(time vector). Or you can use different approaches to combine these features.
Please let me know if you think this approach makes sense for you. Thanks.

Random cut forest for detecting anomalies in periodic time series patterns

I have time series data that measure volume of an activity by half an hour intervals.
The activity has weekly periodic patterns e.g. at Monday morning the volume is highest, at weekends the volume is low, etc.
I couldn't understand whether RRCF detects periodic patterns and gives a different score to a volume ,that on Monday morning would be considered normal but on Thursday morning it would be abnormal.
Of course any suggestion on any algorithm would be appreciated.
Technically yes, the algorithm is able to see this distinction. The reason for this is that RCF works by randomly cutting on features and trying to see which points are most "isolated" (kind of, actually the score it computes is a bit more complex). If scores on Monday are always high, then a given point will not be easily isolated, because there will be many points with the same distribution. If however there is a point on Wednesday that is particularly high, if the algorithm randomly splits on the volume and on the weekday it will most likely be able to see that the points stands alone.
However, it is important to give the algorithm the means to split well. In particular it is not trivial to split over week days, which is a categorical variable.
The best encoding for this kind of variable would be a sine-cosine encoding, transforming weekday into two sine/cosine variables, so that the algorithm can easily split on the days while maintaining the notion of distance between them (i.e. Monday is as close to Sunday as it is to Tuesday), which you would lose via a Label Encoder.
If the encoding is not clear, try reading this, it should explain the concept better:
https://ianlondon.github.io/blog/encoding-cyclical-features-24hour-time/

Can / should I use past (e.g. monthly) label columns from a database as features in an ML prediction (no time-series!)?

The question: Is it normal / usual / professional to use the past of the labels as features?
I could not find anything reliable on this, although it is a basic question.
Edited: Please mind, this is not a time-series question, I have deleted the time-series tag now and I changed the question. This question is about features that change regularly over time, yes! But we do not create a time-series from this, as there are many other features as well which are not like the label and are also important features in the model. Now please think of using past labels as normal features without a time-series approach.
I try to predict a certain month of data that is available monthly, thus a time-series, but I am not using it as a time-series, it is just monthly avaiable data of various different features.
It is a classification model, and now I want to predict a label column of a selected month of that time-series. The previous months before the selected label month are now the point of the question.
I do not want to just drop the past months of the label just because they are "almost" a label (or in other words: they were just the label columns of the preceding models in time). I know the past of the label, why not considering it as features as well?
My predictions are of course much better when adding the past labels of the time-series of labels to the features. This is logical as the labels usually do not change so much from one month to the other and thus can be predicted very well if you have fed the data with the past of the label. It would be strange not to use such "past labels" as features, as any simple time-series regression would then be better than the ml model.
Example: Let's say I predict the IQ test result of a person, and I use her past IQ test results as features in addition to other normal "non-label" features like age, education aso. I use the first 11 months of "past labels" of a year as features in addition to my normal "non-label" features. I predict the label of the 12th month.
Predicting the label of the 12th month works much better if you add the past of the labels to the features - obviously. This is because the historical labels, if there are any, are of course better indicators of the final outcome than normal columns like age and education.
Possibly related p.s.:
p.s.1: In auto-regressive models, the past of the dependent variable can well be used as independent variable, see: https://de.wikipedia.org/wiki/Regressionsanalyse
p.s.2: In ML you can perhaps just try any features and take what gives you the best results, a bit like >Good question, try them [feature selection methods] all and see what works best< in https://machinelearningmastery.com/feature-selection-in-python-with-scikit-learn/ >If the features are relevant to the outcome, the model will figure out how to use them. Or most models will.< The same is said in Does the feature selection matter for learning algorithm with regularization?
p.s.3: Also probably relevant is the problem of multicollinearity: https://statisticsbyjim.com/regression/multicollinearity-in-regression-analysis/ though multicollinearity is said to be no issue for the prediction: >Multicollinearity affects the coefficients and p-values, but it does not influence the predictions, precision of the predictions, and the goodness-of-fit statistics. If your primary goal is to make predictions, and you don’t need to understand the role of each independent variable, you don’t need to reduce severe multicollinearity.
It is perfectly possible and also good practice to include past label columns as features, though it depends on your question: do you want to explain the label only with other features (on purpose), or do you want to consider other and your past label columns to get the next label predicted, as a sort of adding a time-series character to the model without using a time-series?
The sequence in time is not even important, as long as all of such monthly columns are shifted in time consistently by the same time when going over to the predicting set. The model does not care if it is just January and February of the same column type, for the model, every feature is isolated.
Example: You can perfectly run a random forest model on various features, including their past label columns that repeat the same column type again and again, only representing different months. Any month's column can be dealt with as an independent new feature in the ml model, the only importance is to shift all of those monthly columns by the exactly same period to reach a consistent predicting set. In other words, obviously you should avoid replacing January with March column when you go from a training set January-June to a predicting set February-July, instead you must replace January with February of course.
Update 202301: model name is "walk-forward"
This model setup is called "walk-forward", see Why isn’t out-of-time validation more ubiquitous? --> option 3 almost at the bottom of the page.
I got this from a comment at Splitting Time Series Data into Train/Test/Validation Sets.
In the following, it shows only training and testing set. It writes "validation set", but it is known that this gets mixed up all over the place, see What is the Difference Between Test and Validation Datasets?, and it must be meant as the testing set in the default understanding of it.
Thus, with the right wording, it is:
This should be the best model for labels that become features in time.
validation set in a "walk-forward" model?
As you can see in the model, no validation set is needed since the test data must be biased "forward" in time, that is the whole idea of predicting the "step forward in time", and any validation set would have to be in that same biased artificial future - which is already the past at the time of training, but the model does not know this.
The validation happens by default, without a needed dataset split, during the walk-forward, when the model learns again and again to predict the future and the output metrics can be put against each other. As the model is to predict the time-biased future, there is no need to prove that or how the artificial future is biased and sort of "overtrained by time". It is the aim of the model to have the validation in the artificial future and predict the real future as a last step only.
But then, why not still having a validation set on top of this, at least if it is just a small k-fold validation? It could play a role if the testing set has a few strong changes that happen in small time windows but which are still important to be predicted, or at least hinted at, but should also not be overtrained within each training step. The validation set would hit some of these time windows and might show whether the model can handle them well enough. Any other method than k-fold would shrink the power of the model too much. The more you take away from the testing set during training, the less it can predict the future.
Wrap up:
Try it out, and in doubt, leave the validation aside and judge upon the model by checking its metrics over time, during the "walk-forward". This model is not like the others.
Thus, in the end, you can, but you do not have to, split a k-fold validation from the testing set. That would look like:
After predicting a lot of known futures, the very last step in time is then the prediction of the unknown future.
This also answers Does the training+testing set have to be different from the predicting set (so that you need to apply a time-shift to ALL columns)?.

Time Series Forecasting: weekly vs daily predictions

I have some daily time series data. I am trying to predict the next 3 days from the historical daily set of data.
The historical data shows a definite trend based upon the day-of-week such as Monday, Tuesday, etc Monday and Tuesdays are high, Wednesday typically highest and then decreasing over the remaining part of the week.
If i group the data monthly or weekly, i can definitely see a trend over time going up that appears to be additive.
My goal is to predict the next 3 days only. My intuition is telling me to take one approach and I am hoping for some feedback on pros/cons versus other approaches.
My intuition tells me it might be better to group the data by week or month and then predict the next week or month. Suppose I predict the next week total by loading historical weekly data into ARIMA, train, test and predict the next week. Within a week, each Day-of-week typically contributes x percent to that weekly total. So, if Wednesday historically has on average contributed 50% of the weekly volume, and for the next week I am predicted 1000, then I would predict Wednesday to be 500. Is this a common approach?
Alternatively, I could load the historical daily values into ARIMA, train, test and let ARIMA predict the next 3 days. The big difference here is the whole "predict weekly" versus "predict daily".
In the time series forecasting space, is this a common debate and if so, perhaps someone can suggest some key words i can google to educate myself on pros/cons?
Also, perhaps there is a suggested algorithm to use when day of week is a factor?
Thanks in advance for any responses.
Dan
This is a standard daily time series problem where there is a day-of-week seasonality. If you are using R, you could make the time series a ts object with frequency = 7 and then use auto.arima() from the forecast package to forecast it. Any other seasonal forecasting method is also potentially applicable.

Is it necessary to make time series data stationary before applying tree based ML methods i.e. Random Forest or Xgboost etc?

As in case of ARIMA models, we have to make our data stationary. Is it necessary to make our time series data stationary before applying tree based ML methods?
I have a dataset of customers with monthly electricity consumption of past 2 to 10 years, and I am supposed to predict each customer's next 5 to 6 month's consumption. In the dataset some customers have strange behavior like for a particular month their consumption varies considerably to what he consumed in the same month of last year or last 3 to 4 years, and this change is not because of temperature. And as we don't know the reason behind this change, model is unable to predict that consumption correctly.
So making each customer's timeseries stationary would help in this case or not?

Resources