Model parameter selection for time series data - machine-learning

For model parameter selection, we always make a grid-search with cross validation to test which parameters are better than others.
It's right for general training data, like this one, but if data has time relationship with each other, like sells over days or stock over days, is that wrong to do cross validation directly?
As cross validation will use kFold which split randomly in training data, which means for time series data, recent days info will be used for training on old days.
My Question is, how to do parameter selection, or cross validation on time series data?

Yes, it's more common to use backtesting or rolling forecasting, in which you train on either the previous N time periods or else all of the periods up to N, and then test on period N+k (k>=1), or possibly even test on a range of future periods. (e.g., train on the past 60 months of data, and then predict the next 12 months). But the details really depend on the models you're using and the problem domain. Rob Hyndman gives a concrete example, and you can find many more by searching for "kaggle contest time series data cross validation."
In some cases, it makes sense to perform a random split stratified by time. Then perform the rolling training and CV evaluation as described above on the train split, and separately perform a rolling test evaluation on the random test holdout.
In some cases, you can get away with performing ordinary random CV where time is just another feature value (often encoded as many engineered time features, such as cos(2 pi hour_of_day / 24), sin(2 pi hour_of_day / 24), cos(2 pi hour_of_week / 168), etc.). This tends to work better with models such as XGBoost, allowing the model to discover the time dependencies in the data instead of encoding them in the train/test regimen.

Related

How to deal with a skewed Time series data

I have hourly data of no. of minutes spent online by people for 2 years. Hence the values are distributed between 0 and 60 and also most data is either 0 or 60. My goal is to predict the number of minutes the person will spend online in the future (next day/hour/month etc.). What kind of approach or machine learning model can I use to predict this data? Can this be modelled into a regression/forecasting problem in spite of the skewness?hourly data
In the case of time series data and its prediction, it’s better to use a regression model rather than a classification or clustering model. Because it’s related to calculating specific figures.
It can be modeled into a regression problem to some extent, but more skewness means getting far from the normal probability distribution which might influence the expression into the model, lower prediction accuracy, and so forth. Anyway, any data with significant skewness cannot be regarded as well-refined data. So you might need to rearrange the samples of the data so that the skewness of the data can decrease.

Validating accuracy on time-series data with an atypical ending

I'm working on a project to predict demand for a product based on past historical data for multiple stores. I have data from multiple stores over a 5 year period. I split the 5-year time series into overlapping subsequences and use the last 18 months to predict the next 3 and I'm able to make predictions. However, I've run into a problem in choosing a cross-validation method.
I want to have a holdout test split, and use some sort of cross-validation for training my model and tuning parameters. However, the last year of the data was a recession where almost all demand suffered. When I use the last 20% (time-wise) of the data as a holdout set, my test score is very low compared to my OOF cross-validation scores, even though I am using a timeseriessplit CV. This is very likely to be caused by this recession being new behavior, and the model can't predict these strong downswings since it has never seen them before.
The solution I'm thinking of is using a random 20% of the data as a holdout, and a shuffled Kfold as cross-validation. Since I am not feeding any information about when the sequence started into the model except the starting month (1 to 12) of the sequence (to help the model explain seasonality), my theory is that the model should not overfit this data based on that. If all types of economy are present in the data, the results of the model should extrapolate to new data too.
I would like a second opinion on this, do you think my assumptions are correct? Is there a different way to solve this problem?
Your overall assumption is correct in that you can probably take random chunks of time to form your training and testing set. However, when doing it this way, you need to be careful. Rather than predicting the raw values of the next 3 months from the prior 18 months, I would predict the relative increase/decrease of sales in the next 3 months vs. the mean of the past 18 months.
(see here)
http://people.stern.nyu.edu/churvich/Forecasting/Handouts/CourantTalk2.pdf
Otherwise, the correlation between the next 3 months with your prior 18 months data might give you a misleading impression about the accuracy of your model

Classification of Stock Prices Based on Probabilities

I'm trying to build a classifier to predict stock prices. I generated extra features using some of the well-known technical indicators and feed these values, as well as values at past points to the machine learning algorithm. I have about 45k samples, each representing an hour of ohlcv data.
The problem is actually a 3-class classification problem: with buy, sell and hold signals. I've built these 3 classes as my targets based on the (%) change at each time point. That is: I've classified only the largest positive (%) changes as buy signals, the opposite for sell signals and the rest as hold signals.
However, presenting this 3-class target to the algorithm has resulted in poor accuracy for the buy & sell classifiers. To improve this, I chose to manually assign classes based on the probabilities of each sample. That is, I set the targets as 1 or 0 based on whether there was a price increase or decrease.
The algorithm then returns a probability between 0 and 1(usually between 0.45 and 0.55) for its confidence on which class each sample belongs to. I then select some probability bound for each class within those probabilities. For example: I select p > 0.53 to be classified as a buy signal, p < 0.48 to be classified as a sell signal and anything in between as a hold signal.
This method has drastically improved the classification accuracy, at some points to above 65%. However, I'm failing to come up with a method to select these probability bounds without a large validation set. I've tried finding the best probability values within a validation set of 3000 and this has improved the classification accuracy, yet the larger the validation set becomes, it is clear that the prediction accuracy in the test set is decreasing.
So, what I'm looking for is any method by which I could discern what the specific decision probabilities for each training set should be, without large validation sets. I would also welcome any other ideas as to how to improve this process. Thanks for the help!
What you are experiencing is called non-stationary process. The market movement depends on time of the event.
One way I used to deal with it is to build your model with data in different time chunks.
For example, use data from day 1 to day 10 for training, and day 11 for testing/validation, then move up one day, day 2 to day 11 for training, and day 12 for testing/validation.
you can save all your testing results together to compute an overall score for your model. this way you have lots of test data and a model that will adapt to time.
and you get 3 more parameters to tune, #1 how much data to use for train, #2 how much data for test, # per how many days/hours/data points you retrain your data.

Can I use a transfer learning in facebook prophet?

I want my prophet model to predict values for every 10 minute interval over the next 24h (e.g. 24*6=144 values).
Let's say I've trained a model on a huge (over 900k of rows) .csv file where sample row is
...
ds=2018-04-24 16:10, y=10
ds=2018-04-24 16:20, y=14
ds=2018-04-24 16:30, y=12
...
So I call mode.fit(huge_df) and wait for 1-2 seconds to receive 144 values.
And then an hour passes and I want to tune my prediction for the following (144 - 6) 138 values given a new data (6 rows).
How can I tune my existing prophet model without having to call mode.fit(huge_df + live_df) and wait for some seconds again? I'd like to be able to call mode.tune(live_df) and get an instant prediction.
As far as I'm aware this is not really a possibility. I think they use a variant of the BFGS optimization algorithm to maximize the posterior probability of the the models. So as I see it the only way to train the model is to take into account the whole dataset you want to use. The reason why transfer learning works with neural networks is that it is just a weight (parameter) initialization and back propagation is then run iteratively in the standard SGD training schema. Theoretically you could initialize the parameters to the ones of the previous model in the case of prophet, which might or might not work as expected. I'm however not aware that something of the likes is currently implemented (but since its open-source you could give it a shot, hopefully reducing convergence times quite a bit).
Now as far as practical advice goes. You probably don't need all the data, just tail it to what you really need for the problem at hand. For instance it does not make sense to have 10 years of data if you have only monthly seasonality. Also depending on how strongly your data is autocorrelated, you may downsample a bit without loosing any predictive power. Another idea would be to try an algorithm that is suitable for online-learning (or batch) - You could for instance try a CNN with dilated convolution.
Time Series problems are quite different from usual Machine Learning Problems. When we are training cat/dog classifier, the feature set of cats and dogs are not going to change instantly (evolution is slow). But when it comes to the time series problems, training should happen, every time prior to forecasting. This becomes even more important when you are doing univariate forecasting (as is your case), as only feature we're providing to the model is the past values and these value will change at every instance. Because of these concerns, I don't think something like transfer learning will work in time series.
Instead what you can do is, try converting your time series problem into the regression problem by use of rolling windowing approach. Then, you can save that model and get you predictions. But, make sure to train it again and again in short intervals of time, like once a day or so, depending upon how frequently you need a forecast.

Using Random Forest for time series dataset

For a time series dataset, I would like to do some analysis and create prediction model. Usually, we would split data (by random sampling throughout entire data set) into training set and testing set and use the training set with randomForest function. and keep the testing part to check the behaviour of the model.
However, I have been told that it is not possible to split data by random sampling for time series data.
I would appreciate if someone explain how to split data into training and testing for time series data. Or if there is any alternative to do time series random forest.
Regards
We live in a world where "future-to-past-causality" only occurs in cool scifi movies. Thus, when modeling time series we like to avoid explaining past events with future events. Also, we like to verify that our models, strictly trained on past events, can explain future events.
To model time series T with RF rolling is used. For day t, value T[t] is the target and values T[t-k] where k= {1,2,...,h}, where h is the past horizon will be used to form features. For nonstationary time series, T is converted to e.g. the relatively change Trel. = (T[t+1]-T[t]) / T[t].
To evaluate performance, I advise to check the out-of-bag cross validation measure of RF. Be aware, that there are some pitfalls possibly rendering this measure over optimistic:
Unknown future to past contamination - somehow rolling is faulty and the model using future events to explain the same future within training set.
Non-independent sampling: if the time interval you want to forecast ahead is shorter than the time interval the relative change is computed over, your samples are not independent.
possible other mistakes I don't know of yet
In the end, everyone can make above mistakes in some latent way. To check that is not happening you need to validate your model with back testing. Where each day is forecasted by a model strictly trained on past events only.
When OOB-CV and back testing wildly disagree, this may be a hint to some bug in the code.
To backtest, do rolling on T[t-1 to t-traindays]. Model this training data and forecast T[t]. Then increase t by one, t++, and repeat.
To speed up you may train your model only once or at every n'th increment of t.
Reading Sales File
Sales<-read.csv("Sales.csv")
Finding length of training set.
train_len=round(nrow(Sales)*0.8)
test_len=nrow(Sales)
Splitting your data into training and testing set here I have considered 80-20 split you can change that. Make sure your data in sorted in ascending order.
Training Set
training<-slice(SubSales,1:train_len)
Testing Set
testing<-slice(SubSales,train_len+1:test_len)

Resources