I have some daily time series data. I am trying to predict the next 3 days from the historical daily set of data.
The historical data shows a definite trend based upon the day-of-week such as Monday, Tuesday, etc Monday and Tuesdays are high, Wednesday typically highest and then decreasing over the remaining part of the week.
If i group the data monthly or weekly, i can definitely see a trend over time going up that appears to be additive.
My goal is to predict the next 3 days only. My intuition is telling me to take one approach and I am hoping for some feedback on pros/cons versus other approaches.
My intuition tells me it might be better to group the data by week or month and then predict the next week or month. Suppose I predict the next week total by loading historical weekly data into ARIMA, train, test and predict the next week. Within a week, each Day-of-week typically contributes x percent to that weekly total. So, if Wednesday historically has on average contributed 50% of the weekly volume, and for the next week I am predicted 1000, then I would predict Wednesday to be 500. Is this a common approach?
Alternatively, I could load the historical daily values into ARIMA, train, test and let ARIMA predict the next 3 days. The big difference here is the whole "predict weekly" versus "predict daily".
In the time series forecasting space, is this a common debate and if so, perhaps someone can suggest some key words i can google to educate myself on pros/cons?
Also, perhaps there is a suggested algorithm to use when day of week is a factor?
Thanks in advance for any responses.
Dan
This is a standard daily time series problem where there is a day-of-week seasonality. If you are using R, you could make the time series a ts object with frequency = 7 and then use auto.arima() from the forecast package to forecast it. Any other seasonal forecasting method is also potentially applicable.
Related
In this below Github post, the author predicted the next "1" day with a multivariate LSTM model. But I think he is taking data till the same day and also predicting for the same day. I am not sure though.
Is it okay to take data including the same day to predict the day? What can I do to predict the next day's price without taking its own data in this code?
https://github.com/flo7up/relataly-public-python-tutorials/blob/master/007%20Time%20Series%20Forecasting%20-%20Multivariate%20Time%20Series%20Models.ipynb
Thanks in advance...
I have developed a machine learning regression tool for energy forecasting where as a test set I need data from a weather API. This weather API gives me values with different time steps, minute values for an hour, hourly values for 48 hours and daily values for 7 days. I want that my forecast also gives me the results for those time frames but of course, since it is energy, the results vary if it is minute, hour or daily values.
Does anyone have experience on how to deal with time series with irregular time step as test set? Would I have to train my model at each time step to have a forecast also in minute, hour and daily values?
Info about the weather API: https://openweathermap.org/api/one-call-api
Use the facebook prophet model for time series forecasting , use its function makeprediction(set time duration) and it will help you with the prediction as per your need .
I'm trying to predict the price of tomatoes, I've collected a data set that contains the previous tomato price along with which I've also added features that might affect the change in tomato price, for example, wages in agriculture over months, inflation rate over months, rainfall over months. Does this qualify as a multivariate time series? What machine learning technique can be used to solve this problem? The constraint is that there are only 48 data points (4 years *12 months). Also, can the test and train be pulled using Cross Validation ?
Columns in my dataset:
Year
Month
Tomato price
Wage
Inflation
Rainfall
Number of festivals in the month
Thanks in advance !!
When working with features in Machine learning and representing them in a matrix, what's the recommended way to represent hour of day and day of week as features for value prediction models?
Is using 0 for all hour values and 1 for the hour to represent the preferred way to represent these attributes as a feature? Same for day of week?
Thanks
In this case there is a periodic weekly trend and a long term upwards trend. So you would want to encode two time variables:
day_of_week
absolute_time
In general
There are several common time frames that trends occur over:
absolute_time
day_of_year
day_of_week
month_of_year
hour_of_day
minute_of_hour
Look for trends in all of these.
Weird trends
Look for weird trends too. For example you may see rare but persistent time based trends:
is_easter
is_superbowl
is_national_emergency
etc.
These often require that you cross reference your data against some external source that maps events to time.
Why graph?
There are two reasons that I think graphing is so important.
Weird trends:
While the general trends can be automated pretty easily (just add them every time), weird trends will often require a human eye and knowledge of the world to find. This is one reason that graphing is so important.
Data errors:
All too often data has serious errors in it. For example, you may find that the dates were encoded in two formats and only one of them has been correctly loaded into your program. There are a myriad of such problems and they are surprisingly common. This is the other reason I think graphing is important, not just for time series, but for any data.
Answer from https://datascience.stackexchange.com/questions/2368/machine-learning-features-engineering-from-date-time-data
no, your choice isn't perfect, because like that you will lose the loop representation because in hours the machine learning needs to know that 23:00 is near to 00:00 and the same thing in weekdays, it generally starts with Monday as 0 and Sunday as 6, so if you use your method, machine learning will represent every day or hours as a depending entity that has no relation with other, and that's wrong.
the right way to represent this type of data is you represent each feature( hour, day of the week ..) with two features.
those two features are the sin/cos of the value, for example for hours, you create hours_cos / hours_sin and then for each hour you calculate the sin and cos values, and before applying the sin and cos, you need to calculate theta, in python you just import pi from math then :
theta = 2 * pi * hour
then you import also sin and cos from math, and calculate the sin(theta) cos(theta)
I am building a forecasting system to predict the number of cable subscribers that would disconnect at a given point in time. I am using Python and out of the different models i tried, XGBoost performs the best.
I have a self referential system in place which works in a moving window fashion, e.g, as i run out of actuals, i start using forecasted figures in my lags.
To build the forecasting system, i used previous 800 days of lags(disconnects a day), moving averages, ratios, seasonality, indicators for year, month, day, week etc. However, Holidays, is where is gets a little messed up. Initially i used just one column to indicate holidays of all sorts, but later i figured out that different holidays may have a different impact (some holidays cause high sales, some holidays cause churn) so i added a column for each holiday, i also added indicators for long weekends, for holidays which fall on a Sunday etc. i also added a columns for 'season' indicating festive period such as thanksgiving, new year holidays etc.
Even after adding so many holiday related columns, i largely miss the thanksgiving and the new year period. Although it does take care of holidays to some extent, it completely misses the spike. And as can be seen from the chart, the spikes are a trend and appear every year (orange). my forecast (grey) does address the holidays in dec 17, but it under forecasts, any idea on how that can be taken care of.
p.s. I tuned the xgboost hyperparameters using gridsearch
As I understand, If you cleaned your data, removed outliers, your model will give a more stable prediction set overall, but it will fail to predict said outliers.
If you did clean the data, I'd play with the threshold, see the if the wider regular-day-errors balance with the ability to predict the higher spikes.