Google Auto ML taking huge time for forcasting - time-series

I have around 500 time series dataset for a period of 2.5 yrs with a granularity of 1 day for each series. This amounts to roughly 1 million data points.
I want to forecast for 2 weeks in 1 day granularity for each of the time series. There might be correlation among these 500 time series.
After ensuring that I have data for each timestamp, we are feeding these (500) time series to autoML where each time series is identified by “series identifier”.
So, our input to the autoML (Forecasting) is timestamp, series identifier, features, and target value. I have 30 feature which are combination of categorical and numerical.
With this setup, if I feed to autoML, it is taking more than 20 hrs for training which is not cost effective for me.
Please help me to optimized this.

AutoML is a black box.
There is litle you can do to optimize training time because AutoML will do feature engineering under the hood, and will try very hard not to overfit your data.
You have just two options here:
Train a model with a smaller dataset with the most important time series (it will take time because automl will have to fight not to overfit your dataset).
Remove the time series identifier if it makes sense to you. This gives autoML more chances not to overfit data and might get a result earlier.
Please remeber you're tweaking a black box. Your mileage will vary.

Related

SARIMA Model applied to Day Temperature Forecasting, Seasonal Period

I'm learning Time Series and want to model the daily temperature of a place. However when applying the SARIMA model with seasonal effect factor, I don't think 365 is a reasonable factor as it's a bit too large, and I only have 5 years of data to train. Is there a way to get around this?
I'm thinking smoothing the data might work. Or there will be other methods to remove the seasonality in the dataset.

How to deal with a skewed Time series data

I have hourly data of no. of minutes spent online by people for 2 years. Hence the values are distributed between 0 and 60 and also most data is either 0 or 60. My goal is to predict the number of minutes the person will spend online in the future (next day/hour/month etc.). What kind of approach or machine learning model can I use to predict this data? Can this be modelled into a regression/forecasting problem in spite of the skewness?hourly data
In the case of time series data and its prediction, it’s better to use a regression model rather than a classification or clustering model. Because it’s related to calculating specific figures.
It can be modeled into a regression problem to some extent, but more skewness means getting far from the normal probability distribution which might influence the expression into the model, lower prediction accuracy, and so forth. Anyway, any data with significant skewness cannot be regarded as well-refined data. So you might need to rearrange the samples of the data so that the skewness of the data can decrease.

Which machine learning algorithm should i use to predict if particular parking space will be occupied?

I'm working on my idea for Master thesis topic.
I get a dataset with milions of records which describe on-street parking sensors.
Data i have :
-vehicle present on particular sensor ( true or false)
It's normal that there are few parking event where there are False values with different duration time in a row.
-arrival time and departure time(month,day,hour,minute and even second)
-duration in minutes
And few more columns, but i don't have any idea how to show in my analysis that "continuity of time" and
reflect this in the calculations for a certain future time based on the time when the parking space was usually free or occupied.
Any ideas?
You can take two approaches:
If you want to predict whether a particular space will be occupied or not and if you take in count order of the events (TIME), this seems like a time series problem. You should start by trying simple time-series algorithms like Moving average or ARIMA Models. There are more sophisticated methods that take in count long and short term relationships, like recurrent neural networks, especially LSTM (Long short-term memory) which have shown good performance in time series problems.
You can take in the count all variables and use them to train a clustering algorithm like K-means or SVM.
As you pointed out:
And few more columns, but I don't have any idea how to show in my analysis that "continuity of time" and reflect this in the calculations for a certain future time based on the time when the parking space was usually free or occupied.
I recommend you to work this problem as a time series problem.
Timeseries modeling will be better option for this kind of modelling. As you said you want to predict binary output at different time intervals i.e whether the the parking slot will be occupied at the particular time interval or not. You can use LSTM for this purpose.
Time series is definitely an option here... if you are really going with LSTMs why not look into Transformers and take advantage of attention mechanism while doing time series forecasting !! I don't know them thoroughly, yet, just have a vague idea and performance benefits over RNNs and LSTM.

Is it necessary to make time series data stationary before applying tree based ML methods i.e. Random Forest or Xgboost etc?

As in case of ARIMA models, we have to make our data stationary. Is it necessary to make our time series data stationary before applying tree based ML methods?
I have a dataset of customers with monthly electricity consumption of past 2 to 10 years, and I am supposed to predict each customer's next 5 to 6 month's consumption. In the dataset some customers have strange behavior like for a particular month their consumption varies considerably to what he consumed in the same month of last year or last 3 to 4 years, and this change is not because of temperature. And as we don't know the reason behind this change, model is unable to predict that consumption correctly.
So making each customer's timeseries stationary would help in this case or not?

Using Random Forest for time series dataset

For a time series dataset, I would like to do some analysis and create prediction model. Usually, we would split data (by random sampling throughout entire data set) into training set and testing set and use the training set with randomForest function. and keep the testing part to check the behaviour of the model.
However, I have been told that it is not possible to split data by random sampling for time series data.
I would appreciate if someone explain how to split data into training and testing for time series data. Or if there is any alternative to do time series random forest.
Regards
We live in a world where "future-to-past-causality" only occurs in cool scifi movies. Thus, when modeling time series we like to avoid explaining past events with future events. Also, we like to verify that our models, strictly trained on past events, can explain future events.
To model time series T with RF rolling is used. For day t, value T[t] is the target and values T[t-k] where k= {1,2,...,h}, where h is the past horizon will be used to form features. For nonstationary time series, T is converted to e.g. the relatively change Trel. = (T[t+1]-T[t]) / T[t].
To evaluate performance, I advise to check the out-of-bag cross validation measure of RF. Be aware, that there are some pitfalls possibly rendering this measure over optimistic:
Unknown future to past contamination - somehow rolling is faulty and the model using future events to explain the same future within training set.
Non-independent sampling: if the time interval you want to forecast ahead is shorter than the time interval the relative change is computed over, your samples are not independent.
possible other mistakes I don't know of yet
In the end, everyone can make above mistakes in some latent way. To check that is not happening you need to validate your model with back testing. Where each day is forecasted by a model strictly trained on past events only.
When OOB-CV and back testing wildly disagree, this may be a hint to some bug in the code.
To backtest, do rolling on T[t-1 to t-traindays]. Model this training data and forecast T[t]. Then increase t by one, t++, and repeat.
To speed up you may train your model only once or at every n'th increment of t.
Reading Sales File
Sales<-read.csv("Sales.csv")
Finding length of training set.
train_len=round(nrow(Sales)*0.8)
test_len=nrow(Sales)
Splitting your data into training and testing set here I have considered 80-20 split you can change that. Make sure your data in sorted in ascending order.
Training Set
training<-slice(SubSales,1:train_len)
Testing Set
testing<-slice(SubSales,train_len+1:test_len)

Resources