Dynamic Multi-step Time Series Forecasting - time-series

I am using LSTM to forecast the future temperature of a specific device so my data only have Date and Temperature
I want to create a method to forecast N number of time steps in future. N is changeable (I will take it from the user). But the main thing is, I don't want to use any loops in this method because of the time restrictions.

Related

How to compare forecast based on the data of raw series itself and that of the raw series integrated of order 1?

I have a dataset of close prices, lets say Close which after doing the unit root tests I decided to go ahead with two different models, in eviews the equations will be as follows:
Close c ar(1)
and
d(Close) ma(5)
the first model takes a constant and an AR(1) process while the second one takes MA(5) process with 1st differenced series.
After getting the forecasts for both, my question is how can I compare them and decide whihc forecast is better and why? RMSE and some other statistucs are supposed to be scaler and I wonder if differencing affects this scale. The output for both the forecasts along with statistics is as shown in the picture: the two forecasts
Thankyou for your help

similarity difference of time series data with different time interval

Hi I'm currently trying to get score similarity between two time series data.
Metric that are used now is KL-divergence. But I want to know if there are any other metrics. There are many metrics (e.g. Dynamic time warping, Fréchet distance). However I can't apply these metrics because my time series data starting time is different. For example, stock_data_1 is starting at monday and stock_data_2 is starting at friday. I want to know if there is metirc that compare two time series data with different starting point and can detect distribution difference which is specific for time series data.
Thanks in advance.

Impute time series using similar time series

I have a problem where I have a lot of data about 1 year recordings of thermostats where every hour it gives me the mean temperature in that household. But a lot of data is not available due to they only installed the thermostat in the middle of the year or they put out the thermostat for a week or ... But a lot of this thermostat data is really similar. What I want to do is impute the missing data using similar timeseries.
So lets say house A only started in july but from there they are very similar to household B I would want to then use the info from household B to predict what the data dould be before july in house A.
I was thinking about training a Recurrent Neural Network that could do this for me but I am not shure what is out there to do this and when I search for papers and such they almost exclusively work on data sets over multiple years and impute the data using the data of previous years. I do not have this data, so that is not an option.
Does anyone have a clue how to tackle this problem or a refference I could use that solves a similar problem ?
As I understand it you want to impute the data using cross-sectional data rather than time series information.
There are actually quite a lot of imputation packages that can do this for you in R. (if you are using R)
You'd need equally spaced data. So 1 values per hour and if it is not present, then it needs to be NA. So ideally you have then multiple time series of qual length.
Then you merge these time series according to the time stamp / hour.
Afterwards you can apply an imputation package like e.g. mice, missForest, imputeR with basically one line of code. These packages will use the correlations between the different time series to estimate the missing values in these series.

Can I build a ML model with independent variables containing (time series+ categorical +numeric) and a classifier dependent variable (0,1)

Let's say I have with me data containing
salary,
job profile,
work experience,
number of people in household,
other demographic etc ..
of multiple persons who visited my car dealership and I also have the data if he/she has bought a car from me or not.
I can leverage this dataset to predict if a new customer coming in is likely to buy a car or not. And let's say currently I am doing it using xgboost.
NOW, I have got additional data but it is a time series data of the monthly expenditure the person makes. Say I get the data for my training data too. Now I want to build a model which uses this time series data and the old demographics data(+ salary, age etc) to get to know if a customer is likely to buy or not.
Note: In the second part I have time series data of the monthly expenditure only. The other variables are at a point in time. For example I do not have the time series for Salary or Age.
Note2: I also have categorical variables like job profile which I would like to use in the model. But for this I do not know if the person has been in the same job profile or he has changed over from some other job profile.
As most of the data are specific to the person; except expenditure time series, so it is better to bring time series data at person level. This can be done by feature engineering like:
As #cmxu suggested take various statistical measures. It will be even more beneficial to take these statistical measures at different time intervals like say mean at last 2 days, 5 days, 7 days, 15 days, 30 day, 90 days, 180 days etc.
Create mixed features like:
a) ratio of salary vs expenditure statistical summery created in point 1 (choose appropriate interval)
b) salary per person household or avg monthly expenditure per household. etc.
With similar ideas you can easily create 100s or 1000s of features with your data and then feed all this data to XGBoost (which is easy to train and debug) or NN (more complicated to train).

Using Random Forest for time series dataset

For a time series dataset, I would like to do some analysis and create prediction model. Usually, we would split data (by random sampling throughout entire data set) into training set and testing set and use the training set with randomForest function. and keep the testing part to check the behaviour of the model.
However, I have been told that it is not possible to split data by random sampling for time series data.
I would appreciate if someone explain how to split data into training and testing for time series data. Or if there is any alternative to do time series random forest.
Regards
We live in a world where "future-to-past-causality" only occurs in cool scifi movies. Thus, when modeling time series we like to avoid explaining past events with future events. Also, we like to verify that our models, strictly trained on past events, can explain future events.
To model time series T with RF rolling is used. For day t, value T[t] is the target and values T[t-k] where k= {1,2,...,h}, where h is the past horizon will be used to form features. For nonstationary time series, T is converted to e.g. the relatively change Trel. = (T[t+1]-T[t]) / T[t].
To evaluate performance, I advise to check the out-of-bag cross validation measure of RF. Be aware, that there are some pitfalls possibly rendering this measure over optimistic:
Unknown future to past contamination - somehow rolling is faulty and the model using future events to explain the same future within training set.
Non-independent sampling: if the time interval you want to forecast ahead is shorter than the time interval the relative change is computed over, your samples are not independent.
possible other mistakes I don't know of yet
In the end, everyone can make above mistakes in some latent way. To check that is not happening you need to validate your model with back testing. Where each day is forecasted by a model strictly trained on past events only.
When OOB-CV and back testing wildly disagree, this may be a hint to some bug in the code.
To backtest, do rolling on T[t-1 to t-traindays]. Model this training data and forecast T[t]. Then increase t by one, t++, and repeat.
To speed up you may train your model only once or at every n'th increment of t.
Reading Sales File
Sales<-read.csv("Sales.csv")
Finding length of training set.
train_len=round(nrow(Sales)*0.8)
test_len=nrow(Sales)
Splitting your data into training and testing set here I have considered 80-20 split you can change that. Make sure your data in sorted in ascending order.
Training Set
training<-slice(SubSales,1:train_len)
Testing Set
testing<-slice(SubSales,train_len+1:test_len)

Resources