How to handle multivariate data with varying time lengths in an LSTM? - time-series

The dataset I have is a medical dataset, where measurements were taken at 6 month intervals. Now I want the model to predict 5 years into the future. However there are a lot of subjects that only have for example the first three years of 6 month interval data. How do I still use this data to build a LSTM where the objective is to forecast a value 5 years into the future? Do I have to do something with padding and masking?
Thank you

Related

SARIMA Model applied to Day Temperature Forecasting, Seasonal Period

I'm learning Time Series and want to model the daily temperature of a place. However when applying the SARIMA model with seasonal effect factor, I don't think 365 is a reasonable factor as it's a bit too large, and I only have 5 years of data to train. Is there a way to get around this?
I'm thinking smoothing the data might work. Or there will be other methods to remove the seasonality in the dataset.

Validating accuracy on time-series data with an atypical ending

I'm working on a project to predict demand for a product based on past historical data for multiple stores. I have data from multiple stores over a 5 year period. I split the 5-year time series into overlapping subsequences and use the last 18 months to predict the next 3 and I'm able to make predictions. However, I've run into a problem in choosing a cross-validation method.
I want to have a holdout test split, and use some sort of cross-validation for training my model and tuning parameters. However, the last year of the data was a recession where almost all demand suffered. When I use the last 20% (time-wise) of the data as a holdout set, my test score is very low compared to my OOF cross-validation scores, even though I am using a timeseriessplit CV. This is very likely to be caused by this recession being new behavior, and the model can't predict these strong downswings since it has never seen them before.
The solution I'm thinking of is using a random 20% of the data as a holdout, and a shuffled Kfold as cross-validation. Since I am not feeding any information about when the sequence started into the model except the starting month (1 to 12) of the sequence (to help the model explain seasonality), my theory is that the model should not overfit this data based on that. If all types of economy are present in the data, the results of the model should extrapolate to new data too.
I would like a second opinion on this, do you think my assumptions are correct? Is there a different way to solve this problem?
Your overall assumption is correct in that you can probably take random chunks of time to form your training and testing set. However, when doing it this way, you need to be careful. Rather than predicting the raw values of the next 3 months from the prior 18 months, I would predict the relative increase/decrease of sales in the next 3 months vs. the mean of the past 18 months.
(see here)
http://people.stern.nyu.edu/churvich/Forecasting/Handouts/CourantTalk2.pdf
Otherwise, the correlation between the next 3 months with your prior 18 months data might give you a misleading impression about the accuracy of your model

Python - How to predict feature sales value by feeding other parameters? Using LSTM or?

I am new to this Regression world and I have a nerd question, you may say.
Actually I was trying to solve a problem to predict future sales in my organization.
I have collected all the data for last year. My data includes (for each day):
Total Sales(count)
Temperature
Wind Direction
Precipitation
Day of week (i.e 1 or 2 or 3.. or 7)
Whether a working day or not.
etc.
My goal :
1. I will train a model so that if I give the input of all the values of 2 to 7 (i.e of data, of the day that I want to predict, which is neither in test nor test data) and it will give me the predicted value of 1 (i.e Total Sales).
I Tried :
1. 1st I tried with a Univariate LSTM model(i.e with total sales from past one year data, predict the next data). But, I couldn't feed the other data as input.
Then I tried a Multivariate LSTM model, but this would predict all of the data for the next series.
Then I searched for many tutorials to solve the problem. Such as : This video tutorial which uses LSTM for electricity bill consumption, but it only shows the model building and not how to implement it.
I came with another question : from stack overflow. But here, the user seems to be moving to reinforcement learning.
Conclusion : What should i do to solve such problems? How to predict future sales count by feeding the data for that day?

Is it necessary to make time series data stationary before applying tree based ML methods i.e. Random Forest or Xgboost etc?

As in case of ARIMA models, we have to make our data stationary. Is it necessary to make our time series data stationary before applying tree based ML methods?
I have a dataset of customers with monthly electricity consumption of past 2 to 10 years, and I am supposed to predict each customer's next 5 to 6 month's consumption. In the dataset some customers have strange behavior like for a particular month their consumption varies considerably to what he consumed in the same month of last year or last 3 to 4 years, and this change is not because of temperature. And as we don't know the reason behind this change, model is unable to predict that consumption correctly.
So making each customer's timeseries stationary would help in this case or not?

What machine learning technique can be used for multivariate time series?

I'm trying to predict the price of tomatoes, I've collected a data set that contains the previous tomato price along with which I've also added features that might affect the change in tomato price, for example, wages in agriculture over months, inflation rate over months, rainfall over months. Does this qualify as a multivariate time series? What machine learning technique can be used to solve this problem? The constraint is that there are only 48 data points (4 years *12 months). Also, can the test and train be pulled using Cross Validation ?
Columns in my dataset:
Year
Month
Tomato price
Wage
Inflation
Rainfall
Number of festivals in the month
Thanks in advance !!

Resources