Impute time series using similar time series - machine-learning

I have a problem where I have a lot of data about 1 year recordings of thermostats where every hour it gives me the mean temperature in that household. But a lot of data is not available due to they only installed the thermostat in the middle of the year or they put out the thermostat for a week or ... But a lot of this thermostat data is really similar. What I want to do is impute the missing data using similar timeseries.
So lets say house A only started in july but from there they are very similar to household B I would want to then use the info from household B to predict what the data dould be before july in house A.
I was thinking about training a Recurrent Neural Network that could do this for me but I am not shure what is out there to do this and when I search for papers and such they almost exclusively work on data sets over multiple years and impute the data using the data of previous years. I do not have this data, so that is not an option.
Does anyone have a clue how to tackle this problem or a refference I could use that solves a similar problem ?

As I understand it you want to impute the data using cross-sectional data rather than time series information.
There are actually quite a lot of imputation packages that can do this for you in R. (if you are using R)
You'd need equally spaced data. So 1 values per hour and if it is not present, then it needs to be NA. So ideally you have then multiple time series of qual length.
Then you merge these time series according to the time stamp / hour.
Afterwards you can apply an imputation package like e.g. mice, missForest, imputeR with basically one line of code. These packages will use the correlations between the different time series to estimate the missing values in these series.

Related

Multiple data from different sources on time series forecasting

I have an interesting question about time series forecasting. If someone has temporal data from multiple sensors, each dataset would have data, e.g., from 2010 to 2015, so if one were to train a forecasting model using all the data from those different sensors, how should the data be organized? because if one just stacked up the data set, it would generate, e.g., sensorDataset1 (2010–2015), sensorDataset2 (2010–2015), and the cycle would start over with sensors 3, 4, and n. Is this a problem with time series data or not?
If yes, what is the proper way to handle this?
I tried using all the data stacked up and training the model anyway, and actually it has a good error, but I wonder if that approach is actually valid.
Try sampling your individual sensor data sets to the same period.
For example, if sensor 1 has a data entry every 5 minutes and sensor 2 has an entry every 10 minutes. Try to sample your data to a common period across all sensors. Each data point you show to your model will have better quality data that should influence the performance of your model.
The aspect that will influence your error depends on what you're trying to forecast and the relationships that exist in your data that showcase the relationship between variables.

Predicting time series based on previous events using neural networks

I want to see if the following problem can be solved by using neural networks: I have a database containing over 1000 basketball events, where the total score has been recorded every second from minute 5 till minute 20, and where the basketball games are all from the same league. This means that the events are occurring on different time periods. The data is afterwards interpolated to have the exact time difference between two timesteps, and thus obtaining exactly 300 points between minute 5 and minute 20. This can be seen here:
Time series. The final goal is to have a model that can predict the y values between t=15 till t=20 and use as input data the y values between t=5 and t=15. I want to train the model by using the database containing the 1000 events. For this I tried using the following network:
input data vs output data
Neural network
The input data, that will be used to train the neural network model would have the shape (1000,200) and the output data, would have the shape (1000,100).
Can someone maybe guide me in the right direction for this and maybe give some feedback if this is a correct approach for such a problem, I have found some previous time series problems, but all of them were based on one large time series, while in this situation I have 1000 different time series.
There are a couple different ways to approach this problem. Based on the comments this sounds like a univariate/multi-step time series forecasting albeit across many different events.
First to clarify most deep learning for time series models/frameworks take data in the following format (batch_size, n_historical_steps, n_feature_time_series) and output the result in the format (batch_size, n_forecasted_steps, n_targets) .
Since this is a univariate forecasting problem n_feature_time_series would be one (unless I'm missing something). Now n_historical_steps is a hyper parameter we often optimize on as often the entire temporal history is not relevant to forecasting the next time n steps. You might want to try optimizing on that as well. However let say you choose to use the full temporal history then this would look like (batch_size, 200, 1). Following this approach you might then have output shape of (batch_size, 100, 1). You could then use a batch_size of 1000 to feed in all the different events at once (assuming of course you have a different validation/test set).This would give you an input shape of (1000, 200, 1) This is how you would likely do it for instance if you were going to use models like DA-RNN, LSTM, vanilla Transformer, etc.
There are some other models though that would create a learnable series embedding_id such as the Convolutional Transformer Paper or Deep AR. This is essentially a unique series identifier that would be associated with each event and the model would learn to forecast in the same pass on each.
I have models of both varieties implemented that you could use in Flow Forecast. Though I don't have any detailed tutorials on this type of problem at the moment. I will also say also that in all honesty given that you only have 1000 BB events (each with only 300 univariate time steps) and the many variables in play at Basketball I doubt that you will be able to accomplish this task with any real degree of accuracy. I would guess you probably need at least 20k+ basketball event data to be able to forecast this type of problem well with deep learning at least.

Validating accuracy on time-series data with an atypical ending

I'm working on a project to predict demand for a product based on past historical data for multiple stores. I have data from multiple stores over a 5 year period. I split the 5-year time series into overlapping subsequences and use the last 18 months to predict the next 3 and I'm able to make predictions. However, I've run into a problem in choosing a cross-validation method.
I want to have a holdout test split, and use some sort of cross-validation for training my model and tuning parameters. However, the last year of the data was a recession where almost all demand suffered. When I use the last 20% (time-wise) of the data as a holdout set, my test score is very low compared to my OOF cross-validation scores, even though I am using a timeseriessplit CV. This is very likely to be caused by this recession being new behavior, and the model can't predict these strong downswings since it has never seen them before.
The solution I'm thinking of is using a random 20% of the data as a holdout, and a shuffled Kfold as cross-validation. Since I am not feeding any information about when the sequence started into the model except the starting month (1 to 12) of the sequence (to help the model explain seasonality), my theory is that the model should not overfit this data based on that. If all types of economy are present in the data, the results of the model should extrapolate to new data too.
I would like a second opinion on this, do you think my assumptions are correct? Is there a different way to solve this problem?
Your overall assumption is correct in that you can probably take random chunks of time to form your training and testing set. However, when doing it this way, you need to be careful. Rather than predicting the raw values of the next 3 months from the prior 18 months, I would predict the relative increase/decrease of sales in the next 3 months vs. the mean of the past 18 months.
(see here)
http://people.stern.nyu.edu/churvich/Forecasting/Handouts/CourantTalk2.pdf
Otherwise, the correlation between the next 3 months with your prior 18 months data might give you a misleading impression about the accuracy of your model

What's the best way to represent Hour of Day and Day of Week as a feature in for value prediction models in Machine Learning?

When working with features in Machine learning and representing them in a matrix, what's the recommended way to represent hour of day and day of week as features for value prediction models?
Is using 0 for all hour values and 1 for the hour to represent the preferred way to represent these attributes as a feature? Same for day of week?
Thanks
In this case there is a periodic weekly trend and a long term upwards trend. So you would want to encode two time variables:
day_of_week
absolute_time
In general
There are several common time frames that trends occur over:
absolute_time
day_of_year
day_of_week
month_of_year
hour_of_day
minute_of_hour
Look for trends in all of these.
Weird trends
Look for weird trends too. For example you may see rare but persistent time based trends:
is_easter
is_superbowl
is_national_emergency
etc.
These often require that you cross reference your data against some external source that maps events to time.
Why graph?
There are two reasons that I think graphing is so important.
Weird trends:
While the general trends can be automated pretty easily (just add them every time), weird trends will often require a human eye and knowledge of the world to find. This is one reason that graphing is so important.
Data errors:
All too often data has serious errors in it. For example, you may find that the dates were encoded in two formats and only one of them has been correctly loaded into your program. There are a myriad of such problems and they are surprisingly common. This is the other reason I think graphing is important, not just for time series, but for any data.
Answer from https://datascience.stackexchange.com/questions/2368/machine-learning-features-engineering-from-date-time-data
no, your choice isn't perfect, because like that you will lose the loop representation because in hours the machine learning needs to know that 23:00 is near to 00:00 and the same thing in weekdays, it generally starts with Monday as 0 and Sunday as 6, so if you use your method, machine learning will represent every day or hours as a depending entity that has no relation with other, and that's wrong.
the right way to represent this type of data is you represent each feature( hour, day of the week ..) with two features.
those two features are the sin/cos of the value, for example for hours, you create hours_cos / hours_sin and then for each hour you calculate the sin and cos values, and before applying the sin and cos, you need to calculate theta, in python you just import pi from math then :
theta = 2 * pi * hour
then you import also sin and cos from math, and calculate the sin(theta) cos(theta)

Similarity of trends in time series analysis

I am new in time series analysis. I am trying to find the trend of a short (1 day) temperature time series and tried to different approximations. Moreover, sampling frequency is 2 minute. The data were collocated for different stations. And I will compare different trends to see whether they are similar or not.
I am facing three challenges in doing this:
Q1 - How I can extract the pattern?
Q2 - How I can quantify the trend since I will compare trends belong to two different places?
Q3 - When can I say two trends are similar or not similar?
Q1 -How I can extract the pattern?
You would start by performing time series analysis on both your data sets. You will need a statistical library to do the tests and comparisons.
If you can use Python, pandas is a good option.
In R, the forecast package is great. Start by running ets on both data sets.
Q2 - How I can quantify the trend since I will compare trends belong to two different places?
The idea behind quantifying trend is to start by looking for a (linear) trend line. All stats packages can assist with this. For example, if you are assuming a linear trend, then the line that minimizes the squared deviation from your data points.
The Wikipedia article on trend estimation is quite accessible.
Also, keep in mind that trend can be linear, exponential or damped. Different trending parameters can be tried to take care of these.
Q3 - When can I say two trends are similar or not similar?
Run ARIMA on both data sets. (The basic idea here is to see if the same set of parameters (which make up the ARIMA model) can describe both your temp time series. If you run auto.arima() in forecast (R), then it will select the parameters p,d,q for your data, a great convenience.
Another thought is to perform a 2-sample t-test of both your series and check the p-value for significance. (Caveat: I am not a statistician, so I am not sure if there is any theory against doing this for time series.)
While researching I came across the Granger Test – where the basic idea is to see if one time series can help in forecasting another. Seems very applicable to your case.
So these are just a few things to get you started. Hope that helps.

Resources