If I get 15 mins interval data for predicting a hourly target, should I use 15 mins data or aggregate to 1 hr data for training? - machine-learning

I got the following datasets, and the data is in 15 mins interval:
Time A B A+B
2021-01-01 00:00 10 20 30
2021-01-01 00:15 20 30 50
2021-01-01 00:30 30 40 70
2021-01-01 01:00 40 50 90
2021-01-01 01:00 10 20 30
2021-01-01 01:15 20 30 50
2021-01-01 01:30 30 40 70
2021-01-01 02:00 40 50 90
Basically I need to develop a machine learning model for predicting the hourly A+B
Time A+B
2021-01-02 00:00
2021-01-02 01:00
2021-01-02 02:00
2021-01-02 03:00
I want to ask when selecting my target label for my training model
Should I use 15 mins data for training and add the result afterward for hourly A+B or should I aggregate the 15 mins data into hourly data for training? What is the difference?
Is there any difference if I try to train A and B separately and add them up comparing with training A+B directly?
Thanks a lot

Here is a possible solution. Since you care about the hour total and you get data every 15 mins I would give the 15 minute interval data for a whole hour as the input to the network. Then the output would be the final value at the end of that hour.
so for example the input the net would be shape [4,2] this would be the A and B values. The output will be the final result after the hour.
On another note this doesn't sound like a problem that needs machine learning but I'm sure there is more info i dont know about

I would first split the data into a training and validation set as-is.
Then take a third option of using a sliding window of 1 hour over the samples in each set to produce data at one hour intervals. This will create 3x more valid training samples than simple aggregating.
Whether to build a model of A, B or A+B depends on what you want to predict. Do you need predictions for A and B separately? or do you only need A+B? If you only want A+B, then build the model around that. Any basic ML model will be able to handle the summation, so it will likely not make significant difference. As with most data-driven problems, it will depend on the data, so if you really want to find out if there is a difference for your data, you may want to try both, and compare results on a hold-out set.

Related

Best way to handle irregular time intervals when forecasting with LSTM?

I have 2 two datasets that record temperature recieved from a sensor and the time (down to the second) each is recieved. I am evaluating unsing LSTM as one option to predict future temperatures. The intervals between each record are very irregular.
The first dataset has the following characteristics:
Time period: 1 day. Number of samples: 2026. Max interval in seconds: 4377. Min interval in seconds: 0. Average interval in seconds: 41.17. 93% of intervals are under one minute, 85% are under 30 seconds and 80% are under 20 seconds.
The second dataset has the following characteristics:
Time period: 3 days. Number of samples: 1383. Max interval in seconds: 23976. Min interval in seconds: 0. Average interval in seconds: 184.93. 89% of intervals are under one minute, 79% are under 30 seconds and 74% are under 20 seconds.
Given the irregular nature of the intervals, what is the best way to prepare the data as an input for a LSTM model? Leave as is? Take the average temperature over fxied time intervals? Remove extreme values? Other options?

Google Vertex Auto-ML Forecast every 30 mins predict 2 hours

GOAL
Every 30 mins I get a new bunch of price related data:
CurrentDatetime, CurrentPrice, Feature1, Feature2
I want to predict the price in 2 hours from now, so 4x30mins (4 steps into the future)
PROBLEM DESCRIPTION
I am puzzled what google vertex auto-ml forecasting is doing and if I can trust the results I am getting. Also unsure how to use a trained model for batch predicting.
WHAT I DID
I think the way to set up the training dataset is to add:
TargetDateTime column (2 hours ahead of CurrentDatetime)
TargetActualPrice column (the actual price 2 hours into the future)
TimeSeriesId column (always equal to 1 as all the data is one time
series).
This means, every 30 mins I now have:
CurrentDatetime, CurrentPrice, Feature1, Feature2, TargetDateTime, TargetActualPrice, TimeSeriesId
I use this dataset to train an auto-ml forecast model, setting:
"series Identifier Column" to TimeSeriesId,
"Target Column" to TargetActualPrice,
"Timestamp Column" to TargetDateTime
"Data granularity" to 30mins
"Forecast Horizon" to 4
"Context Window" to 4 (use last 2 hours of historic data to predict next 2 hours)
Split train/val/test chronologically (on TargetDateTime as is Timestamp column)
This model trains and gives some results.
Looking at the saved test data set, I can see 4 rows for each TargetDateTime, with a predictedvalue column containing a price prediction and a predicted_on_TargetDateTime column which goes from CurrentDateTime to TargetDateTime in 30 mins intervals.
This makes sense, for every 30 mins of input data, the model makes 4 predictions, each 30 mins into the future, ending up with a prediction 2 hours into the future. Great.
PROBLEM 1 : Batch predictions
I get confused when I try to use this trained model to make batch predictions. The crux of the problem is that Vertex will look at the batch input dataset, find the first row (30 min input data) for which there is no actual price data yet (TargetActualPrice is null) and then predict the next 4 steps (2 hours). This seems to mean, to make a next prediction, I would need to wait for the actuals of the previous prediction. But that means, when I get the next set of input data (30 mins later, and 1.30 hrs out from previous prediction target), I cannot use the model to make a new prediction cause the previous prediction has not TargetActualPrice yet.
To make it more explicit, suppose I have the following batch data:
CurrentDatetime
CurrentPrice
Feature1
Feature2
TargetDateTime
TargetActualPrice
TimeSeriesId
11:00
$2.1
3.4
abc
13:00
$2.4
1
11:30
$2.2
3.3
abd
13:30
$2.5
1
12:00
$2.3
3.1
abe
14:00
$2.6
1
12:30
$2.3
3.0
abe
14:30
$2.7
1
13:00
$2.4
2.9
abf
15:00
null
1
13:30
$2.5
2.8
abg
15:30
null
1
14:00
$2.6
2.7
abh
16:00
null
1
14:30
$2.7
2.6
abi
16:30
null
1
In the batch data above, I have 2 hours (4 rows) of historic data with actuals (11:00-12:30). Current time is 14:30 so I don't have actuals for 15:00 yet. The last prediction made was with the 13:00 input data (as it is the first row with actual data = null). The 13:30 - 14:30 rows I cannot use for a new prediction until I have the 15:00 actuals.
This doesn't make sense to me. I should be able to make a new 4 hour prediction every 30 mins? I must be doing something wrong?
Is the solution that, when I get the next 30 mins of input data, should I put the last predicted value into the actuals column (and update with real actuals once I have it) to proceed with next prediction? Seems cumbersome.
PROBLEM 2 : Leakage
My other concern with this is how Vertex is training and calculating the results. I am worried that when (during training) Vertex picks up the next row of 30 mins data, it will create a prediction based on the previous 4x30 mins of data (2 hour "Context window") INCLUDING the TargetActualPrice data for those rows. But this would be incorrect, as the TargetActualPrice value is 2 hours into the future and not yet available when the next 30 mins of data comes in. This would mean leakage of actual data, predicting using actuals before they are known (ie cheating).
SUMMARY
In summary, I am hoping someone can tell if I am setting the dataset up incorrectly, and/or how to batch predict every 30 mins.
With regards to my leakage concern, it originally came about because I didn't understand the batch predictions, and it seemed that every 30 mins I needed the price from 2 hours in the future in order to update the actuals of the previous prediction and then create a new prediction, which is obviously unknown at that moment.
Now my understanding is that during training, even though every 30 min timestep will have an actual price column for 4 hours in the future, the model will only use actual prices available at the moment of prediction. If making a prediction at 14.30, the model will use 14.30 data (and historic data before 14.30), and then make 4x30 mins predictions. These four predictions do not use nor are influenced by the 30 mins data after 14.30 (15.00, 15.30, 16.00, 16.30). Nor does it use the 16.30 actual price value which is the target column on 14.30 row of data.
I think I now understand the batch predictions.
Every 30 mins I get a new set of data (Feature1 and Feature2) as well the CurrentPrice. I just add this data to the batch table with TargetActualPrice set to NULL and TargetDateTime set to 2 hours in the future. I also update the TargetActualPrice in the previous row(s) in the batch table (with TargetDateTime equal to current time). Now I can run the model against this batch table and get a prediction for a max of 4 rows with TargetActualPrice = NULL.
For clarity, in the batch table, I end up with 4 TargetActualPrice = NULL rows, matching the prediction horizon of the model. When
I run the prediction, I will get 4 prediction values for these NULL rows.

Ideas for model selection for predicting sales at locations based on time component and class column

I am trying to build a model for sales prediction out of three different storages based on previous sales. However there is an extra (and very important) component to this which is a column with the values of A and B. These letters indicate a price category, where A siginifies a comparetively cheaper price compared to similar products. Here is a mock example of the table
week
Letter
Storage1 sales
Storage2 sales
Storage3 sales
1
A
50
28
34
2
A
47
29
19
3
B
13
11
19
4
B
14
19
8
5
B
21
13
3
6
A
39
25
23
I have previously worked with both types of prediction problems seperately, namely time series analysis and regression problems, using classical methods and using machine learning but I have not built a model which can take both predicition types into account.
I am writing this to hear any suggestions as how to tackle such a prediction problem. I am thinking of converting the three storage sale columns into one, in order to have one feature column, and having three one-hot encoder columns to indicate the storage. However I am not sure how to tackle this problem with a machine learning approach and would like to hear if anyone knows where to start with such a prediction problem.

LSTM Predictions

I'm working on a LSTM model, I found some examples and I was confused about the output.
Here, I'm trying to predict the next 24 hours, should I put 1 or 24 on the Dense layer? is this section correct ?
I've been following this video
reg = Sequential()
reg.add(LSTM(units = 5, activation='relu', input_shape=(24,1)))
reg.add(Dense(24)) #Predicting the next 24h
Thank you.
A dense layer of 1, means you will get one output. So if you are predicting the next hour, you use 1 dense layer. However, keep in mind that if you want to predict the next 24 hours there are two ways to do that. You can iteratively predict 1 hour 24 times by feeding your new prediction into your next time sequence. Or you can predict 24 hours all at once by using a dense layer with 24 outputs.
Example
[1,2,3,4,5] is my sequence and I want to predict the 10th value.
I can predict the 6th value. Then shift my next time sequence so I end up with [2,3,4,5,6]. And keep doing that to predict the 7th, 8th, 9th, and 10th,
Alternatively, I can use [1,2,3,4,5] to try and predict [6,7,8,9,10] in one step.

SARIMAX model fitting too slow in statsmodels

I am trying a grid search to perform model selection by fitting SARIMAX(p, d, q)x(P, D, Q, s) models using SARIMAX() method in statsmodels. I do set d and D to 1 and s to 7 and iterate over values of p in {0, 1}, q in {0, 1, 2}, P in {0, 1}, Q in {0, 1}, and trend in {None, 'c'}, which makes for a total of 48 iterations. During the model fitting phase, if any combination of the parameters leads to a non-stationary or non-invertible model, I move to the next combination of parameters.
I have a set of time-series, each one representing the performance of an agent over time and consisting of 83 (daily) measurements with a weekly seasonality. I keep 90% of the data for model fitting, and the last 10% for forecasting/testing purposes.
What I find is that model fitting during the grid search takes a very long time, about 11 minutes, for a couple of agents, whereas the same 48 iterations take much less time, less than 10 seconds, for others.
However, if, before performing my grid search, I log-transform the data corresponding to the agents whose analyses take a very long time, the same 48 iterations take about 15 seconds! However, as much as I love the speed-up factor, the final forecast turns out to be poorer compared to the case where the original (that is, not log-transformed) data was used. So, I'd rather keep the data in its original format.
My questions are the following:
What causes such slow down for certain time-serires?
And is there a way to speed-up the model fitting by giving SARIMAX() or SARIMAX.fit() certain arguments? I have tried simple_differencing = True which, constructing a smaller model in the state-space, reduced the time from 11 minutes to 6 minutes, but that's still too long.
I'd appreciate any help.

Resources