Google Vertex Auto-ML Forecast every 30 mins predict 2 hours - time-series

GOAL
Every 30 mins I get a new bunch of price related data:
CurrentDatetime, CurrentPrice, Feature1, Feature2
I want to predict the price in 2 hours from now, so 4x30mins (4 steps into the future)
PROBLEM DESCRIPTION
I am puzzled what google vertex auto-ml forecasting is doing and if I can trust the results I am getting. Also unsure how to use a trained model for batch predicting.
WHAT I DID
I think the way to set up the training dataset is to add:
TargetDateTime column (2 hours ahead of CurrentDatetime)
TargetActualPrice column (the actual price 2 hours into the future)
TimeSeriesId column (always equal to 1 as all the data is one time
series).
This means, every 30 mins I now have:
CurrentDatetime, CurrentPrice, Feature1, Feature2, TargetDateTime, TargetActualPrice, TimeSeriesId
I use this dataset to train an auto-ml forecast model, setting:
"series Identifier Column" to TimeSeriesId,
"Target Column" to TargetActualPrice,
"Timestamp Column" to TargetDateTime
"Data granularity" to 30mins
"Forecast Horizon" to 4
"Context Window" to 4 (use last 2 hours of historic data to predict next 2 hours)
Split train/val/test chronologically (on TargetDateTime as is Timestamp column)
This model trains and gives some results.
Looking at the saved test data set, I can see 4 rows for each TargetDateTime, with a predictedvalue column containing a price prediction and a predicted_on_TargetDateTime column which goes from CurrentDateTime to TargetDateTime in 30 mins intervals.
This makes sense, for every 30 mins of input data, the model makes 4 predictions, each 30 mins into the future, ending up with a prediction 2 hours into the future. Great.
PROBLEM 1 : Batch predictions
I get confused when I try to use this trained model to make batch predictions. The crux of the problem is that Vertex will look at the batch input dataset, find the first row (30 min input data) for which there is no actual price data yet (TargetActualPrice is null) and then predict the next 4 steps (2 hours). This seems to mean, to make a next prediction, I would need to wait for the actuals of the previous prediction. But that means, when I get the next set of input data (30 mins later, and 1.30 hrs out from previous prediction target), I cannot use the model to make a new prediction cause the previous prediction has not TargetActualPrice yet.
To make it more explicit, suppose I have the following batch data:
CurrentDatetime
CurrentPrice
Feature1
Feature2
TargetDateTime
TargetActualPrice
TimeSeriesId
11:00
$2.1
3.4
abc
13:00
$2.4
1
11:30
$2.2
3.3
abd
13:30
$2.5
1
12:00
$2.3
3.1
abe
14:00
$2.6
1
12:30
$2.3
3.0
abe
14:30
$2.7
1
13:00
$2.4
2.9
abf
15:00
null
1
13:30
$2.5
2.8
abg
15:30
null
1
14:00
$2.6
2.7
abh
16:00
null
1
14:30
$2.7
2.6
abi
16:30
null
1
In the batch data above, I have 2 hours (4 rows) of historic data with actuals (11:00-12:30). Current time is 14:30 so I don't have actuals for 15:00 yet. The last prediction made was with the 13:00 input data (as it is the first row with actual data = null). The 13:30 - 14:30 rows I cannot use for a new prediction until I have the 15:00 actuals.
This doesn't make sense to me. I should be able to make a new 4 hour prediction every 30 mins? I must be doing something wrong?
Is the solution that, when I get the next 30 mins of input data, should I put the last predicted value into the actuals column (and update with real actuals once I have it) to proceed with next prediction? Seems cumbersome.
PROBLEM 2 : Leakage
My other concern with this is how Vertex is training and calculating the results. I am worried that when (during training) Vertex picks up the next row of 30 mins data, it will create a prediction based on the previous 4x30 mins of data (2 hour "Context window") INCLUDING the TargetActualPrice data for those rows. But this would be incorrect, as the TargetActualPrice value is 2 hours into the future and not yet available when the next 30 mins of data comes in. This would mean leakage of actual data, predicting using actuals before they are known (ie cheating).
SUMMARY
In summary, I am hoping someone can tell if I am setting the dataset up incorrectly, and/or how to batch predict every 30 mins.

With regards to my leakage concern, it originally came about because I didn't understand the batch predictions, and it seemed that every 30 mins I needed the price from 2 hours in the future in order to update the actuals of the previous prediction and then create a new prediction, which is obviously unknown at that moment.
Now my understanding is that during training, even though every 30 min timestep will have an actual price column for 4 hours in the future, the model will only use actual prices available at the moment of prediction. If making a prediction at 14.30, the model will use 14.30 data (and historic data before 14.30), and then make 4x30 mins predictions. These four predictions do not use nor are influenced by the 30 mins data after 14.30 (15.00, 15.30, 16.00, 16.30). Nor does it use the 16.30 actual price value which is the target column on 14.30 row of data.

I think I now understand the batch predictions.
Every 30 mins I get a new set of data (Feature1 and Feature2) as well the CurrentPrice. I just add this data to the batch table with TargetActualPrice set to NULL and TargetDateTime set to 2 hours in the future. I also update the TargetActualPrice in the previous row(s) in the batch table (with TargetDateTime equal to current time). Now I can run the model against this batch table and get a prediction for a max of 4 rows with TargetActualPrice = NULL.
For clarity, in the batch table, I end up with 4 TargetActualPrice = NULL rows, matching the prediction horizon of the model. When
I run the prediction, I will get 4 prediction values for these NULL rows.

Related

How to forecast macro trend by multiple index by LSTM model?

I just start exploring machine learning world. I want to try predicting the macro economic trend by grouping different index futures by LSTM model. After reading many article, I have came up 2 approaches below. May I ask what is the best approach?
1. In the pre-processing stage, group the Index futures (e.g. S&P 500, Dow Jones, Nasdaq 100, FTSE 100 etc) and get the average price. Adding a extra column holding the average price of 2 days after.
data structure:
date
avg price
T+2 avg price
2. Simply random pick one index futures and adding a extra column holding its average price of 2 days after.
date
S&P
RTY
DJ
FESX
NK
S&P +2

If I get 15 mins interval data for predicting a hourly target, should I use 15 mins data or aggregate to 1 hr data for training?

I got the following datasets, and the data is in 15 mins interval:
Time A B A+B
2021-01-01 00:00 10 20 30
2021-01-01 00:15 20 30 50
2021-01-01 00:30 30 40 70
2021-01-01 01:00 40 50 90
2021-01-01 01:00 10 20 30
2021-01-01 01:15 20 30 50
2021-01-01 01:30 30 40 70
2021-01-01 02:00 40 50 90
Basically I need to develop a machine learning model for predicting the hourly A+B
Time A+B
2021-01-02 00:00
2021-01-02 01:00
2021-01-02 02:00
2021-01-02 03:00
I want to ask when selecting my target label for my training model
Should I use 15 mins data for training and add the result afterward for hourly A+B or should I aggregate the 15 mins data into hourly data for training? What is the difference?
Is there any difference if I try to train A and B separately and add them up comparing with training A+B directly?
Thanks a lot
Here is a possible solution. Since you care about the hour total and you get data every 15 mins I would give the 15 minute interval data for a whole hour as the input to the network. Then the output would be the final value at the end of that hour.
so for example the input the net would be shape [4,2] this would be the A and B values. The output will be the final result after the hour.
On another note this doesn't sound like a problem that needs machine learning but I'm sure there is more info i dont know about
I would first split the data into a training and validation set as-is.
Then take a third option of using a sliding window of 1 hour over the samples in each set to produce data at one hour intervals. This will create 3x more valid training samples than simple aggregating.
Whether to build a model of A, B or A+B depends on what you want to predict. Do you need predictions for A and B separately? or do you only need A+B? If you only want A+B, then build the model around that. Any basic ML model will be able to handle the summation, so it will likely not make significant difference. As with most data-driven problems, it will depend on the data, so if you really want to find out if there is a difference for your data, you may want to try both, and compare results on a hold-out set.

LSTM Predictions

I'm working on a LSTM model, I found some examples and I was confused about the output.
Here, I'm trying to predict the next 24 hours, should I put 1 or 24 on the Dense layer? is this section correct ?
I've been following this video
reg = Sequential()
reg.add(LSTM(units = 5, activation='relu', input_shape=(24,1)))
reg.add(Dense(24)) #Predicting the next 24h
Thank you.
A dense layer of 1, means you will get one output. So if you are predicting the next hour, you use 1 dense layer. However, keep in mind that if you want to predict the next 24 hours there are two ways to do that. You can iteratively predict 1 hour 24 times by feeding your new prediction into your next time sequence. Or you can predict 24 hours all at once by using a dense layer with 24 outputs.
Example
[1,2,3,4,5] is my sequence and I want to predict the 10th value.
I can predict the 6th value. Then shift my next time sequence so I end up with [2,3,4,5,6]. And keep doing that to predict the 7th, 8th, 9th, and 10th,
Alternatively, I can use [1,2,3,4,5] to try and predict [6,7,8,9,10] in one step.

Forecasts Machine Learning

This is a follow up question from my other question . So, I'm making an Machine Learning Model to forecast when some things happen. I will use softmax as output.
My question is, is it better to use 7 output nodes ( range from sunday - saturday, i.e. for data on monday, then the computer predict some things will happen in friday) or 0....n output nodes ( as in day interval since day h )?
If the weekday doesn't have to do something with your data, it's defenetly better to use the 0....n outputnodes since day n.
In that case, which differs from what you asked last time, a single neuron with relu as output might be even better. (This time the weekday seems not to play a role, so you try not to classify the weekday (classification - discrete), but want to know the time to the next event (regression - continuous), that also could be 3.54 days).
Classification: Softmax
Regression: Single Neuron with relu/linear/...

LSTM and labels

Lets start off with "I know ML cannot predict stock markets better than monkeys."
But I just want to go through with it.
My question is a theretical one.
Say I have date, open, high, low, close as columns. So I guess I have 4 features, open, high, low, close.
'my_close' is going to be my label(answer) and I will use the 'close' 7 days from current row. Basically i shift the 'close' column up 7 rows and make it a new column called 'my_close'.
LSTMs work on sequences. So say the sequence I set is 20 days.
hence my shape will be (1000days of data, 20 day as a sequence, 3 features).
The problem that is bothering me is should these 20 days or rows of data, have the exact same label? or can they have individual labels ?
Or have i misunderstood the whole theory?
Thanks guys.
In your case, You want to predict the current day's stock price using previous 7 days stock values. The way your building your inputs and outputs require some modification before feeding into the model.
Your making mistake in understanding timesteps(in your sequences).
Timesteps(sequences) in layman terms is the total number of inputs we will consider while predicting the output. In your case, it will be 7(not 20) as we will be using previous 7 days data to predict the current day's output.
Your Input should be previous 7 days of info
[F11,F12,F13],[F21,F22,F23],........,[F71,F72,F73]
Fij in this, F represents the feature, i represents timestep and j represents feature number.
and the output will be the stock price of the 8th day.
Here your model will analyze previous 7 days inputs and predict the output.
So to answer your question You will have a common label for previous 7 days input.
I strongly recommend you to study a bit more on LSTM's.

Resources