GOAL
Every 30 mins I get a new bunch of price related data:
CurrentDatetime, CurrentPrice, Feature1, Feature2
I want to predict the price in 2 hours from now, so 4x30mins (4 steps into the future)
PROBLEM DESCRIPTION
I am puzzled what google vertex auto-ml forecasting is doing and if I can trust the results I am getting. Also unsure how to use a trained model for batch predicting.
WHAT I DID
I think the way to set up the training dataset is to add:
TargetDateTime column (2 hours ahead of CurrentDatetime)
TargetActualPrice column (the actual price 2 hours into the future)
TimeSeriesId column (always equal to 1 as all the data is one time
series).
This means, every 30 mins I now have:
CurrentDatetime, CurrentPrice, Feature1, Feature2, TargetDateTime, TargetActualPrice, TimeSeriesId
I use this dataset to train an auto-ml forecast model, setting:
"series Identifier Column" to TimeSeriesId,
"Target Column" to TargetActualPrice,
"Timestamp Column" to TargetDateTime
"Data granularity" to 30mins
"Forecast Horizon" to 4
"Context Window" to 4 (use last 2 hours of historic data to predict next 2 hours)
Split train/val/test chronologically (on TargetDateTime as is Timestamp column)
This model trains and gives some results.
Looking at the saved test data set, I can see 4 rows for each TargetDateTime, with a predictedvalue column containing a price prediction and a predicted_on_TargetDateTime column which goes from CurrentDateTime to TargetDateTime in 30 mins intervals.
This makes sense, for every 30 mins of input data, the model makes 4 predictions, each 30 mins into the future, ending up with a prediction 2 hours into the future. Great.
PROBLEM 1 : Batch predictions
I get confused when I try to use this trained model to make batch predictions. The crux of the problem is that Vertex will look at the batch input dataset, find the first row (30 min input data) for which there is no actual price data yet (TargetActualPrice is null) and then predict the next 4 steps (2 hours). This seems to mean, to make a next prediction, I would need to wait for the actuals of the previous prediction. But that means, when I get the next set of input data (30 mins later, and 1.30 hrs out from previous prediction target), I cannot use the model to make a new prediction cause the previous prediction has not TargetActualPrice yet.
To make it more explicit, suppose I have the following batch data:
CurrentDatetime
CurrentPrice
Feature1
Feature2
TargetDateTime
TargetActualPrice
TimeSeriesId
11:00
$2.1
3.4
abc
13:00
$2.4
1
11:30
$2.2
3.3
abd
13:30
$2.5
1
12:00
$2.3
3.1
abe
14:00
$2.6
1
12:30
$2.3
3.0
abe
14:30
$2.7
1
13:00
$2.4
2.9
abf
15:00
null
1
13:30
$2.5
2.8
abg
15:30
null
1
14:00
$2.6
2.7
abh
16:00
null
1
14:30
$2.7
2.6
abi
16:30
null
1
In the batch data above, I have 2 hours (4 rows) of historic data with actuals (11:00-12:30). Current time is 14:30 so I don't have actuals for 15:00 yet. The last prediction made was with the 13:00 input data (as it is the first row with actual data = null). The 13:30 - 14:30 rows I cannot use for a new prediction until I have the 15:00 actuals.
This doesn't make sense to me. I should be able to make a new 4 hour prediction every 30 mins? I must be doing something wrong?
Is the solution that, when I get the next 30 mins of input data, should I put the last predicted value into the actuals column (and update with real actuals once I have it) to proceed with next prediction? Seems cumbersome.
PROBLEM 2 : Leakage
My other concern with this is how Vertex is training and calculating the results. I am worried that when (during training) Vertex picks up the next row of 30 mins data, it will create a prediction based on the previous 4x30 mins of data (2 hour "Context window") INCLUDING the TargetActualPrice data for those rows. But this would be incorrect, as the TargetActualPrice value is 2 hours into the future and not yet available when the next 30 mins of data comes in. This would mean leakage of actual data, predicting using actuals before they are known (ie cheating).
SUMMARY
In summary, I am hoping someone can tell if I am setting the dataset up incorrectly, and/or how to batch predict every 30 mins.
With regards to my leakage concern, it originally came about because I didn't understand the batch predictions, and it seemed that every 30 mins I needed the price from 2 hours in the future in order to update the actuals of the previous prediction and then create a new prediction, which is obviously unknown at that moment.
Now my understanding is that during training, even though every 30 min timestep will have an actual price column for 4 hours in the future, the model will only use actual prices available at the moment of prediction. If making a prediction at 14.30, the model will use 14.30 data (and historic data before 14.30), and then make 4x30 mins predictions. These four predictions do not use nor are influenced by the 30 mins data after 14.30 (15.00, 15.30, 16.00, 16.30). Nor does it use the 16.30 actual price value which is the target column on 14.30 row of data.
I think I now understand the batch predictions.
Every 30 mins I get a new set of data (Feature1 and Feature2) as well the CurrentPrice. I just add this data to the batch table with TargetActualPrice set to NULL and TargetDateTime set to 2 hours in the future. I also update the TargetActualPrice in the previous row(s) in the batch table (with TargetDateTime equal to current time). Now I can run the model against this batch table and get a prediction for a max of 4 rows with TargetActualPrice = NULL.
For clarity, in the batch table, I end up with 4 TargetActualPrice = NULL rows, matching the prediction horizon of the model. When
I run the prediction, I will get 4 prediction values for these NULL rows.
I just start exploring machine learning world. I want to try predicting the macro economic trend by grouping different index futures by LSTM model. After reading many article, I have came up 2 approaches below. May I ask what is the best approach?
1. In the pre-processing stage, group the Index futures (e.g. S&P 500, Dow Jones, Nasdaq 100, FTSE 100 etc) and get the average price. Adding a extra column holding the average price of 2 days after.
data structure:
date
avg price
T+2 avg price
2. Simply random pick one index futures and adding a extra column holding its average price of 2 days after.
date
S&P
RTY
DJ
FESX
NK
S&P +2
Here is my task:
Split the data into two datasets: a training dataset and a test dataset. The training dataset should incude the first 7,111 observations (until the last observation of 2004). The aim will be to use the training dataset to forecast the value of NOx concentration at 9am in January 2005. Therefore, split the original dataset into a training and a test dataset. The test dataset should include the 31 observations at 9am every day in January 2005
These are the variables in my dataset and their are 9375 observations:
Date Date (dd/mm/yyyy)
Time Time (hh:mm:ss)
NOx True hourly averaged NOx concentration in ppb
NO2 True hourly averaged NO2 concentration in microg/m3 Temp Temperature in °C
RH Relative Humidity (% )
AH Absolute Humidity
I used:
airdata_train <- airdata[1:7111,]
airdata_test <- subset(airdata,Date > 31/01/2005 & Date <= 01/01/2005, select = airdata)
but I'm unable to figure how to put multiple conditions.
I am studying ML and want to practice building a model to predict stock market returns for the next day, for example based on price and volume of the preceding days.
The current values I have for each day:
M = [[Price at day-1, price at day 0, return at day+1]
[Volume at day-1, volume at day 0, return at day+1]]
I would like to find rules, that define the ranges of price at day-1 and price at day 0 to predict the price at day+1 in the following way:
If price is below 500 for day-1 AND price is above 200 at day 0
The average return at day+1 is 1.05 (5%)
or
If price is below 500 for day-1 AND price is above 200 at day 0
AND If volume is above 200 for day-1 AND volume is below 800 at day 0
The average return at day+1 is 1.09 (9%)
I am not looking for any solutions but just for the general strategy how to approach this problem.
Is ML useful here at all, or would it be better done using a for loop iterating through all values to find the rules? I am considering random forest, would that be a viable option?
Yes. Random forests can be used for regression.
They will have a tendency to predict the average though, because of the forest aggregation. Regular decision trees may be a bit more "decisive".
I have data as follows in order to do a predictive learning as to what feature do people find attractive in a model when purchasing clothes online.
So I have data as follows.
COLORofCLOTHING MODELHAIR_COLOR MODEL_BUILD SELLER_CATEGORY
Red Black Lean 1
Blue Brown Lean 5
Black Blonde Healthy 10
In order to predict if the clothing will sell well given a set of attributes.
However seller category can be anything between 1 to 10 (1 being best and 10 being worst) I am not sure how to approach this problem. I am using weka for this purpose. Can people please give me ideas on how to approach this problem?
basically I want to build a model which learns the features like color of the clothing etc and can predict how well the clothes will sell.
Transform and normalise your dataset into something along the lines of:
color_red color_blue color_black hair_black hair_brown hair_blonde ... prediction
1 0 0 1 0 0 ... 0
0 1 0 0 1 0 ... 0.5
0 0 1 0 0 1 ... 1
Random Forests and Neural Networks should be able to give you predictions.