How to forecast macro trend by multiple index by LSTM model? - machine-learning

I just start exploring machine learning world. I want to try predicting the macro economic trend by grouping different index futures by LSTM model. After reading many article, I have came up 2 approaches below. May I ask what is the best approach?
1. In the pre-processing stage, group the Index futures (e.g. S&P 500, Dow Jones, Nasdaq 100, FTSE 100 etc) and get the average price. Adding a extra column holding the average price of 2 days after.
data structure:
date
avg price
T+2 avg price
2. Simply random pick one index futures and adding a extra column holding its average price of 2 days after.
date
S&P
RTY
DJ
FESX
NK
S&P +2

Related

Google Vertex Auto-ML Forecast every 30 mins predict 2 hours

GOAL
Every 30 mins I get a new bunch of price related data:
CurrentDatetime, CurrentPrice, Feature1, Feature2
I want to predict the price in 2 hours from now, so 4x30mins (4 steps into the future)
PROBLEM DESCRIPTION
I am puzzled what google vertex auto-ml forecasting is doing and if I can trust the results I am getting. Also unsure how to use a trained model for batch predicting.
WHAT I DID
I think the way to set up the training dataset is to add:
TargetDateTime column (2 hours ahead of CurrentDatetime)
TargetActualPrice column (the actual price 2 hours into the future)
TimeSeriesId column (always equal to 1 as all the data is one time
series).
This means, every 30 mins I now have:
CurrentDatetime, CurrentPrice, Feature1, Feature2, TargetDateTime, TargetActualPrice, TimeSeriesId
I use this dataset to train an auto-ml forecast model, setting:
"series Identifier Column" to TimeSeriesId,
"Target Column" to TargetActualPrice,
"Timestamp Column" to TargetDateTime
"Data granularity" to 30mins
"Forecast Horizon" to 4
"Context Window" to 4 (use last 2 hours of historic data to predict next 2 hours)
Split train/val/test chronologically (on TargetDateTime as is Timestamp column)
This model trains and gives some results.
Looking at the saved test data set, I can see 4 rows for each TargetDateTime, with a predictedvalue column containing a price prediction and a predicted_on_TargetDateTime column which goes from CurrentDateTime to TargetDateTime in 30 mins intervals.
This makes sense, for every 30 mins of input data, the model makes 4 predictions, each 30 mins into the future, ending up with a prediction 2 hours into the future. Great.
PROBLEM 1 : Batch predictions
I get confused when I try to use this trained model to make batch predictions. The crux of the problem is that Vertex will look at the batch input dataset, find the first row (30 min input data) for which there is no actual price data yet (TargetActualPrice is null) and then predict the next 4 steps (2 hours). This seems to mean, to make a next prediction, I would need to wait for the actuals of the previous prediction. But that means, when I get the next set of input data (30 mins later, and 1.30 hrs out from previous prediction target), I cannot use the model to make a new prediction cause the previous prediction has not TargetActualPrice yet.
To make it more explicit, suppose I have the following batch data:
CurrentDatetime
CurrentPrice
Feature1
Feature2
TargetDateTime
TargetActualPrice
TimeSeriesId
11:00
$2.1
3.4
abc
13:00
$2.4
1
11:30
$2.2
3.3
abd
13:30
$2.5
1
12:00
$2.3
3.1
abe
14:00
$2.6
1
12:30
$2.3
3.0
abe
14:30
$2.7
1
13:00
$2.4
2.9
abf
15:00
null
1
13:30
$2.5
2.8
abg
15:30
null
1
14:00
$2.6
2.7
abh
16:00
null
1
14:30
$2.7
2.6
abi
16:30
null
1
In the batch data above, I have 2 hours (4 rows) of historic data with actuals (11:00-12:30). Current time is 14:30 so I don't have actuals for 15:00 yet. The last prediction made was with the 13:00 input data (as it is the first row with actual data = null). The 13:30 - 14:30 rows I cannot use for a new prediction until I have the 15:00 actuals.
This doesn't make sense to me. I should be able to make a new 4 hour prediction every 30 mins? I must be doing something wrong?
Is the solution that, when I get the next 30 mins of input data, should I put the last predicted value into the actuals column (and update with real actuals once I have it) to proceed with next prediction? Seems cumbersome.
PROBLEM 2 : Leakage
My other concern with this is how Vertex is training and calculating the results. I am worried that when (during training) Vertex picks up the next row of 30 mins data, it will create a prediction based on the previous 4x30 mins of data (2 hour "Context window") INCLUDING the TargetActualPrice data for those rows. But this would be incorrect, as the TargetActualPrice value is 2 hours into the future and not yet available when the next 30 mins of data comes in. This would mean leakage of actual data, predicting using actuals before they are known (ie cheating).
SUMMARY
In summary, I am hoping someone can tell if I am setting the dataset up incorrectly, and/or how to batch predict every 30 mins.
With regards to my leakage concern, it originally came about because I didn't understand the batch predictions, and it seemed that every 30 mins I needed the price from 2 hours in the future in order to update the actuals of the previous prediction and then create a new prediction, which is obviously unknown at that moment.
Now my understanding is that during training, even though every 30 min timestep will have an actual price column for 4 hours in the future, the model will only use actual prices available at the moment of prediction. If making a prediction at 14.30, the model will use 14.30 data (and historic data before 14.30), and then make 4x30 mins predictions. These four predictions do not use nor are influenced by the 30 mins data after 14.30 (15.00, 15.30, 16.00, 16.30). Nor does it use the 16.30 actual price value which is the target column on 14.30 row of data.
I think I now understand the batch predictions.
Every 30 mins I get a new set of data (Feature1 and Feature2) as well the CurrentPrice. I just add this data to the batch table with TargetActualPrice set to NULL and TargetDateTime set to 2 hours in the future. I also update the TargetActualPrice in the previous row(s) in the batch table (with TargetDateTime equal to current time). Now I can run the model against this batch table and get a prediction for a max of 4 rows with TargetActualPrice = NULL.
For clarity, in the batch table, I end up with 4 TargetActualPrice = NULL rows, matching the prediction horizon of the model. When
I run the prediction, I will get 4 prediction values for these NULL rows.

Can I use machine learning on below dataset sample

Dataset Sample
Can I use any algorithm to train above dataset ?
Because Each Row (Id) has Dependent Variable(Status) . But Each "Id" again as Mulitple Rows as per Features
You Can Assume it as "Each Id has multiple transaction and All transactions have common Status"
Will Machine learning find some Patterns from these transaction
Is there any other approach to solve these type of problems
Just fill your ID row with the value from the above row , same for the status row, this will lead to:
df
ID Feature1 Feature2 Feature3 Status
8079 100 Asia High Approved
8079 200 Africa Low Approved
When you run a classification algorithm, you can use: ID, Feature1, Feature2, Feature3as features and Status as target. A classifier will learn with this and everything is completly the same as before.
The features are still independet. Dependet features you will only have if the variables are somehow dependet to each other, in your case the ID 8079 does not lead to Feature1: Africa. They are independet.
You can fill your cells with:
import numpy as np
df[df[0]==""] = np.NaN
df.fillna(method='ffill')
Based on your comments, the approach can be slightly different, you need to convert your entries to new features (Python pandas convert rows to columns where multiple columns exist):
The dataframe then should look like:
ID Feature1 Feature2 Feature3 Feature1a .... Feature3z Status
8079 100 Asia High 200 Approved
you can either assume that each row is independent and ignore the id column or if every ID has 3 rows, you could extend the dataset with more features

Random Forest as best approach to this problem?

I am studying ML and want to practice building a model to predict stock market returns for the next day, for example based on price and volume of the preceding days.
The current values I have for each day:
M = [[Price at day-1, price at day 0, return at day+1]
[Volume at day-1, volume at day 0, return at day+1]]
I would like to find rules, that define the ranges of price at day-1 and price at day 0 to predict the price at day+1 in the following way:
If price is below 500 for day-1 AND price is above 200 at day 0
The average return at day+1 is 1.05 (5%)
or
If price is below 500 for day-1 AND price is above 200 at day 0
AND If volume is above 200 for day-1 AND volume is below 800 at day 0
The average return at day+1 is 1.09 (9%)
I am not looking for any solutions but just for the general strategy how to approach this problem.
Is ML useful here at all, or would it be better done using a for loop iterating through all values to find the rules? I am considering random forest, would that be a viable option?
Yes. Random forests can be used for regression.
They will have a tendency to predict the average though, because of the forest aggregation. Regular decision trees may be a bit more "decisive".

LSTM and labels

Lets start off with "I know ML cannot predict stock markets better than monkeys."
But I just want to go through with it.
My question is a theretical one.
Say I have date, open, high, low, close as columns. So I guess I have 4 features, open, high, low, close.
'my_close' is going to be my label(answer) and I will use the 'close' 7 days from current row. Basically i shift the 'close' column up 7 rows and make it a new column called 'my_close'.
LSTMs work on sequences. So say the sequence I set is 20 days.
hence my shape will be (1000days of data, 20 day as a sequence, 3 features).
The problem that is bothering me is should these 20 days or rows of data, have the exact same label? or can they have individual labels ?
Or have i misunderstood the whole theory?
Thanks guys.
In your case, You want to predict the current day's stock price using previous 7 days stock values. The way your building your inputs and outputs require some modification before feeding into the model.
Your making mistake in understanding timesteps(in your sequences).
Timesteps(sequences) in layman terms is the total number of inputs we will consider while predicting the output. In your case, it will be 7(not 20) as we will be using previous 7 days data to predict the current day's output.
Your Input should be previous 7 days of info
[F11,F12,F13],[F21,F22,F23],........,[F71,F72,F73]
Fij in this, F represents the feature, i represents timestep and j represents feature number.
and the output will be the stock price of the 8th day.
Here your model will analyze previous 7 days inputs and predict the output.
So to answer your question You will have a common label for previous 7 days input.
I strongly recommend you to study a bit more on LSTM's.

Genetic Algorithm CrossOver

I have a GA of population X.
After I run the gene and get the result for each gene I do some weighted multiply for the genes(so the better ranked genes get multiplied the most)
I get either x*2 or x*2+(x*100/10) genes. The 10% is random new genes it may or may not trigger depending on the mutation rate.
The problem is, I don' know what is the best approach to reduce the population to X again.
If the gene is a List should I just use list[::2] (or get every even index item from list)
What is a common practice when crossing genes?
EDIT:
Example of my GA with a population of 100;
Run the the 100 genes in the fitness function and get the result. Current Population: 100
Add 10% new random genes. Current Population: 110
Duplicate top 10% genes. Current Population: 121
Remove 10% worst genes. Current Population: 108
Combine all possible genes(no duplicates). Current Population: 5778
Remove genes from genepool until Population = 100. Current Population: 100
Restart the fitness function
What I want to know is: How should I do the last step? Currently I have a list with 5778 items and I take one every '58' or expressed as len(list)/startpopulation-1
Or should I use a 'while True' with a random.delete until len(list) == 100?
The new random genes should be added before or after the crossover?
Is there a way to make a gausian multiplication of the top-to-lowest rated items?
e.g: the top rated are multiplied by n, the second best by (n-1), the third by (n-2) ..., the worst rated multiplied by (n-n).
I do not really know why you are performing GA like that, could you give some references?
In any case here goes my typical solution for implementing a functional GA method:
Run the the 100 genes in the fitness function and get the result.
Randomly choose 2 genes based on the normalized fitness function
(consider this the probability of each gene to be chosen from the
pool) and cross-over. Repeat this step until you have 90 new genes
(45 times for this case). Save the top 5 without modification and
duplicate. Total genes: 100.
For the 90 new genes and the 5 duplicates on the new pool allow
them to mutate based on your mutation probability (typically 1%).
Total genes: 100.
Repeat from 1) to 3) until convergence, or X number of
iterations.
Note: You always want to keep unchanged the best genes such as you always get a better solution in each iteration.
Good luck!

Resources