Random Forest as best approach to this problem? - machine-learning

I am studying ML and want to practice building a model to predict stock market returns for the next day, for example based on price and volume of the preceding days.
The current values I have for each day:
M = [[Price at day-1, price at day 0, return at day+1]
[Volume at day-1, volume at day 0, return at day+1]]
I would like to find rules, that define the ranges of price at day-1 and price at day 0 to predict the price at day+1 in the following way:
If price is below 500 for day-1 AND price is above 200 at day 0
The average return at day+1 is 1.05 (5%)
or
If price is below 500 for day-1 AND price is above 200 at day 0
AND If volume is above 200 for day-1 AND volume is below 800 at day 0
The average return at day+1 is 1.09 (9%)
I am not looking for any solutions but just for the general strategy how to approach this problem.
Is ML useful here at all, or would it be better done using a for loop iterating through all values to find the rules? I am considering random forest, would that be a viable option?

Yes. Random forests can be used for regression.
They will have a tendency to predict the average though, because of the forest aggregation. Regular decision trees may be a bit more "decisive".

Related

How to forecast macro trend by multiple index by LSTM model?

I just start exploring machine learning world. I want to try predicting the macro economic trend by grouping different index futures by LSTM model. After reading many article, I have came up 2 approaches below. May I ask what is the best approach?
1. In the pre-processing stage, group the Index futures (e.g. S&P 500, Dow Jones, Nasdaq 100, FTSE 100 etc) and get the average price. Adding a extra column holding the average price of 2 days after.
data structure:
date
avg price
T+2 avg price
2. Simply random pick one index futures and adding a extra column holding its average price of 2 days after.
date
S&P
RTY
DJ
FESX
NK
S&P +2

How should I read the sum of the values that RMSPROP produces?

I have a 2D time series data-set with integers ranging in 1,000,000 - 2,000,000 output on any given day. Of course my data is not limited, as I can sum up to weekly values hence the range increasing to over 10,000,000.
I'm able to achieve RMSE = 0.02 whenever I normalize my data, but when I feed the raw(1 million range) data into the algorithm, RSME can equal up to 30k - 150k error range.
Why in one version of the RMSE outputs my "global minima" is 0.02, while the other output in higher ranges? I've been testing with AdaDelta.
The definition of RMSE is:
The scale of this value directly depends on the scale on predictions and actuals, so it's quite normal that you get a higher RMSE value when you don't normalize the dataset.
This is why normalization is important, as it lets us compare error metrics across models and datasets.

Predict price range of houses

I have a dataset with several features of houses including type, location, the number of bedrooms, etc. For example:
Type: Apartment, Semi-detached House, Single-detached House
Location: (Lat, Lon) Pairs like (40.7128° N, 74.0059° W)
Number of Bedrooms: 1, 2, 3, 4 ...
The target variable I want to predict is the house price. However, the house price given in the original dataset is the intervals of prices instead of numeric values, for example:
House Price: [0,100000), [100000,150000), [150000,200000), [200000,250000), etc.
So my question is what model should I use if I want to predict the range of house price? Simple regression models seem not work because we are predicting intervals instead of continuous numeric values.
Thanks in advance.
I would use the median of the price range and run a linear regression. In your case the labels would be {50000, 125000, 175000, 225000, ...}. After you get the predicted price just pick the range it falls into.
Alternatively, if the price ranges are fixed, you can use a one-vs-all logistic regression, although I am sure this is not the best approach.

Machine Learning algorithm for finding drug based on diagnosis

Training Data Set:
--------------------
Patient Age: 25
Patient Weight: 60
Diagnosis one: Fever
Diagnosis two: Headache
> Medicine: **Crocin**
---------------------------------
Patient Age: 25
Patient Weight: 60
Diagnosis one: Fever
Diagnosis two: no headache
> Medicine: Paracetamol
----------------------------------
Give sample data set with drug/medicne prescribed to patient.
How to find what medicine based on patient info(age/weight) and diagnosis(fever/headeache/etc)?
The task you are aiming at is classification since the target values are a nominal scale.
Getting the vocabulary right is crucial since all the rest of the work is already done by others such as in sklearn library for Python which contains most relevant algorithms and plenty of data to test them and learn the algorithms.
It seems you have four variables as input:
age - metric variable
weight - metric variable
Diagnosis one - nominal variable
Diagnosis two - nominal variable
You will have to encode you nominal variables, where I would recommend an array of all possible diagnosis such as:
Fever, Headache, Stomach pain, x - [0, 0, 0, 0]
Now each array element will be set to 1 if the diagnosis is correct and 0 else.
Therefore you have a total of 2 + n input variables, whereas n is the number of possible symptoms.
Then you can simply go to the sklearn library and start using the most simple classification algorithm: Nearest Neighbour Classification
If this does not yield good result (probably results will be not good), you can start to use more sophisticated models (SVM, RandomForest). Yet first you should learn the vocabulary and use simple models to get to know the methods and the processing chain.

Wilson scoring doesn't factor in negative votes?

I'm using the wilson scoring algorithm (code below) and realized it doesn't factor in negative votes.
Example:
Upvotes Downvotes Score
1 0 0.2070
0 0 0
0 1 0 <--- this is wrong
That isn't correct as the negative net votes should have a lower score.
def calculate_wilson_score(up_votes, down_votes)
require 'cmath'
total_votes = up_votes + down_votes
return 0 if total_votes == 0
z = 1.96
positive_ratio = (1.0*up_votes)/total_votes
score = (positive_ratio + z*z/(2*total_votes) - z * CMath.sqrt((positive_ratio*(1-positive_ratio)+z*z/(4*total_votes))/total_votes))/(1+z*z/total_votes)
score.round(3)
end
Update:
Here is a description of the Wilson scoring confidence interval on Wikipedia.
The Wilson Score Lower Confidence Bound posted will certainly take negative votes into account, although the lower confidence bound will not get lower than zero, which is perfectly fine. This approximation for ranking items is generally used for identifying the highest ranked items on a best-rated list. It thus may have undesirable properties when looking at the lowest ranked items, which are the type you are describing.
This method of ranking items was popularized by Evan Miller in a post on how not to sort by average rating, although he later stated
The solution I proposed previously — using the lower bound of a confidence interval around the mean — is what computer programmers call a hack. It works not because it is a universally optimal solution, but because it roughly corresponds to our intuitive sense of what we'd like to see at the top of a best-rated list: items with the smallest probability of being bad, given the data.
If you are genuinely interested in analyzing the lowest ranked items on a list, I would suggest either using the upper confidence bound, or using a Bayesian rating systems as described in: https://stackoverflow.com/a/30111531/3884938

Resources