I've been going through Andrew Ng's machine learning course and just got done with the learning curve lecture. I created a learning curve for a logistic regression model I created, and it looks like the training and CV scores converge, which means my model could benefit from more features. How could I do a similar analysis for something like a random forest? When I create a learning curve for a random forest classifier with the same data in sklearn my training score just stays very close to 1. Do I need to use a different method of getting the training error?
Learning Curves is a tool to learn about bias-variance-trade-off. Since your random forest model training score stays very close to 1, your random forest mode l is able to learn underlying function. If your underlying function was more non-linear, more complex, you would have had to add more features. See following example, figure Learning Curves.
Start with only 2 features and train your random forests model. Then use all of your features and train random forests your model.
You should see similar graph for your example.
Related
I am working from this article: "A novel method for predicting kidney stone type using ensemble learning". The author used a genetic algorithm for finding the optimal weight vector for voting with WEKA, but i don't know see can they did that. How can i use a genetic algorithm to find weight of voting classifier with WEKA?
This below paragraph has been extracted from the article:
In order to enhance the performance of the voting algorithm,a weighted
majority vote is used. Simple majority vote algorithm is usually an
effective way to combine different classifiers, but not all
classifiers have the same effect on the classification problem. To
optimize the results from weight majority vote classifier, we need to
find the optimal weight vector. Applying Genetic algorithms is our
solution for finding the optimal weight vector in this problem.
Assuming you have some trained classifiers and a test set, you can create a method calculateFitness(double[] weights). In this method for each Instance calculate all predictions and a merged prediction according to the weights. Use the combined predictions and the real values to calculate the total score you want to maximize/minimize.
Using the calculateFitness method you can create a custom GA to find best weights.
can we apply only genetic algorithm model on a dataset for linear regression?
for example:
assume we have a dataset with features such as toffle score, cgpa, gre score ,etc and output values of chance of admission. In this we have to predict the chances of admission based on the features.Link to the dataset
Lot of things are possible by using genetic algorithms. You just have to be sore that you are using correct dataset, you have to know what you want to get from it and last but not least, you have to know what exactly are you doing, which means you need to have correct fitness function :)
I am trying to use machine learning to predict a dataset. It is a regression problem with 180 input features and 1 continuously-valued output. I try to compare deep neural networks, random forest regression, and linear regression.
As I expect, 3-hidden-layer deep neural networks outperform other two approaches with a root mean square error (RMSE) of 0.1. However, I unexpected to see that random forest even performs worse than linear regression (RMSE 0.29 vs. 0.27). In my expectation, the random forest can discover more complex dependencies between features to decrease error. I have tried to tune the parameters of random forest (number of trees, maximum features, max_depth, etc.). I also tried different K-cross validation, but the performance is still less than linear regression.
I searched online, and one answer says linear regression may perform better if features have a smooth, nearly linear dependence on the covariates. I do not fully get the point because if that is the case, should not deep neural networks give much performance gain?
I am struggling to give an explanation. Under what situation, random forest is worse than linear regression, but deep neural networks can perform much better?
If your features explain linear relation to the target variable then a Linear Model usually performs well than a Random Forest Model. It totally depends on the linear relations between your features.
That said, Linear models are not superior or the Random Forest is any inferior one.
Try scaling and transforming the data using MinMaxScaler() from scikit-learn to see if the linear model improves further
Pro Tips
If linear model is working like a charm you need to ask your self Why? and How? And get into the basics of both the models to understand why it worked on your data. These questions will lead you to feature engineer better. And as a matter of fact, Kaggle Grand Masters do use Linear Models in stacking to get that top 1% score by capturing the linear relations in the dataset.
So at the end of the day, linear models could wonders too.
According to my understanding, RF selects features randomly and hence is hard to overfit. But, in sklearn Gradient boosting also offers the option of max_features which can help to prevent overfitting. So, why would anyone use Random forest?
Can anyone explain when to use Gradient boosting vs Random forest based on the given data?
Any help is highly appreciated.
According to my personal experience, Random Forest could be a better choice when..
You train a model on small data set.
Your data set has few features to learn.
Your data set has low Y flag count or you try to predict a situation that has low chance to occur or rarely occurs.
In these situations, Gradient Boosting algorithms like XGBoost and Light GBM can overfit (though their parameters are tuned) while simple algorithms like Random Forest or even Logistic Regression may perform better. To illustrate, for XGboost and Ligh GBM, ROC AUC from test set may be higher in comparison with Random Forest but shows too high difference with ROC AUC from train set.
Despite the sharp prediction form Gradient Boosting algorithms, in some cases, Random Forest take advantage of model stability from begging methodology (selecting randomly) and outperform XGBoost and Light GBM. However, Gradient Boosting algorithms perform better in general situations.
Similar question asked on Quora:
https://www.quora.com/How-do-random-forests-and-boosted-decision-trees-compare
I agree with the author at the link that random forests are more robust -- they don't require much problem-specific tuning to get good results. Besides that, a couple other items based on my own experience:
Random forests can perform better on small data sets; gradient boosted trees are data hungry
Random forests are easier to explain and understand. This perhaps seems silly but can lead to better adoption of a model if needed to be used by less technical people
I think that's also true. I have also read on this page How Random Forest Works
There explains the advantages of random forest. like this :
For applications in classification problems, Random Forest algorithm
will avoid the overfitting problem
For both classification and
regression task, the same random forest algorithm can be used
The Random Forest algorithm can be used for identifying the most
important features from the training dataset, in other words,
feature engineering.
Imagine we have a classification problem on a dataset where the examples are only positive (equivalently negative). For instance, on a problem where the the winning class is specified by position (e.g. think of a tennis dataset problem where the first player is always the winner). How can we create negative examples in order to train a supervised learning algorithm on this dataset? One idea could be to generate negative examples, by exchanging the positions of the features that are tied to each of the classes. Do you think this will give an unbiased dataset? Could we create negative duplicates of our original dataset and train a supervised learning algorithm on this double dataset?