Looking for the best (clearest, shortest, brightest)
concise distinction between the ML terms “Decision Forest" and “Random Forest"?
Note the similar and also unanswered question:
Multiclass Decision Forest vs Random Forest
Random forests or random decision forests is an extension of the decision forests (ensemble of decision trees) combining bagging and random selection of features to construct a collection of decision trees with controlled variance.
Random forest is an extension of random decision forest that includes bagging. For details check the original paper by Breiman or more lightweight description on Wikipedia. Majority of well-known machine learning libraries, like Python's scikit learn, implements Random Forest.
Related
What is the difference if we use Decision Tree as Base estimator in AdaBoost algorithm ?
Is Random Forest a special case of AdaBoost?
Most certainly not; Random Forest is a case of bagging ensemble algorithm (short for bootstrap aggregating), which is different from boosting - check here for their differences.
What is the difference if we use Decision Tree as Base estimator in AdaBoost algorithm ?
You don't get a Random Forest, but a Gradient Tree Boosting Machine, available in several packages like xgboost (R/Python), gbm (R), scikit-learn (Python) etc.
Check chapter 8 of the excellent (and freely available) book An Introduction to Statistical Learning for more, or The Elements of Statistical Learning (heavy in math & theory, not for the faint-hearted)...
According to my understanding, RF selects features randomly and hence is hard to overfit. But, in sklearn Gradient boosting also offers the option of max_features which can help to prevent overfitting. So, why would anyone use Random forest?
Can anyone explain when to use Gradient boosting vs Random forest based on the given data?
Any help is highly appreciated.
According to my personal experience, Random Forest could be a better choice when..
You train a model on small data set.
Your data set has few features to learn.
Your data set has low Y flag count or you try to predict a situation that has low chance to occur or rarely occurs.
In these situations, Gradient Boosting algorithms like XGBoost and Light GBM can overfit (though their parameters are tuned) while simple algorithms like Random Forest or even Logistic Regression may perform better. To illustrate, for XGboost and Ligh GBM, ROC AUC from test set may be higher in comparison with Random Forest but shows too high difference with ROC AUC from train set.
Despite the sharp prediction form Gradient Boosting algorithms, in some cases, Random Forest take advantage of model stability from begging methodology (selecting randomly) and outperform XGBoost and Light GBM. However, Gradient Boosting algorithms perform better in general situations.
Similar question asked on Quora:
https://www.quora.com/How-do-random-forests-and-boosted-decision-trees-compare
I agree with the author at the link that random forests are more robust -- they don't require much problem-specific tuning to get good results. Besides that, a couple other items based on my own experience:
Random forests can perform better on small data sets; gradient boosted trees are data hungry
Random forests are easier to explain and understand. This perhaps seems silly but can lead to better adoption of a model if needed to be used by less technical people
I think that's also true. I have also read on this page How Random Forest Works
There explains the advantages of random forest. like this :
For applications in classification problems, Random Forest algorithm
will avoid the overfitting problem
For both classification and
regression task, the same random forest algorithm can be used
The Random Forest algorithm can be used for identifying the most
important features from the training dataset, in other words,
feature engineering.
Naive Bayes Algorithm assumes independence among features. What are some text classification algorithms which are not Naive i.e. do not assume independence among it's features.
The answer will be very straight forward, since nearly every classifier (besides Naive Bayes) is not naive. Features independence is very rare assumption, and is not taken by (among huge list of others):
logistic regression (in NLP community known as maximum entropy model)
linear discriminant analysis (fischer linear discriminant)
kNN
support vector machines
decision trees / random forests
neural nets
...
You are asking about text classification, but there is nothing really special about text, and you can use any existing classifier for such data.
How does Multiclass Decision Forest differ from Random Forest? What factors do they have in common? It appears there is not a clear answer on the web regarding this matter.
Random forests or random decision forests is an extension of the decision forests (ensemble of decision trees) combining bagging and random selection of features to construct a collection of decision trees with controlled variance.
A very good paper from Microsoft research you may consider to look at.
I want to implement a language classifier like Linguist in Github:-
http://www.github.com/github/linguist
I don't know if Random forest is better than Bayesian in terms of complexity.
There would be lot of sample data to train for each programming language.
Can Random forest outperform bayesian?
If you're not very experienced, and you don't use toooo many features ( and they are not too sparse ), then random forest will probably work just fine.