Decision Trees in Random forest Algorithm - machine-learning

Hi I'm a beginner to random forest algorithm in Machine Learning.
According to what I have read in theory, it works on majority vote concept for classification problems. But can it be possible that number of "Yes" is same as number of "No"?
What would be done in that case?

Can it be possible that number of "Yes" is same as number of "No"?
Decision trees do not "coordinate" their predictions amongst themselves, so any ratio of "Yes" and "No" predictions is possible, including a tie (ie. a 50/50 ratio).
What would be done in that case?
You break a tie according to a predefined tie breaking algorithm. For example, you can stipulate that the "Yes" class always wins over the "No" class in such a scenario.
It's advisable to use an odd number of decision trees in a random forest in order to eliminate the possibility of ties. For example, if your random forest has 101 member decision trees, then there can't be a 50/50 outcome - one class will always have one extra vote.

Related

Why in some cases random forest with n_estimators equals to 1 performs worse than decision tree [duplicate]

This question already has an answer here:
Why is Random Forest with a single tree much better than a Decision Tree classifier?
(1 answer)
Closed 4 months ago.
Why in some cases random forest with n_estimators equals to 1 performs worse than decision tree, even after setting the bootstrap to false?
Try to use different machine learning model for predicting credit card default rate, I tried random forest and decision tree, but random forest seems to perform worse, then I tried random forest with only 1 tree, so it is supposed to be the same as decision tree, but it still performed worse.
A specific answer to your observations depends on the implementation of the decision tree (DT) and random forest (RF) methods that you're using. That said, there are three most likely reasons:
bootstrapping: Although you mention that you set that to False, in the most general form, RFs use two forms of bootstrapping: of the dataset and of the features. Perhaps the setting only controls one of these. Even if both of these are off, some RF implementations have other parameters that control the number of attributes considered for each split of the tree and how they are selected.
tree hyperparameters: Related to my remark on the previous point, the other aspect to check is if all of the other tree hyperparameters are the same. Tree depth, number of points per leaf node, etc, these all would have to matched to make the methods directly comparable.
growing method: Lastly, it is important to remember that trees are learned via indirect/heuristic losses that are often greedily optimized. Accordingly, there are different algorithms to grow the trees (e.g., C4.5), and the DT and RF implementation may be using different approaches.
If all of these match, then the differences should really be minor. If there are still differences (i.e., "in some cases"), these may be because of randomness in initialization and the greedy learning schemes which lead to suboptimal trees. That is the main reason for RFs, in which the ensemble diversity is used to mitigate these issues.

Regression tree output

I'm confused about the intuition behind decision trees when used to describe continuous targets in machine learning.
I understand that decision trees uses splits based on feature values to decide which branches of a tree to go down to get to a leaf value.
It intuitively make sense to me when doing inference on classification based on nominal targets because each leaf would have as specific value (label), so after going down enough branches one eventually arrives at discrete value which is the label.
But if we're doing regression where a machine learning model predicts a value on a continuum, for example a real number between 0 and 100, how could there be enough leaves to allow the model to output any real number between 0 and 100?
Regression trees are only what you could call 'pseudo continuous' in contrast for example to linear regression models. For the 'leaves' the outputs will have a steady value for certain ranges of the independent variable(s) - dependent on the mentioned 'splits'.
However, there exists some academic work that fits (regression) models in the nodes (...). See the accepted answer here:
https://stats.stackexchange.com/questions/439756/decision-tree-that-fits-a-regression-at-leaf-nodes

Decision Tree Uniqueness sklearn

I have some questions regarding decision tree and random forest classifier.
Question 1: Is a trained Decision Tree unique?
I believe that it should be unique as it maximizes Information Gain over each split. Now if it is unique why there is random_state parameter in decision tree classifier.As it is unique so it will be reproducible every time. So no need for random_state as Decision tree is unique.
Question 2: What does a decision tree actually predict?
While going through random forest algorithm I read that it averages probability of each class from its individual tree, But as far I know decision tree predicts class not the Probability for each class.
Even without checking out the code, you will see this note in the docs:
The features are always randomly permuted at each split. Therefore, the best found split may vary, even with the same training data and max_features=n_features, if the improvement of the criterion is identical for several splits enumerated during the search of the best split. To obtain a deterministic behaviour during fitting, random_state has to be fixed.
For splitter='best', this is happening here:
# Draw a feature at random
f_j = rand_int(n_drawn_constants, f_i - n_found_constants,
random_state)
And for your other question, read this:
...
Just build the tree so that the leaves contain not just a single class estimate, but also a probability estimate as well. This could be done simply by running any standard decision tree algorithm, and running a bunch of data through it and counting what portion of the time the predicted label was correct in each leaf; this is what sklearn does. These are sometimes called "probability estimation trees," and though they don't give perfect probability estimates, they can be useful. There was a bunch of work investigating them in the early '00s, sometimes with fancier approaches, but the simple one in sklearn is decent for use in forests.
...

Classification vs Regression?

I am not quite sure what the differences are between classification and regression.
From what I understand is that classification is something that is categorical. It's either this or it's either that.
Regression is more of a prediction.
Both of the problems above would be more of a Regression problem right? It is both using a learning algorithm to predict. Could anyone give an example of Classification vs Regression?
You are correct: given some data point, classification assigns a label (or 'class') to that point. This label is, as you said, categorical. One example might be, say, malware classification: given some file, is it malware or is it not? (The "label" will be the answer to this question: 'yes' or 'no'.)
But in regression, the goal is instead to predict a real value (i.e. not categorical). An example here might be, given someone's height and age, predict their weight.
So in either of the questions you've quoted, the answer comes down to what you are trying to get out of your prediction: a category, or a real value?
(A side note: there are connections and relations between the two problems, and you could, if you wanted, see regression as an extension of classification in the case where the labels are ordinal and there are infinite labels.)
1.Classification is a process of organizing data into categories for its most effective and efficient use whereas Regression is the process of identifying the relationship and the effect of this relationship on the outcome of the future value of object.
2.classification is used to predict both numerical and categorical data whereas regression is used to predict numerical data.
Classification examples:-
Predicting whether a share of a company is good to buy or no given that the previous history of the company, along with the buyer's review on it saying yes or no for buying the share. (Discrete answer: Buy - Yes/No)
Regression example:-
Predicting the best price at which one should buy the share of a company given that the previous history of the company, along with the buyer's review of the price at which they bought the share in the past. (Continuous answer:- Price range)

How to select training data for naive bayes classifier

I want to double check some concepts I am uncertain of regarding the training set for classifier learning. When we select records for our training data, do we select an equal number of records per class, summing to N or should it be randomly picking N number of records (regardless of class)?
Intuitively I was thinking of the former but thought of the prior class probabilities would then be equal and not be really helpful?
It depends on the distribution of your classes and the determination can only be made with domain knowledge of problem at hand.
You can ask the following questions:
Are there any two classes that are very similar and does the learner have enough information to distinguish between them?
Is there a large difference in the prior probabilities of each class?
If so, you should probably redistribute the classes.
In my experience, there is no harm in redistributing the classes, but it's not always necessary.
It really depends on the distribution of your classes. In the case of fraud or intrusion detection, the distribution of the prediction class can be less than 1%.
In this case you must distribute the classes evenly in the training set if you want the classifier to learn differences between each class. Otherwise, it will produce a classifier that correctly classifies over 99% of the cases without ever correctly identifying a fraud case, which is the whole point of creating a classifier to begin with.
Once you have a set of evenly distributed classes you can use any technique, such as k-fold, to perform the actual training.
Another example where class distributions need to be adjusted, but not necessarily in an equal number of records for each, is the case of determining upper-case letters of the alphabet from their shapes.
If you take a distribution of letters commonly used in the English language to train the classifier, there will be almost no cases, if any, of the letter Q. On the other hand, the letter O is very common. If you don't redistribute the classes to allow for the same number of Q's and O's, the classifier doesn't have enough information to ever distinguish a Q. You need to feed it enough information (i.e. more Qs) so it can determine that Q and O are indeed different letters.
The preferred approach is to use K-Fold Cross validation for picking up learning and testing data.
Quote from wikipedia:
K-fold cross-validation
In K-fold cross-validation, the
original sample is randomly
partitioned into K subsamples. Of the
K subsamples, a single subsample is
retained as the validation data for
testing the model, and the remaining K
− 1 subsamples are used as training
data. The cross-validation process is
then repeated K times (the folds),
with each of the K subsamples used
exactly once as the validation data.
The K results from the folds then can
be averaged (or otherwise combined) to
produce a single estimation. The
advantage of this method over repeated
random sub-sampling is that all
observations are used for both
training and validation, and each
observation is used for validation
exactly once. 10-fold cross-validation
is commonly used.
In stratified K-fold cross-validation,
the folds are selected so that the
mean response value is approximately
equal in all the folds. In the case of
a dichotomous classification, this
means that each fold contains roughly
the same proportions of the two types
of class labels.
You should always take the common approach in order to have comparable results with other scientific data.
I built an implementation of a Bayesian classifier to determine if a sample is NSFW (Not safe for work) by examining the occurrence of words in examples. When training a classifier for NSFW detection I've tried making it so that each class in the training sets has the same number of examples. This didn't work out as well as I had planned being that one of the classes had many more words per example than the other class.
Since I was computing the likelihood of NSFW based on these words I found that balancing out the classes based on their actual size (in MB) worked. I tried 10-cross fold validation for both approaches (balancing by number of examples and size of classes) and found that balancing by the size of the data worked well.

Resources