I have learned that Sklearn treats multi-class classification problems as a collection of binary problems. Quoting the Sklearn user guide:
In extending a binary metric to multiclass or multilabel problems, the data is treated as a collection of binary problems, one for each class.
So, binary classification models like LogisticRegression or Support vector matrices can support multi-class cases by using either One-vs-One or One-vs-Rest strategies. I wanted to know if that was the case for RandomForestClassifier too? How about other classifiers in Sklearn - are they all used as binary classifiers under the hood when dealing with a multi-class problem?
According to the documentation for Decision Trees, multi-output problems add a small change to the leaves of each tree in a random forest.
Suppose you have set criterion='gini'. In essence, each node is built by picking a subset of max_features features, calculating the average reduction in the gini impurity for all N classes and choosing the variable-threshold combination that reduces it most.
This means that random forests do not create one model for each class. Instead, it's only one model that simultaneously reduces the criterion metric for all classes in each node of every tree and predicts the most common class at each leaf.
Related
In TensorFlow library, what does the tf.estimator.LinearClassifier class do in linear regression models? (In other words, what is it used for?)
Linear Classifier is nothing but Logistic Regression.
According to Tensorflow documentation, tf.estimator.LinearClassifier is used to
Train a linear model to classify instances into one of multiple
possible classes. When number of possible classes is 2, this is binary
classification
Linear regression predicts a value while the linear classifier predicts a class. Classification aims at predicting the probability of each class given a set of inputs.
For implementation of tf.estimator.LinearClassifier, please follow this tutorial by guru99.
To know about the linear classifiers, read this article.
I am recently found a model to classify the Irish flower based on the size of its leaf. There are 3 types of flowers as a target (dependent variable). As I know, the categorical data should be encoded so that it can be used in machine learning. However, in the model the data is used directly without encoding process.
Can anyone help to explain when to use encoding? Thank you in advance!
Relevant question - encoding of continuous feature variables.
Originally, the Iris data were published by Fisher when he published his linear discriminant classifier.
Generally, a distinction is made between:
Real-value classifiers
Discrete feature classifiers
Linear discriminant analysis and quadratic discriminant analysis are real-value classifiers. Trying to add discrete variables as extra input does not work. Special procedures for working with indicator variables (the name used in statistics) in discriminant analysis have been developed. Also the k-nearest neighbour classifier really only works well with real-valued feature variables.
The naive Bayes classifier is most commonly used for classification problems with discrete features. When you don't want to assume conditional independence between the feature variables, the multinomial classifier can be applied to discrete features. A classifier service that does all this for you in one go, is insight classifiers.
Neural networks and support vector machines combine real-valued and discrete features. My advice is to use one separate input node for each discrete outcome - don't use one single input node provided with values like: (0: small, 1: minor, 2: medium, 3: larger, 4: big). One input-node-per-outcome-encoding will improve your training result and yield better test set performance.
The random forest classifier also combines real-valued and discrete features seamlessly.
Final advice is to train and test-set compare at least 4 different types of classifiers, as there is no such thing as the universal best type of classifier.
I would like to check with you if my understanding about ensemble learning (homogeneous vs heterogeneous) is correct.
Is the following statement correct?
An homogeneous ensemble is a set of classifiers of the same type built upon different data as random forest and an heterogeneous ensemble is a set of classifiers of different types built upon same data.
If it's not correct, could you please clarify this point?
Homogeneous ensemble consists of members having a single-type base learning algorithm. Popular methods like bagging and boosting generate
diversity by sampling from or assigning weights to training
examples but generally utilize a single type of base classifier
to build the ensemble.
On the other hand, Heterogeneous ensemble consists of members having different base learning algorithms such as SVM, ANN and Decision Trees. A popular heterogeneous ensemble method is stacking, which is similar to boosting.
This table contains examples for both homogeneous and heterogeneous ensemble models.
EDIT:
Homogeneous ensemble methods, use the same feature selection method with different training data and distributing the dataset over several nodes while
Heterogeneous ensemble methods use different feature selection methods with the same training data.
Heterogeneous Ensembles (HEE) use different fine-tunes algorithms. They usually work well if we have a small amount of estimators. Note that the number of algorithms should always be odd (3+) in order to avoid ties. For example, we could combine a decision tree, a SVM and a logistic regression using a voting mechanism to improve the results. Then use combined wisdom through majority vote in order to classify a given sample. Besides voting, we can also use averaging or stacking to aggregate the results of the models.The data for each model is the same.
Homogeneous Ensembles (HOE), such as bagging work by applying the same algorithm on all the estimators. These algorithms should not be fine-tuned -> They should be weak ! In contrast to HEE we will use a large amount of estimators. Note that the datsets for this model should be separately sampled in order to guarantee independence. Furthermore, the datasets should be different for each model. This will allow us to be more precise when aggregating the results of each model. Bagging reduces variances as the sampling is truly random. Through using the ensemble itself, we can reduce the risk of over-fitting and we create a robust model. Unfortunately bagging is computationally expensive.
EDIT: Here an example in code
Heterogeneous Ensemble Function:
# Instantiate the individual models
clf_knn = KNeighborsClassifier(5)
clf_decision_tree= DecisionTreeClassifier()
clf_logistic_regression = LogisticRegression()
# Create voting classifier
clf_voting = VotingClassifier(
estimators=[
('knn', clf_knn),
('dt', clf_decision_tree),
('lr', clf_logistic_regression )])
# Fit it to the training set and predict
clf_voting.fit(X_train, y_train)
y_pred = clf_voting.predict(X_test)
Homogeneous Ensemble Function:
# Instantiate the base estimator, which is a weak model (we set max depth to 3)
clf_decision_tree = DecisionTreeClassifier(max_depth=3)
# Build the Bagging classifier with 5 estimators (we use 5 decision trees)
clf_bag = BaggingClassifier(
base_estimator=clf_decision_tree,
n_estimators=5
)
# Fit the Bagging model to the training set
clf_bag.fit(X_train, y_train)
# Make predictions on the test set
y_pred = clf_bag.predict(X_test)
Conclusion: In summary what you say is correct, yes.
If a dataset contains multi categories, e.g. 0-class, 1-class and 2-class. Now the goal is to divide new samples into 0-class or non-0-class.
One can
combine 1,2-class into a unified non-0-class and train a binary classifier,
or train a multi-class classifier to do binary classification.
How is the performance of these two approaches?
I think more categories will bring about a more accurate discriminant surface, however the weights of 1- and 2- classes are both lower than non-0-class, resulting in less samples be judged as non-0-class.
Short answer: You would have to try both and see.
Why?: It would really depend on your data and the algorithm you use (just like for many other machine learning questions..)
For many classification algorithms (e.g. SVM, Logistic Regression), even if you want to do a multi-class classification, you would have to perform a one-vs-all classification, which means you would have to treat class 1 and class 2 as the same class. Therefore, there is no point running a multi-class scenario if you just need to separate out the 0.
For algorithms such as Neural Networks, where having multiple output classes is more natural, I think training a multi-class classifier might be more beneficial if your classes 0, 1 and 2 are very distinct. However, this means you would have to choose a more complex algorithm to fit all three. But the fit would possibly be nicer. Therefore, as already mentioned, you would really have to try both approaches and use a good metric to evaluate the performance (e.g. confusion matrices, F-score, etc..)
I hope this is somewhat helpful.
I am participating in the Kaggle San Francisco Crime competition and i am currently trying o number of different classifiers to test benchmark performances. I am using a LogisticRegressionClassifier from sklearn, without any parameter tuning and I noticed from sklearn.metrict.classification_report that it is only predicting the predominant classses,i.e. the classes which have the highest number of occurrences in my training set.
Intuition tells me that this has to parameter tuning, but I am not sure which parameters I have to tweek in order to make the classifier more aware of less predominant classes ( LogisticRegressionClassifier has quite a few ). At the moment it is predicting only 3 classes from 38 or smth like that so it definitely needs improvement.
Any ideas?
If your model is classifying only predominant classes then you are facing problem of imbalance classes. Here are some good reads to tackle this in machine learning.
Logistic Regression is a binary classifier and uses one-vs-all or one-vs-one technique for multiclass classification, which is not good if you have higher number of output classes (33 in your case). Try using other classifier. For a start , use softmax classifier which is an extension of logistic classifier having support for multi-class classification. In scikit learn, set multi_class variable as multinomial to use softmax regression.
Other way to improve your model could be using GridSearch for parameter tuning.
On a side note, I would recommend you to use other models as well.