Homogeneous vs heterogeneous ensembles - machine-learning

I would like to check with you if my understanding about ensemble learning (homogeneous vs heterogeneous) is correct.
Is the following statement correct?
An homogeneous ensemble is a set of classifiers of the same type built upon different data as random forest and an heterogeneous ensemble is a set of classifiers of different types built upon same data.
If it's not correct, could you please clarify this point?

Homogeneous ensemble consists of members having a single-type base learning algorithm. Popular methods like bagging and boosting generate
diversity by sampling from or assigning weights to training
examples but generally utilize a single type of base classifier
to build the ensemble.
On the other hand, Heterogeneous ensemble consists of members having different base learning algorithms such as SVM, ANN and Decision Trees. A popular heterogeneous ensemble method is stacking, which is similar to boosting.
This table contains examples for both homogeneous and heterogeneous ensemble models.
EDIT:
Homogeneous ensemble methods, use the same feature selection method with different training data and distributing the dataset over several nodes while
Heterogeneous ensemble methods use different feature selection methods with the same training data.

Heterogeneous Ensembles (HEE) use different fine-tunes algorithms. They usually work well if we have a small amount of estimators. Note that the number of algorithms should always be odd (3+) in order to avoid ties. For example, we could combine a decision tree, a SVM and a logistic regression using a voting mechanism to improve the results. Then use combined wisdom through majority vote in order to classify a given sample. Besides voting, we can also use averaging or stacking to aggregate the results of the models.The data for each model is the same.
Homogeneous Ensembles (HOE), such as bagging work by applying the same algorithm on all the estimators. These algorithms should not be fine-tuned -> They should be weak ! In contrast to HEE we will use a large amount of estimators. Note that the datsets for this model should be separately sampled in order to guarantee independence. Furthermore, the datasets should be different for each model. This will allow us to be more precise when aggregating the results of each model. Bagging reduces variances as the sampling is truly random. Through using the ensemble itself, we can reduce the risk of over-fitting and we create a robust model. Unfortunately bagging is computationally expensive.
EDIT: Here an example in code
Heterogeneous Ensemble Function:
# Instantiate the individual models
clf_knn = KNeighborsClassifier(5)
clf_decision_tree= DecisionTreeClassifier()
clf_logistic_regression = LogisticRegression()
# Create voting classifier
clf_voting = VotingClassifier(
estimators=[
('knn', clf_knn),
('dt', clf_decision_tree),
('lr', clf_logistic_regression )])
# Fit it to the training set and predict
clf_voting.fit(X_train, y_train)
y_pred = clf_voting.predict(X_test)
Homogeneous Ensemble Function:
# Instantiate the base estimator, which is a weak model (we set max depth to 3)
clf_decision_tree = DecisionTreeClassifier(max_depth=3)
# Build the Bagging classifier with 5 estimators (we use 5 decision trees)
clf_bag = BaggingClassifier(
base_estimator=clf_decision_tree,
n_estimators=5
)
# Fit the Bagging model to the training set
clf_bag.fit(X_train, y_train)
# Make predictions on the test set
y_pred = clf_bag.predict(X_test)
Conclusion: In summary what you say is correct, yes.

Related

How does RandomForestClassifier work for classification?

I have learned that Sklearn treats multi-class classification problems as a collection of binary problems. Quoting the Sklearn user guide:
In extending a binary metric to multiclass or multilabel problems, the data is treated as a collection of binary problems, one for each class.
So, binary classification models like LogisticRegression or Support vector matrices can support multi-class cases by using either One-vs-One or One-vs-Rest strategies. I wanted to know if that was the case for RandomForestClassifier too? How about other classifiers in Sklearn - are they all used as binary classifiers under the hood when dealing with a multi-class problem?
According to the documentation for Decision Trees, multi-output problems add a small change to the leaves of each tree in a random forest.
Suppose you have set criterion='gini'. In essence, each node is built by picking a subset of max_features features, calculating the average reduction in the gini impurity for all N classes and choosing the variable-threshold combination that reduces it most.
This means that random forests do not create one model for each class. Instead, it's only one model that simultaneously reduces the criterion metric for all classes in each node of every tree and predicts the most common class at each leaf.

Use categorical data as feature/target without encoding it

I am recently found a model to classify the Irish flower based on the size of its leaf. There are 3 types of flowers as a target (dependent variable). As I know, the categorical data should be encoded so that it can be used in machine learning. However, in the model the data is used directly without encoding process.
Can anyone help to explain when to use encoding? Thank you in advance!
Relevant question - encoding of continuous feature variables.
Originally, the Iris data were published by Fisher when he published his linear discriminant classifier.
Generally, a distinction is made between:
Real-value classifiers
Discrete feature classifiers
Linear discriminant analysis and quadratic discriminant analysis are real-value classifiers. Trying to add discrete variables as extra input does not work. Special procedures for working with indicator variables (the name used in statistics) in discriminant analysis have been developed. Also the k-nearest neighbour classifier really only works well with real-valued feature variables.
The naive Bayes classifier is most commonly used for classification problems with discrete features. When you don't want to assume conditional independence between the feature variables, the multinomial classifier can be applied to discrete features. A classifier service that does all this for you in one go, is insight classifiers.
Neural networks and support vector machines combine real-valued and discrete features. My advice is to use one separate input node for each discrete outcome - don't use one single input node provided with values like: (0: small, 1: minor, 2: medium, 3: larger, 4: big). One input-node-per-outcome-encoding will improve your training result and yield better test set performance.
The random forest classifier also combines real-valued and discrete features seamlessly.
Final advice is to train and test-set compare at least 4 different types of classifiers, as there is no such thing as the universal best type of classifier.

Linear SVM is used for linearly separating the data which have two features

Can we use KNN and linear SVM classifier for training the model with data which contains 4 features and have 6 classification clusters? Because what i think that linear SVM and KNN are used for linearly separating the data which have two features and have binary classification cluster.
This is possible, you just need to use OneVsAll wrapper, like this one https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html
Essentially you will train 6 classifiers, one per cluster, which seeks to locate one class from all the rest.

Train multi-class classifier for binary classification

If a dataset contains multi categories, e.g. 0-class, 1-class and 2-class. Now the goal is to divide new samples into 0-class or non-0-class.
One can
combine 1,2-class into a unified non-0-class and train a binary classifier,
or train a multi-class classifier to do binary classification.
How is the performance of these two approaches?
I think more categories will bring about a more accurate discriminant surface, however the weights of 1- and 2- classes are both lower than non-0-class, resulting in less samples be judged as non-0-class.
Short answer: You would have to try both and see.
Why?: It would really depend on your data and the algorithm you use (just like for many other machine learning questions..)
For many classification algorithms (e.g. SVM, Logistic Regression), even if you want to do a multi-class classification, you would have to perform a one-vs-all classification, which means you would have to treat class 1 and class 2 as the same class. Therefore, there is no point running a multi-class scenario if you just need to separate out the 0.
For algorithms such as Neural Networks, where having multiple output classes is more natural, I think training a multi-class classifier might be more beneficial if your classes 0, 1 and 2 are very distinct. However, this means you would have to choose a more complex algorithm to fit all three. But the fit would possibly be nicer. Therefore, as already mentioned, you would really have to try both approaches and use a good metric to evaluate the performance (e.g. confusion matrices, F-score, etc..)
I hope this is somewhat helpful.

Determine most important feature per class

Imagine a machine learning problem where you have 20 classes and about 7000 sparse boolean features.
I want to figure out what the 20 most unique features per class are. In other words, features that are used a lot in a specific class but aren't used in other classes, or hardly used.
What would be a good feature selection algorithm or heuristic that can do this?
When you train a Logistic Regression multi-class classifier the train model is a num_class x num_feature matrix which is called the model where its [i,j] value is the weight of feature j in class i. The indices of features are the same as your input feature matrix.
In scikit-learn you can access to the parameters of the model
If you use scikit-learn classification algorithms you'll be able to find the most important features per class by:
clf = SGDClassifier(loss='log', alpha=regul, penalty='l1', l1_ratio=0.9, learning_rate='optimal', n_iter=10, shuffle=False, n_jobs=3, fit_intercept=True)
clf.fit(X_train, Y_train)
for i in range(0, clf.coef_.shape[0]):
top20_indices = np.argsort(clf.coef_[i])[-20:]
print top20_indices
clf.coef_ is the matrix containing the weight of each feature in each class so clf.coef_[0][2] is the weight of the third feature in the first class.
If when you build your feature matrix you keep track of the index of each feature in a dictionary where dic[id] = feature_name you'll be able to retrieve the name of the top feature using that dictionary.
For more information refer to scikit-learn text classification example
Random Forest and Naive Bayes should be able to handle this for you. Given the sparsity, I'd go for the Naive Bayes first. Random Forest would be better if you're looking for combinations.

Resources