Stacking in data mining - machine-learning

I have trained and tested the data using two classifiers such as naivebayes and SMO. Now I need to combine them using stacking.I need to know how I can perform stacking and what should be my base level classifier and meta level classifier.

It sounds like what you are want is ensemble learning rather than stacking. In an ensemble, you would use both classifiers to make decisions and combine those decisions.
Stacking is a process where the output of one level of classifiers is used as input for the next level. That is, the predictions of some classifiers are the features for other classifiers. For this, you would need to retrain one of the models with the outputs of the first classifier as input.
Which one of the classifiers should used where depends on your specific application. Similarly, how to do it depends on what system you've used to train these classifiers.

To choose the base-level classifiers for stacking consider diverse classifiers that can potentially learn on a subset of features or a subset of data. For example, your base-level classifiers could be K-NN, Random Forest and Naive Bayes. For meta-level classifier we would like to choose a classifier that will learn well based on base-level predictions as features. A good candidate is Logistic Regression.
Using mlxtend library as an example, we have:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import StackingClassifier
clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
lr = LogisticRegression()
sclf = StackingClassifier(classifiers=[clf1, clf2, clf3],
meta_classifier=lr)
For examples on stacked classifiers and regressors see mlxtend documentation page.

Related

How does RandomForestClassifier work for classification?

I have learned that Sklearn treats multi-class classification problems as a collection of binary problems. Quoting the Sklearn user guide:
In extending a binary metric to multiclass or multilabel problems, the data is treated as a collection of binary problems, one for each class.
So, binary classification models like LogisticRegression or Support vector matrices can support multi-class cases by using either One-vs-One or One-vs-Rest strategies. I wanted to know if that was the case for RandomForestClassifier too? How about other classifiers in Sklearn - are they all used as binary classifiers under the hood when dealing with a multi-class problem?
According to the documentation for Decision Trees, multi-output problems add a small change to the leaves of each tree in a random forest.
Suppose you have set criterion='gini'. In essence, each node is built by picking a subset of max_features features, calculating the average reduction in the gini impurity for all N classes and choosing the variable-threshold combination that reduces it most.
This means that random forests do not create one model for each class. Instead, it's only one model that simultaneously reduces the criterion metric for all classes in each node of every tree and predicts the most common class at each leaf.

One Class SVM and Isolation Forest for novelty detection

My question is regarding the Novelty detection algorithms - Isolation Forest and One Class SVM.
I have a training dataset(with 4-5 features) where all the sample points are inliers and I need to classify any new data as an inlier or outlier and ingest in another dataframe accordingly.
While trying to use Isolation Forest or One Class SVM, i have to input the contamination percentage(nu) during the training phase. However as the training dataset doesn't have any contamination, do I need to add outliers to the training dataframe and put that outlier fraction as nu.
Also while using the Isolation forest, I noticed that the outlier percentage changes everytime I predict, even though i don't change the model. Is there a way to take care of this problem apart from going into the Extended Isolation Forest algorithm.
Thanks in advance.
Regarding contamination for isolation forest,
If you are training for the normal instances (all inliers), you should put zero for contamination. If you don't specify this, contamination would be 0.1 (for version 0.2).
The following is a simple code to show this,
1- Import libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
rng = np.random.RandomState(42)
2- Generate a 2D dataset
X = 0.3 * rng.randn(1000, 2)
3- Train iForest model and predict the outliers
clf = IsolationForest(random_state=rng, contamination=0)
clf.fit(X)
y_pred_train = clf.predict(X)
4- Print # of anomalies
print(sum(y_pred_train==-1))
This would give you 0 anomalies. Now if you change the contamination to 0.15, the program specifies 150 anomalies out of the same dataset you already had (same because of RandomState(42)).
[References]:
1 Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. "Isolation forest."
Data Mining, 2008. ICDM'08. Eighth IEEE International Conference
2 Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. "Isolation-based
anomaly detection." ACM Transactions on Knowledge Discovery from
Data (TKDD), (2012)
"Training with normal data(inliers) only".
This is against the nature of Isolation Forest. The training is here completely different than training in the Neural Networks. Because everyone is using these without clarifying what is going on, and writing blogs with 20% of ML knowledge, we are having questions like this.
clf = IsolationForest(random_state=rng, contamination=0)
clf.fit(X)
What does fit do here? Is it training? If yes, what is trained?
In Isolation Forest:
First, we build trees,
Then, we pass each data point through each tree,
Then, we calculate the average path that is required to isolate the point.
The shorter the path, the higher the anomaly score.
contamination will determine your threshold. if it is 0, then what is your threshold?
Please read the original paper first to understand the logic behind it. Not all anomaly detection algorithms suit for every occasion.

how to set multi classes with machine learning algorithm?

I'm using XGboost, Randomforest(sklearn), SVM(sklearn) and MLPclassifier(sklearn) as classifier.
And I want to set these models for multi label class.
How can i set?
import xgboost as xgb
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
xgb.XGBClassifier()
SVC()
MLPClassifier()
RandomForestClassifier()
None of these algorithms you've mentioned are restricted to binary classification problems. They can be used for multiclassification problems the same way as you would do for binary classification, by calling model.fit(x_train,y_train).
I think you don't need to do anything extra for XGboost, Random forest and and MLP. For SVC you can use OneVsRestClassifier(LinearSVC()).Then You just have to train with the algorithms you mentioned and tune it based on predictors to get the best results

Time Weighted Samples when using Random Forest

I am wondering if there is a best practice for exponentially weight the training samples for random forest by time (putting more weights on more recent samples)? One way I can think of is to sample the full dataset with replacements according to the weights given time. Are there any other methods I should consider? It would be great if anyone knows some python packages that could help me accomplish this goal. Any help is much appreciated!
The sklearn implementation of Random Forests allows to specify sample weights in the fit function.
from sklearn.ensemble import RandomForestClassifier
# fill sample_weight with the desired weighting
sample_weights = numpy.ones(y.shape)
estimator = RandomForestClassifier
estimator.fit(X, y, sample_weights)

Which supervised classifiers in scikit-learn are recommended for large datasets?

There are many supervised classifier algorithms available in scikit-learn but I couldn't find any information about their scaalbility regarding large datasets. I know that for instance, support vector machines don't behave well with huge datasets, but what about others?
Which supervised/semi-supervised classifier algorithms are most suitable for large datasets?
If you are specifically looking for classifiers in sklearn, you can have a look at this link : Scaling Strategies for large datasets.
Generally, the classifiers do incremental learning on your dataset by creating mini-batches. Here are some link for reference :
Incremental Learning links
Advanced ML lecture on Incremental Learning
ML on streaming data
Incremental Leanring
Microsoft paper on Incremental Learning
You can have a look at these classifiers in SKlearn for more info
SGD Classifier
Passive Agrressive Classifier
Multinomial Naive Bayes Incremental Learning
BErnoulli Naive Bayes
If your data is given as a stream during input, you can have a look at Apache Spark Streaming and jump to MlLib in Apache Spark for more info.
You can also have a look at Feature Hasher for large scale feature hashing in sklearn.
By huge datasets you mean like the "iris" deafult dataset?
Depending on what you want to do with those algorithms, like training and fitting, for example.
I am gonna write down the ones I use for BIG datasets, and work fine.
from sklearn.cross_validation import train_test_split
from sklearn import datasets, svm\n
import numpy as np\n
import matplotlib.pyplot as plt\n
from sklearn.model_selection import GridSearchCV\n
from sklearn.metrics import mean_squared_error\n
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import SGDRegressor\n
But of course you need to know what do you want to do with them.
Here you can check everything you want to know about these or many more.
http://scikit-learn.org/stable/

Resources