There are many supervised classifier algorithms available in scikit-learn but I couldn't find any information about their scaalbility regarding large datasets. I know that for instance, support vector machines don't behave well with huge datasets, but what about others?
Which supervised/semi-supervised classifier algorithms are most suitable for large datasets?
If you are specifically looking for classifiers in sklearn, you can have a look at this link : Scaling Strategies for large datasets.
Generally, the classifiers do incremental learning on your dataset by creating mini-batches. Here are some link for reference :
Incremental Learning links
Advanced ML lecture on Incremental Learning
ML on streaming data
Incremental Leanring
Microsoft paper on Incremental Learning
You can have a look at these classifiers in SKlearn for more info
SGD Classifier
Passive Agrressive Classifier
Multinomial Naive Bayes Incremental Learning
BErnoulli Naive Bayes
If your data is given as a stream during input, you can have a look at Apache Spark Streaming and jump to MlLib in Apache Spark for more info.
You can also have a look at Feature Hasher for large scale feature hashing in sklearn.
By huge datasets you mean like the "iris" deafult dataset?
Depending on what you want to do with those algorithms, like training and fitting, for example.
I am gonna write down the ones I use for BIG datasets, and work fine.
from sklearn.cross_validation import train_test_split
from sklearn import datasets, svm\n
import numpy as np\n
import matplotlib.pyplot as plt\n
from sklearn.model_selection import GridSearchCV\n
from sklearn.metrics import mean_squared_error\n
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import SGDRegressor\n
But of course you need to know what do you want to do with them.
Here you can check everything you want to know about these or many more.
http://scikit-learn.org/stable/
Related
I was watching a YouTube video to learn about Support Vector Machines (SVM).
In the video, he mentions that an SVM finds Support Vector Classifiers (SVC) for dividing the data as one step in their classifying process.
I have used LinearSVC from scikit-learn for classification, but I have a hard time understanding if the implementation of LinearSVC in scikit-learn is an SVM or an SVC, or if the description in the video is incorrect. I find contradicting descriptions on different sites.
The accepted answer in this question states that LinearSVC is not an SVM, but either it does not say that it is an SVC.
On the description page of LinearSVC it says "Linear Support Vector Classification", but under "See also" on this page, it says that LinearSVC is "Scalable Linear Support Vector Machine for classification implemented using liblinear".
From what I can understand, LinearSVC and SVC(kernel='linear') are not the same, but that is not the question.
Thanks!
In terms of Machine Learning concepts LinearSVC is both because:
SVM is a model/algorithm used to find a plane that splits the space of samples
this can be applied for both classification (SVC) and regression (SVR) - both SVC and SVR are kinds of SVMs
So, an SVC would be a kind of SVM and LinearSVC looks like a specific kind of SVC, although not extending a base SVC class in scikit-learn.
If you mean sklearn source code - the LinearSVC is in the svm module... so it's an SVM. It doesn't extend the SVC or BaseSVC classes but to me this is an implementation issue/detail and I'd rather think of it as an SVC.
My question is regarding the Novelty detection algorithms - Isolation Forest and One Class SVM.
I have a training dataset(with 4-5 features) where all the sample points are inliers and I need to classify any new data as an inlier or outlier and ingest in another dataframe accordingly.
While trying to use Isolation Forest or One Class SVM, i have to input the contamination percentage(nu) during the training phase. However as the training dataset doesn't have any contamination, do I need to add outliers to the training dataframe and put that outlier fraction as nu.
Also while using the Isolation forest, I noticed that the outlier percentage changes everytime I predict, even though i don't change the model. Is there a way to take care of this problem apart from going into the Extended Isolation Forest algorithm.
Thanks in advance.
Regarding contamination for isolation forest,
If you are training for the normal instances (all inliers), you should put zero for contamination. If you don't specify this, contamination would be 0.1 (for version 0.2).
The following is a simple code to show this,
1- Import libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
rng = np.random.RandomState(42)
2- Generate a 2D dataset
X = 0.3 * rng.randn(1000, 2)
3- Train iForest model and predict the outliers
clf = IsolationForest(random_state=rng, contamination=0)
clf.fit(X)
y_pred_train = clf.predict(X)
4- Print # of anomalies
print(sum(y_pred_train==-1))
This would give you 0 anomalies. Now if you change the contamination to 0.15, the program specifies 150 anomalies out of the same dataset you already had (same because of RandomState(42)).
[References]:
1 Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. "Isolation forest."
Data Mining, 2008. ICDM'08. Eighth IEEE International Conference
2 Liu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. "Isolation-based
anomaly detection." ACM Transactions on Knowledge Discovery from
Data (TKDD), (2012)
"Training with normal data(inliers) only".
This is against the nature of Isolation Forest. The training is here completely different than training in the Neural Networks. Because everyone is using these without clarifying what is going on, and writing blogs with 20% of ML knowledge, we are having questions like this.
clf = IsolationForest(random_state=rng, contamination=0)
clf.fit(X)
What does fit do here? Is it training? If yes, what is trained?
In Isolation Forest:
First, we build trees,
Then, we pass each data point through each tree,
Then, we calculate the average path that is required to isolate the point.
The shorter the path, the higher the anomaly score.
contamination will determine your threshold. if it is 0, then what is your threshold?
Please read the original paper first to understand the logic behind it. Not all anomaly detection algorithms suit for every occasion.
I'm using XGboost, Randomforest(sklearn), SVM(sklearn) and MLPclassifier(sklearn) as classifier.
And I want to set these models for multi label class.
How can i set?
import xgboost as xgb
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
xgb.XGBClassifier()
SVC()
MLPClassifier()
RandomForestClassifier()
None of these algorithms you've mentioned are restricted to binary classification problems. They can be used for multiclassification problems the same way as you would do for binary classification, by calling model.fit(x_train,y_train).
I think you don't need to do anything extra for XGboost, Random forest and and MLP. For SVC you can use OneVsRestClassifier(LinearSVC()).Then You just have to train with the algorithms you mentioned and tune it based on predictors to get the best results
I am wondering if there is a best practice for exponentially weight the training samples for random forest by time (putting more weights on more recent samples)? One way I can think of is to sample the full dataset with replacements according to the weights given time. Are there any other methods I should consider? It would be great if anyone knows some python packages that could help me accomplish this goal. Any help is much appreciated!
The sklearn implementation of Random Forests allows to specify sample weights in the fit function.
from sklearn.ensemble import RandomForestClassifier
# fill sample_weight with the desired weighting
sample_weights = numpy.ones(y.shape)
estimator = RandomForestClassifier
estimator.fit(X, y, sample_weights)
I have trained and tested the data using two classifiers such as naivebayes and SMO. Now I need to combine them using stacking.I need to know how I can perform stacking and what should be my base level classifier and meta level classifier.
It sounds like what you are want is ensemble learning rather than stacking. In an ensemble, you would use both classifiers to make decisions and combine those decisions.
Stacking is a process where the output of one level of classifiers is used as input for the next level. That is, the predictions of some classifiers are the features for other classifiers. For this, you would need to retrain one of the models with the outputs of the first classifier as input.
Which one of the classifiers should used where depends on your specific application. Similarly, how to do it depends on what system you've used to train these classifiers.
To choose the base-level classifiers for stacking consider diverse classifiers that can potentially learn on a subset of features or a subset of data. For example, your base-level classifiers could be K-NN, Random Forest and Naive Bayes. For meta-level classifier we would like to choose a classifier that will learn well based on base-level predictions as features. A good candidate is Logistic Regression.
Using mlxtend library as an example, we have:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import StackingClassifier
clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
lr = LogisticRegression()
sclf = StackingClassifier(classifiers=[clf1, clf2, clf3],
meta_classifier=lr)
For examples on stacked classifiers and regressors see mlxtend documentation page.