how to set multi classes with machine learning algorithm? - machine-learning

I'm using XGboost, Randomforest(sklearn), SVM(sklearn) and MLPclassifier(sklearn) as classifier.
And I want to set these models for multi label class.
How can i set?
import xgboost as xgb
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
xgb.XGBClassifier()
SVC()
MLPClassifier()
RandomForestClassifier()

None of these algorithms you've mentioned are restricted to binary classification problems. They can be used for multiclassification problems the same way as you would do for binary classification, by calling model.fit(x_train,y_train).

I think you don't need to do anything extra for XGboost, Random forest and and MLP. For SVC you can use OneVsRestClassifier(LinearSVC()).Then You just have to train with the algorithms you mentioned and tune it based on predictors to get the best results

Related

What happens when we apply .fit() method to a kNN model in Scikit-learn if kNN has no training phase?

Since kNN handles both training and prediction at the RAM level and requires no explicit training process, what exactly happens when a knn model is being fitted? I thought this step was related to training the model. Thank you.
Here is the error I will get if I skip fitting step.
NotFittedError: This KNeighborsClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.
Sample Code:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
f=r"aug_train.csv"
df=pd.read_csv(f)
X=df[:90000][["training_hours", "city_development_index"]].values
y=df[:90000]["target"].values
X_train, X_test, y_train, y_test=train_test_split(X,y)
knn=KNeighborsClassifier(n_neighbors=2)
knn.fit(X_train, y_train)
yhat=knn.predict(X_test)
print(yhat)
Unlike other Machine Learning Algorithms KNN doesn't optimize a cost function instead it remembers the training data. When a prediction is made the KNN compares the input with the training data it has stored. The class label of the data point which has maximum similarity with the queried input is given as prediction. Hence when we fit a KNN model it learns or stores the dataset in memory.

Is sklearn LinearSVC an SVM or SVC?

I was watching a YouTube video to learn about Support Vector Machines (SVM).
In the video, he mentions that an SVM finds Support Vector Classifiers (SVC) for dividing the data as one step in their classifying process.
I have used LinearSVC from scikit-learn for classification, but I have a hard time understanding if the implementation of LinearSVC in scikit-learn is an SVM or an SVC, or if the description in the video is incorrect. I find contradicting descriptions on different sites.
The accepted answer in this question states that LinearSVC is not an SVM, but either it does not say that it is an SVC.
On the description page of LinearSVC it says "Linear Support Vector Classification", but under "See also" on this page, it says that LinearSVC is "Scalable Linear Support Vector Machine for classification implemented using liblinear".
From what I can understand, LinearSVC and SVC(kernel='linear') are not the same, but that is not the question.
Thanks!
In terms of Machine Learning concepts LinearSVC is both because:
SVM is a model/algorithm used to find a plane that splits the space of samples
this can be applied for both classification (SVC) and regression (SVR) - both SVC and SVR are kinds of SVMs
So, an SVC would be a kind of SVM and LinearSVC looks like a specific kind of SVC, although not extending a base SVC class in scikit-learn.
If you mean sklearn source code - the LinearSVC is in the svm module... so it's an SVM. It doesn't extend the SVC or BaseSVC classes but to me this is an implementation issue/detail and I'd rather think of it as an SVC.

Time Weighted Samples when using Random Forest

I am wondering if there is a best practice for exponentially weight the training samples for random forest by time (putting more weights on more recent samples)? One way I can think of is to sample the full dataset with replacements according to the weights given time. Are there any other methods I should consider? It would be great if anyone knows some python packages that could help me accomplish this goal. Any help is much appreciated!
The sklearn implementation of Random Forests allows to specify sample weights in the fit function.
from sklearn.ensemble import RandomForestClassifier
# fill sample_weight with the desired weighting
sample_weights = numpy.ones(y.shape)
estimator = RandomForestClassifier
estimator.fit(X, y, sample_weights)

Which supervised classifiers in scikit-learn are recommended for large datasets?

There are many supervised classifier algorithms available in scikit-learn but I couldn't find any information about their scaalbility regarding large datasets. I know that for instance, support vector machines don't behave well with huge datasets, but what about others?
Which supervised/semi-supervised classifier algorithms are most suitable for large datasets?
If you are specifically looking for classifiers in sklearn, you can have a look at this link : Scaling Strategies for large datasets.
Generally, the classifiers do incremental learning on your dataset by creating mini-batches. Here are some link for reference :
Incremental Learning links
Advanced ML lecture on Incremental Learning
ML on streaming data
Incremental Leanring
Microsoft paper on Incremental Learning
You can have a look at these classifiers in SKlearn for more info
SGD Classifier
Passive Agrressive Classifier
Multinomial Naive Bayes Incremental Learning
BErnoulli Naive Bayes
If your data is given as a stream during input, you can have a look at Apache Spark Streaming and jump to MlLib in Apache Spark for more info.
You can also have a look at Feature Hasher for large scale feature hashing in sklearn.
By huge datasets you mean like the "iris" deafult dataset?
Depending on what you want to do with those algorithms, like training and fitting, for example.
I am gonna write down the ones I use for BIG datasets, and work fine.
from sklearn.cross_validation import train_test_split
from sklearn import datasets, svm\n
import numpy as np\n
import matplotlib.pyplot as plt\n
from sklearn.model_selection import GridSearchCV\n
from sklearn.metrics import mean_squared_error\n
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import SGDRegressor\n
But of course you need to know what do you want to do with them.
Here you can check everything you want to know about these or many more.
http://scikit-learn.org/stable/

Stacking in data mining

I have trained and tested the data using two classifiers such as naivebayes and SMO. Now I need to combine them using stacking.I need to know how I can perform stacking and what should be my base level classifier and meta level classifier.
It sounds like what you are want is ensemble learning rather than stacking. In an ensemble, you would use both classifiers to make decisions and combine those decisions.
Stacking is a process where the output of one level of classifiers is used as input for the next level. That is, the predictions of some classifiers are the features for other classifiers. For this, you would need to retrain one of the models with the outputs of the first classifier as input.
Which one of the classifiers should used where depends on your specific application. Similarly, how to do it depends on what system you've used to train these classifiers.
To choose the base-level classifiers for stacking consider diverse classifiers that can potentially learn on a subset of features or a subset of data. For example, your base-level classifiers could be K-NN, Random Forest and Naive Bayes. For meta-level classifier we would like to choose a classifier that will learn well based on base-level predictions as features. A good candidate is Logistic Regression.
Using mlxtend library as an example, we have:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from mlxtend.classifier import StackingClassifier
clf1 = KNeighborsClassifier(n_neighbors=1)
clf2 = RandomForestClassifier(random_state=1)
clf3 = GaussianNB()
lr = LogisticRegression()
sclf = StackingClassifier(classifiers=[clf1, clf2, clf3],
meta_classifier=lr)
For examples on stacked classifiers and regressors see mlxtend documentation page.

Resources