Logistic Regression sklearn with categorical Output - machine-learning

i have to train a model with logistic Regression in sklearn. I saw everywhere that the outcome has to be binary but my label is good, bad or normal. I have 12 features and i don't know how can i deal with three Labels ? I am very thankful for every answer

You can use Multinomial Logistic Regression.
In python, you can modify your Logistic Regression code as:
LogisticRegression(multi_class='multinomial').fit(X_train,y_train)
You can see Logistic Regression documentation in Scikit-Learn for more details.

It's called as one-vs-all Classification or Multi class classification.
From sklearn.linear_model.LogisticRegression:
In the multiclass case, the training algorithm uses the one-vs-rest (OvR) scheme if the ‘multi_class’ option is set to ‘ovr’, and uses the cross-entropy loss if the ‘multi_class’ option is set to ‘multinomial’. (Currently the ‘multinomial’ option is supported only by the ‘lbfgs’, ‘sag’, ‘saga’ and ‘newton-cg’ solvers.)
Code example:
# Authors: Tom Dupre la Tour <tom.dupre-la-tour#m4x.org>
# License: BSD 3 clause
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.linear_model import LogisticRegression
# make 3-class dataset for classification
centers = [[-5, 0], [0, 1.5], [5, -1]]
X, y = make_blobs(n_samples=1000, centers=centers, random_state=40)
transformation = [[0.4, 0.2], [-0.4, 1.2]]
X = np.dot(X, transformation)
for multi_class in ('multinomial', 'ovr'):
clf = LogisticRegression(solver='sag', max_iter=100, random_state=42,
multi_class=multi_class).fit(X, y)
# print the training scores
print("training score : %.3f (%s)" % (clf.score(X, y), multi_class))
Check for full code example: Plot multinomial and One-vs-Rest Logistic Regression

Related

PCA on data and training with SVM with K-fold CV and Gridsearch

I need to train a SVM model using LinearSVC and a 10-fold cross-validation with an internal 2-fold Gridsearch to optimze gamma and C. But I also have to apply PCA on my data to reduce its size.
Should I apply PCA before or within the loop where the CV and training of the model happens?
In the latter case I would have different numbers of Principal Components for each loop, but is there a disadvantage on that?
The best solution would be to create a sklearn Pipeline and put both steps (PCA and LinarSvc within it). This will create an object that implement fit() and predict() and that can be used within a GridSearchCV.
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV
pipe = Pipeline([('pca', PCA()),
('clf', LinearSVC())])
params = {
'pca__n_components' : [2, 5, 10, 15],
'clf__C' : [0.5, 1, 5, 10],
}
gs = GridSearchCV(estimator=pipe, param_grid=params)
gs.fit(X_train, y_train)

Inspection of trees in a Quantile Random Forest Regression model

I am interested in training a random forest to learn some conditional quantile on some data {X, y} sampled independently from some distribution.
That is, for some $$\alpha \in (0, 1)$$, a mapping $$\hat{q}{\alpha}(x) \in [0, 1]$$ such that for each $X$, $$argmin{\hat{q}{\alpha} P(y < \hat{q}\alpha(x)) > \alpha$$.
Is there any clear way to build a random forest effectively in python that could yield such a model?
Additionally, I have one added requirement that may be possible with the current libraries, though I am unsure. Requirement: I would like to select a subset of points, A, from my training set and select and exclude those trees that were trained with points in A from my random forest as I make predictions.
There is a Python-based, scikit-learn compatible/compliant Quantile Regression Forest implementation that can be used to estimate conditional quantiles here: https://github.com/zillow/quantile-forest
Your additional requirement of making predictions on training samples by excluding trees that included those samples during training is called out-of-bag (OOB) estimation, and can also be done with the above package.
Setup should be as easy as:
pip install quantile-forest
Then, here's an example of how to fit a quantile random forest model and use it to predict quantiles with OOB estimation for a subset (here the first 100 rows) of the training data:
import numpy as np
from quantile_forest import RandomForestQuantileRegressor
from sklearn import datasets
from sklearn.model_selection import train_test_split
X, y = datasets.fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
qrf = RandomForestQuantileRegressor()
qrf.fit(X_train, y_train)
# Predict OOB quantiles for first 100 training samples.
y_pred_oob = qrf.predict(
X_train[:100, :],
quantiles=[0.025, 0.5, 0.975],
oob_score=True,
indices=np.arange(100),
)

SVM duality: set of hyperparameters not supported

I am trying to train a SVM model on the Iris dataset. The aim is to classify Iris virginica flowers from other types of flowers. Here is the code:
import numpy as np
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
iris = datasets.load_iris()
X = iris["data"][:, (2,3)] # petal length, petal width
y = (iris["target"]==2).astype(np.float64) # Iris virginica
svm_clf = Pipeline([
("scaler", StandardScaler()),
("linear_svc", LinearSVC(C=1, loss="hinge", dual=False))
])
svm_clf.fit(X,y)
My book, which is Aurelien Geron's "Hands-On Machine Learning with Scikit-Learn , Keras and TensorFlow", 2nd edition, at page 156 says:
For better performance, you should set the dual hyperparameter to
False, unless there are more features than training instances
But If I set the dual hyperparameter to False, I get the following error:
ValueError: Unsupported set of arguments: The combination of penalty='l2' and loss='hinge' are not supported when dual=False, Parameters: penalty='l2', loss='hinge', dual=False
It instead works if I set the dual hyperparameter to True.
Why is this set of hyperparameters not supported?
L2 SVM with L1 loss (hinge) cannot be solving in the primal form. Only its dual form can be solved effectively. This is due to the limitation of the LIBLINEAR library used by sklearn. If you want to solve the primal form of the L2 SVM you will have to use L2 loss (squared hinge) instead.
LinearSVC(C=1, loss='squared_hinge', dual=False).fit(X,y)
For mode details: Link 1

How can I import Elastic-Net, Lasso and Ridge regression in Pyspark?

Could you please tell me how can I use Elastic-Net, Lasso and Ridge regression in Pyspark? Actually I chose Linear, Elastic-Net, Lasso and Ridge regression these 4 algorithms according to machine learning cheatsheet. However, I don't know how to import Elastic-Net, Lasso and Ridge regression in Pyspark and cannot google the right answers. I just know using the Linear Regression in Pyspark.
Have a look at https://spark.apache.org/docs/1.5.2/ml-linear-methods.html
You can use something like:
from pyspark.ml.regression import LinearRegression
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.util import MLUtils
# Load training data
training = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF()
lr = LinearRegression(maxIter=10, regParam=0.3, elasticNetParam=0.8)
# Fit the model
lrModel = lr.fit(training)
# Print the weights and intercept for linear regression
print("Weights: " + str(lrModel.weights))
print("Intercept: " + str(lrModel.intercept))
If you read into the setup, you'll find that:
By setting α properly, elastic net contains both L1 and L2 regularization as special cases. For example, if a linear regression model is trained with the elastic net parameter α set to 1, it is equivalent to a Lasso model. On the other hand, if α is set to 0, the trained model reduces to a ridge regression model.
Where:
elasticNetParam corresponds to α and regParam corresponds to λ.

scikit-learn calculate F1 in multilabel classification

I am trying to calculate macro-F1 with scikit in multi-label classification
from sklearn.metrics import f1_score
y_true = [[1,2,3]]
y_pred = [[1,2,3]]
print f1_score(y_true, y_pred, average='macro')
However it fails with error message
ValueError: multiclass-multioutput is not supported
How I can calculate macro-F1 with multi-label classification?
In the current scikit-learn release, your code results in the following warning:
DeprecationWarning: Direct support for sequence of sequences multilabel
representation will be unavailable from version 0.17. Use
sklearn.preprocessing.MultiLabelBinarizer to convert to a label
indicator representation.
Following this advice, you can use sklearn.preprocessing.MultiLabelBinarizer to convert this multilabel class to a form accepted by f1_score. For example:
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.metrics import f1_score
y_true = [[1,2,3]]
y_pred = [[1,2,3]]
m = MultiLabelBinarizer().fit(y_true)
f1_score(m.transform(y_true),
m.transform(y_pred),
average='macro')
# 1.0

Resources