How to extract coefficients from fitted pipeline for penalized logistic regression? - machine-learning

I have a set of training data that consists of X, which is a set of n columns of data (features), and Y, which is one column of target variable.
I am trying to train my model with logistic regression using the following pipeline:
pipeline = sklearn.pipeline.Pipeline([
('logistic_regression', LogisticRegression(penalty = 'none', C = 10))
])
My goal is to obtain the values of each of the n coefficients corresponding to the features, under the assumption of a linear model (y = coeff_0 + coeff_1*x1 + ... + coeff_n*xn).
What I tried was to train this pipeline on my data with model = pipeline.fit(X, Y). So I think that I now have the model that contains the coefficients that I want. However, I don't know how to access them. I'm looking for something like mode.best_params_('logistic_regression').
Does anyone know how to extract the fitted coefficients from a model like this?

Have a look at the scikit-learn documentation for Pipeline, this example is inspired by it:
from sklearn import svm
from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
from sklearn.pipeline import Pipeline
# generate some data to play with
X, y = make_classification(n_informative=5, n_redundant=0, random_state=42)
# ANOVA SVM-C
anova_filter = SelectKBest(f_regression, k=5)
clf = svm.SVC(kernel='linear')
anova_svm = Pipeline([('anova', anova_filter), ('svc', clf)])
anova_svm.set_params(anova__k=10, svc__C=.1).fit(X, y)
# access coefficients
print(anova_svm['svc'].coef_)
model.coef_ does the job, .best_params_ is usualy associated with GridSearch, i.e. hyperparameter optimization.
In your specific case try: model['logistic_regression'].coefs_.

Example to get the coefs from a pipeline.
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.pipeline import Pipeline
X, y = load_iris(return_X_y=True)
pipeline = Pipeline([('lr', LogisticRegression(penalty = 'l2',
C = 10))])
pipeline.fit(X, y)
pipeline['lr'].coef_
array([[-0.42923513, 2.08235619, -4.28084811, -1.97174699],
[ 1.06321671, -0.08077595, -0.46911772, -2.3221883 ],
[-0.63398158, -2.00158024, 4.74996583, 4.29393529]])

here is how to visualize the coefficients and measure model accuracy. I used the baby weight and height and gestation period to predict preterm
pipeline = Pipeline([('lr', LogisticRegression(penalty='l2',C=10))])
scaler=StandardScaler()
#X=np.array(df['gestation_wks']).reshape(-1,1)
X=scaler.fit_transform(df[['bwt_lbs','height_ft','gestation_wks']])
y=np.array(df['PreTerm'])
X_train,X_test, y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=123)
pipeline.fit(X_train,y_train)
y_pred_prob=pipeline.predict_proba(X_test)
predictions=pipeline.predict(X_test)
print(predictions)
sns.countplot(x=predictions, orient='h')
plt.show()
#print(predictions[:,0])
print(pipeline['lr'].coef_)
print(pipeline['lr'].intercept_)
print('Coefficients close to zero will contribute little to the end result')
num_err = np.sum(y != pipeline.predict(X))
print("Number of errors:", num_err)
def my_loss(y,w):
s = 0
for i in range(y.size):
# Get the true and predicted target values for example 'i'
y_i_true = y[i]
y_i_pred = w[i]
s = s + (y_i_true - y_i_pred)**2
return s
print("Loss:",my_loss(y_test,predictions))
fpr, tpr, threshholds = roc_curve(y_test,y_pred_prob[:,1])
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()
accuracy=round(pipeline['lr'].score(X_train, y_train) * 100, 2)
print("Model Accuracy={accuracy}".format(accuracy=accuracy))
cm=confusion_matrix(y_test,predictions)
print(cm)

Related

Unexpected behaviour (inflated results on random-data) in scikit-learn with nested cross-validation

When trying to train/evaluate a support vector machine in scikit-learn, I am experiencing some unexpected behaviour and I am wondering whether I am doing something wrong or that this is a possible bug.
In a very specific subset of circumstances, nested cross-validation using GridSearchCV and SVM, provides inflated predictive results, even with randomly generated data.
For instance, see this code:
from sklearn import svm
from sklearn.linear_model import LogisticRegression
import numpy as np
from sklearn.model_selection import GridSearchCV, StratifiedKFold, LeaveOneOut
from sklearn.metrics import roc_auc_score, brier_score_loss
from tqdm import tqdm
import pandas as pd
N = 20
N_FEATURES = 50
param_grid = {'C': [1e-5, 1e-3, 1, 1e3, 1e5]}
scores = []
for z in tqdm(range(100)):
X = np.random.uniform(size=(N, N_FEATURES))
y = np.random.binomial(1, 0.5, size=N)
if z < 10:
y = np.array([0, 1] * int(N/2))
y = np.random.permutation(y)
for skf_outer in [StratifiedKFold(n_splits=5), LeaveOneOut()]:
for skf_inner in [5, LeaveOneOut()]:
for model in [svm.SVC(probability=True), LogisticRegression()]:
y_pred, y_real = [], []
for train_index, test_index in skf_outer.split(X, y):
X_train, X_test = X[train_index], X[test_index, :]
y_train, y_test = y[train_index], y[test_index]
clf = GridSearchCV(
model, param_grid, cv=skf_inner, n_jobs=-1, scoring='neg_brier_score'
)
clf.fit(X_train, y_train)
predictions = clf.predict_proba(X_test)[:, 1]
y_pred.extend(predictions)
y_real.extend(y_test)
scores.append([str(skf_outer), str(skf_inner), str(model), np.mean(y), brier_score_loss(np.array(y_real), np.array(y_pred)), roc_auc_score(np.array(y_real), np.array(y_pred))])
df_scores = pd.DataFrame(scores)
df_scores.columns = ['skf_outer', 'skf_inner', 'model', 'y_label', 'brier', 'auc']
df_scores['y_0.5'] = df_scores['y_label'] == 0.5
df_scores = df_scores.groupby(['skf_outer', 'skf_inner', 'model', 'y_0.5']).mean()
print(df_scores)
In the following circumstances:
Both in the inner- and outerloop of the CV, LeaveOneOut() is used
The SVM is used
The y labels are balanced (i.e. the mean of y is 0.5)
The predictions are much better than expected by random chance (AUC>0.9, sometimes even 1, Brier of 0.15 or lower). I can replicate this generating more samples, more features etc - the issue stays the same. Swapping the SVM for LogisticRegression (as shown in the analysis above), leads to expected results (AUC 0.5, Brier of 0.25). And for the other scenario's (no LOO-CV in either inner or outer loop, or a different distribution of y labels), the results are as expected.
Can anyone replicate this? Am I missing something obvious?
I've replicated this with an older version of sklearn (0.24.0) and the newest one (1.2.0).

sklearn GP return std dev is zero for predictions where it must be large

I am trying regression using Gaussian processes sklearn package. The standard deviation on predictions are zero, where it must be larger.
kernel = ConstantKernel() + 1.0 * DotProduct() ** 0.3 + 1.0 * WhiteKernel()
gpr = GaussianProcessRegressor(
kernel=kernel,
alpha=0.3,
normalize_y=True,
random_state=123,
n_restarts_optimizer=0
)
gpr.fit(X_train, y_train)
Here I have shown the samples from posterior after training the model. It clearly shows the standard deviation increases along with x-axis.
This is the output I got. As the value increases along x-axis the stddev must increase, where as it is showing zero stddev.
Acutal results should look something like this.
Is it a bug ?
Full Code to reproduce the issue.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import ConstantKernel, WhiteKernel, DotProduct
df = pd.read_csv('train.csv')
X_train = df[:,0].to_numpy().reshape(-1,1)
y_train = df[:,1].to_numpy()
X_pred = np.linspace(0.01, 8.5, 1000).reshape(-1,1)
# Instantiate a Gaussian Process model
kernel = ConstantKernel() + 1.0 * DotProduct() ** 0.3 + 1.0 * WhiteKernel()
gpr = GaussianProcessRegressor(
kernel=kernel,
alpha=0.3,
normalize_y=True,
random_state=123,
n_restarts_optimizer=0
)
gpr.fit(X_train, y_train)
print(
f"Kernel parameters before fit:\n{kernel} \n"
f"Kernel parameters after fit: \n{gpr.kernel_} \n"
f"Log-likelihood: {gpr.log_marginal_likelihood(gpr.kernel_.theta):.3f} \n"
f"Score = {gpr.score(X_train,y_train)}"
)
n_samples = 10
y_samples = gpr.sample_y(X_pred, n_samples)
for idx, single_prior in enumerate(y_samples.T):
plt.plot(
X_pred,
single_prior,
linestyle="--",
alpha=0.7,
label=f"Sampled function #{idx + 1}",
)
plt.title('Sample from posterior distribution')
plt.show()
y_pred, sigma = gpr.predict(X_pred, return_std=True)
plt.figure(figsize=(10,6))
plt.plot(X_train, y_train, 'r.', markersize=3, label='Observations')
plt.plot(X_pred, y_pred, 'b-', label='Prediction',)
plt.fill_between(X_pred[:,0], y_pred-1*sigma, y_pred+1*sigma,
alpha=.4, fc='b', ec='None', label='68% confidence interval')
plt.fill_between(X_pred[:,0], y_pred-2*sigma, y_pred+2*sigma,
alpha=.3, fc='b', ec='None', label='95% confidence interval')
plt.fill_between(X_pred[:,0], y_pred-3*sigma, y_pred+3*sigma,
alpha=.1, fc='b', ec='None', label='99% confidence interval')
plt.legend()
plt.show()
Not really an answer but something to look out for that maybe it might help. I was having the same problem and had some results when changing the alpha, some kernel parameters or normalizing the data.
Probably it was due to a matter of scale (with big numbers, the std dev is too small in proportion)

How can I plot a confusion matrix for image dataset from directory?

I've built up my own neural model, trained it, and got 99.58% accuracy. But I am facing a problem with plotting the confusion matrix. There are some examples available for flow_from_directory but no examples exist for image_dataset_from_directory. Can anyone help me?
See the post How to plot confusion matrix for prefetched dataset in Tensorflow using
true_categories = tf.concat([y for x, y in val_ds], axis=0)
to get the true labels for the validation set. Then you can plot the confusion matrix with something like this
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import confusion_matrix
cm = confusion_matrix(true_categories, predicted_id)
fig = plt.figure(figsize = (8,8))
ax1 = fig.add_subplot(1,1,1)
sns.set(font_scale=1.4) #for label size
sns.heatmap(cm, annot=True, annot_kws={"size": 12},
cbar = False, cmap='Purples');
ax1.set_ylabel('True Values',fontsize=14)
ax1.set_xlabel('Predicted Values',fontsize=14)
plt.show()
Here is the code I created to be able to assemble the matrix of confusion
Note:
test_dataset is a tf.data.Dataset variable.
I used validation_dataset = tf.keras.preprocessing.image_dataset_from_directory()
import tensorflow as tf
y_true = []
y_pred = []
for x,y in validation_dataset:
y= tf.argmax(y,axis=1)
y_true.append(y)
y_pred.append(tf.argmax(model.predict(x),axis = 1))
y_pred = tf.concat(y_pred, axis=0)
y_true = tf.concat(y_true, axis=0)

How can I distribute a SVC to different workers(on other computers) using dask

I have a scheduler running on my PC and I want to train 10 instances of a SVC on different worker computers. I fiddled around but could not find a solution
I am assuming that you want to train thoses 10 SVC with different hyperparameters and find the best one (aka hyperparameters optimization that you can do using gridsearchCV). I am also assuming that you are using scikit learn.
Usually you would train the SVC using a code like :
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC
# Loading the Digits dataset
digits = datasets.load_digits()
# To apply an classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
X = digits.images.reshape((n_samples, -1))
y = digits.target
# Split the dataset in two equal parts
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.5, random_state=0)
# Set the parameters by cross-validation
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
'C': [1, 10, 100, 1000]},
{'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]
scores = ['precision', 'recall']
for score in scores:
print("# Tuning hyper-parameters for %s" % score)
print()
clf = GridSearchCV(SVC(), tuned_parameters, cv=5,
scoring='%s_macro' % score)
clf.fit(X_train, y_train)
print("Best parameters set found on development set:")
print()
print(clf.best_params_)
print()
print("Grid scores on development set:")
print()
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
print("%0.3f (+/-%0.03f) for %r"
% (mean, std * 2, params))
print()
print("Detailed classification report:")
print()
print("The model is trained on the full development set.")
print("The scores are computed on the full evaluation set.")
print()
y_true, y_pred = y_test, clf.predict(X_test)
print(classification_report(y_true, y_pred))
print()
but it would only train sequentially on one thread.
If you install dask-ML, you can leverage a drop in replacement for grid search
conda install dask-searchcv -c conda-forge
Replacing
from sklearn.model_selection import GridSearchCV
by
from dask_searchcv import GridSearchCV
should be sufficient.
However, in you case, you don't want to use the threaded scheduler but the distributed scheduler. Hence, you have to add the following code at the begining
# Distribute grid-search across a cluster
from dask.distributed import Client
scheduler_address = '127.0.0.1:8786'
client = Client(scheduler_address)
The final code should look like this (not tested)
from sklearn import datasets
from sklearn.model_selection import train_test_split
from dask_searchcv import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC
# Distribute grid-search across a cluster
from dask.distributed import Client
scheduler_address = '127.0.0.1:8786'
client = Client(scheduler_address)
# Loading the Digits dataset
digits = datasets.load_digits()
# To apply an classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
X = digits.images.reshape((n_samples, -1))
y = digits.target
# Split the dataset in two equal parts
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.5, random_state=0)
# Set the parameters by cross-validation
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
'C': [1, 10, 100, 1000]},
{'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]
scores = ['precision', 'recall']
for score in scores:
print("# Tuning hyper-parameters for %s" % score)
print()
clf = GridSearchCV(SVC(), tuned_parameters, cv=5,
scoring='%s_macro' % score)
clf.fit(X_train, y_train)
print("Best parameters set found on development set:")
print()
print(clf.best_params_)
print()
print("Grid scores on development set:")
print()
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
print("%0.3f (+/-%0.03f) for %r"
% (mean, std * 2, params))
print()
print("Detailed classification report:")
print()
print("The model is trained on the full development set.")
print("The scores are computed on the full evaluation set.")
print()
y_true, y_pred = y_test, clf.predict(X_test)
print(classification_report(y_true, y_pred))
print()

StandardScaler with make_pipeline

If I use make_pipeline, do I still need to use fit and transform functions to fit my model and transform or it will perform these functions itself?
Also, does StandardScaler also perform the normalization or only the scaling?
Explaining the code: I want to apply PCA and later applying normalization with svm.
pca = PCA(n_components=4).fit(X)
X = pca.transform(X)
# training a linear SVM classifier 5-fold
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
clf = make_pipeline(preprocessing.StandardScaler(), SVC(kernel = 'linear'))
scores = cross_val_score(clf, X, y, cv=5)
Also abit confused what happens if I don't use the fit function in the below code:
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
clf = SVC(kernel = 'linear', C = 1)
scores = cross_val_score(clf, X, y, cv=5)
StandardScaler does both normalization and scaling.
cross_val_score() will fit (transform) your data set for you, so you don't need to call it explicitly.
A bit more common approach would be to put all steps (StandardScale, PCA, SVC) in one pipeline and use GridSearchCV for tuning hyperparameters and chosing best parameters (estimators).
Demo:
pipe = Pipeline([
('scale, StandardScaler()),
('reduce_dims', PCA(n_components=4)),
('clf', SVC(kernel = 'linear', C = 1))
])
param_grid = dict(reduce_dims__n_components=[4,6,8],
clf__C=np.logspace(-4, 1, 6),
clf__kernel=['rbf','linear'])
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2)
grid.fit(X_train, y_train)
print(grid.score(X_test, y_test))

Resources