Including the training data in SHAP TreeExplainer gives different expected_value in scikit-learn GBT Regressor.
Reproducible example (run in Google Colab):
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
import numpy as np
import shap
shap.__version__
# 0.37.0
X, y = make_regression(n_samples=1000, n_features=10, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
gbt = GradientBoostingRegressor(random_state=0)
gbt.fit(X_train, y_train)
# mean prediction:
mean_pred_gbt = np.mean(gbt.predict(X_train))
mean_pred_gbt
# -11.534353657511172
# explainer without data
gbt_explainer = shap.TreeExplainer(gbt)
gbt_explainer.expected_value
# array([-11.53435366])
np.isclose(mean_pred_gbt, gbt_explainer.expected_value)
# array([ True])
# explainer with training data
gbt_data_explainer = shap.TreeExplainer(model=gbt, data=X_train) # specifying feature_perturbation does not change the result
gbt_data_explainer.expected_value
# -23.564797322079635
So, the expected value when including the training data gbt_data_explainer.expected_value is quite different from the one calculated without supplying the data (gbt_explainer.expected_value).
Both approaches are additive and consistent when used with the (obviously different) respective shap_values:
np.abs(gbt_explainer.expected_value + gbt_explainer.shap_values(X_train).sum(1) - gbt.predict(X_train)).max() < 1e-4
# True
np.abs(gbt_data_explainer.expected_value + gbt_data_explainer.shap_values(X_train).sum(1) - gbt.predict(X_train)).max() < 1e-4
# True
but I wonder why they do not provide the same expected_value, and why gbt_data_explainer.expected_value is so different from the mean value of predictions.
What am I missing here?
Apparently shap subsets to 100 rows when data is passed, then runs those rows through the trees to reset the sample counts for each node. So the -23.5... being reported is the average model output for those 100 rows.
The data is passed to an Independent masker, which does the subsampling:
https://github.com/slundberg/shap/blob/v0.37.0/shap/explainers/_tree.py#L94
https://github.com/slundberg/shap/blob/v0.37.0/shap/explainers/_explainer.py#L68
https://github.com/slundberg/shap/blob/v0.37.0/shap/maskers/_tabular.py#L216
Running
from shap import maskers
another_gbt_explainer = shap.TreeExplainer(
gbt,
data=maskers.Independent(X_train, max_samples=800),
feature_perturbation="tree_path_dependent"
)
another_gbt_explainer.expected_value
gets back to
-11.534353657511172
Though #Ben did a great job in digging out how the data gets passed through Independent masker, his answer does not show exactly (1) how base values are calculated and where do we get the different base value from and (2) how to choose/lower the max_samples param
Where the different value comes from
The masker object has a data attribute that holds data after masking process. To get the value showed in gbt_explainer.expected_value:
from shap.maskers import Independent
gbt = GradientBoostingRegressor(random_state=0)
# mean prediction:
mean_pred_gbt = np.mean(gbt.predict(X_train))
mean_pred_gbt
# -11.534353657511172
# explainer without data
gbt_explainer = shap.TreeExplainer(gbt)
gbt_explainer.expected_value
# array([-11.53435366])
gbt_explainer = shap.TreeExplainer(gbt, Independent(X_train,100))
gbt_explainer.expected_value
# -23.56479732207963
one would need to do:
masker = Independent(X_train,100)
gbt.predict(masker.data).mean()
# -23.56479732207963
What about choosing max_samples?
Setting max_samples to the original dataset length seem to work with other explainers too:
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import shap
from shap.maskers import Independent
from scipy.special import logit, expit
corpus,y = shap.datasets.imdb()
corpus_train, corpus_test, y_train, y_test = train_test_split(corpus, y, test_size=0.2, random_state=7)
vectorizer = TfidfVectorizer(min_df=10)
X_train = vectorizer.fit_transform(corpus_train)
model = sklearn.linear_model.LogisticRegression(penalty="l2", C=0.1)
model.fit(X_train, y_train)
explainer = shap.Explainer(model
,masker = Independent(X_train,100)
,feature_names=vectorizer.get_feature_names()
)
explainer.expected_value
# -0.18417413671991964
This value comes from:
masker=Independent(X_train,100)
logit(model.predict_proba(masker.data.mean(0).reshape(1,-1))[...,1])
# array([-0.18417414])
max_samples=100 seem to be a bit off for a true base_value (just feeding feature means):
logit(model.predict_proba(X_train.mean(0).reshape(1,-1))[:,1])
array([-0.02938039])
By increasing max_samples one might get reasonably close to true baseline, while keeping num of samples low:
masker = Independent(X_train,1000)
logit(model.predict_proba(masker.data.mean(0).reshape(1,-1))[:,1])
# -0.05957302658674238
So, to get base value for an explainer of interest (1) pass explainer.data (or masker.data) through your model and (2) choose max_samples so that base_value on sampled data is close enough to the true base value. You may also try to observe if the values and order of shap importances converge.
Some people may notice that to get to the base values sometimes we average feature inputs (LogisticRegression) and sometimes outputs (GBT)
I'm trying to build a simple regression model using keras and tensorflow. In my problem I have data in the form (x, y), where x and y are simply numbers. I'd like to build a keras model in order to predict y using x as an input.
Since I think images better explains thing, these are my data:
We may discuss if they are good or not, but in my problem I cannot really cheat them.
My keras model is the following (data are splitted 30% test (X_test, y_test) and 70% training (X_train, y_train)):
model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(32, input_shape=() activation="relu", name="first_layer"))
model.add(tf.keras.layers.Dense(16, activation="relu", name="second_layer"))
model.add(tf.keras.layers.Dense(1, name="output_layer"))
model.compile(loss = "mean_squared_error", optimizer = "adam", metrics=["mse"] )
history = model.fit(X_train, y_train, epochs=500, batch_size=1, verbose=0, shuffle=False)
eval_result = model.evaluate(X_test, y_test)
print("\n\nTest loss:", eval_result, "\n")
predict_Y = model.predict(X)
note: X contains both X_test and X_train.
Plotting the prediction I get (blue squares are the prediction predict_Y)
I'm playing a lot with layers, activation funztions and other parameters. My goal is to find the best parameters to train the model, but the actual question, here, is slightly different: in fact I have hard times to force the model to overfit the data (as you can see from the above results).
Does anyone have some sort of idea about how to reproduce overfitting?
This is the outcome I would like to get:
(red dots are under blue squares!)
EDIT:
Here I provide you the data used in the example above: you can copy paste directly to a python interpreter:
X_train = [0.704619794270697, 0.6779457393024553, 0.8207082120250023, 0.8588819357831449, 0.8692320257603844, 0.6878750931810429, 0.9556331888763945, 0.77677964510883, 0.7211381534179618, 0.6438319113259414, 0.6478339581502052, 0.9710222750072649, 0.8952188423349681, 0.6303124926673513, 0.9640316662124185, 0.869691568491902, 0.8320164648420931, 0.8236399177660375, 0.8877334038470911, 0.8084042532069621, 0.8045680821762038]
y_train = [0.7766424210611557, 0.8210846773655833, 0.9996114311913593, 0.8041331063189883, 0.9980525368790883, 0.8164056182686034, 0.8925487603333683, 0.7758207470960685, 0.37345286573743475, 0.9325789202459493, 0.6060269037514895, 0.9319771743389491, 0.9990691225991941, 0.9320002808310418, 0.9992560731072977, 0.9980241561997089, 0.8882905258641204, 0.4678339275898943, 0.9312152374846061, 0.9542371205095945, 0.8885893668675711]
X_test = [0.9749191829308574, 0.8735366740730178, 0.8882783211709133, 0.8022891400991644, 0.8650601322313454, 0.8697902997857514, 1.0, 0.8165876695985228, 0.8923841531760973]
y_test = [0.975653685270635, 0.9096752789481569, 0.6653736469114154, 0.46367666660348744, 0.9991817903431941, 1.0, 0.9111205717076893, 0.5264993912088891, 0.9989199241685126]
X = [0.704619794270697, 0.77677964510883, 0.7211381534179618, 0.6478339581502052, 0.6779457393024553, 0.8588819357831449, 0.8045680821762038, 0.8320164648420931, 0.8650601322313454, 0.8697902997857514, 0.8236399177660375, 0.6878750931810429, 0.8923841531760973, 0.8692320257603844, 0.8877334038470911, 0.8735366740730178, 0.8207082120250023, 0.8022891400991644, 0.6303124926673513, 0.8084042532069621, 0.869691568491902, 0.9710222750072649, 0.9556331888763945, 0.8882783211709133, 0.8165876695985228, 0.6438319113259414, 0.8952188423349681, 0.9749191829308574, 1.0, 0.9640316662124185]
Y = [0.7766424210611557, 0.7758207470960685, 0.37345286573743475, 0.6060269037514895, 0.8210846773655833, 0.8041331063189883, 0.8885893668675711, 0.8882905258641204, 0.9991817903431941, 1.0, 0.4678339275898943, 0.8164056182686034, 0.9989199241685126, 0.9980525368790883, 0.9312152374846061, 0.9096752789481569, 0.9996114311913593, 0.46367666660348744, 0.9320002808310418, 0.9542371205095945, 0.9980241561997089, 0.9319771743389491, 0.8925487603333683, 0.6653736469114154, 0.5264993912088891, 0.9325789202459493, 0.9990691225991941, 0.975653685270635, 0.9111205717076893, 0.9992560731072977]
Where X contains the list of the x values and Y the corresponding y value. (X_test, y_test) and (X_train, y_train) are two (non overlapping) subset of (X, Y).
To predict and show the model results I simply use matplotlib (imported as plt):
predict_Y = model.predict(X)
plt.plot(X, Y, "ro", X, predict_Y, "bs")
plt.show()
Overfitted models are rarely useful in real life. It appears to me that OP is well aware of that but wants to see if NNs are indeed capable of fitting (bounded) arbitrary functions or not. On one hand, the input-output data in the example seems to obey no discernible pattern. On the other hand, both input and output are scalars in [0, 1] and there are only 21 data points in the training set.
Based on my experiments and results, we can indeed overfit as requested. See the image below.
Numerical results:
x y_true y_pred error
0 0.704620 0.776642 0.773753 -0.002889
1 0.677946 0.821085 0.819597 -0.001488
2 0.820708 0.999611 0.999813 0.000202
3 0.858882 0.804133 0.805160 0.001026
4 0.869232 0.998053 0.997862 -0.000190
5 0.687875 0.816406 0.814692 -0.001714
6 0.955633 0.892549 0.893117 0.000569
7 0.776780 0.775821 0.779289 0.003469
8 0.721138 0.373453 0.374007 0.000554
9 0.643832 0.932579 0.912565 -0.020014
10 0.647834 0.606027 0.607253 0.001226
11 0.971022 0.931977 0.931549 -0.000428
12 0.895219 0.999069 0.999051 -0.000018
13 0.630312 0.932000 0.930252 -0.001748
14 0.964032 0.999256 0.999204 -0.000052
15 0.869692 0.998024 0.997859 -0.000165
16 0.832016 0.888291 0.887883 -0.000407
17 0.823640 0.467834 0.460728 -0.007106
18 0.887733 0.931215 0.932790 0.001575
19 0.808404 0.954237 0.960282 0.006045
20 0.804568 0.888589 0.906829 0.018240
{'me': -0.00015776709314323828,
'mae': 0.00329163070145315,
'mse': 4.0713782563067185e-05,
'rmse': 0.006380735268216915}
OP's code seems good to me. My changes were minor:
Use deeper networks. It may not actually be necessary to use a depth of 30 layers but since we just want to overfit, I didn't experiment too much with what's the minimum depth needed.
Each Dense layer has 50 units. Again, this may be overkill.
Added batch normalization layer every 5th dense layer.
Decreased learning rate by half.
Ran optimization for longer using the all 21 training examples in a batch.
Used MAE as objective function. MSE is good but since we want to overfit, I want to penalize small errors the same way as large errors.
Random numbers are more important here because data appears to be arbitrary. Though, you should get similar results if you change random number seed and let the optimizer run long enough. In some cases, optimization does get stuck in a local minima and it would not produce overfitting (as requested by OP).
The code is below.
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense, BatchNormalization
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam
import matplotlib.pyplot as plt
# Set seed just to have reproducible results
np.random.seed(84)
tf.random.set_seed(84)
# Load data from the post
# https://stackoverflow.com/questions/61252785/how-to-overfit-data-with-keras
X_train = np.array([0.704619794270697, 0.6779457393024553, 0.8207082120250023,
0.8588819357831449, 0.8692320257603844, 0.6878750931810429,
0.9556331888763945, 0.77677964510883, 0.7211381534179618,
0.6438319113259414, 0.6478339581502052, 0.9710222750072649,
0.8952188423349681, 0.6303124926673513, 0.9640316662124185,
0.869691568491902, 0.8320164648420931, 0.8236399177660375,
0.8877334038470911, 0.8084042532069621,
0.8045680821762038])
Y_train = np.array([0.7766424210611557, 0.8210846773655833, 0.9996114311913593,
0.8041331063189883, 0.9980525368790883, 0.8164056182686034,
0.8925487603333683, 0.7758207470960685,
0.37345286573743475, 0.9325789202459493,
0.6060269037514895, 0.9319771743389491, 0.9990691225991941,
0.9320002808310418, 0.9992560731072977, 0.9980241561997089,
0.8882905258641204, 0.4678339275898943, 0.9312152374846061,
0.9542371205095945, 0.8885893668675711])
X_test = np.array([0.9749191829308574, 0.8735366740730178, 0.8882783211709133,
0.8022891400991644, 0.8650601322313454, 0.8697902997857514,
1.0, 0.8165876695985228, 0.8923841531760973])
Y_test = np.array([0.975653685270635, 0.9096752789481569, 0.6653736469114154,
0.46367666660348744, 0.9991817903431941, 1.0,
0.9111205717076893, 0.5264993912088891, 0.9989199241685126])
X = np.array([0.704619794270697, 0.77677964510883, 0.7211381534179618,
0.6478339581502052, 0.6779457393024553, 0.8588819357831449,
0.8045680821762038, 0.8320164648420931, 0.8650601322313454,
0.8697902997857514, 0.8236399177660375, 0.6878750931810429,
0.8923841531760973, 0.8692320257603844, 0.8877334038470911,
0.8735366740730178, 0.8207082120250023, 0.8022891400991644,
0.6303124926673513, 0.8084042532069621, 0.869691568491902,
0.9710222750072649, 0.9556331888763945, 0.8882783211709133,
0.8165876695985228, 0.6438319113259414, 0.8952188423349681,
0.9749191829308574, 1.0, 0.9640316662124185])
Y = np.array([0.7766424210611557, 0.7758207470960685, 0.37345286573743475,
0.6060269037514895, 0.8210846773655833, 0.8041331063189883,
0.8885893668675711, 0.8882905258641204, 0.9991817903431941, 1.0,
0.4678339275898943, 0.8164056182686034, 0.9989199241685126,
0.9980525368790883, 0.9312152374846061, 0.9096752789481569,
0.9996114311913593, 0.46367666660348744, 0.9320002808310418,
0.9542371205095945, 0.9980241561997089, 0.9319771743389491,
0.8925487603333683, 0.6653736469114154, 0.5264993912088891,
0.9325789202459493, 0.9990691225991941, 0.975653685270635,
0.9111205717076893, 0.9992560731072977])
# Reshape all data to be of the shape (batch_size, 1)
X_train = X_train.reshape((-1, 1))
Y_train = Y_train.reshape((-1, 1))
X_test = X_test.reshape((-1, 1))
Y_test = Y_test.reshape((-1, 1))
X = X.reshape((-1, 1))
Y = Y.reshape((-1, 1))
# Is data scaled? NNs do well with bounded data.
assert np.all(X_train >= 0) and np.all(X_train <= 1)
assert np.all(Y_train >= 0) and np.all(Y_train <= 1)
assert np.all(X_test >= 0) and np.all(X_test <= 1)
assert np.all(Y_test >= 0) and np.all(Y_test <= 1)
assert np.all(X >= 0) and np.all(X <= 1)
assert np.all(Y >= 0) and np.all(Y <= 1)
# Build a model with variable number of hidden layers.
# We will use Keras functional API.
# https://www.perfectlyrandom.org/2019/06/24/a-guide-to-keras-functional-api/
n_dense_layers = 30 # increase this to get more complicated models
# Define the layers first.
input_tensor = Input(shape=(1,), name='input')
layers = []
for i in range(n_dense_layers):
layers += [Dense(units=50, activation='relu', name=f'dense_layer_{i}')]
if (i > 0) & (i % 5 == 0):
# avg over batches not features
layers += [BatchNormalization(axis=1)]
sigmoid_layer = Dense(units=1, activation='sigmoid', name='sigmoid_layer')
# Connect the layers using Keras Functional API
mid_layer = input_tensor
for dense_layer in layers:
mid_layer = dense_layer(mid_layer)
output_tensor = sigmoid_layer(mid_layer)
model = Model(inputs=[input_tensor], outputs=[output_tensor])
optimizer = Adam(learning_rate=0.0005)
model.compile(optimizer=optimizer, loss='mae', metrics=['mae'])
model.fit(x=[X_train], y=[Y_train], epochs=40000, batch_size=21)
# Predict on various datasets
Y_train_pred = model.predict(X_train)
# Create a dataframe to inspect results manually
train_df = pd.DataFrame({
'x': X_train.reshape((-1)),
'y_true': Y_train.reshape((-1)),
'y_pred': Y_train_pred.reshape((-1))
})
train_df['error'] = train_df['y_pred'] - train_df['y_true']
print(train_df)
# A dictionary to store all the errors in one place.
train_errors = {
'me': np.mean(train_df['error']),
'mae': np.mean(np.abs(train_df['error'])),
'mse': np.mean(np.square(train_df['error'])),
'rmse': np.sqrt(np.mean(np.square(train_df['error']))),
}
print(train_errors)
# Make a plot to visualize true vs predicted
plt.figure(1)
plt.clf()
plt.plot(train_df['x'], train_df['y_true'], 'r.', label='y_true')
plt.plot(train_df['x'], train_df['y_pred'], 'bo', alpha=0.25, label='y_pred')
plt.grid(True)
plt.xlabel('x')
plt.ylabel('y')
plt.title(f'Train data. MSE={np.round(train_errors["mse"], 5)}.')
plt.legend()
plt.show(block=False)
plt.savefig('true_vs_pred.png')
A problem you may encountering is that you don't have enough training data for the model to be able to fit well. In your example, you only have 21 training instances, each with only 1 feature. Broadly speaking with neural network models, you need on the order of 10K or more training instances to produce a decent model.
Consider the following code that generates a noisy sine wave and tries to train a densely-connected feed-forward neural network to fit the data. My model has two linear layers, each with 50 hidden units and a ReLU activation function. The experiments are parameterized with the variable num_points which I will increase.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(7)
def generate_data(num_points=100):
X = np.linspace(0.0 , 2.0 * np.pi, num_points).reshape(-1, 1)
noise = np.random.normal(0, 1, num_points).reshape(-1, 1)
y = 3 * np.sin(X) + noise
return X, y
def run_experiment(X_train, y_train, X_test, batch_size=64):
num_points = X_train.shape[0]
model = keras.Sequential()
model.add(layers.Dense(50, input_shape=(1, ), activation='relu'))
model.add(layers.Dense(50, activation='relu'))
model.add(layers.Dense(1, activation='linear'))
model.compile(loss = "mse", optimizer = "adam", metrics=["mse"] )
history = model.fit(X_train, y_train, epochs=10,
batch_size=batch_size, verbose=0)
yhat = model.predict(X_test, batch_size=batch_size)
plt.figure(figsize=(5, 5))
plt.plot(X_train, y_train, "ro", markersize=2, label='True')
plt.plot(X_train, yhat, "bo", markersize=1, label='Predicted')
plt.ylim(-5, 5)
plt.title('N=%d points' % (num_points))
plt.legend()
plt.grid()
plt.show()
Here is how I invoke the code:
num_points = 100
X, y = generate_data(num_points)
run_experiment(X, y, X)
Now, if I run the experiment with num_points = 100, the model predictions (in blue) do a terrible job at fitting the true noisy sine wave (in red).
Now, here is num_points = 1000:
Here is num_points = 10000:
And here is num_points = 100000:
As you can see, for my chosen NN architecture, adding more training instances allows the neural network to better (over)fit the data.
If you do have a lot of training instances, then if you want to purposefully overfit your data, you can either increase the neural network capacity or reduce regularization. Specifically, you can control the following knobs:
increase the number of layers
increase the number of hidden units
increase the number of features per data instance
reduce regularization (e.g. by removing dropout layers)
use a more complex neural network architecture (e.g. transformer blocks instead of RNN)
You may be wondering if neural networks can fit arbitrary data rather than just a noisy sine wave as in my example. Previous research says that, yes, a big enough neural network can fit any data. See:
Universal approximation theorem. https://en.wikipedia.org/wiki/Universal_approximation_theorem
Zhang 2016, "Understanding deep learning requires rethinking generalization". https://arxiv.org/abs/1611.03530
As discussed in the comments, you should make a Python array (with NumPy) like this:-
Myarray = [[0.65, 1], [0.85, 0.5], ....]
Then you would just call those specific parts of the array whom you need to predict. Here the first value is the x-axis value. So you would call it to obtain the corresponding pair stored in Myarray
There are many resources to learn these types of things. some of them are ===>
https://www.geeksforgeeks.org/python-using-2d-arrays-lists-the-right-way/
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=video&cd=2&cad=rja&uact=8&ved=0ahUKEwjGs-Oxne3oAhVlwTgGHfHnDp4QtwIILTAB&url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DQgfUT7i4yrc&usg=AOvVaw3LympYRszIYi6_OijMXH72
In my classification scheme, there are several steps including:
SMOTE (Synthetic Minority Over-sampling Technique)
Fisher criteria for feature selection
Standardization (Z-score normalisation)
SVC (Support Vector Classifier)
The main parameters to be tuned in the scheme above are percentile (2.) and hyperparameters for SVC (4.) and I want to go through grid search for tuning.
The current solution builds a "partial" pipeline including step 3 and 4 in the scheme clf = Pipeline([('normal',preprocessing.StandardScaler()),('svc',svm.SVC(class_weight='auto'))])
and breaks the scheme into two parts:
Tune the percentile of features to keep through the first grid search
skf = StratifiedKFold(y)
for train_ind, test_ind in skf:
X_train, X_test, y_train, y_test = X[train_ind], X[test_ind], y[train_ind], y[test_ind]
# SMOTE synthesizes the training data (we want to keep test data intact)
X_train, y_train = SMOTE(X_train, y_train)
for percentile in percentiles:
# Fisher returns the indices of the selected features specified by the parameter 'percentile'
selected_ind = Fisher(X_train, y_train, percentile)
X_train_selected, X_test_selected = X_train[selected_ind,:], X_test[selected_ind, :]
model = clf.fit(X_train_selected, y_train)
y_predict = model.predict(X_test_selected)
f1 = f1_score(y_predict, y_test)
The f1 scores will be stored and then be averaged through all fold partitions for all percentiles, and the percentile with the best CV score is returned. The purpose of putting 'percentile for loop' as the inner loop is to allow fair competition as we have the same training data (including synthesized data) across all fold partitions for all percentiles.
After determining the percentile, tune the hyperparameters by second grid search
skf = StratifiedKFold(y)
for train_ind, test_ind in skf:
X_train, X_test, y_train, y_test = X[train_ind], X[test_ind], y[train_ind], y[test_ind]
# SMOTE synthesizes the training data (we want to keep test data intact)
X_train, y_train = SMOTE(X_train, y_train)
for parameters in parameter_comb:
# Select the features based on the tuned percentile
selected_ind = Fisher(X_train, y_train, best_percentile)
X_train_selected, X_test_selected = X_train[selected_ind,:], X_test[selected_ind, :]
clf.set_params(svc__C=parameters['C'], svc__gamma=parameters['gamma'])
model = clf.fit(X_train_selected, y_train)
y_predict = model.predict(X_test_selected)
f1 = f1_score(y_predict, y_test)
It is done in the very similar way, except we tune the hyperparamter for SVC rather than percentile of features to select.
My questions are:
In the current solution, I only involve 3. and 4. in the clf and do 1. and 2. kinda "manually" in two nested loop as described above. Is there any way to include all four steps in a pipeline and do the whole process at once?
If it is okay to keep the first nested loop, then is it possible (and how) to simplify the next nested loop using a single pipeline
clf_all = Pipeline([('smote', SMOTE()),
('fisher', Fisher(percentile=best_percentile))
('normal',preprocessing.StandardScaler()),
('svc',svm.SVC(class_weight='auto'))])
and simply use GridSearchCV(clf_all, parameter_comb) for tuning?
Please note that both SMOTE and Fisher (ranking criteria) have to be done only for the training data in each fold partition.
It would be so much appreciated for any comment.
SMOTE and Fisher are shown below:
def Fscore(X, y, percentile=None):
X_pos, X_neg = X[y==1], X[y==0]
X_mean = X.mean(axis=0)
X_pos_mean, X_neg_mean = X_pos.mean(axis=0), X_neg.mean(axis=0)
deno = (1.0/(shape(X_pos)[0]-1))*X_pos.var(axis=0) +(1.0/(shape(X_neg[0]-1))*X_neg.var(axis=0)
num = (X_pos_mean - X_mean)**2 + (X_neg_mean - X_mean)**2
F = num/deno
sort_F = argsort(F)[::-1]
n_feature = (float(percentile)/100)*shape(X)[1]
ind_feature = sort_F[:ceil(n_feature)]
return(ind_feature)
SMOTE is from https://github.com/blacklab/nyan/blob/master/shared_modules/smote.py, it returns the synthesized data. I modified it to return the original input data stacked with the synthesized data along with its labels and synthesized ones.
def smote(X, y):
n_pos = sum(y==1), sum(y==0)
n_syn = (n_neg-n_pos)/float(n_pos)
X_pos = X[y==1]
X_syn = SMOTE(X_pos, int(round(n_syn))*100, 5)
y_syn = np.ones(shape(X_syn)[0])
X, y = np.vstack([X, X_syn]), np.concatenate([y, y_syn])
return(X, y)
scikit created a FunctionTransformer as part of the preprocessing class in version 0.17. It can be used in a similar manner as David's implementation of the class Fisher in the answer above - but with less flexibility. If the input/output of the function is configured properly, the transformer can implement the fit/transform/fit_transform methods for the function and thus allow it to be used in the scikit pipeline.
For example, if the input to a pipeline is a series, the transformer would be as follows:
def trans_func(input_series):
return output_series
from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(trans_func)
sk_pipe = Pipeline([("trans", transformer), ("vect", tf_1k), ("clf", clf_1k)])
sk_pipe.fit(train.desc, train.tag)
where vect is a tf_idf transformer, clf is a classifier and train is the training dataset. "train.desc" is the series text input to the pipeline.
I don't know where your SMOTE() and Fisher() functions are coming from, but the answer is yes you can definitely do this. In order to do so you will need to write a wrapper class around those functions though. The easiest way to this is inherit sklearn's BaseEstimator and TransformerMixin classes, see this for an example: http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html
If this isn't making sense to you, post the details of at least one of your functions (the library it comes from or your code if you wrote it yourself) and we can go from there.
EDIT:
I apologize, I didn't look at your functions closely enough to realize that they transform your target in addition to your training data (i.e. both X and y). Pipeline does not support transformations to your target so you will have do them prior as you originally were. For your reference, here is what it would look like to write your custom class for your Fisher process which would work if the function itself did not need to affect your target variable.
>>> from sklearn.base import BaseEstimator, TransformerMixin
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.svm import SVC
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.grid_search import GridSearchCV
>>> from sklearn.datasets import load_iris
>>>
>>> class Fisher(BaseEstimator, TransformerMixin):
... def __init__(self,percentile=0.95):
... self.percentile = percentile
... def fit(self, X, y):
... from numpy import shape, argsort, ceil
... X_pos, X_neg = X[y==1], X[y==0]
... X_mean = X.mean(axis=0)
... X_pos_mean, X_neg_mean = X_pos.mean(axis=0), X_neg.mean(axis=0)
... deno = (1.0/(shape(X_pos)[0]-1))*X_pos.var(axis=0) + (1.0/(shape(X_neg)[0]-1))*X_neg.var(axis=0)
... num = (X_pos_mean - X_mean)**2 + (X_neg_mean - X_mean)**2
... F = num/deno
... sort_F = argsort(F)[::-1]
... n_feature = (float(self.percentile)/100)*shape(X)[1]
... self.ind_feature = sort_F[:ceil(n_feature)]
... return self
... def transform(self, x):
... return x[self.ind_feature,:]
...
>>>
>>> data = load_iris()
>>>
>>> pipeline = Pipeline([
... ('fisher', Fisher()),
... ('normal',StandardScaler()),
... ('svm',SVC(class_weight='auto'))
... ])
>>>
>>> grid = {
... 'fisher__percentile':[0.75,0.50],
... 'svm__C':[1,2]
... }
>>>
>>> model = GridSearchCV(estimator = pipeline, param_grid=grid, cv=2)
>>> model.fit(data.data,data.target)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/grid_search.py", line 596, in fit
return self._fit(X, y, ParameterGrid(self.param_grid))
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/grid_search.py", line 378, in _fit
for parameters in parameter_iterable
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 653, in __call__
self.dispatch(function, args, kwargs)
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 400, in dispatch
job = ImmediateApply(func, args, kwargs)
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 138, in __init__
self.results = func(*args, **kwargs)
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1239, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/pipeline.py", line 130, in fit
self.steps[-1][-1].fit(Xt, y, **fit_params)
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/svm/base.py", line 149, in fit
(X.shape[0], y.shape[0]))
ValueError: X and y have incompatible shapes.
X has 1 samples, but y has 75.
You actually can put all of these functions into a single pipeline!
In the accepted answer, #David wrote that your functions
transform your target in addition to your training data (i.e. both X and y). Pipeline does not support transformations to your target so you will have do them prior as you originally were.
It is true that sklearn's pipeline does not support this. However imblearn's pipeline here supports this. The imblearn pipeline is just like that of sklearn but it allows you to call transformations separately on the training and testing data via sample methods. Moreover, these sample methods are actually designed so that you can change both the data X and the labels y. This is important because many times you want to include smote in your pipeline but you want to smote just the training data, not the testing data. And with the imblearn pipeline, you can call smote in the pipeline to transform just X_train and y_train and not X_test and y_test.
So you can create an imblearn pipeline that has a smote sampler, pre-processing step, and svc.
For more details check out this stack overflow post here and machine learning mastery article here.