Is it possible to perform quantization on densenet169 and how? - machine-learning

I have been trying to performing quantization on a densenet model without success. I have been trying to implement pytorch post training static quantization. Pytorch has quantized versions of other models, but does not have for densenet. Is it possible to quantize the densenet architecture.
I have searched for tutorials on how to apply quantization on pre-trained models but i havent had any success.

Here's how to do this on DenseNet169 from torchvision:
from torch.ao.quantization import QuantStub, DeQuantStub
from torch import nn
from torchvision.models import densenet169, DenseNet169_Weights
from tqdm import tqdm
from torch.ao.quantization import HistogramObserver, PerChannelMinMaxObserver
import torch
# Wrap base model with quant/dequant stub
class QuantizedDenseNet169(nn.Module):
def __init__(self):
super().__init__()
self.dn = densenet169(weights=DenseNet169_Weights.IMAGENET1K_V1)
self.quant = QuantStub()
self.dequant = DeQuantStub()
def forward(self, x):
x = self.quant(x)
x = self.dn(x)
return self.dequant(x)
dn = QuantizedDenseNet169()
# move to gpu
dn.cuda()
# Propagate qconfig
dn.qconfig = torch.quantization.QConfig(
activation=HistogramObserver.with_args(),
weight=PerChannelMinMaxObserver.with_args(dtype=torch.qint8)
)
# fbgemm for x86 architecture
torch.backends.quantized.engine = 'fbgemm'
dn = torch.quantization.prepare(dn, inplace=False)
# calibrate with own dataset (I'm using random inputs to show process)
with torch.no_grad():
for _ in tqdm(range(5), desc="PTQ progess"):
input_ = torch.randn([1, 3, 128, 128], device='cuda')
dn.forward(input_)
# move to cpu before quantization
dn.cpu()
dn = torch.quantization.convert(dn, inplace=False)
# check if it's working
out = dn(torch.randn([1, 3, 128, 128]))

Related

Target size (torch.Size([32, 9])) must be the same as input size (torch.Size([32, 10]))

I have 10 classes. I have a model such as;
from brevitas.nn import QuantLinear, QuantReLU
import torch.nn as nn
# Setting seeds for reproducibility
torch.manual_seed(0)
model = nn.Sequential(
QuantLinear(input_size, hidden1, bias=True, weight_bit_width=weight_bit_width),
nn.BatchNorm1d(hidden1),
nn.Dropout(0.5),
QuantReLU(bit_width=act_bit_width),
QuantLinear(hidden1, hidden2, bias=True, weight_bit_width=weight_bit_width),
nn.BatchNorm1d(hidden2),
nn.Dropout(0.5),
QuantReLU(bit_width=act_bit_width),
QuantLinear(hidden2, hidden3, bias=True, weight_bit_width=weight_bit_width),
nn.BatchNorm1d(hidden3),
nn.Dropout(0.5),
QuantReLU(bit_width=act_bit_width),
QuantLinear(hidden3, num_classes, bias=True, weight_bit_width=weight_bit_width)
)
model.to(device)
and I have defined my training phase as:
def train(model, train_loader, optimizer, criterion):
losses = []
# ensure model is in training mode
model.train()
for i, data in enumerate(train_loader, 0):
inputs, target = data['pointcloud'].to(device).float(), data['category'].to(device)
target = torch.nn.functional.one_hot(target)
optimizer.zero_grad()
# forward pass
output = model(inputs)
loss = criterion(output, target.float())
# backward pass + run optimizer to update weights
loss.backward()
optimizer.step()
# keep track of loss value
losses.append(loss.data.cpu().numpy())
return losses
As I run the training code:
import numpy as np
from sklearn.metrics import accuracy_score
from tqdm import tqdm, trange
# Setting seeds for reproducibility
torch.manual_seed(0)
np.random.seed(0)
running_loss = []
running_test_acc = []
t = trange(num_epochs, desc="Training loss", leave=True)
for epoch in t:
loss_epoch = train(model, train_loader, optimizer,criterion)
test_acc = test(model, valid_loader)
t.set_description("Training loss = %f test accuracy = %f" % (np.mean(loss_epoch), test_acc))
t.refresh() # to show immediately the update
running_loss.append(loss_epoch)
running_test_acc.append(test_acc)
I get an error as:
Target size (torch.Size([32, 9])) must be the same as input size
(torch.Size([32, 10]))
Please help me about what can be possibly the solution. I have added one hot encoding because I have seen some solutions like that before.
The code error is pretty straightforward - the criterion (that you didn't show here in the code) expects both the input and the target arguments to be the same size, but they're not.
The problem is that you're using torch.nn.functional.one_hot(target) without telling it how many classes you need for one-hot encoding; it is then infered as the largest values in target +1 (see: https://pytorch.org/docs/stable/generated/torch.nn.functional.one_hot.html). You should change it to torch.nn.functional.one_hot(target, num_classes=10)

Pass information between pipeline steps in sklearn

I am working on a simple text generation problem with LSTMs. To make the preprocessing more compact and reproducible, I decided to implement everything in sklearn fashion, using custom sklearn transformers, and the KerasClassifier from scikeras to wrap the neural network definition in a sklearn-type estimator.
It almost works but I can't figure out how to pass information from within a certain custom transformer on to the KerasClassifier estimator. More precisely, for the method that creates the neural network, I need the number of outputs as an argument; but this depends on the number of words in the fitted vocabulary - which is an information that is currently encapsulated in ModelEncoder class.
(Note that in order to get the current logic work, I had to slightly modify the default sklearn Pipeline class, as it wouldn't allow modifying and returning both X and y. In other words, the default sklearn Pipeline only allows feature transformations but not target transformations. Modifying the custom Pipeline class was explained in this StackOverflow post.)
Example data:
train_data = ['o by no means honest ventidius i gave it freely ever and theres none can truly say he gives if our betters play at that game we must not dare to imitate them faults that are rich are fair'
'but was not this nigh shore'
'impairing henry strengthening misproud york the common people swarm like summer flies and whither fly the gnats but to the sun'
'what while you were there'
'chill pick your teeth zir come no matter vor your foins'
'thanks dear isabel' 'come prick me bullcalf till he roar again'
'go some of you knock at the abbeygate and bid the lady abbess come to me'
'an twere not as good deed as drink to break the pate on thee i am a very villain'
'beaufort it is thy sovereign speaks to thee'
'but say lucetta now we are alone wouldst thou then counsel me to fall in love'
'for being a bawd for being a bawd'
'all blest secrets all you unpublishd virtues of the earth spring with my tears'
'what likelihood' 'o find him']
Full code:
# Modify the sklearn Pipeline class to allow it to return tuples and hence enable both X and y modifications. (Current default implementation in sklearn only allows
# feature transformations, i.e. transformations on X, but not on y.)
class Pipeline(pipeline.Pipeline):
def _fit(self, X, y=None, **fit_params_steps):
self.steps = list(self.steps)
self._validate_steps()
memory = check_memory(self.memory)
fit_transform_one_cached = memory.cache(pipeline._fit_transform_one)
for (step_idx, name, transformer) in self._iter(
with_final=False, filter_passthrough=False
):
if transformer is None or transformer == "passthrough":
with _print_elapsed_time("Pipeline", self._log_message(step_idx)):
continue
try:
# joblib >= 0.12
mem = memory.location
except AttributeError:
mem = memory.cachedir
finally:
cloned_transformer = clone(transformer) if mem else transformer
X, fitted_transformer = fit_transform_one_cached(
cloned_transformer,
X,
y,
None,
message_clsname="Pipeline",
message=self._log_message(step_idx),
**fit_params_steps[name],
)
if isinstance(X, tuple): ###### unpack X if is tuple X = (X,y)
X, y = X
self.steps[step_idx] = (name, fitted_transformer)
return X, y
def fit(self, X, y=None, **fit_params):
fit_params_steps = self._check_fit_params(**fit_params)
Xt = self._fit(X, y, **fit_params_steps)
if isinstance(Xt, tuple): ###### unpack X if is tuple X = (X,y)
Xt, y = Xt
with _print_elapsed_time("Pipeline", self._log_message(len(self.steps) - 1)):
if self._final_estimator != "passthrough":
fit_params_last_step = fit_params_steps[self.steps[-1][0]]
self._final_estimator.fit(Xt, y, **fit_params_last_step)
return self
class ModelTokenizer(TransformerMixin, BaseEstimator):
def __init__(self, max_len=100):
super().__init__()
self.max_len = max_len
def fit(self, X=None, y=None):
return self
def transform(self, X, y=None):
X_flattened = " ".join(X).split()
sequences = list()
for i in range(self.max_len+1, len(X_flattened)):
seq = X_flattened[i-self.max_len-1:i]
sequences.append(seq)
return sequences
class ModelEncoder(TransformerMixin, BaseEstimator):
def __init__(self):
super().__init__()
self.tokenizer = Tokenizer()
def fit(self, X=None, y=None):
self.tokenizer.fit_on_texts(X)
return self
def transform(self, X, y=None):
encoded_sequences = np.array(self.tokenizer.texts_to_sequences(X))
return (encoded_sequences[:,:-1], encoded_sequences[:,-1])
def create_nn(input_shape=(100,1), output_shape=None):
model = Sequential()
model.add(LSTM(64, input_shape=input_shape, return_sequences=True))
model.add(Dropout(0.3))
model.add(Flatten())
model.add(Dense(20, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(output_shape, activation='softmax'))
metrics_list = [tf.keras.metrics.BinaryAccuracy(name='accuracy')]
model.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = metrics_list)
return model
pipe = Pipeline([
('tokenizer', ModelTokenizer()),
('encoder', ModelEncoder()),
('model', KerasClassifier(build_fn=create_nn, epochs=10, output_shape=vocab_size)),
])
# Question: how to pass 'vocab_size'?
Imports:
from sklearn import pipeline
from sklearn.base import clone
from sklearn.utils import _print_elapsed_time
from sklearn.utils.validation import check_memory
from sklearn.base import BaseEstimator, TransformerMixin
from keras.preprocessing.text import Tokenizer
from scikeras.wrappers import KerasClassifier
KerasClassifier has its own internal transformer (see here, it is used to provide one-hot encoding and such) which has an API to pass metadata to the model (see here, that's how arguments such as n_outputs_ are passed into the model building function). Could you override that to pass this extra metadata to the model? It's stepping a bit outside of the Scikit-Learn API, but as you've noted the Scikit-Learn API doesn't have this functionality built in. If you want to propagate that information from a Transformer in your pipeline into SciKeras you could encode it into a feature and then use the above-mentioned hooks along with a custom encoder to remove that feature and convert it into metadata that can be passed into the model, but now you'd be really pushing the Scikit-Learn API.

Why I get different expected_value when I include the training data in TreeExplainer?

Including the training data in SHAP TreeExplainer gives different expected_value in scikit-learn GBT Regressor.
Reproducible example (run in Google Colab):
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
import numpy as np
import shap
shap.__version__
# 0.37.0
X, y = make_regression(n_samples=1000, n_features=10, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
gbt = GradientBoostingRegressor(random_state=0)
gbt.fit(X_train, y_train)
# mean prediction:
mean_pred_gbt = np.mean(gbt.predict(X_train))
mean_pred_gbt
# -11.534353657511172
# explainer without data
gbt_explainer = shap.TreeExplainer(gbt)
gbt_explainer.expected_value
# array([-11.53435366])
np.isclose(mean_pred_gbt, gbt_explainer.expected_value)
# array([ True])
# explainer with training data
gbt_data_explainer = shap.TreeExplainer(model=gbt, data=X_train) # specifying feature_perturbation does not change the result
gbt_data_explainer.expected_value
# -23.564797322079635
So, the expected value when including the training data gbt_data_explainer.expected_value is quite different from the one calculated without supplying the data (gbt_explainer.expected_value).
Both approaches are additive and consistent when used with the (obviously different) respective shap_values:
np.abs(gbt_explainer.expected_value + gbt_explainer.shap_values(X_train).sum(1) - gbt.predict(X_train)).max() < 1e-4
# True
np.abs(gbt_data_explainer.expected_value + gbt_data_explainer.shap_values(X_train).sum(1) - gbt.predict(X_train)).max() < 1e-4
# True
but I wonder why they do not provide the same expected_value, and why gbt_data_explainer.expected_value is so different from the mean value of predictions.
What am I missing here?
Apparently shap subsets to 100 rows when data is passed, then runs those rows through the trees to reset the sample counts for each node. So the -23.5... being reported is the average model output for those 100 rows.
The data is passed to an Independent masker, which does the subsampling:
https://github.com/slundberg/shap/blob/v0.37.0/shap/explainers/_tree.py#L94
https://github.com/slundberg/shap/blob/v0.37.0/shap/explainers/_explainer.py#L68
https://github.com/slundberg/shap/blob/v0.37.0/shap/maskers/_tabular.py#L216
Running
from shap import maskers
another_gbt_explainer = shap.TreeExplainer(
gbt,
data=maskers.Independent(X_train, max_samples=800),
feature_perturbation="tree_path_dependent"
)
another_gbt_explainer.expected_value
gets back to
-11.534353657511172
Though #Ben did a great job in digging out how the data gets passed through Independent masker, his answer does not show exactly (1) how base values are calculated and where do we get the different base value from and (2) how to choose/lower the max_samples param
Where the different value comes from
The masker object has a data attribute that holds data after masking process. To get the value showed in gbt_explainer.expected_value:
from shap.maskers import Independent
gbt = GradientBoostingRegressor(random_state=0)
# mean prediction:
mean_pred_gbt = np.mean(gbt.predict(X_train))
mean_pred_gbt
# -11.534353657511172
# explainer without data
gbt_explainer = shap.TreeExplainer(gbt)
gbt_explainer.expected_value
# array([-11.53435366])
gbt_explainer = shap.TreeExplainer(gbt, Independent(X_train,100))
gbt_explainer.expected_value
# -23.56479732207963
one would need to do:
masker = Independent(X_train,100)
gbt.predict(masker.data).mean()
# -23.56479732207963
What about choosing max_samples?
Setting max_samples to the original dataset length seem to work with other explainers too:
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import shap
from shap.maskers import Independent
from scipy.special import logit, expit
corpus,y = shap.datasets.imdb()
corpus_train, corpus_test, y_train, y_test = train_test_split(corpus, y, test_size=0.2, random_state=7)
vectorizer = TfidfVectorizer(min_df=10)
X_train = vectorizer.fit_transform(corpus_train)
model = sklearn.linear_model.LogisticRegression(penalty="l2", C=0.1)
model.fit(X_train, y_train)
explainer = shap.Explainer(model
,masker = Independent(X_train,100)
,feature_names=vectorizer.get_feature_names()
)
explainer.expected_value
# -0.18417413671991964
This value comes from:
masker=Independent(X_train,100)
logit(model.predict_proba(masker.data.mean(0).reshape(1,-1))[...,1])
# array([-0.18417414])
max_samples=100 seem to be a bit off for a true base_value (just feeding feature means):
logit(model.predict_proba(X_train.mean(0).reshape(1,-1))[:,1])
array([-0.02938039])
By increasing max_samples one might get reasonably close to true baseline, while keeping num of samples low:
masker = Independent(X_train,1000)
logit(model.predict_proba(masker.data.mean(0).reshape(1,-1))[:,1])
# -0.05957302658674238
So, to get base value for an explainer of interest (1) pass explainer.data (or masker.data) through your model and (2) choose max_samples so that base_value on sampled data is close enough to the true base value. You may also try to observe if the values and order of shap importances converge.
Some people may notice that to get to the base values sometimes we average feature inputs (LogisticRegression) and sometimes outputs (GBT)

Why different intermediate layer ouput of CNN in keras?

I am using this code to perform some experiment, I want to use intermediate layer representation of layer mainly before the fully connected layer(or last layer) of CNN.
from __future__ import print_function
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding
from keras.layers import Conv1D, GlobalMaxPooling1D
from keras.datasets import imdb
# set parameters:
max_features = 5000
maxlen = 400
batch_size = 100
embedding_dims = 50
filters = 250
kernel_size = 3
hidden_dims = 250
epochs = 100
print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
print(len(x_train), 'train sequences')
print(len(x_test), 'test sequences')
print('Pad sequences (samples x time)')
x_train = sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = sequence.pad_sequences(x_test, maxlen=maxlen)
print('x_train shape:', x_train.shape)
print('x_test shape:', x_test.shape)
print('Build model...')
model = Sequential()
# we start off with an efficient embedding layer which maps
# our vocab indices into embedding_dims dimensions
model.add(Embedding(max_features,
embedding_dims,
input_length=maxlen))
model.add(Dropout(0.2))
# we add a Convolution1D, which will learn filters
# word group filters of size filter_length:
model.add(Conv1D(filters,
kernel_size,
padding='valid',
activation='relu',
strides=1))
# we use max pooling:
model.add(GlobalMaxPooling1D())
# We add a vanilla hidden layer:
model.add(Dense(hidden_dims))
model.add(Dropout(0.2))
model.add(Activation('relu'))#<======== I need output after this.
# We project onto a single unit output layer, and squash it with a sigmoid:
model.add(Dense(1))
model.add(Activation('sigmoid'))
model.compile(loss='binary_crossentropy',
optimizer='adam', metrics=['accuracy'])
To get the intermediate layer representation of penultimate layer I used following code.
CODE1
get_layer_output = K.function([model.layers[0].input, K.learning_phase()],
[model.layers[6].output])
# output in test mode = 0
layer_output_test = get_layer_output([x_test, 0])[0]
# output in train mode = 1
layer_output_train = get_layer_output([x_train, 1])[0]
print(layer_output_train)
print(layer_output_train.shape)
CODE2
def get_activations(model, layer, X_batch):
get_activations = K.function([model.layers[0].input, K.learning_phase()], [model.layers[layer].output,])
activations = get_activations([X_batch,1])
return activations
import numpy as np
X_train=np.array(get_activations(model=model,layer=6, X_batch=x_train)[0], dtype=np.float32)
print(X_train)
print(X_train.shape)
Which one is correct as I am getting/printing different output for above two codes? I want to use the above correct output to multiply by weights and optimise by custom optimiser.
If you pass 1 to K.learning_phase() you will get different results every time. But both codes give the same result.
Using a higher level approach, you can do this:
from keras.models import Model
newModel = Model(model.inputs,model.layers[6].output)
Do whatever you want with newModel. You can train it (and affect the original model), and use it to predict values.

Put customized functions in Sklearn pipeline

In my classification scheme, there are several steps including:
SMOTE (Synthetic Minority Over-sampling Technique)
Fisher criteria for feature selection
Standardization (Z-score normalisation)
SVC (Support Vector Classifier)
The main parameters to be tuned in the scheme above are percentile (2.) and hyperparameters for SVC (4.) and I want to go through grid search for tuning.
The current solution builds a "partial" pipeline including step 3 and 4 in the scheme clf = Pipeline([('normal',preprocessing.StandardScaler()),('svc',svm.SVC(class_weight='auto'))])
and breaks the scheme into two parts:
Tune the percentile of features to keep through the first grid search
skf = StratifiedKFold(y)
for train_ind, test_ind in skf:
X_train, X_test, y_train, y_test = X[train_ind], X[test_ind], y[train_ind], y[test_ind]
# SMOTE synthesizes the training data (we want to keep test data intact)
X_train, y_train = SMOTE(X_train, y_train)
for percentile in percentiles:
# Fisher returns the indices of the selected features specified by the parameter 'percentile'
selected_ind = Fisher(X_train, y_train, percentile)
X_train_selected, X_test_selected = X_train[selected_ind,:], X_test[selected_ind, :]
model = clf.fit(X_train_selected, y_train)
y_predict = model.predict(X_test_selected)
f1 = f1_score(y_predict, y_test)
The f1 scores will be stored and then be averaged through all fold partitions for all percentiles, and the percentile with the best CV score is returned. The purpose of putting 'percentile for loop' as the inner loop is to allow fair competition as we have the same training data (including synthesized data) across all fold partitions for all percentiles.
After determining the percentile, tune the hyperparameters by second grid search
skf = StratifiedKFold(y)
for train_ind, test_ind in skf:
X_train, X_test, y_train, y_test = X[train_ind], X[test_ind], y[train_ind], y[test_ind]
# SMOTE synthesizes the training data (we want to keep test data intact)
X_train, y_train = SMOTE(X_train, y_train)
for parameters in parameter_comb:
# Select the features based on the tuned percentile
selected_ind = Fisher(X_train, y_train, best_percentile)
X_train_selected, X_test_selected = X_train[selected_ind,:], X_test[selected_ind, :]
clf.set_params(svc__C=parameters['C'], svc__gamma=parameters['gamma'])
model = clf.fit(X_train_selected, y_train)
y_predict = model.predict(X_test_selected)
f1 = f1_score(y_predict, y_test)
It is done in the very similar way, except we tune the hyperparamter for SVC rather than percentile of features to select.
My questions are:
In the current solution, I only involve 3. and 4. in the clf and do 1. and 2. kinda "manually" in two nested loop as described above. Is there any way to include all four steps in a pipeline and do the whole process at once?
If it is okay to keep the first nested loop, then is it possible (and how) to simplify the next nested loop using a single pipeline
clf_all = Pipeline([('smote', SMOTE()),
('fisher', Fisher(percentile=best_percentile))
('normal',preprocessing.StandardScaler()),
('svc',svm.SVC(class_weight='auto'))])
and simply use GridSearchCV(clf_all, parameter_comb) for tuning?
Please note that both SMOTE and Fisher (ranking criteria) have to be done only for the training data in each fold partition.
It would be so much appreciated for any comment.
SMOTE and Fisher are shown below:
def Fscore(X, y, percentile=None):
X_pos, X_neg = X[y==1], X[y==0]
X_mean = X.mean(axis=0)
X_pos_mean, X_neg_mean = X_pos.mean(axis=0), X_neg.mean(axis=0)
deno = (1.0/(shape(X_pos)[0]-1))*X_pos.var(axis=0) +(1.0/(shape(X_neg[0]-1))*X_neg.var(axis=0)
num = (X_pos_mean - X_mean)**2 + (X_neg_mean - X_mean)**2
F = num/deno
sort_F = argsort(F)[::-1]
n_feature = (float(percentile)/100)*shape(X)[1]
ind_feature = sort_F[:ceil(n_feature)]
return(ind_feature)
SMOTE is from https://github.com/blacklab/nyan/blob/master/shared_modules/smote.py, it returns the synthesized data. I modified it to return the original input data stacked with the synthesized data along with its labels and synthesized ones.
def smote(X, y):
n_pos = sum(y==1), sum(y==0)
n_syn = (n_neg-n_pos)/float(n_pos)
X_pos = X[y==1]
X_syn = SMOTE(X_pos, int(round(n_syn))*100, 5)
y_syn = np.ones(shape(X_syn)[0])
X, y = np.vstack([X, X_syn]), np.concatenate([y, y_syn])
return(X, y)
scikit created a FunctionTransformer as part of the preprocessing class in version 0.17. It can be used in a similar manner as David's implementation of the class Fisher in the answer above - but with less flexibility. If the input/output of the function is configured properly, the transformer can implement the fit/transform/fit_transform methods for the function and thus allow it to be used in the scikit pipeline.
For example, if the input to a pipeline is a series, the transformer would be as follows:
def trans_func(input_series):
return output_series
from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(trans_func)
sk_pipe = Pipeline([("trans", transformer), ("vect", tf_1k), ("clf", clf_1k)])
sk_pipe.fit(train.desc, train.tag)
where vect is a tf_idf transformer, clf is a classifier and train is the training dataset. "train.desc" is the series text input to the pipeline.
I don't know where your SMOTE() and Fisher() functions are coming from, but the answer is yes you can definitely do this. In order to do so you will need to write a wrapper class around those functions though. The easiest way to this is inherit sklearn's BaseEstimator and TransformerMixin classes, see this for an example: http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html
If this isn't making sense to you, post the details of at least one of your functions (the library it comes from or your code if you wrote it yourself) and we can go from there.
EDIT:
I apologize, I didn't look at your functions closely enough to realize that they transform your target in addition to your training data (i.e. both X and y). Pipeline does not support transformations to your target so you will have do them prior as you originally were. For your reference, here is what it would look like to write your custom class for your Fisher process which would work if the function itself did not need to affect your target variable.
>>> from sklearn.base import BaseEstimator, TransformerMixin
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.svm import SVC
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.grid_search import GridSearchCV
>>> from sklearn.datasets import load_iris
>>>
>>> class Fisher(BaseEstimator, TransformerMixin):
... def __init__(self,percentile=0.95):
... self.percentile = percentile
... def fit(self, X, y):
... from numpy import shape, argsort, ceil
... X_pos, X_neg = X[y==1], X[y==0]
... X_mean = X.mean(axis=0)
... X_pos_mean, X_neg_mean = X_pos.mean(axis=0), X_neg.mean(axis=0)
... deno = (1.0/(shape(X_pos)[0]-1))*X_pos.var(axis=0) + (1.0/(shape(X_neg)[0]-1))*X_neg.var(axis=0)
... num = (X_pos_mean - X_mean)**2 + (X_neg_mean - X_mean)**2
... F = num/deno
... sort_F = argsort(F)[::-1]
... n_feature = (float(self.percentile)/100)*shape(X)[1]
... self.ind_feature = sort_F[:ceil(n_feature)]
... return self
... def transform(self, x):
... return x[self.ind_feature,:]
...
>>>
>>> data = load_iris()
>>>
>>> pipeline = Pipeline([
... ('fisher', Fisher()),
... ('normal',StandardScaler()),
... ('svm',SVC(class_weight='auto'))
... ])
>>>
>>> grid = {
... 'fisher__percentile':[0.75,0.50],
... 'svm__C':[1,2]
... }
>>>
>>> model = GridSearchCV(estimator = pipeline, param_grid=grid, cv=2)
>>> model.fit(data.data,data.target)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/grid_search.py", line 596, in fit
return self._fit(X, y, ParameterGrid(self.param_grid))
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/grid_search.py", line 378, in _fit
for parameters in parameter_iterable
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 653, in __call__
self.dispatch(function, args, kwargs)
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 400, in dispatch
job = ImmediateApply(func, args, kwargs)
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.py", line 138, in __init__
self.results = func(*args, **kwargs)
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py", line 1239, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/pipeline.py", line 130, in fit
self.steps[-1][-1].fit(Xt, y, **fit_params)
File "/Users/dmcgarry/anaconda/lib/python2.7/site-packages/sklearn/svm/base.py", line 149, in fit
(X.shape[0], y.shape[0]))
ValueError: X and y have incompatible shapes.
X has 1 samples, but y has 75.
You actually can put all of these functions into a single pipeline!
In the accepted answer, #David wrote that your functions
transform your target in addition to your training data (i.e. both X and y). Pipeline does not support transformations to your target so you will have do them prior as you originally were.
It is true that sklearn's pipeline does not support this. However imblearn's pipeline here supports this. The imblearn pipeline is just like that of sklearn but it allows you to call transformations separately on the training and testing data via sample methods. Moreover, these sample methods are actually designed so that you can change both the data X and the labels y. This is important because many times you want to include smote in your pipeline but you want to smote just the training data, not the testing data. And with the imblearn pipeline, you can call smote in the pipeline to transform just X_train and y_train and not X_test and y_test.
So you can create an imblearn pipeline that has a smote sampler, pre-processing step, and svc.
For more details check out this stack overflow post here and machine learning mastery article here.

Resources