Complex nesting within imblearn pipelines - machine-learning

I have been trying to find a solution to this but unsuccessfully so far.
I am working with some data for which I need to adopt a resampling procedure within a (scikit-learn/imblearn) pipeline, meaning that the size of both the samples and targets has to change within the pipeline. In order to do this I am using FunctionSampler from imblearn.
My problem is that the main pipeline is composed of steps which are, actually, pipelines themselves, which is giving me some problems.
The code below shows an extremely simplified version of the scenario I am working in. Please note this is not the actual code I am using (the transformers/classifiers are different and many more in the original code), only the structure is similar.
# pipeline definition
from sklearn.preprocessing import StandardScaler, Normalizer, PolynomialFeatures
from sklearn.feature_selection import VarianceThreshold, SelectKBest
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.svm import SVC
# from sklearn.pipeline import Pipeline
from imblearn.pipeline import Pipeline
from imblearn import FunctionSampler
def outlier_extractor(X, y):
# just an example
return X, y
pipe = Pipeline(steps=[("feature_engineering", PolynomialFeatures()),
("variance_threshold", VarianceThreshold()),
("outlier_correction", FunctionSampler(func=outlier_extractor)),
("classifier", QuadraticDiscriminantAnalysis())])
# definition of the feature engineering options
feature_engineering_options = [
Pipeline(steps=[
("scaling", StandardScaler()),
("PCA", PCA(n_components=3))
]),
Pipeline(steps=[ # add div and prod features
("polynomial", PolynomialFeatures()),
("kBest", SelectKBest())
])
]
outlier_correction_options = [
FunctionSampler(func=outlier_extractor),
Pipeline(steps=[
("center_scaling", StandardScaler()),
("normalisation", Normalizer(norm="l2"))
])
]
# definition of the parameters to optimize in the pipeline
params = [ # support vector machine
{"feature_engineering": feature_engineering_options,
"variance_threshold__threshold": [0, 0.5, 1],
"outlier_correction": outlier_correction_options,
"classifier": [SVC()],
"classifier__C": [0.1, 1, 10, 50],
"classifier__kernel": ["linear", "rbf"],
},
# quadratic discriminant analysis
{"feature_engineering": feature_engineering_options,
"variance_threshold__threshold": [0, 0.5, 1],
"outlier_correction": outlier_correction_options,
"classifier": [QuadraticDiscriminantAnalysis()]
}
]
When using GridSearchCV(pipe, param_grid=params) I receive the error TypeError: All intermediate steps of the chain should be estimators that implement fit and transform or fit_resample. I know that I should unpack the pipelines, and I have also tried to follow this and this in order to solve the problem but my case seems (to me, at least) more complicated and I could not get these workarounds to work.
Any help/suggestion is very much appreciated. Thanks

Related

Preprocess and data transformation in machine learning

I have a problem where I have to predict a buyer using machine learning (created a dummy dataset). I need to transform the data first before I can use it for machine learning. I am aggregating information per id,visit level which gives me a list of food and cloths bought. This list needs to be one hot encoded before applying classifier model.
import pandas as pd
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import GradientBoostingClassifier
def preprocess(df):
# Only keep rows till buyer=1
df = df.groupby(["id1", "visit"], group_keys=False).apply(
lambda g: g.loc[: g["Buyer"].idxmax()]
)
# Form lists on each id1,visit level
df1 = df.groupby(["id1", "visit"], as_index=False).agg(
is_Pax=("Buyer", "max"),
fruits=("fruits", lambda x: x.dropna().unique().tolist()),
cloths=("cloths", lambda x: x.dropna().unique().tolist()),
)
col = ["fruits", "cloths"]
df_transformed = onehot(df1, col)
return df_transformed
def onehot(df, col):
"""
This function does one hot encoding of a list column.
"""
onehot_list_encoder = MultiLabelBinarizer()
for cl in col:
print("One hot encoding ", cl)
newd = pd.DataFrame(
onehot_list_encoder.fit_transform(df[cl]),
columns=onehot_list_encoder.classes_,
).add_prefix(cl + "_")
df = df.join(newd)
return df
df = pd.DataFrame(np.array([['a', 'a', 'b', 'b','a','a'], [1, 2, 2, 2,1,1],
['Apple', 'Apple', 'Banana', None,'Orange','Pear'],[1,2,1,3,4,5],
[0, 0, 1, 0,1,0]]).T,
columns=['id1', 'visit', 'fruits','cloths','Buyer'])
df['Buyer'] = df['Buyer'].astype('int')
How to create a simple ML model now that does this preprocessing to data (both fit and predict) since in test data, I want the same transformation (i.e. 0 for all columns not present in the test rows), Can pipeline solve this? I am not so good with writing pipelines and am getting errors.
droplist=['id1', 'visit', 'fruits','cloths']
pipe=Pipeline(steps=[
("preprocess",preprocess(df)),
("coltrans",ColumnTransformer([("drop",'drop',droplist)])),
("model",GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)),
])
Can someone help?

How do I setup my pipeline for GridsearchCV if I have a text features and a numerical feature?

I'm trying to do a classification with a text feature and numerical feature.
I'll like to run CountVectorizer on my text, and passing the sparse matrix output with my numerical feature to my classifier, and then run a gridsearchcv.
This is my failed attempt at trying to setup a pipeline for gridsearchcv.
I've referenced this : How to access ColumnTransformer elements in GridSearchCV , but can't seem to get it to work.
Any help will be appreciated.
edit: got it to work with 'make column transformer'.
# X_train contains 2 columns, ['text','num']
# y_train contains 1 column, ['label']
word_transformer = Pipeline(steps = [('cvec',CountVectorizer())])
preprocessor = ColumnTransformer (transformers = [('wt',word_transformer,['text'])],
remainder = 'passthrough')
pipe = Pipeline(steps=[
('preprocessor',preprocessor),
('RBC', RandomForestClassifier())
])
pipe_params = {
'preprocessor__wt__cvec__max_features': [1500,2000,3000],
'RBC__max_features':['sqrt', 'log2']
}
gs = GridSearchCV(pipe,
param_grid=pipe_params,
cv=5) # 5-fold cross-validation.
gs.fit(X_train,y_train)

Using fit_params in neuraxle pipeline

I want to use a classifier, e.g. the sklearn.linear_model.SGDClassifier, within a neuraxle pipeline and fit it in an online fashion using partial_fit. I have the classifier wrapped in an
SKLearnWrapper with use_partial_fit=True, like this:
from neuraxle.pipeline import Pipeline
from neuraxle.steps.sklearn import SKLearnWrapper
from sklearn.linear_model import SGDClassifier
p = Pipeline([
SKLearnWrapper(SGDClassifier(), use_partial_fit=True)
]
)
X = [[1.], [2.], [3.]]
y = ['class1', 'class2', 'class1']
p.fit(X, y)
However, to fit the classifier in online fashion, one needs to provide an additional argument classes to the partial_fit function, that contains the possible classes that are occurring in the data, e.g. classes=['class1', 'class2'], at least for the first time it is called. So the above code results in an error:
ValueError: classes must be passed on the first call to partial_fit.
The same issue arises for other fit_params like sample_weight. In a standard sklearn pipeline, fit_params can be handed down to individual steps via the <step name>__<parameter name> syntax, e.g. for the sample_weight parameter:
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
q = Pipeline([
('clf', SGDClassifier())
])
q.fit(X, y, clf__sample_weight=[0.25, 0.5, 0.25])
Of course, the standard sklearn pipeline does not allow to call partial_fit on the classifier, which is why I want to use the neuraxle pipeline in the first place.
Is there any way to hand additional parameters to the fit or partial_fit functions of a step in a neuraxle pipeline?
I suggest that you edit the SKLearnWrapper so as to add arguments to the partial_fit method by redefining it and to add the missing arguments you would like to have.
You could also add a method to this forked SKLearnWrapper as follow. The classes arguments could be changed using an apply method called from outside the pipeline later on.
ConfigurablePartialSGDClassifier(SKLearnWrapper)
def __init__(self):
super().__init__(SGDClassifier(), use_partial_fit=True)
def update_classes(self, classes: List[str]):
self.classes = classes
def _sklearn_fit_without_expected_outputs(self, data_inputs):
self.wrapped_sklearn_predictor.partial_fit(data_inputs, classes=self.classes)
You can then do:
p = Pipeline([
('clf', ConfigurablePartialSGDClassifier())
])
X1 = [[1.], [2.], [3.]]
X2 = [[4.], [5.], [6.]]
Y1 = [0, 1, 1]
Y2 = [1, 1, 0]
classes = ['class1', 'class2', 'class1']
p.apply("update_classes", classes)
p.fit(X1, Y1)
p.fit(X2, Y2)
Note that p could also simply have been defined this way to get the same behavior:
p = ConfigurablePartialSGDClassifier()
The thing is, calls to apply methods can pass through pipelines and are applied to all nested steps if the steps contain such methods.

Pipeline using CountVectorizer (max_df) before tfidf

Currently i am not sure if this quation is for stackoverflow or another more theoretical statistical QA. But im confused about the following.
I am doing a binairy tekst classification task. For this task i use a pipeline, one of the example codes is below:
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', LogisticRegression())
])
parameters = {
'vect__ngram_range': [(1, 1), (1, 2), (1, 3)],
'vect__stop_words': [None, stopwords.words('dutch'), stopwordList],
'clf__C': [0.1, 1, 10, 100, 1000]
}
So nothing really strange about this, but then i started playing with the parameter options/settings and noted that the code below (so the steps and parameters in the code) had the highest accuracy score (f1 score):
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', LogisticRegression())
])
parameters = {
'vect__ngram_range': [(1,1)],
'vect__stop_words': [None],
'vect__max_df': [0.2],
'vect__max_features': [10000],
'clf__C': [100]
}
So im pleased to sort of find out with which parameter settings and methods i get the highest score, but i cant figure out the exact meaning. As with the 'vectorizor'-step the settings for max_df (ignoring terms that appear in more 20% of the documents) seems to be strange to apply before tfidf (or somehow double)
Furthermore it also uses max_features of 10.000. What step is used before the max_df or the max_features? and how do i interpret max_features setting this parameter and doing tfidf afterwards. Does it then perform a tfidf over the 10.000 features?
For me it seems rather strange to do a tfidf after using parameters such as max_df and max_features? Am i correct? and why? or should i just do what gives the highest outcome..
I hope someone can help me in the right direction, thanks a lot in advance.

How would you do RandomizedSearchCV with VotingClassifier for Sklearn?

I'm trying to tune my voting classifier. I wanted to use randomized search in Sklearn. However how could you set parameter lists for my voting classifier since I currently use two algorithms (different tree algorithms)?
Do I have to separately run randomized search and combine them together in voting classifier later?
Could someone help? Code examples would be highly appreciated :)
Thanks!
You can perfectly combine both, the VotingClassifier with RandomizedSearchCV. No need to run them separately. See the documentation: http://scikit-learn.org/stable/modules/ensemble.html#using-the-votingclassifier-with-gridsearch
The trick is to prefix your params list with your estimator name. For example, if you have created a RandomForest estimator and you created it as ('rf',clf2) then you can set up its parameters in the form <name__param>. Specific example: rf__n_estimators: [20,200], so you refer to a specific estimator and set values to test for a specific param.
Ready to test executable code example ;)
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.grid_search import RandomizedSearchCV
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 2, 2, 2])
clf1 = DecisionTreeClassifier()
clf2 = RandomForestClassifier(random_state=1)
params = {'dt__max_depth': [5, 10], 'rf__n_estimators': [20, 200],}
eclf = VotingClassifier(estimators=[('dt', clf1), ('rf', clf2)], voting='hard')
random_search = RandomizedSearchCV(eclf, param_distributions=params,n_iter=4)
random_search.fit(X, y)
print(random_search.grid_scores_)

Resources