RandomizedSearchCV does not work in a Pipeline when evaluating alternative classifiers - machine-learning

In Scikit Learn, RandomizedSearchCV can work for evaluating different parameters in a pipeline, but only in some cases where the classifiers share similar/same parameters. When you pass blocks of parameters for different classifiers it fails when GridSearchCV succeeds.
You will notice in the code below, the problem setup is the same for gridsearch and random search but only random search fails.
numpy.random.seed(52)
MY_RAND_SEED=numpy.random.seed(52)
pipe = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler' , StandardScaler()),
('classify', LogisticRegression())
])
X, y = make_classification(n_samples= 500, n_features=58, n_redundant=13, n_informative=7, n_clusters_per_class=2)
param_grid_linear = [
{'classify' : [LogisticRegression(),],
'classify__penalty' : ['l1', 'l2'],
'classify__C' : numpy.logspace(-4, 4, 50),
'classify__solver' : ['liblinear']},
{'classify' : [LogisticRegression(),],
'classify__penalty' : ['l2'],
'classify__C' : numpy.logspace(-4, 4, 50),
'classify__solver' : ['lbfgs']},
{'classify': [SVC(),],
'classify__kernel': ['linear',],
'classify__C': numpy.linspace(0.001,200, 10),},
]
innercv=StratifiedKFold(n_splits=5, shuffle=True, random_state=numpy.random.seed(52))
gridA = GridSearchCV(pipe, param_grid_linear, scoring='accuracy', iid=False, verbose=1, n_jobs=12)
gridA.fit(X, y)
print("finished grid search")
gridB = RandomizedSearchCV(pipe, param_grid_linear, scoring='accuracy', n_iter=5, iid=False, verbose=1, n_jobs=12)
gridB.fit(X, y)

Apparently, usually one passes only lists of dictionaries as parameters {dict, dict, dict}, but to do what I suggested above requires passing a list of lists of dictionaries, which is currently only accepted by GridSearchCV.
RandomizedSearchCV does not accept this now but will in a future release of sklearn. Here is a response I received on GitHub:
From: Thomas J Fan
Date: Wednesday, August 14, 2019 at 7:13 PM
This was addressed in #14549
This feature is not released yet, but you can try it out by installing the nightly build of scikit-learn:
pip install --pre -f https://sklearn-nightly.scdn8.secure.raxcdn.com scikit-learn

Related

How do I setup my pipeline for GridsearchCV if I have a text features and a numerical feature?

I'm trying to do a classification with a text feature and numerical feature.
I'll like to run CountVectorizer on my text, and passing the sparse matrix output with my numerical feature to my classifier, and then run a gridsearchcv.
This is my failed attempt at trying to setup a pipeline for gridsearchcv.
I've referenced this : How to access ColumnTransformer elements in GridSearchCV , but can't seem to get it to work.
Any help will be appreciated.
edit: got it to work with 'make column transformer'.
# X_train contains 2 columns, ['text','num']
# y_train contains 1 column, ['label']
word_transformer = Pipeline(steps = [('cvec',CountVectorizer())])
preprocessor = ColumnTransformer (transformers = [('wt',word_transformer,['text'])],
remainder = 'passthrough')
pipe = Pipeline(steps=[
('preprocessor',preprocessor),
('RBC', RandomForestClassifier())
])
pipe_params = {
'preprocessor__wt__cvec__max_features': [1500,2000,3000],
'RBC__max_features':['sqrt', 'log2']
}
gs = GridSearchCV(pipe,
param_grid=pipe_params,
cv=5) # 5-fold cross-validation.
gs.fit(X_train,y_train)

high variance with Randomforest learner

I'm using Random Forest Regressor to fit a 10-dimensional regression problem with around 300 thousand samples. Although not necessary when dealing with Random Forest I started by putting the data on the same scale (by using preprocessing of sklearn) and then I did a randomised search over the following parameter space:
n_estimators=[int(x) for x in linspace (start=100, stop= 2000, num=11)]
max_features= auto, sqrt
max_depth= from 1- to 150 with step =11
min_sampl_split=2,5,10,12
min_samples_leaf=1,2,4,6
Bootstrap true or false
Moreover, after getting the best parameters I did a second narrower search.
Though I am using a 10-Fold cross validation scheme with the random search I'm still getting a serious overfitting problem!
Moreover, I have also tried using DBSCAN algorithm to check for outliers. After excluding some parts of the dataset I got even worse results!
Should I include other parameters of the Random Forest in the randomised search? or should I apply some more preprocessing techniques on the data set before fitting?
For convenience, this is my implementation I wrote:
from sklearn.model_selection import ShuffleSplit
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
n_estimators = [int(x) for x in np.linspace(start = 1, stop =
15, num = 15)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
min_samples_split = [2, 5, 10,12]
min_samples_leaf = [1, 2, 4,6]
bootstrap = [True, False]
cv = ShuffleSplit(n_splits=10, test_size=0.01, random_state=0)
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
rf = RandomForestRegressor()
rf_random = RandomizedSearchCV(estimator = rf, param_distributions
= random_grid, n_iter = 50, cv = cv, verbose=2, random_state=42,
n_jobs = 32)
rf_random.fit(x_train, y_train)
the best parameters returned by the randomizedsearch function:
bootstrap: Fasle. Min_samples_leaf=2. n_estimators= 1647. Max_features: sqrt. min_samples_split=3. Max_depth: None.
The range of the target is from 0 to 10000 [unit]. This model is resulting in 6.98 [unit] RMSE accuracy on the training set and and average of 67.54 [unit] RMSE accuracy on the test sets.
that line
max_depth= from 1- to 150 with step =11
For a 10 feature problem, the optimum depth is under 10. You are overfitting like crazy beacause of that. consider putting max_depth from 1 to 15 with step 1
min_sampl_split=2,5,10,12
min_samples_leaf=1,2,4,6
This should help reduce the variance, however, the step of 11 for max_depth is killing all the efforts you could possibly make

Why does cross_val_score always report the same score for each fold?

I'm running a Scikit-learn pipeline using cross validation via cross_val_score.
But after running it a couple of times the results for each fold are always the same. I'm bothered by this because shouldn't the splits be random?
This is the relevant part of my code:
pipeline = Pipeline([
('vect', CountVectorizer(preprocessor=clean_text_custom, max_features=MAX_NB_WORDS, strip_accents='unicode')),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC(),n_jobs=-1)),
])
cross_val_score(pipeline, data, binary_label_data, cv=5,scoring='f1_micro')
# array([ 0.25129587, 0.37780563, 0.33195376, 0.31269861, 0.14555337])
# then i run it again and I get he exact same scores for each fold
cross_val_score(pipeline, data, binary_label_data, cv=5,scoring='f1_micro')
# array([ 0.25129587, 0.37780563, 0.33195376, 0.31269861, 0.14555337])

Pipeline using CountVectorizer (max_df) before tfidf

Currently i am not sure if this quation is for stackoverflow or another more theoretical statistical QA. But im confused about the following.
I am doing a binairy tekst classification task. For this task i use a pipeline, one of the example codes is below:
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', LogisticRegression())
])
parameters = {
'vect__ngram_range': [(1, 1), (1, 2), (1, 3)],
'vect__stop_words': [None, stopwords.words('dutch'), stopwordList],
'clf__C': [0.1, 1, 10, 100, 1000]
}
So nothing really strange about this, but then i started playing with the parameter options/settings and noted that the code below (so the steps and parameters in the code) had the highest accuracy score (f1 score):
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', LogisticRegression())
])
parameters = {
'vect__ngram_range': [(1,1)],
'vect__stop_words': [None],
'vect__max_df': [0.2],
'vect__max_features': [10000],
'clf__C': [100]
}
So im pleased to sort of find out with which parameter settings and methods i get the highest score, but i cant figure out the exact meaning. As with the 'vectorizor'-step the settings for max_df (ignoring terms that appear in more 20% of the documents) seems to be strange to apply before tfidf (or somehow double)
Furthermore it also uses max_features of 10.000. What step is used before the max_df or the max_features? and how do i interpret max_features setting this parameter and doing tfidf afterwards. Does it then perform a tfidf over the 10.000 features?
For me it seems rather strange to do a tfidf after using parameters such as max_df and max_features? Am i correct? and why? or should i just do what gives the highest outcome..
I hope someone can help me in the right direction, thanks a lot in advance.

How to use a ValidationMonitor for an Estimator in TensorFlow 1.0?

TensorFlow provides the possibility for combining ValidationMonitors with several predefined estimators like tf.contrib.learn.DNNClassifier.
But I want to use a ValidationMonitor for my own estimator which I have created based on 1.
For my own estimator I initialize first a ValidationMonitor:
validation_monitor = tf.contrib.learn.monitors.ValidationMonitor(testX,testY,every_n_steps=50)
estimator = tf.contrib.learn.Estimator(model_fn=model,model_dir=direc,config=tf.contrib.learn.RunConfig(save_checkpoints_secs=1))
input_fn = tf.contrib.learn.io.numpy_input_fn({"x": x}, y, 4, num_epochs=1000)
Here I pass the monitor as shown in 2 for tf.contrib.learn.DNNClassifier:
estimator.fit(input_fn=input_fn, steps=1000,monitors=[validation_monitor])
This fails and following error was printed:
ValueError: Features are incompatible with given information. Given features: Tensor("input:0", shape=(?, 1), dtype=float64), required signatures: {'x': TensorSignature(dtype=tf.float64, shape=TensorShape([Dimension(None)]), is_sparse=False)}.
How can I use monitors for my own estimators?
Thanks.
Problem is solved when passing input_fn containing testX and testY to ValidationMonitor instead of passing the tensors testX and testY directly.
For the record, your error was caused by the fact that ValidationMonitor expects x to be a dictionary like { 'feature_name_as_a_string' : feature_tensor }, which in your input_fn is done internally by the call to tf.contrib.learn.io.numpy_input_fn(...).
More information about how to build features dictionaries can be found in the Building Input Functions with tf.contrib.learn article of the documentation.

Resources