Using SMOTEENN in GridSearchCV Pipeline with Preprocesing - random-forest

I am working on a classification problem with a highly imbalanced dataset. I am trying to use SMOTEENN in the grid search pipeline, however I keep getting this ValueError:
ValueError: Invalid parameter randomforestclassifier for estimator Pipeline(memory=None,
ColumnTransformer(n_jobs=None, remainder='drop',
verbose=False). Check the list of available parameters with `estimator.get_params().keys()`.
I found online that SMOTEENN can be used with GridSearchCV if the Pipeline from imblearn is imported. I am using the Pipeline from imblearn but it still gives me this error.
The issue first started when I tried to use SMOTEENN and get the X and y variables. I have a prepare_data() function that breaks the data into X,y. I wanted to use SMOTEENN in that function and return the balanced data. However, one of my features is of type string - and needs to be put in OneHotEncoder. For some reason, SMOTEENN doesn't seem to process strings. Thus, I needed to use it in the pipeline so that SMOTEENN would be effective post-preprocessing.
I am pasting my pipeline code below. Any help or explanation would be much appreciated! Thank you!
def ML_RandomF(X, y, random_state, n_folds, oneHot_ftrs,
num_ftrs, ordinal_ftrs, ordinal_cats, beta, test_size, score_type):
scoring = {'roc_auc_score': make_scorer(roc_auc_score),
'f_beta': make_scorer(fbeta_score, beta=beta, average='weighted'),
'accuracy': make_scorer(accuracy_score)}
X_other, X_test, y_other, y_test = train_test_split(X, y, test_size=test_size, random_state = random_state)
kf = StratifiedKFold(n_splits=n_folds,shuffle=True,random_state=random_state)
reg = RandomForestClassifier(random_state=random_state, n_estimators=100, class_weight="balanced")
sme = SMOTEENN(random_state=random_state)
model = Pipeline([
('sampling', sme),
('classification', reg)])
# ordinal encoder
ordinal_transformer = Pipeline(steps=[
('ordinal', OrdinalEncoder(categories = ordinal_cats))])
# oneHot encoder
onehot_transformer = Pipeline(steps=[
('ordinal', OneHotEncoder(sparse=False, handle_unknown='ignore'))])
# standard scaler
numeric_transformer = Pipeline(steps=[
('scaler', StandardScaler())])
preprocessor_X = ColumnTransformer(
('num', numeric_transformer, num_ftrs),
('oneH', onehot_transformer, oneHot_ftrs),
('ordinal', ordinal_transformer, ordinal_ftrs)])
pipe = Pipeline(steps=[('preprocessor_X', preprocessor_X), ('model', model)])
param_grid = {'randomforestclassifier__max_depth': [3,5,7,10],
'randomforestclassifier__min_samples_split': [10,25,40]}
grid = GridSearchCV(pipe,param_grid=param_grid,
scoring=scoring,cv=kf, refit=score_type,
return_train_score=True,iid=True, verbose=2, n_jobs=-1), y_other)
return grid, grid.score(X_test, y_test)

You had named RandomForestClassifier as classification and that pipeline is named as model in your next pipeline. Hence you have to change your param_grid as follows
param_grid = {'model__classification__max_depth': [3,5,7,10],
'model__classification__min_samples_split': [10,25,40]}


Nested cross validation with pipeline sklearn

I am trying to apply nested cross-validation with pipeline from the Sklearn library as seen below:
pipeline = imbpipeline(steps=[['smote', SMOTE(random_state=11)],
['scaler', MinMaxScaler()],
['classifier', LogisticRegression(random_state=11,
cv_inner = KFold(n_splits=3, shuffle=True, random_state=1)
cv_outer = KFold(n_splits=10, shuffle=True, random_state=1)
param_grid = {'classifier__C':[0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]}
grid_search = GridSearchCV(estimator=pipeline,
scores = cross_val_score(grid_search,
print('Accuracy: %.3f (%.3f)' % (np.mean(scores), np.std(scores)))
The code works just fine but I can't figure out how to extract the best parameters found with the above procedure.
As per documentation I tried:
but I get :
AttributeError: 'GridSearchCV' object has no attribute 'best_params_'
which I can't really understand.
Any thoughts?
Before you can get the best parameters, you need to fit the data. You should add one line:, y_train)

sklearn and weka kNN predictions exactly same for all except for one data point

I wrote a code for kNN using sklearn and then compared the predictions using the WEKA kNN. The comparison was done using the 10 test set predictions, out of which, only a single one is showing a high difference of >1.5 but all others are exactly the same. So, I am not sure about if my code is working fine or not. Here is my code:
df = pd.read_csv('xxxx.csv')
X = df.drop(['Name', 'activity'], axis=1)
y = df['activity']
Xstd = StandardScaler().fit_transform(X)
x_train, x_test, y_train, y_test = train_test_split(Xstd, y, test_size=0.2,
shuffle=False, random_state=None)
print(x_train.shape, x_test.shape)
X_train_trans = x_train
X_test_trans = x_test
for i in range(2, 3):
knn_regressor = KNeighborsRegressor(n_neighbors=i, algorithm='brute',
weights='uniform', metric='euclidean', n_jobs=1, p=2)
CV_pred_train = cross_val_predict(knn_regressor, X_train_trans, y_train,
n_jobs=-1, verbose=0, cv=LeaveOneOut())
print("LOO Q2: ", metrics.r2_score(y_train, CV_pred_train).round(2))
# Train Test predictions, y_train)
train_r2 = knn_regressor.score(X_train_trans, y_train)
y_train_pred = knn_regressor.predict(X_train_trans).round(3)
train_r2_1 = metrics.r2_score(y_train, y_train_pred)
y_test_pred = knn_regressor.predict(X_test_trans).round(3)
train_r = stats.pearsonr(y_train, y_train_pred)
abs_error_train = (y_train - y_train_pred)
train_predictions = pd.DataFrame({'Actual': y_train, 'Predcited':
y_train_pred, "error": abs_error_train.round(3)})
MAE_train = metrics.mean_absolute_error(y_train, y_train_pred)
abs_error_test = (y_test_pred - y_test)
test_predictions = pd.DataFrame({'Actual': y_test, 'predcited':
y_test_pred, 'error': abs_error_test.round(3)})
test_r = stats.pearsonr(y_test, y_test_pred)
test_r2 = metrics.r2_score(y_test, y_test_pred)
MAE_test = metrics.mean_absolute_error(y_test, y_test_pred).round(3)
The train set statistics are almost same in both sklearn and WEKA kNN.
the sklearn predictions are:
Actual predcited error
6.00 5.285 -0.715
5.44 5.135 -0.305
6.92 6.995 0.075
7.28 7.005 -0.275
5.96 6.440 0.480
7.96 7.150 -0.810
7.30 6.660 -0.640
6.68 7.200 0.520
***4.60 6.950 2.350***
and the weka predictions are:
actual predicted error
6 5.285 -0.715
5.44 5.135 -0.305
6.92 6.995 0.075
7.28 7.005 -0.275
5.96 6.44 0.48
7.96 7.15 -0.81
7.3 6.66 -0.64
6.68 7.2 0.52
***4.6 5.285 0.685***
parameters used in both algorithms are: k =2, brute force for distance calculation, metric: euclidean.
Any suggestions for the difference?

Using GridSearchCV and a Random Forest Regressor with the same parameters gives different results

As the huge title says I'm trying to use GridSearchCV to find the best parameters for a Random Forest Regressor and I'm measuring my results with mse.
Inputs_Treino = dataset.iloc[:253,1:4].values
Outputs_Treino = dataset.iloc[:253,-1].values
Inputs_Teste = dataset.iloc[254:,1:4].values
Outputs_Teste = dataset.iloc[254:,-1].values
estimator = RandomForestRegressor()
para_grids = {
"n_estimators" : [10,50,100],
"max_features" : ["auto", "log2", "sqrt"],
"bootstrap" : [True, False]
grid = GridSearchCV(estimator, para_grids, scoring = 'mean_squared_error'), Outputs_Treino)
forest = grid.best_estimator_
print (grid.best_score_, grid.best_params_)
mse = mean_absolute_error(Outputs_Teste, reg_prediction)
This is the gist of the code (nothing too complex I know, just getting started with it all)
When I print the result of grid.best_estimator_ I get this
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=50, n_jobs=1, oob_score=False, random_state=None,
verbose=0, warm_start=False)
The problem is if I try to create a regressor with these parameters (without using grid search at all) and train it the same way I get a waaaay bigger MSE on the testing set (5.483837301587303 vs 43.801520165079467)
Inputs_Treino = dataset.iloc[:253,1:4].values
Outputs_Treino = dataset.iloc[:253,-1].values
Inputs_Teste = dataset.iloc[254:,1:4].values
Outputs_Teste = dataset.iloc[254:,-1].values
regressor = RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
max_features='auto', max_leaf_nodes=None,
min_impurity_split=1e-07, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=50, n_jobs=1, oob_score=False, random_state=None,
verbose=0, warm_start=False),Outputs_Treino)
#fazer as predictions
Teste_Prediction = regressor.predict(Inputs_Teste);
mse = mean_squared_error(Outputs_Teste, Teste_Prediction);
Does this have to do with the cross validation GridSearchCV performs ? What am I missing here ?

StandardScaler with make_pipeline

If I use make_pipeline, do I still need to use fit and transform functions to fit my model and transform or it will perform these functions itself?
Also, does StandardScaler also perform the normalization or only the scaling?
Explaining the code: I want to apply PCA and later applying normalization with svm.
pca = PCA(n_components=4).fit(X)
X = pca.transform(X)
# training a linear SVM classifier 5-fold
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
clf = make_pipeline(preprocessing.StandardScaler(), SVC(kernel = 'linear'))
scores = cross_val_score(clf, X, y, cv=5)
Also abit confused what happens if I don't use the fit function in the below code:
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
clf = SVC(kernel = 'linear', C = 1)
scores = cross_val_score(clf, X, y, cv=5)
StandardScaler does both normalization and scaling.
cross_val_score() will fit (transform) your data set for you, so you don't need to call it explicitly.
A bit more common approach would be to put all steps (StandardScale, PCA, SVC) in one pipeline and use GridSearchCV for tuning hyperparameters and chosing best parameters (estimators).
pipe = Pipeline([
('scale, StandardScaler()),
('reduce_dims', PCA(n_components=4)),
('clf', SVC(kernel = 'linear', C = 1))
param_grid = dict(reduce_dims__n_components=[4,6,8],
clf__C=np.logspace(-4, 1, 6),
grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2), y_train)
print(grid.score(X_test, y_test))

How do you make TensorFlow + Keras fast with a TFRecord dataset?

What is an example of how to use a TensorFlow TFRecord with a Keras Model and while keeping the dataset in tensors w/ queue runners?
Below is a snippet that works but it needs the following improvements:
Use the Model API
specify an Input()
Load a dataset from a TFRecord
Run through a dataset in parallel (such as with a queuerunner)
Here is the snippet, there are several TODO lines indicating what is needed:
from keras.models import Model
import tensorflow as tf
from keras import backend as K
from keras.layers import Dense, Input
from keras.objectives import categorical_crossentropy
from tensorflow.examples.tutorials.mnist import input_data
sess = tf.Session()
# Can this be done more efficiently than placeholders w/ TFRecords?
img = tf.placeholder(tf.float32, shape=(None, 784))
labels = tf.placeholder(tf.float32, shape=(None, 10))
# TODO: Use Input()
x = Dense(128, activation='relu')(img)
x = Dense(128, activation='relu')(x)
preds = Dense(10, activation='softmax')(x)
# TODO: Construct model = Model(input=inputs, output=preds)
loss = tf.reduce_mean(categorical_crossentropy(labels, preds))
# TODO: handle TFRecord data, is it the same?
mnist_data = input_data.read_data_sets('MNIST_data', one_hot=True)
train_step = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
# TODO remove default, add queuerunner
with sess.as_default():
for i in range(1000):
batch = mnist_data.train.next_batch(50){img: batch[0],
labels: batch[1]})
print(loss.eval(feed_dict={img: mnist_data.test.images,
labels: mnist_data.test.labels}))
Why is this question relevant?
For high performance training without going back to python
no TFRecord to numpy to tensor conversions
Keras will soon be part of tensorflow
Demonstrate how Keras Model() classes can accept tensors for input data correctly.
Here is some starter information for a semantic segmentation problem example:
example unet Keras model, happens to be for semantic segmentation.
Keras + Tensorflow Blog Post
An attempt at running the unet model a tf session with TFRecords and a Keras model (not working)
Code to create the TFRecords:
An attempt at running the unet model a tf session with TFRecords and a Keras model is in (not working)
I don't use tfrecord dataset format so won't argue on the pros and cons, but I got interested in extending Keras to support the same. is the repository. Will briefly explain the main changes.
Dataset creation and loading
data_to_tfrecord and read_and_decode here takes care of creating tfrecord dataset and loading the same. Special care must be to implement the read_and_decode otherwise you will face cryptic errors during training.
Initialization and Keras model
Now both tf.train.shuffle_batch and Keras Input layer returns tensor. But the one returned by tf.train.shuffle_batch don't have metadata needed by Keras internally. As it turns out, any tensor can be easily turned into a tensor with keras metadata by calling Input layer with tensor param.
So this takes care of initialization:
x_train_, y_train_ = ktfr.read_and_decode('train.mnist.tfrecord', one_hot=True, n_class=nb_classes, is_train=True)
x_train_batch, y_train_batch =[x_train_, y_train_],
num_threads=32) # set the number of threads here
x_train_inp = Input(tensor=x_train_batch)
Now with x_train_inp any keras model can be developed.
Training (simple)
Lets say train_out is the output tensor of your keras model. You can easily write a custom training loop on the lines of:
loss = tf.reduce_mean(categorical_crossentropy(y_train_batch, train_out))
train_op = tf.train.GradientDescentOptimizer(0.01).minimize(loss)
with sess.as_default():
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
step = 0
while not coord.should_stop():
start_time = time.time()
_, loss_value =[train_op, loss], feed_dict={K.learning_phase(): 0})
duration = time.time() - start_time
if step % 100 == 0:
print('Step %d: loss = %.2f (%.3f sec)' % (step, loss_value,
step += 1
except tf.errors.OutOfRangeError:
print('Done training for %d epochs, %d steps.' % (FLAGS.num_epochs, step))
Training (keras style)
One of the features of keras that makes it so lucrative is its generalized training mechanism with the callback functions.
But to support tfrecords type training there are several changes that are need in the fit function
running the queue threads
no feeding in batch data through feed_dict
supporting validation becomes tricky as the validation data will also be coming in through another tensor an different model needs to be internally created with shared upper layers and validation tensor fed in by other tfrecord reader.
But all this can be easily supported by another flag parameter. What makes things messing are the keras features sample_weight and class_weight they are used to weigh each sample and weigh each class. For this in compile() keras creates placeholders (here) and placeholders are also implicitly created for the targets (here) which is not needed in our case the labels are already fed in by tfrecord readers. These placeholders needs to be fed in during session run which is unnecessary in our cae.
So taking into account these changes, compile_tfrecord(here) and fit_tfrecord(here) are the extension of compile and fit and shares say 95% of the code.
They can be used in the following way:
import keras_tfrecord as ktfr
train_model = Model(input=x_train_inp, output=train_out)
ktfr.compile_tfrecord(train_model, optimizer='rmsprop', loss='categorical_crossentropy', out_tensor_lst=[y_train_batch], metrics=['accuracy'])
ktfr.fit_tfrecord(train_model, X_train.shape[0], batch_size, nb_epoch=3)
You are welcome to improve on the code and pull requests.
Update 2018-08-29 this is now directly supported in keras, see the following example:
Original Answer:
TFRecords are supported by using an external loss. Here are the key lines constructing an external loss:
# tf yield ops that supply dataset images and labels
x_train_batch, y_train_batch = read_and_decode_recordinput(...)
# create a basic cnn
x_train_input = Input(tensor=x_train_batch)
x_train_out = cnn_layers(x_train_input)
model = Model(inputs=x_train_input, outputs=x_train_out)
loss = keras.losses.categorical_crossentropy(y_train_batch, x_train_out)
model.compile(optimizer='rmsprop', loss=None)
Here is an example for Keras 2. It works after applying the small patch #7060:
'''MNIST dataset with TensorFlow TFRecords.
Gets to 99.25% test accuracy after 12 epochs
(there is still a lot of margin for parameter tuning).
import os
import copy
import time
import numpy as np
import tensorflow as tf
from tensorflow.python.ops import data_flow_ops
from keras import backend as K
from keras.models import Model
from keras.layers import Dense
from keras.layers import Dropout
from keras.layers import Flatten
from keras.layers import Input
from keras.layers import Conv2D
from keras.layers import MaxPooling2D
from keras.callbacks import EarlyStopping
from keras.callbacks import TensorBoard
from keras.objectives import categorical_crossentropy
from keras.utils import np_utils
from keras.utils.generic_utils import Progbar
from keras import callbacks as cbks
from keras import optimizers, objectives
from keras import metrics as metrics_module
from keras.datasets import mnist
if K.backend() != 'tensorflow':
raise RuntimeError('This example can only run with the '
'TensorFlow backend for the time being, '
'because it requires TFRecords, which '
'are not supported on other platforms.')
def images_to_tfrecord(images, labels, filename):
def _int64_feature(value):
return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
def _bytes_feature(value):
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
""" Save data into TFRecord """
if not os.path.isfile(filename):
num_examples = images.shape[0]
rows = images.shape[1]
cols = images.shape[2]
depth = images.shape[3]
print('Writing', filename)
writer = tf.python_io.TFRecordWriter(filename)
for index in range(num_examples):
image_raw = images[index].tostring()
example = tf.train.Example(features=tf.train.Features(feature={
'height': _int64_feature(rows),
'width': _int64_feature(cols),
'depth': _int64_feature(depth),
'label': _int64_feature(int(labels[index])),
'image_raw': _bytes_feature(image_raw)}))
print('tfrecord %s already exists' % filename)
def read_and_decode_recordinput(tf_glob, one_hot=True, classes=None, is_train=None,
batch_shape=[1000, 28, 28, 1], parallelism=1):
""" Return tensor to read from TFRecord """
print 'Creating graph for loading %s TFRecords...' % tf_glob
with tf.variable_scope("TFRecords"):
record_input = data_flow_ops.RecordInput(
tf_glob, batch_size=batch_shape[0], parallelism=parallelism)
records_op = record_input.get_yield_op()
records_op = tf.split(records_op, batch_shape[0], 0)
records_op = [tf.reshape(record, []) for record in records_op]
progbar = Progbar(len(records_op))
images = []
labels = []
for i, serialized_example in enumerate(records_op):
with tf.variable_scope("parse_images", reuse=True):
features = tf.parse_single_example(
'label': tf.FixedLenFeature([], tf.int64),
'image_raw': tf.FixedLenFeature([], tf.string),
img = tf.decode_raw(features['image_raw'], tf.uint8)
img.set_shape(batch_shape[1] * batch_shape[2])
img = tf.reshape(img, [1] + batch_shape[1:])
img = tf.cast(img, tf.float32) * (1. / 255) - 0.5
label = tf.cast(features['label'], tf.int32)
if one_hot and classes:
label = tf.one_hot(label, classes)
images = tf.parallel_stack(images, 0)
labels = tf.parallel_stack(labels, 0)
images = tf.cast(images, tf.float32)
images = tf.reshape(images, shape=batch_shape)
# StagingArea will store tensors
# across multiple steps to
# speed up execution
images_shape = images.get_shape()
labels_shape = labels.get_shape()
copy_stage = data_flow_ops.StagingArea(
[tf.float32, tf.float32],
shapes=[images_shape, labels_shape])
copy_stage_op = copy_stage.put(
[images, labels])
staged_images, staged_labels = copy_stage.get()
return images, labels
def save_mnist_as_tfrecord():
(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train = X_train[..., np.newaxis]
X_test = X_test[..., np.newaxis]
images_to_tfrecord(images=X_train, labels=y_train, filename='train.mnist.tfrecord')
images_to_tfrecord(images=X_test, labels=y_test, filename='test.mnist.tfrecord')
def cnn_layers(x_train_input):
x = Conv2D(32, (3, 3), activation='relu', padding='valid')(x_train_input)
x = Conv2D(64, (3, 3), activation='relu')(x)
x = MaxPooling2D(pool_size=(2, 2))(x)
x = Dropout(0.25)(x)
x = Flatten()(x)
x = Dense(128, activation='relu')(x)
x = Dropout(0.5)(x)
x_train_out = Dense(classes,
return x_train_out
sess = tf.Session()
batch_size = 100
batch_shape = [batch_size, 28, 28, 1]
epochs = 3000
classes = 10
parallelism = 10
x_train_batch, y_train_batch = read_and_decode_recordinput(
x_test_batch, y_test_batch = read_and_decode_recordinput(
x_batch_shape = x_train_batch.get_shape().as_list()
y_batch_shape = y_train_batch.get_shape().as_list()
x_train_input = Input(tensor=x_train_batch, batch_shape=x_batch_shape)
x_train_out = cnn_layers(x_train_input)
y_train_in_out = Input(tensor=y_train_batch, batch_shape=y_batch_shape, name='y_labels')
cce = categorical_crossentropy(y_train_batch, x_train_out)
train_model = Model(inputs=[x_train_input], outputs=[x_train_out])
tensorboard = TensorBoard()
# tensorboard disabled due to Keras bug,
epochs=epochs) # callbacks=[tensorboard])
# Second Session, pure Keras
(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train = X_train[..., np.newaxis]
X_test = X_test[..., np.newaxis]
x_test_inp = Input(batch_shape=(None,) + (X_test.shape[1:]))
test_out = cnn_layers(x_test_inp)
test_model = Model(inputs=x_test_inp, outputs=test_out)
test_model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['accuracy'])
loss, acc = test_model.evaluate(X_test, np_utils.to_categorical(y_test), classes)
print('\nTest accuracy: {0}'.format(acc))
I've also been working to improve the support for TFRecords in the following issue and pull request:
#6928 Yield Op support: High Performance Large Datasets via TFRecords, and RecordInput
#7102 Keras Input Tensor API Design Proposal
Finally, it is possible to use tf.contrib.learn.Experiment to train Keras models in TensorFlow.
