Find the best pipeline model using CrossValidator and ParamGridBuilder - machine-learning

I have an acceptable model, but I would like to improve it by adjusting its parameters in Spark ML Pipeline with CrossValidator and ParamGridBuilder.
As an Estimator I will place the existing pipeline.
In ParamMaps I would not know what to put, I do not understand it.
As Evaluator I will use the RegressionEvaluator already created previously.
I'm going to do it for 5 folds, with a list of 10 different depth values in the tree.
How can I select and show the best model for the lowest RMSE?
ACTUAL example:
from import Pipeline
from import DecisionTreeRegressor
from import VectorIndexer
from import RegressionEvaluator
dt = DecisionTreeRegressor()
pipeline = Pipeline(stages=[vectorizer, dt])
model =
regEval = RegressionEvaluator(predictionCol = "Predicted_XX", labelCol = "XX", metricName = "rmse")
rmse = regEval.evaluate(predictions)
print("Root Mean Squared Error: %.2f" % rmse)
(1) Spark Jobs
(2) Root Mean Squared Error: 3.60
from import CrossValidator, ParamGridBuilder
dt2 = DecisionTreeRegressor()
pipeline2 = Pipeline(stages=[vectorizer, dt2])
model2 =
regEval2 = RegressionEvaluator(predictionCol = "Predicted_PE", labelCol = "PE", metricName = "rmse")
paramGrid = ParamGridBuilder().build() # ??????
crossval = CrossValidator(estimator = pipeline2, estimatorParamMaps = paramGrid, evaluator=regEval2, numFolds = 5) # ?????
rmse2 = regEval2.evaluate(predictions)
#bestPipeline = ????
#bestLRModel = ????
#bestParams = ????
print("Root Mean Squared Error: %.2f" % rmse2)
(1) Spark Jobs
(2) Root Mean Squared Error: 3.60 # the same ¿?

You need to call .fit() with your training data on the crossval object to create the cv model. That will do the cross validation. Then you get the best model (according to your evaluator metric) from that. Eg.
cvModel =
myBestModel = cvModel.bestModel


Prophet with Multiprocessing?

My problem is time series anomaly detection and I use facebook prophet library. So I have a function called "fit_predict_model" and I have 90 different dataframes that I keep in the dictionary. I mean have 90 different models. Then it takes a long time to train. I wanted to use multiprocessing to train faster.But I am getting memory error. How can I solve this problem?
def fit_predict_model(dataframe, model_name, interval_width = 0.95, changepoint_range = 0.88):
model = Prophet(yearly_seasonality=False,daily_seasonality=True,
seasonality_mode = "multiplicative",changepoint_range = changepoint_range)
model =
forecast = model.predict(forecast)
return forecast
pred = {}
def run(key):
pred[key] = fit_predict_model(train[key], model_name = key)
pool = Pool(cpu_count()), list(train.keys()))

Any way to efficiently stack/ensemble pre-trained models for image classification?

I am trying to stack a few pre-trained models that I have through taking the last hidden layer of each model and then concatenating them together and then plugging them into a meta-learner model (e.g. XGBoost).
I am running into a big problem of having to process each image of my dataset multiple times since each base model requires a different processing method. This is causing my model to take a really long time to train and is infeasible. Is there any way to work past this?
For example:
model_1, processor_1 = pretrained_model(), pretrained_processor()
model_2, processor_2 = pretrained_model2(), pretrained_processor2()
for img in images:
input_1 = processor_1(img)
input_2 = processor_2(img)
out_1 = model_1(input_1)
out_2 = model_2(input_2),out2), dim=1) #concatenates hidden representations to feed into another model
Here'a recommendation if you want to process your images faster:
Note: I did not test this out
import torch
import torch.nn as nn
# Create a stack nn module
class StackedModel(nn.Module):
def __init__(self, model1, model2):
super(StackedModel, self).__init__()
self.model1 = model1
self.model2 = model2
def forward(self, imgs):
out_1 = model_1(input_1)
out_2 = model_2(input_2)
return, out2), dim=1)
# Init model
model = StackedModel(model1, model2)
# Try to stack and run in a larger batch assuming u have extra gpu space
stacked_preproc1 = []
stacked_preproc2 = []
max_batch_size = 16
total_output = []
for index, img in enumerate(images):
input_1 = processor_1(img)
input_2 = processor_2(img)
if index % max_batch_size == 0:
stacked_preproc1 = torch.stack(stacked_preproc1)
stakced_preproc2 = torch.stack(stakced_preproc2)
model(stacked_preproc1, stacked_preproc2)
# Reset array
stacked_preproc1 = []
stakced_preproc2 = []

XGBoost Hyperparameter Tuning using Hyperopt

I am trying to tune my XGBClassifier model. But I am failing to do so. Please find the code below and please help me clean and edit the code.
import csv
from hyperopt import STATUS_OK
from timeit import default_timer as timer
N_FOLDS = 10
def objective(params, n_folds = N_FOLDS):
"""Objective function for Gradient Boosting Machine Hyperparameter Optimization"""
# Keep track of evals
# Retrieve the subsample if present otherwise set to 1.0
subsample = params['boosting_type'].get('subsample', 1.0)
# Extract the boosting type
params['boosting_type'] = params['boosting_type']['boosting_type']
params['subsample'] = subsample
# Make sure parameters that need to be integers are integers
for parameter_name in ['num_leaves', 'subsample_for_bin',
params[parameter_name] = int(params[parameter_name])
start = timer()
# Perform n_folds cross validation
cv_results =, train_set, num_boost_round = 10000,
nfold = n_folds, early_stopping_rounds = 100,
metrics = 'auc', seed = 50)
run_time = timer() - start
# Extract the best score
best_score = np.max(cv_results['auc-mean'])
# Loss must be minimized
loss = 1 - best_score
# Boosting rounds that returned the highest cv score
n_estimators = int(np.argmax(cv_results['auc-mean']) + 1)
# Write to the csv file ('a' means append)
of_connection = open(out_file, 'a')
writer = csv.writer(of_connection)
writer.writerow([loss, params, ITERATION, n_estimators,
# Dictionary with information for evaluation
return {'loss': loss, 'params': params, 'iteration': ITERATION,
'estimators': n_estimators, 'train_time': run_time,
'status': STATUS_OK}
I believe I am doing something wrong in the objective function, as I am trying to edit the objective function of LightGBM.
Please help me.
I created the hgboost library which provides XGBoost Hyperparameter Tuning using Hyperopt.
pip install hgboost
Examples can be found here

PicklingError when running pipeline in sklearn

I am running the following pipeline in sklearn to perform grid search
def logistic():
sm = SMOTE()
poly = polynomial_transform()
stand = StandardScaler()
pca = PCA()
#I need to think about how I am training this...
logistic = LogisticRegression(max_iter=100, tol=0.01,solver = 'saga') #what is my binary scorer here
pipe = Pipeline(steps=[('smt', sm),('poly', poly),('standardise', stand),('pca', pca), ('logistic', logistic)],memory = mem)
# Parameters of pipelines can be set using ‘__’ separated parameter names:
param_grid = {
#'poly__degree': [1,2],
'logistic__C': [0.0001],
scorers = {
'precision_score': make_scorer(precision_score),
'recall_score': make_scorer(recall_score),
'accuracy_score': make_scorer(accuracy_score),
'f1_score': make_scorer(f1_score),
'tp': make_scorer(tp),
'tn': make_scorer(tn),
'fp': make_scorer(fp),
'fn': make_scorer(fn),
'ck': make_scorer(cohen_kappa_score)
#performing grid search
search = GridSearchCV(pipe, param_grid, n_jobs=-1,verbose=2,scoring= scorers,refit='accuracy_score'), y_train.to_numpy().ravel())
#results of cross validation grid search
print("Best parameter (CV score=%0.3f):" % search.best_score_)
return search
polynomial transform is a class I have created myself
class polynomial_transform(BaseEstimator):
def __init__(self,degree=None):
def fit(self,X,y=None):
return self
#Method that describes what we need this transformer to do
def transform( self, X, y = None ):
for i in range(1,
X = np.hstack((X, X**i))
return X
def set_params(self, degree): = degree
def get_params(self,deep=True):
return {'degree'}
When setting the memory parameter in the pipeline I am getting the following error
FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
_pickle.PicklingError: ("Can't pickle : it's not found as main.polynomial_transform"

How to prevent Keras predict_generator from shuffling data?

I created a deep learning model, and I want to check the performance of the model by using predict_generator. I am using the following code which compares the images' labels with the predicted classes and then returns the prediction error.
validation_generator = validation_datagen.flow_from_directory(
target_size=(image_size, image_size),
# Get the filenames from the generator
fnames = validation_generator.filenames
# Get the ground truth from generator
ground_truth = validation_generator.classes
# Get the label to class mapping from the generator
label2index = validation_generator.class_indices
# Getting the mapping from class index to class label
idx2label = dict((v,k) for k,v in label2index.items())
# Get the predictions from the model using the generator
predictions = model.predict_generator(validation_generator, steps=validation_generator.samples/validation_generator.batch_size,verbose=1)
predicted_classes = np.argmax(predictions,axis=1)
errors = np.where(predicted_classes != ground_truth)[0]
print("No of errors = {}/{}".format(len(errors),validation_generator.samples))
# Show the errors
for i in range(len(errors)):
pred_class = np.argmax(predictions[errors[i]])
pred_label = idx2label[pred_class]
title = 'Original label:{}, Prediction :{}, confidence : {:.3f}'.format(
original = load_img('{}/{}'.format(validation_dir,fnames[errors[i]]))
validation_generator.classes is arranged but predicted_classes is not arranged.
I take the code from here
How can I prevent predict_generator from shuffling data?
