How would you do RandomizedSearchCV with VotingClassifier for Sklearn? - machine-learning

I'm trying to tune my voting classifier. I wanted to use randomized search in Sklearn. However how could you set parameter lists for my voting classifier since I currently use two algorithms (different tree algorithms)?
Do I have to separately run randomized search and combine them together in voting classifier later?
Could someone help? Code examples would be highly appreciated :)
Thanks!

You can perfectly combine both, the VotingClassifier with RandomizedSearchCV. No need to run them separately. See the documentation: http://scikit-learn.org/stable/modules/ensemble.html#using-the-votingclassifier-with-gridsearch
The trick is to prefix your params list with your estimator name. For example, if you have created a RandomForest estimator and you created it as ('rf',clf2) then you can set up its parameters in the form <name__param>. Specific example: rf__n_estimators: [20,200], so you refer to a specific estimator and set values to test for a specific param.
Ready to test executable code example ;)
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.grid_search import RandomizedSearchCV
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 2, 2, 2])
clf1 = DecisionTreeClassifier()
clf2 = RandomForestClassifier(random_state=1)
params = {'dt__max_depth': [5, 10], 'rf__n_estimators': [20, 200],}
eclf = VotingClassifier(estimators=[('dt', clf1), ('rf', clf2)], voting='hard')
random_search = RandomizedSearchCV(eclf, param_distributions=params,n_iter=4)
random_search.fit(X, y)
print(random_search.grid_scores_)

Related

OperatorNotAllowedInGraphError: Iterating over a symbolic `tf.Tensor` is not allowed when using a dataset with tuples

I am trying to create my own transformer with tensorflow and of course I want to train it. For the purpuse I use dataset to handle my data. The data is created by a code snippet from the tensorflow dataset.from_tensor_slices() method documentation article. Nevertheless, tensorflow is giving me the following error when I call the fit() method:
"OperatorNotAllowedInGraphError: Iterating over a symbolic tf.Tensor is not allowed: AutoGraph did convert this function. This might indicate you are trying to use an unsupported feature."
Here is the code that I am using:
import numpy as np
import tensorflow as tf
batched_features = tf.constant([[[1, 3], [2, 3]],
[[2, 1], [1, 2]],
[[3, 3], [3, 2]]], shape=(3, 2, 2))
batched_labels = tf.constant([['A', 'A'],
['B', 'B'],
['A', 'B']], shape=(3, 2, 1))
dataset = tf.data.Dataset.from_tensor_slices((batched_features, batched_labels))
dataset = dataset.batch(1)
for element in dataset.as_numpy_iterator():
print(element)
class MyTransformer(tf.keras.Model):
def __init__(self):
super().__init__()
def call(self, inputs, training):
print(type(inputs))
feature, lable = inputs
return feature
model = MyTransformer()
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
loss=tf.keras.losses.BinaryCrossentropy(),
metrics=[tf.keras.metrics.BinaryAccuracy(),
tf.keras.metrics.FalseNegatives()])
model.fit(train_data, batch_size = 1, epochs = 1)
The code is reduced significantly just for the purpuse of reproducing the issue.
I've tried passing the data as dictionary instead of tuple and couple more things but nothing worked. It seems that I am missing something.

Analyzing underfitting and overfitting in Machine learning Model

below snapshot shows my code to get the mse and score of my model during training and testing. From the code, could it be assumed:
Looking at the RandomForestRegressor, does it really show that the model is not performing well on the training set? cos the MSE is high on the training set and low on the test set. Can we say model is underfitting?
Likewise,
The XGBRegressor, i have low training error and high test error. Does this mean, the model is overfitting?
snapshot
Both RF and XGB Regressors have issues with overfitting. Use cross-validation to resolve this issues. For example,
from sklearn.datasets import make_regression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
X, y = make_regression(n_samples=100)
from sklearn.model_selection import GridSearchCV
# Create the parameter grid based on the results of random search
param_grid = {
'bootstrap': [True],
'max_depth': [80, 90, 100, 110],
'max_features': [2, 3],
'min_samples_leaf': [3, 4, 5],
'min_samples_split': [8, 10, 12],
'n_estimators': [100, 200, 300, 1000]
}
# Create a based model
rf = RandomForestRegressor()
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid,
cv = 3, n_jobs = -1, verbose = 2)
# Fit the grid search to the data
grid_search.fit(X, y)
grid_search.best_params_

Does the leave-one-out algorithm form a linear prediction?

I am using the leave-one-out algorithm using code that I found here.
I'm copying the code below:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from numpy import mean
from numpy import absolute
from numpy import sqrt
import pandas as pd
df = pd.DataFrame({'y': [6, 8, 12, 14, 14, 15, 17, 22, 24, 23],
'x1': [2, 5, 4, 3, 4, 6, 7, 5, 8, 9],
'x2': [14, 12, 12, 13, 7, 8, 7, 4, 6, 5]})
#define predictor and response variables
X = df[['x1', 'x2']]
y = df['y']
#define cross-validation method to use
cv = LeaveOneOut()
#build multiple linear regression model
model = LinearRegression()
#use LOOCV to evaluate model
scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error',
cv=cv, n_jobs=-1)
#view mean absolute error
mean(absolute(scores))
I have two questions regarding this method:
How does the model form a prediction from the data (apart from the one data point that it excludes)? Is it linear regression?
From what I understand, the error is calculated to be the sum of (actual value-predicted value)^2. Is there any way that I could modify the code such that the error could become the sum of [(actual value-predicted value)/actual value]^2?

How to pass the cv argument for randomsearchcv in sklearn if we have train, test, validate [duplicate]

Similar to Custom cross validation split sklearn I want to define my own splits for GridSearchCV for which I need to customize the built in cross-validation iterator.
I want to pass my own set of train-test indices for cross validation to the GridSearch instead of allowing the iterator to determine them for me. I went through the available cv iterators on the sklearn documentation page but couldn't find it.
For example I want to implement something like this
Data has 9 samples
For 2 fold cv I create my own set of training-testing indices
>>> train_indices = [[1,3,5,7,9],[2,4,6,8]]
>>> test_indices = [[2,4,6,8],[1,3,5,7,9]]
1st fold^ 2nd fold^
>>> custom_cv = sklearn.cross_validation.customcv(train_indices,test_indices)
>>> clf = GridSearchCV(X,y,params,cv=custom_cv)
What can be used to work like customcv?
Actually, cross-validation iterators are just that: Iterators. They give back a tuple of train/test fold at each iteration. This should then work for you:
custom_cv = zip(train_indices, test_indices)
Also, for the specific case you are mentioning, you can do
import numpy as np
labels = np.arange(0, 10) % 2
from sklearn.cross_validation import LeaveOneLabelOut
cv = LeaveOneLabelOut(labels)
Observe that list(cv) yields
[(array([1, 3, 5, 7, 9]), array([0, 2, 4, 6, 8])),
(array([0, 2, 4, 6, 8]), array([1, 3, 5, 7, 9]))]
Actually the above solution returns each row as a fold what one really needs is:
[(train_indices, test_indices)] # for one fold
[(train_indices, test_indices), # 1stfold
(train_indices, test_indices)] # 2nd fold etc

How should I split my data for cross validation and grid search?

Should i split my data in to two parts similar in size to use each half for eaxh tasks or i should do grid search on my whole data and then just do cross validation again on my whole data to check my accuracy ?
You need to split the data into test and train (20:80) (eg. test_train_split in sklearn), then run the model with the train data and check the accuracy. If its not what you expect, then you can try applying Hyper parameter Tuning.
You can do this by GridSearchCV, where you need to fit the desired estimator (depending on the type of problem ) and the parameter values.
Attached a sample code :
from sklearn.model_selection import GridSearchCV
# Create the parameter grid based on the results of random search
param_grid = {
'bootstrap': [True],
'max_depth': [50, 55, 60, 65],
'max_features': ["auto","sqrt", 2, 3],
'min_samples_leaf': [1, 2, 3],
'min_samples_split': [2, 3, 4],
'n_estimators': [60, 65, 70, 75]
}
grid_search = GridSearchCV(estimator = rfcv, param_grid = param_grid, cv = 3, n_jobs = -1, verbose = 2)
grid_search.fit(X_train, Y_train)
grid_search.best_params_
Based the best parameter results, you can fine tune the grid search.
Eg, if best parameter value is near 60 for n_estimators then you need to change the values as surrounding to 60 like [50,55,60,60]. To figure out the exact value.
Then build the machine learning model based on the best parameters value. Evaluate the train data accuracy and then predict the result using test data values.
rf = rgf(n_estimators = 70, random_state=0, min_samples_split = 2, min_samples_leaf=1, max_features = 'sqrt',bootstrap='True', max_depth=65)
regressor = rf.fit(X_train,Y_train)
pred_tuned = regressor.predict(X_test)
You can find an improvement in your accuracy !!

Resources