below snapshot shows my code to get the mse and score of my model during training and testing. From the code, could it be assumed:
Looking at the RandomForestRegressor, does it really show that the model is not performing well on the training set? cos the MSE is high on the training set and low on the test set. Can we say model is underfitting?
The XGBRegressor, i have low training error and high test error. Does this mean, the model is overfitting?
Both RF and XGB Regressors have issues with overfitting. Use cross-validation to resolve this issues. For example,
from sklearn.datasets import make_regression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
X, y = make_regression(n_samples=100)
from sklearn.model_selection import GridSearchCV
# Create the parameter grid based on the results of random search
param_grid = {
'bootstrap': [True],
'max_depth': [80, 90, 100, 110],
'max_features': [2, 3],
'min_samples_leaf': [3, 4, 5],
'min_samples_split': [8, 10, 12],
'n_estimators': [100, 200, 300, 1000]
# Create a based model
rf = RandomForestRegressor()
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid,
cv = 3, n_jobs = -1, verbose = 2)
# Fit the grid search to the data, y)
I am using the leave-one-out algorithm using code that I found here.
I'm copying the code below:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from numpy import mean
from numpy import absolute
from numpy import sqrt
import pandas as pd
df = pd.DataFrame({'y': [6, 8, 12, 14, 14, 15, 17, 22, 24, 23],
'x1': [2, 5, 4, 3, 4, 6, 7, 5, 8, 9],
'x2': [14, 12, 12, 13, 7, 8, 7, 4, 6, 5]})
#define predictor and response variables
X = df[['x1', 'x2']]
y = df['y']
#define cross-validation method to use
cv = LeaveOneOut()
#build multiple linear regression model
model = LinearRegression()
#use LOOCV to evaluate model
scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error',
cv=cv, n_jobs=-1)
#view mean absolute error
I have two questions regarding this method:
How does the model form a prediction from the data (apart from the one data point that it excludes)? Is it linear regression?
From what I understand, the error is calculated to be the sum of (actual value-predicted value)^2. Is there any way that I could modify the code such that the error could become the sum of [(actual value-predicted value)/actual value]^2?
I was trying to hyper tune param but after I did it, the accuracy score has not changed at all, what I do wrong?
# Log reg
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=0.3326530612244898,max_iter=100,tol=0.01),y_train)
from sklearn.metrics import confusion_matrix
y_pred = logreg.predict(X_test)
print('Accuracy of log reg is: ', logreg.score(X_test,y_test))
# 0.9181286549707602 - acurracy before tunning
Accuracy of log reg is: 0.9181286549707602
array([[ 54, 9],
[ 5, 103]])
Here is me Using Grid Search CV:
from sklearn.model_selection import GridSearchCV
params ={'tol':[0.01,0.001,0.0001],
grid_model = GridSearchCV(logreg,param_grid=params,cv=5)
grid_model_result =,y_train)
0.8867405063291139 {'C': 0.3326530612244898, 'max_iter': 100, 'tol': 0.01}
The problem was that in the first chunk you evaluate the model's performance on the test set, while in the GridSearchCV you only looked at the performance on the training set after hyperparameter optimization.
The code below shows that both procedures, when used to predict the test set labels, perform equally well in terms of accuracy (~0.93).
Note, you might want to consider using a hyperparameter grid with other solvers and a larger range of max_iter because I obtained convergence warnings.
# Load packages
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
# Load the dataset and split in X and y
df = pd.read_csv('Breast_cancer_data.csv')
X = df.iloc[:, 0:5]
y = df.iloc[:, 5]
# Perform train and test split (80/20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize a model
Log = LogisticRegression(n_jobs=-1)
# Initialize a parameter grid
params = [{'tol':[0.01,0.001,0.0001],
# Perform GridSearchCV and store the best parameters
grid_model = GridSearchCV(Log,param_grid=params,cv=5)
grid_model_result =,y_train)
best_param = grid_model_result.best_params_
# This step is only to prove that both procedures actually result in the same accuracy score
Log2 = LogisticRegression(C=best_param['C'], max_iter=best_param['max_iter'], tol=best_param['tol'], n_jobs=-1), y_train)
# Perform two predictions one straight from the GridSearch and the other one with manually inputting the best params
y_pred1 = grid_model_result.best_estimator_.predict(X_test)
y_pred2 = Log2.predict(X_test)
# Compare the accuracy scores and see that both are the same
print("Accuracy:",metrics.accuracy_score(y_test, y_pred1))
print("Accuracy:",metrics.accuracy_score(y_test, y_pred2))
Should i split my data in to two parts similar in size to use each half for eaxh tasks or i should do grid search on my whole data and then just do cross validation again on my whole data to check my accuracy ?
You need to split the data into test and train (20:80) (eg. test_train_split in sklearn), then run the model with the train data and check the accuracy. If its not what you expect, then you can try applying Hyper parameter Tuning.
You can do this by GridSearchCV, where you need to fit the desired estimator (depending on the type of problem ) and the parameter values.
Attached a sample code :
from sklearn.model_selection import GridSearchCV
# Create the parameter grid based on the results of random search
param_grid = {
'bootstrap': [True],
'max_depth': [50, 55, 60, 65],
'max_features': ["auto","sqrt", 2, 3],
'min_samples_leaf': [1, 2, 3],
'min_samples_split': [2, 3, 4],
'n_estimators': [60, 65, 70, 75]
grid_search = GridSearchCV(estimator = rfcv, param_grid = param_grid, cv = 3, n_jobs = -1, verbose = 2), Y_train)
Based the best parameter results, you can fine tune the grid search.
Eg, if best parameter value is near 60 for n_estimators then you need to change the values as surrounding to 60 like [50,55,60,60]. To figure out the exact value.
Then build the machine learning model based on the best parameters value. Evaluate the train data accuracy and then predict the result using test data values.
rf = rgf(n_estimators = 70, random_state=0, min_samples_split = 2, min_samples_leaf=1, max_features = 'sqrt',bootstrap='True', max_depth=65)
regressor =,Y_train)
pred_tuned = regressor.predict(X_test)
You can find an improvement in your accuracy !!
I'm using Random Forest Regressor to fit a 10-dimensional regression problem with around 300 thousand samples. Although not necessary when dealing with Random Forest I started by putting the data on the same scale (by using preprocessing of sklearn) and then I did a randomised search over the following parameter space:
n_estimators=[int(x) for x in linspace (start=100, stop= 2000, num=11)]
max_features= auto, sqrt
max_depth= from 1- to 150 with step =11
Bootstrap true or false
Moreover, after getting the best parameters I did a second narrower search.
Though I am using a 10-Fold cross validation scheme with the random search I'm still getting a serious overfitting problem!
Moreover, I have also tried using DBSCAN algorithm to check for outliers. After excluding some parts of the dataset I got even worse results!
Should I include other parameters of the Random Forest in the randomised search? or should I apply some more preprocessing techniques on the data set before fitting?
For convenience, this is my implementation I wrote:
from sklearn.model_selection import ShuffleSplit
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
n_estimators = [int(x) for x in np.linspace(start = 1, stop =
15, num = 15)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
min_samples_split = [2, 5, 10,12]
min_samples_leaf = [1, 2, 4,6]
bootstrap = [True, False]
cv = ShuffleSplit(n_splits=10, test_size=0.01, random_state=0)
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
rf = RandomForestRegressor()
rf_random = RandomizedSearchCV(estimator = rf, param_distributions
= random_grid, n_iter = 50, cv = cv, verbose=2, random_state=42,
n_jobs = 32), y_train)
the best parameters returned by the randomizedsearch function:
bootstrap: Fasle. Min_samples_leaf=2. n_estimators= 1647. Max_features: sqrt. min_samples_split=3. Max_depth: None.
The range of the target is from 0 to 10000 [unit]. This model is resulting in 6.98 [unit] RMSE accuracy on the training set and and average of 67.54 [unit] RMSE accuracy on the test sets.
that line
max_depth= from 1- to 150 with step =11
For a 10 feature problem, the optimum depth is under 10. You are overfitting like crazy beacause of that. consider putting max_depth from 1 to 15 with step 1
This should help reduce the variance, however, the step of 11 for max_depth is killing all the efforts you could possibly make
I'm trying to tune my voting classifier. I wanted to use randomized search in Sklearn. However how could you set parameter lists for my voting classifier since I currently use two algorithms (different tree algorithms)?
Do I have to separately run randomized search and combine them together in voting classifier later?
Could someone help? Code examples would be highly appreciated :)
You can perfectly combine both, the VotingClassifier with RandomizedSearchCV. No need to run them separately. See the documentation:
The trick is to prefix your params list with your estimator name. For example, if you have created a RandomForest estimator and you created it as ('rf',clf2) then you can set up its parameters in the form <name__param>. Specific example: rf__n_estimators: [20,200], so you refer to a specific estimator and set values to test for a specific param.
Ready to test executable code example ;)
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.grid_search import RandomizedSearchCV
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 2, 2, 2])
clf1 = DecisionTreeClassifier()
clf2 = RandomForestClassifier(random_state=1)
params = {'dt__max_depth': [5, 10], 'rf__n_estimators': [20, 200],}
eclf = VotingClassifier(estimators=[('dt', clf1), ('rf', clf2)], voting='hard')
random_search = RandomizedSearchCV(eclf, param_distributions=params,n_iter=4), y)