I was trying to deploy my custom DecisionTreeRegressor for house price prediction to GCS Vertex AI. The tutorial I followed was tutorial for MPG dataset tutorial
However, when I tried to build and test the container locally using commands:
docker build ./ -t $IMAGE_URI
docker run $IMAGE_URI
The error message came out:
AttributeError: 'DecisionTreeRegressor' object has no attribute 'save'
The code I run as train.py:
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from sklearn.model_selection import ShuffleSplit
# Load the Boston housing dataset
data = pd.read_csv('trainer/housing.csv')
prices = data['MEDV']
features = data.drop('MEDV', axis = 1)
# Import 'train_test_split'
from sklearn.model_selection import train_test_split
# Shuffle and split the data into training and testing subsets
X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state = 42)
#Defining model fitting and tuning functions
# Import 'make_scorer', 'DecisionTreeRegressor', and 'GridSearchCV'
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score # Import 'r2_score'
from sklearn.metrics import accuracy_score
# TODO: replace `your-gcs-bucket` with the name of the Storage bucket you created earlier
BUCKET = 'gs://gardena-dps-bucket'
def performance_metric(y_true, y_predict):
""" Calculates and returns the performance score between
true (y_true) and predicted (y_predict) values based on the metric chosen. """
score = r2_score(y_true, y_predict)
# Return the score
return score
def fit_model(X, y):
""" Performs grid search over the 'max_depth' parameter for a
decision tree regressor trained on the input data [X, y]. """
# Create cross-validation sets from the training data
cv_sets = ShuffleSplit(n_splits = 10, test_size = 0.20, random_state = 0)
# Create a decision tree regressor object
regressor = DecisionTreeRegressor()
# Create a dictionary for the parameter 'max_depth' with a range from 1 to 10
params = {'max_depth':[1,2,3,4,5,6,7,8,9,10]}
# Transform 'performance_metric' into a scoring function using 'make_scorer'
scoring_fnc = make_scorer(performance_metric)
# Create the grid search cv object --> GridSearchCV()
# Make sure to include the right parameters in the object:
# (estimator, param_grid, scoring, cv) which have values 'regressor', 'params', 'scoring_fnc', and 'cv_sets' respectively.
grid = GridSearchCV(estimator=regressor, param_grid=params, scoring=scoring_fnc, cv=cv_sets)
# Fit the grid search object to the data to compute the optimal model
grid = grid.fit(X, y)
# Return the optimal model after fitting the data
return grid.best_estimator_
# Fit the training data to the model using grid search
reg = fit_model(X_train, y_train)
# Produce a matrix for client data
client_data = [[12, 26.3, 16.99885]] # Client data in 2D array
# Show predictions
reprice = reg.predict(client_data).astype(int)
reprice
# Export model and save to GCS
reg.save(BUCKET + '/housing/model')
Scikit-learn estimators do not provide any method to save their states directly. From the Google documentation, the best way to store a fitted model to GCS is to use joblib to locally serialize your model and then upload it to GCS.
As follow:
from google.cloud import storage
from sklearn.externals import joblib
# Export the model to a file
model = 'model.joblib'
joblib.dump(pipeline, model)
# Upload the model to GCS
bucket = storage.Client().bucket(BUCKET_NAME)
blob = bucket.blob('{}/{}'.format(
datetime.datetime.now().strftime('model_%Y%m%d_%H%M%S'),
model))
blob.upload_from_filename(model)
Related
I was tasked with the creation of a dataset to test the functionality of the code we're working on.
The dataset must have a group of tensors that will be used later on in a generative model.
I'm trying to save the tensors to a .pt file, but I'm overwriting the tensors thus creating a file with only one. I've read about torch.utils.data.dataset but I'm not able to figure out by my own how to use it.
Here is my code:
import torch
import numpy as np
from torch.utils.data import Dataset
#variables that will be used to create the size of the tensors:
num_jets, num_particles, num_features = 1, 30, 3
for i in range(100):
#tensor from a gaussian dist with mean=5,std=1 and shape=size:
tensor = torch.normal(5,1,size=(num_jets, num_particles, num_features))
#We will need the tensors to be of the cpu type
tensor = tensor.cpu()
#save the tensor to 'tensor_dataset.pt'
torch.save(tensor,'tensor_dataset.pt')
#open the recently created .pt file inside a list
tensor_list = torch.load('tensor_dataset.pt')
#prints the list. Just one tensor inside .pt file
print(tensor_list)
Reason: You overwrote tensor x each time in a loop, therefore you did not get your list, and you only had x at the end.
Solution: you have the size of the tensor, you can initialize a tensor first and iterate through lst_tensors:
import torch
import numpy as np
from torch.utils.data import Dataset
num_jets, num_particles, num_features = 1, 30, 3
lst_tensors = torch.empty(size=(100,num_jets, num_particles, num_features))
for i in range(100):
lst_tensors[i] = torch.normal(5,1,size=(num_jets, num_particles, num_features))
lst_tensors[i] = lst_tensors[i].cpu()
torch.save(lst_tensors,'tensor_dataset.pt')
tensor_list = torch.load('tensor_dataset.pt')
print(tensor_list.shape) # [100,1,30,3]
I was trying to hyper tune param but after I did it, the accuracy score has not changed at all, what I do wrong?
# Log reg
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=0.3326530612244898,max_iter=100,tol=0.01)
logreg.fit(X_train,y_train)
from sklearn.metrics import confusion_matrix
y_pred = logreg.predict(X_test)
print('Accuracy of log reg is: ', logreg.score(X_test,y_test))
confusion_matrix(y_test,y_pred)
# 0.9181286549707602 - acurracy before tunning
Output:
Accuracy of log reg is: 0.9181286549707602
array([[ 54, 9],
[ 5, 103]])
Here is me Using Grid Search CV:
from sklearn.model_selection import GridSearchCV
params ={'tol':[0.01,0.001,0.0001],
'max_iter':[100,150,200],
'C':np.linspace(1,20)/10}
grid_model = GridSearchCV(logreg,param_grid=params,cv=5)
grid_model_result = grid_model.fit(X_train,y_train)
print(grid_model_result.best_score_,grid_model_result.best_params_)
Output:
0.8867405063291139 {'C': 0.3326530612244898, 'max_iter': 100, 'tol': 0.01}
The problem was that in the first chunk you evaluate the model's performance on the test set, while in the GridSearchCV you only looked at the performance on the training set after hyperparameter optimization.
The code below shows that both procedures, when used to predict the test set labels, perform equally well in terms of accuracy (~0.93).
Note, you might want to consider using a hyperparameter grid with other solvers and a larger range of max_iter because I obtained convergence warnings.
# Load packages
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
# Load the dataset and split in X and y
df = pd.read_csv('Breast_cancer_data.csv')
X = df.iloc[:, 0:5]
y = df.iloc[:, 5]
# Perform train and test split (80/20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize a model
Log = LogisticRegression(n_jobs=-1)
# Initialize a parameter grid
params = [{'tol':[0.01,0.001,0.0001],
'max_iter':[100,150,200],
'C':np.linspace(1,20)/10}]
# Perform GridSearchCV and store the best parameters
grid_model = GridSearchCV(Log,param_grid=params,cv=5)
grid_model_result = grid_model.fit(X_train,y_train)
best_param = grid_model_result.best_params_
# This step is only to prove that both procedures actually result in the same accuracy score
Log2 = LogisticRegression(C=best_param['C'], max_iter=best_param['max_iter'], tol=best_param['tol'], n_jobs=-1)
Log2.fit(X_train, y_train)
# Perform two predictions one straight from the GridSearch and the other one with manually inputting the best params
y_pred1 = grid_model_result.best_estimator_.predict(X_test)
y_pred2 = Log2.predict(X_test)
# Compare the accuracy scores and see that both are the same
print("Accuracy:",metrics.accuracy_score(y_test, y_pred1))
print("Accuracy:",metrics.accuracy_score(y_test, y_pred2))
I have the results of an estimator running on X, as well as the ground truth, and I want to use plot_precision_recall_curve, but that requires passing in the estimator and X - which I can't do, the estimator is very complex and resides in another system... What should I do? (it would be nice to have a version of plot_precision_recall_curve that takes in y_pred and y_true ...).
You can use precision_recall_curve which accepts y_true and y_pred, and returns precision, recall, and thresholds, to be used further to find f1_score and auc, the latter can let you plot it manually.
This is an example:
# precision-recall curve and f1
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import f1_score
from sklearn.metrics import auc
from matplotlib import pyplot
# generate 2 class dataset
X, y = make_classification(n_samples=1000, n_classes=2, random_state=1)
# split into train/test sets
trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2)
# fit a model
model = LogisticRegression(solver='lbfgs')
model.fit(trainX, trainy)
# predict probabilities
lr_probs = model.predict_proba(testX)
# keep probabilities for the positive outcome only
lr_probs = lr_probs[:, 1]
# predict class values
yhat = model.predict(testX)
lr_precision, lr_recall, _ = precision_recall_curve(testy, lr_probs)
lr_f1, lr_auc = f1_score(testy, yhat), auc(lr_recall, lr_precision)
# summarize scores
print('Logistic: f1=%.3f auc=%.3f' % (lr_f1, lr_auc))
# plot the precision-recall curves
no_skill = len(testy[testy==1]) / len(testy)
pyplot.plot([0, 1], [no_skill, no_skill], linestyle='--', label='No Skill')
pyplot.plot(lr_recall, lr_precision, marker='.', label='Logistic')
# axis labels
pyplot.xlabel('Recall')
pyplot.ylabel('Precision')
# show the legend
pyplot.legend()
# show the plot
pyplot.show()
Including the training data in SHAP TreeExplainer gives different expected_value in scikit-learn GBT Regressor.
Reproducible example (run in Google Colab):
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingRegressor
import numpy as np
import shap
shap.__version__
# 0.37.0
X, y = make_regression(n_samples=1000, n_features=10, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
gbt = GradientBoostingRegressor(random_state=0)
gbt.fit(X_train, y_train)
# mean prediction:
mean_pred_gbt = np.mean(gbt.predict(X_train))
mean_pred_gbt
# -11.534353657511172
# explainer without data
gbt_explainer = shap.TreeExplainer(gbt)
gbt_explainer.expected_value
# array([-11.53435366])
np.isclose(mean_pred_gbt, gbt_explainer.expected_value)
# array([ True])
# explainer with training data
gbt_data_explainer = shap.TreeExplainer(model=gbt, data=X_train) # specifying feature_perturbation does not change the result
gbt_data_explainer.expected_value
# -23.564797322079635
So, the expected value when including the training data gbt_data_explainer.expected_value is quite different from the one calculated without supplying the data (gbt_explainer.expected_value).
Both approaches are additive and consistent when used with the (obviously different) respective shap_values:
np.abs(gbt_explainer.expected_value + gbt_explainer.shap_values(X_train).sum(1) - gbt.predict(X_train)).max() < 1e-4
# True
np.abs(gbt_data_explainer.expected_value + gbt_data_explainer.shap_values(X_train).sum(1) - gbt.predict(X_train)).max() < 1e-4
# True
but I wonder why they do not provide the same expected_value, and why gbt_data_explainer.expected_value is so different from the mean value of predictions.
What am I missing here?
Apparently shap subsets to 100 rows when data is passed, then runs those rows through the trees to reset the sample counts for each node. So the -23.5... being reported is the average model output for those 100 rows.
The data is passed to an Independent masker, which does the subsampling:
https://github.com/slundberg/shap/blob/v0.37.0/shap/explainers/_tree.py#L94
https://github.com/slundberg/shap/blob/v0.37.0/shap/explainers/_explainer.py#L68
https://github.com/slundberg/shap/blob/v0.37.0/shap/maskers/_tabular.py#L216
Running
from shap import maskers
another_gbt_explainer = shap.TreeExplainer(
gbt,
data=maskers.Independent(X_train, max_samples=800),
feature_perturbation="tree_path_dependent"
)
another_gbt_explainer.expected_value
gets back to
-11.534353657511172
Though #Ben did a great job in digging out how the data gets passed through Independent masker, his answer does not show exactly (1) how base values are calculated and where do we get the different base value from and (2) how to choose/lower the max_samples param
Where the different value comes from
The masker object has a data attribute that holds data after masking process. To get the value showed in gbt_explainer.expected_value:
from shap.maskers import Independent
gbt = GradientBoostingRegressor(random_state=0)
# mean prediction:
mean_pred_gbt = np.mean(gbt.predict(X_train))
mean_pred_gbt
# -11.534353657511172
# explainer without data
gbt_explainer = shap.TreeExplainer(gbt)
gbt_explainer.expected_value
# array([-11.53435366])
gbt_explainer = shap.TreeExplainer(gbt, Independent(X_train,100))
gbt_explainer.expected_value
# -23.56479732207963
one would need to do:
masker = Independent(X_train,100)
gbt.predict(masker.data).mean()
# -23.56479732207963
What about choosing max_samples?
Setting max_samples to the original dataset length seem to work with other explainers too:
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
import shap
from shap.maskers import Independent
from scipy.special import logit, expit
corpus,y = shap.datasets.imdb()
corpus_train, corpus_test, y_train, y_test = train_test_split(corpus, y, test_size=0.2, random_state=7)
vectorizer = TfidfVectorizer(min_df=10)
X_train = vectorizer.fit_transform(corpus_train)
model = sklearn.linear_model.LogisticRegression(penalty="l2", C=0.1)
model.fit(X_train, y_train)
explainer = shap.Explainer(model
,masker = Independent(X_train,100)
,feature_names=vectorizer.get_feature_names()
)
explainer.expected_value
# -0.18417413671991964
This value comes from:
masker=Independent(X_train,100)
logit(model.predict_proba(masker.data.mean(0).reshape(1,-1))[...,1])
# array([-0.18417414])
max_samples=100 seem to be a bit off for a true base_value (just feeding feature means):
logit(model.predict_proba(X_train.mean(0).reshape(1,-1))[:,1])
array([-0.02938039])
By increasing max_samples one might get reasonably close to true baseline, while keeping num of samples low:
masker = Independent(X_train,1000)
logit(model.predict_proba(masker.data.mean(0).reshape(1,-1))[:,1])
# -0.05957302658674238
So, to get base value for an explainer of interest (1) pass explainer.data (or masker.data) through your model and (2) choose max_samples so that base_value on sampled data is close enough to the true base value. You may also try to observe if the values and order of shap importances converge.
Some people may notice that to get to the base values sometimes we average feature inputs (LogisticRegression) and sometimes outputs (GBT)
I am trying to export the decision tree as an image with the original labels of all categorical fields.
The current data I have is like so:
I transformed the categorical features into numerical:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('data.csv')
X = dataset.iloc[:, 0:4]
y = dataset.iloc[:, 4]
from sklearn.preprocessing import LabelEncoder
lb = LabelEncoder()
X['Outlook'] = lb.fit_transform(X['Outlook'])
X['Temp'] = lb.fit_transform(X['Temp'])
X['Humidity'] = lb.fit_transform(X['Humidity'])
X['Windy'] = lb.fit_transform(X['Windy'])
y = lb.fit_transform(y)
Afterwards, I applied the DecisionTreeClassifier:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(criterion="entropy")
dtc.fit(X, y)
At the end, I needed to check the tree generated from the model using the following:
# Import tools needed for visualization
from sklearn.tree import export_graphviz
import pydot
# Pull out one tree from the forest
# Export the image to a dot file
export_graphviz(dtc, out_file = 'tree.dot', feature_names = X.columns, rounded = True, precision = 1)
# Use dot file to create a graph
(graph, ) = pydot.graph_from_dot_file('tree.dot')
# Write graph to a png file
graph.write_png('tree.png')
The tree.png:
But what I really need, is to see the main labels of each feature inside the nodes or at each branch, instead of true or false or a numeric representation.
I tried the following:
y=lb.inverse_transform(y)
And the same for X features, but the tree is being generated the same as above.