CUDA memory error calculating shap values although enough memory - memory

I am trying to calculate SHAP Values from a previously trained Random Forest. I am getting the following error:
MemoryError: std::bad_alloc: CUDA error at: /opt/anaconda3/envs/rapids-21.12/include/rmm/mr/device/cuda_memory_resource.hpp
The Code I am using is
import pickle
from cuml.explainer import KernelExplainer
import cupy as cp
filename = 'cuml_random_forest_model.sav'
cuml_model = pickle.load(open(filename, 'rb'))
arr_cupy_X_test = cp.load("arr_cupy_X_test.npy")
cu_explainer = KernelExplainer(model=cuml_model.predict,
data=arr_cupy_X_test.astype(cp.float32),
is_gpu_model=True)
cu_shap_values = cu_explainer.shap_values(arr_cupy_X_test)
I am using gpu_usage() and torch.cuda.empty_cache() to clear gpu memory. I have diminished the size of the test array arr_cupy_X_test down to 100, but still receiving the error.
Is there maybe another issue with the cuml kernel explainer?
Any suggestions welcome.
Reproducable code example (works with n_samples=2000, throws error with 10000):
from cuml import RandomForestRegressor
from cuml import make_regression
from cuml import train_test_split
from cuml.explainer import KernelExplainer
X, y = make_regression(n_samples=10000,n_features=180,noise=0.1,random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=2,random_state=42)
model = RandomForestRegressor().fit(X_train, y_train)
cu_explainer = KernelExplainer(model=model.predict, data=X_train, is_gpu_model=True)
cu_shap_values = cu_explainer.shap_values(X_test)

Related

After hyperparameter tuning accuracy remains the same

I was trying to hyper tune param but after I did it, the accuracy score has not changed at all, what I do wrong?
# Log reg
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=0.3326530612244898,max_iter=100,tol=0.01)
logreg.fit(X_train,y_train)
from sklearn.metrics import confusion_matrix
​
y_pred = logreg.predict(X_test)
​
print('Accuracy of log reg is: ', logreg.score(X_test,y_test))
​
confusion_matrix(y_test,y_pred)
# 0.9181286549707602 - acurracy before tunning
Output:
Accuracy of log reg is: 0.9181286549707602
array([[ 54, 9],
[ 5, 103]])
Here is me Using Grid Search CV:
from sklearn.model_selection import GridSearchCV
params ={'tol':[0.01,0.001,0.0001],
'max_iter':[100,150,200],
'C':np.linspace(1,20)/10}
grid_model = GridSearchCV(logreg,param_grid=params,cv=5)
grid_model_result = grid_model.fit(X_train,y_train)
print(grid_model_result.best_score_,grid_model_result.best_params_)
Output:
0.8867405063291139 {'C': 0.3326530612244898, 'max_iter': 100, 'tol': 0.01}
The problem was that in the first chunk you evaluate the model's performance on the test set, while in the GridSearchCV you only looked at the performance on the training set after hyperparameter optimization.
The code below shows that both procedures, when used to predict the test set labels, perform equally well in terms of accuracy (~0.93).
Note, you might want to consider using a hyperparameter grid with other solvers and a larger range of max_iter because I obtained convergence warnings.
# Load packages
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
# Load the dataset and split in X and y
df = pd.read_csv('Breast_cancer_data.csv')
X = df.iloc[:, 0:5]
y = df.iloc[:, 5]
# Perform train and test split (80/20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize a model
Log = LogisticRegression(n_jobs=-1)
# Initialize a parameter grid
params = [{'tol':[0.01,0.001,0.0001],
'max_iter':[100,150,200],
'C':np.linspace(1,20)/10}]
# Perform GridSearchCV and store the best parameters
grid_model = GridSearchCV(Log,param_grid=params,cv=5)
grid_model_result = grid_model.fit(X_train,y_train)
best_param = grid_model_result.best_params_
# This step is only to prove that both procedures actually result in the same accuracy score
Log2 = LogisticRegression(C=best_param['C'], max_iter=best_param['max_iter'], tol=best_param['tol'], n_jobs=-1)
Log2.fit(X_train, y_train)
# Perform two predictions one straight from the GridSearch and the other one with manually inputting the best params
y_pred1 = grid_model_result.best_estimator_.predict(X_test)
y_pred2 = Log2.predict(X_test)
# Compare the accuracy scores and see that both are the same
print("Accuracy:",metrics.accuracy_score(y_test, y_pred1))
print("Accuracy:",metrics.accuracy_score(y_test, y_pred2))

Linear Regression script not working in Python

I tried running my Machine Learning LinearRegression code, but it is not working. Here is the code:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import pandas as pd
df = pd.read_csv(r'C:\Users\SVISHWANATH\Downloads\datasets\GGP_data.csv')
df["OHLC"] = (df.open+df.high+df.low+df.close)/4
df['HLC'] = (df.high+df.low+df.close)/3
df.index = df.index+1
reg = LinearRegression()
reg.fit(df.index, df.OHLC)
Basically, I just imported a few libraries, used the read_csv function, and called the LinearRegression() function, and this is the error:
ValueError: Expected 2D array, got 1D array instead:
array=[ 1 2 3 ... 1257 1258 1259].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or
array.reshape(1, -1) if it contains a single sample
Thanks!
As mentioned in the error message, you need to give the fit method a 2D array.
df.index is a 1D array. You can do it this way:
reg.fit(df.index.values.reshape(-1, 1), df.OHLC)

Keras Regressor giving different prediction for my input everytime

I built a Keras regressor using the following code:
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as ny
import pandas
from numpy.random import seed
seed(1)
from tensorflow import set_random_seed
set_random_seed(2)
X = ny.array([[1,2], [3,4], [5,6], [7,8], [9,10]])
sc_X=StandardScaler()
X_train = sc_X.fit_transform(X)
Y = ny.array([3, 4, 5, 6, 7])
Y=ny.reshape(Y,(-1,1))
sc_Y=StandardScaler()
Y_train = sc_Y.fit_transform(Y)
N = 5
def brain():
#Create the brain
br_model=Sequential()
br_model.add(Dense(3, input_dim=2, kernel_initializer='normal',activation='relu'))
br_model.add(Dense(2, kernel_initializer='normal',activation='relu'))
br_model.add(Dense(1,kernel_initializer='normal'))
#Compile the brain
br_model.compile(loss='mean_squared_error',optimizer='adam')
return br_model
def predict(X,sc_X,sc_Y,estimator):
prediction = estimator.predict(sc_X.fit_transform(X))
return sc_Y.inverse_transform(prediction)
estimator = KerasRegressor(build_fn=brain, epochs=1000, batch_size=5,verbose=0)
# print "Done"
estimator.fit(X_train,Y_train)
prediction = estimator.predict(X_train)
print predict(X,sc_X,sc_Y,estimator)
X_test = ny.array([[1.5,4.5], [7,8], [9,10]])
print predict(X_test,sc_X,sc_Y,estimator)
The issue I face is that the code is not predicting the same value (for example, it predicting 6.64 for [9,10] in the first prediction (X) and 6.49 for [9,10] in the second prediction (X_test) )
The full output is this:
[2.9929883 4.0016675 5.0103474 6.0190268 6.6434317]
[3.096634 5.422326 6.4955378]
Why do I get different values and how do I resolve them?
The problem lies in this line of code:
prediction = estimator.predict(sc_X.fit_transform(X))
You are fitting a new scaler every time when you predict values for new data. This is where differences come from. Try:
prediction = estimator.predict(sc_X.transform(X))
In this case, you use a pretrained scaler.

KerasRegressor giving different output everytime I run (despite inputs and training set being same)

Whenever I run the following code, I keep getting different outputs. Please could someone help me out with this? Code:
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.preprocessing import StandardScaler
import numpy as ny
X = ny.array([[1,2], [3,4], [5,6], [7,8], [9,10]])
sc_X=StandardScaler()
X_train = sc_X.fit_transform(X)
Y = ny.array([3, 4, 5, 6, 7])
Y=ny.reshape(Y,(-1,1))
sc_Y=StandardScaler()
Y_train = sc_Y.fit_transform(Y)
N = 5
def brain():
#Create the brain
br_model=Sequential()
br_model.add(Dense(3, input_dim=2, kernel_initializer='normal',activation='relu'))
br_model.add(Dense(2, kernel_initializer='normal',activation='relu'))
br_model.add(Dense(1,kernel_initializer='normal'))
#Compile the brain
br_model.compile(loss='mean_squared_error',optimizer='adam')
return br_model
estimator = KerasRegressor(build_fn=brain, epochs=1000, batch_size=5,verbose=0)
estimator.fit(X_train,Y_train)
prediction = estimator.predict(X_train)
print Y
print sc_Y.inverse_transform(prediction)
Basically, I have declared a dataset, am training a neural network to do regression on that and predict the values. Given that everything is already hardcoded into the code, I must be getting the same output everytime I run. However, this is not the case. I request you to help me out.

how to find which rules in decision tree that are causing misclassifications

I built an binary decision tree classifier . From the confusion matrix m i found class 0 is misclassified 495 times and class 1 is misclassified 134 times.I want to find which rules in the decision trees are actually causing the records to misclassify.
In short which record failed at the which tree node
Is there a machine learning method which can be used to find the rules in the decision tree which are causing them to misclassify
Confusion Matrix
[[14226 495]
[ 134 3271]]
Fitting the decision tree and plotting it
cv = CountVectorizer( max_features = 200,analyzer='word',ngram_range=(1, 3))
cv_addr = cv.fit_transform(data.pop('Clean_addr'))
for i, col in enumerate(cv.get_feature_names()):
data[col] = pd.SparseSeries(cv_addr[:, i].toarray().ravel(), fill_value=0)
train = data.drop(['Resi], axis=1)
Y = data['Resi']
X_train, X_test, y_train, y_test = train_test_split(train, Y, test_size=0.3,random_state =8)
rus = RandomUnderSampler(random_state=42)
X_train_res, y_train_res = rus.fit_sample(X_train, y_train)
dt=DecisionTreeClassifier(class_weight="balanced", min_samples_leaf=30)
fit_decision=dt.fit(X_train_res,y_train_res)
from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus
dot_data = StringIO()
export_graphviz(fit_decision, out_file=dot_data,
filled=True, rounded=True,
special_characters=True,feature_names=train.columns)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus
dot_data = StringIO()
export_graphviz(fit_decision, out_file=dot_data,
filled=True, rounded=True,
special_characters=True,feature_names=train.columns)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
Image(graph.create_png())
Any help is appreciated.
Dtree Plot
Decision Tree Image
Dataset
https://drive.google.com/open?id=1NhXfwBIB640wJ30AyPKFnbIECCdmpyi5
Resi is the target column . Using the other data columns i am trying to predict and I have countvectorized the Clean_addr column.

Resources