Using dask backend for joblib in a remote cluster - dask

How we can use dask for parallel computations on a remote clutser node, for example, PSC bridges. As an example:
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import train_test_split
from task.distributed import Client
from joblib import Parallel, delayed, parallel_backend
import numpy as np
client = Client() # create local cluster LINE-1
# client = Client(processes=False) # create local cluster LINE-2
# client = Client("scheduler-address:8786") # or connect to remote cluster
def get_acc(X, y, i):
X_ =X[i]
X_train, X_test, y_train, y_test = train_test_split(
X_, y, test_size=0.1, random_state=42)
clf = LogisticRegressionCV(
n_jobs=5,
cv = 5,
max_iter=1000,
solver='liblinear',
penalty='l1').fit(
X_train, y_train)
return clf.score(X_test, y_test)
X, y = make_classification(n_samples=1000, n_features=10000, n_classes=2)
X = np.repeat(X[None,:,:], 150, axis=0)
num_cores = 30
with parallel_backend('dask'):
scores = Parallel(n_jobs=num_cores, verbose=100)(
delayed(get_acc)
(X, y, i)
for i in range(25)
)
On creating local cluster using LINE-1, the node memory (512GB) gets full and it does not start any computations. On using second line (LINE-2), I get “OSError: timed out connecting to improc://10.8.10.235/791270/1 after 10s"
But this dask parallel works perfectly fine on my laptop.

I don't see how that could have worked on your laptop. In this
from task.distributed import Client
task should be dask.

Related

CUDA memory error calculating shap values although enough memory

I am trying to calculate SHAP Values from a previously trained Random Forest. I am getting the following error:
MemoryError: std::bad_alloc: CUDA error at: /opt/anaconda3/envs/rapids-21.12/include/rmm/mr/device/cuda_memory_resource.hpp
The Code I am using is
import pickle
from cuml.explainer import KernelExplainer
import cupy as cp
filename = 'cuml_random_forest_model.sav'
cuml_model = pickle.load(open(filename, 'rb'))
arr_cupy_X_test = cp.load("arr_cupy_X_test.npy")
cu_explainer = KernelExplainer(model=cuml_model.predict,
data=arr_cupy_X_test.astype(cp.float32),
is_gpu_model=True)
cu_shap_values = cu_explainer.shap_values(arr_cupy_X_test)
I am using gpu_usage() and torch.cuda.empty_cache() to clear gpu memory. I have diminished the size of the test array arr_cupy_X_test down to 100, but still receiving the error.
Is there maybe another issue with the cuml kernel explainer?
Any suggestions welcome.
Reproducable code example (works with n_samples=2000, throws error with 10000):
from cuml import RandomForestRegressor
from cuml import make_regression
from cuml import train_test_split
from cuml.explainer import KernelExplainer
X, y = make_regression(n_samples=10000,n_features=180,noise=0.1,random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=2,random_state=42)
model = RandomForestRegressor().fit(X_train, y_train)
cu_explainer = KernelExplainer(model=model.predict, data=X_train, is_gpu_model=True)
cu_shap_values = cu_explainer.shap_values(X_test)

After hyperparameter tuning accuracy remains the same

I was trying to hyper tune param but after I did it, the accuracy score has not changed at all, what I do wrong?
# Log reg
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=0.3326530612244898,max_iter=100,tol=0.01)
logreg.fit(X_train,y_train)
from sklearn.metrics import confusion_matrix
​
y_pred = logreg.predict(X_test)
​
print('Accuracy of log reg is: ', logreg.score(X_test,y_test))
​
confusion_matrix(y_test,y_pred)
# 0.9181286549707602 - acurracy before tunning
Output:
Accuracy of log reg is: 0.9181286549707602
array([[ 54, 9],
[ 5, 103]])
Here is me Using Grid Search CV:
from sklearn.model_selection import GridSearchCV
params ={'tol':[0.01,0.001,0.0001],
'max_iter':[100,150,200],
'C':np.linspace(1,20)/10}
grid_model = GridSearchCV(logreg,param_grid=params,cv=5)
grid_model_result = grid_model.fit(X_train,y_train)
print(grid_model_result.best_score_,grid_model_result.best_params_)
Output:
0.8867405063291139 {'C': 0.3326530612244898, 'max_iter': 100, 'tol': 0.01}
The problem was that in the first chunk you evaluate the model's performance on the test set, while in the GridSearchCV you only looked at the performance on the training set after hyperparameter optimization.
The code below shows that both procedures, when used to predict the test set labels, perform equally well in terms of accuracy (~0.93).
Note, you might want to consider using a hyperparameter grid with other solvers and a larger range of max_iter because I obtained convergence warnings.
# Load packages
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
# Load the dataset and split in X and y
df = pd.read_csv('Breast_cancer_data.csv')
X = df.iloc[:, 0:5]
y = df.iloc[:, 5]
# Perform train and test split (80/20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Initialize a model
Log = LogisticRegression(n_jobs=-1)
# Initialize a parameter grid
params = [{'tol':[0.01,0.001,0.0001],
'max_iter':[100,150,200],
'C':np.linspace(1,20)/10}]
# Perform GridSearchCV and store the best parameters
grid_model = GridSearchCV(Log,param_grid=params,cv=5)
grid_model_result = grid_model.fit(X_train,y_train)
best_param = grid_model_result.best_params_
# This step is only to prove that both procedures actually result in the same accuracy score
Log2 = LogisticRegression(C=best_param['C'], max_iter=best_param['max_iter'], tol=best_param['tol'], n_jobs=-1)
Log2.fit(X_train, y_train)
# Perform two predictions one straight from the GridSearch and the other one with manually inputting the best params
y_pred1 = grid_model_result.best_estimator_.predict(X_test)
y_pred2 = Log2.predict(X_test)
# Compare the accuracy scores and see that both are the same
print("Accuracy:",metrics.accuracy_score(y_test, y_pred1))
print("Accuracy:",metrics.accuracy_score(y_test, y_pred2))

How to parallelize sklearn's random forest regressor on SLURM

I am currently trying to make sklearn's random forest run parallely on SLURM cluster. I have sent them to nodes, and then I have noticed that the parameter, n_jobs=-1, was no longer working on SLURM.
I have tried ipyparallel package, but it gave me error messages. I do not necessarily use ipyparallel, so I appreciate any module that I can parallelize random forest on the cluster.
from sklearn.ensemble import RandomForestRegressor
from joblib import parallel_backend, register_parallel_backend
from ipyparallel import Client
from ipyparallel.joblib import IPythonParallelBackend
import sys
import time
import pickle
import numpy as np
def fit_predict(self, X_train, y, X_test):
"""
train a model by X_train and y, and then return the prediction of
X_test
"""
pred = None
client = Client(profile='myprofile')
bview = client.load_balanced_view()
register_parallel_backend('ipyparallel', lambda: IPythonParallelBackend(view=bview))
regr = RandomForestRegressor(n_jobs=-1)
try:
with parallel_backend('ipyparallel'):
regr.fit(X_train, y)
pred = regr.predict(X_test)
except Exception as e:
print(e)
return pred
Error:
Traceback (most recent call last):
File "job.py", line 124, in <module>
pred = rf.fit_predict(X_train, y_train, X_test)
File "job.py", line 50, in fit_predict
client = Client(profile='myprofile')
File "/home/lfz/.conda/envs/mvi/lib/python3.7/site-packages/ipyparallel/client/client.py", line 419, in __init__
raise IOError(no_file_msg)
OSError: You have attempted to connect to an IPython Cluster but no Controller could be found.
Please double-check your configuration and ensure that a cluster is running.
srun: error: c6-28: task 0: Exited with exit code 1

why 10-fold cross validation is even faster than 1-fold fit when using LGB?

I am using LGB to handle a machine leaning task. But I found when I use the sklearn API cross_val_score and set cv=10, the time cost is less than single fold fit. I splited dataset useing train_test_split, then fit a LGBClassifier on training set. The time cost of latter is much more than former, why?
Sorry for my bad English.
Environment: Python 3.5, scikit-learn 0.20.3, lightgbm 2.2.3
Inter Xeon CPU E5-2650 v4
Memory 128GB
X = train_df.drop(['uId', 'age'], axis=1)
Y = train_df.loc[:, 'age']
X_test = test_df.drop(['uId'], axis=1)
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.1,
stratify=Y)
# (1809000, 12) (1809000,) (201000, 12) (201000,) (502500, 12)
print(X_train.shape, Y_train.shape, X_val.shape, Y_val.shape, X_test.shape)
from lightgbm import LGBMClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
import time
lgb = LGBMClassifier(n_jobs=-1)
tic = time.time()
scores = cross_val_score(lgb, X, Y,
scoring='accuracy', cv=10, n_jobs=-1)
toc = time.time()
# 0.3738402985074627 0.0009231167322574765 300.1487271785736
print(np.mean(scores), np.std(scores), toc-tic)
tic = time.time()
lgb.fit(X_train, Y_train)
toc = time.time()
# 0.3751492537313433 472.1763586997986 (is much more than 300)
print(accuracy_score(Y_val, lgb.predict(X_val)), toc-tic)
Sorry, I found the answer. Here writtens in documentation of LightGBM: 'for the best speed, set this to the number of real CPU cores, not the number of threads'. So set n_jobs=-1 is not a best choice.

Keras Regressor giving different prediction for my input everytime

I built a Keras regressor using the following code:
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as ny
import pandas
from numpy.random import seed
seed(1)
from tensorflow import set_random_seed
set_random_seed(2)
X = ny.array([[1,2], [3,4], [5,6], [7,8], [9,10]])
sc_X=StandardScaler()
X_train = sc_X.fit_transform(X)
Y = ny.array([3, 4, 5, 6, 7])
Y=ny.reshape(Y,(-1,1))
sc_Y=StandardScaler()
Y_train = sc_Y.fit_transform(Y)
N = 5
def brain():
#Create the brain
br_model=Sequential()
br_model.add(Dense(3, input_dim=2, kernel_initializer='normal',activation='relu'))
br_model.add(Dense(2, kernel_initializer='normal',activation='relu'))
br_model.add(Dense(1,kernel_initializer='normal'))
#Compile the brain
br_model.compile(loss='mean_squared_error',optimizer='adam')
return br_model
def predict(X,sc_X,sc_Y,estimator):
prediction = estimator.predict(sc_X.fit_transform(X))
return sc_Y.inverse_transform(prediction)
estimator = KerasRegressor(build_fn=brain, epochs=1000, batch_size=5,verbose=0)
# print "Done"
estimator.fit(X_train,Y_train)
prediction = estimator.predict(X_train)
print predict(X,sc_X,sc_Y,estimator)
X_test = ny.array([[1.5,4.5], [7,8], [9,10]])
print predict(X_test,sc_X,sc_Y,estimator)
The issue I face is that the code is not predicting the same value (for example, it predicting 6.64 for [9,10] in the first prediction (X) and 6.49 for [9,10] in the second prediction (X_test) )
The full output is this:
[2.9929883 4.0016675 5.0103474 6.0190268 6.6434317]
[3.096634 5.422326 6.4955378]
Why do I get different values and how do I resolve them?
The problem lies in this line of code:
prediction = estimator.predict(sc_X.fit_transform(X))
You are fitting a new scaler every time when you predict values for new data. This is where differences come from. Try:
prediction = estimator.predict(sc_X.transform(X))
In this case, you use a pretrained scaler.

Resources