How to parallelize sklearn's random forest regressor on SLURM - machine-learning

I am currently trying to make sklearn's random forest run parallely on SLURM cluster. I have sent them to nodes, and then I have noticed that the parameter, n_jobs=-1, was no longer working on SLURM.
I have tried ipyparallel package, but it gave me error messages. I do not necessarily use ipyparallel, so I appreciate any module that I can parallelize random forest on the cluster.
from sklearn.ensemble import RandomForestRegressor
from joblib import parallel_backend, register_parallel_backend
from ipyparallel import Client
from ipyparallel.joblib import IPythonParallelBackend
import sys
import time
import pickle
import numpy as np
def fit_predict(self, X_train, y, X_test):
"""
train a model by X_train and y, and then return the prediction of
X_test
"""
pred = None
client = Client(profile='myprofile')
bview = client.load_balanced_view()
register_parallel_backend('ipyparallel', lambda: IPythonParallelBackend(view=bview))
regr = RandomForestRegressor(n_jobs=-1)
try:
with parallel_backend('ipyparallel'):
regr.fit(X_train, y)
pred = regr.predict(X_test)
except Exception as e:
print(e)
return pred
Error:
Traceback (most recent call last):
File "job.py", line 124, in <module>
pred = rf.fit_predict(X_train, y_train, X_test)
File "job.py", line 50, in fit_predict
client = Client(profile='myprofile')
File "/home/lfz/.conda/envs/mvi/lib/python3.7/site-packages/ipyparallel/client/client.py", line 419, in __init__
raise IOError(no_file_msg)
OSError: You have attempted to connect to an IPython Cluster but no Controller could be found.
Please double-check your configuration and ensure that a cluster is running.
srun: error: c6-28: task 0: Exited with exit code 1

Related

Error in load a 'rb' file using pickle in random forest algorithm

import pickle
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report,accuracy_score,roc_auc_score
# print ('Number of arguments:', len(sys.argv), 'arguments.')
# print ('Argument List:', str(sys.argv))
with open('RF_Model_Py3', 'rb') as f:
RanFor = pickle.load(f)
"
Exception has occurred: EOFError
Ran out of input
File "C:\Users\HP\Downloads\Crop-Yield-Prediction-using-ML-master (1)\Crop-Yield-Prediction-using-ML-master\RF_predict.py", line 15, in <module>
RanFor = pickle.load(f)
" this the error shown in vs code
After doing lots of steps the error doesn't clear. There is an error showing in the Randomforest algorithm.
The format for pickle.load is pickle.load(open(filename,'rb'))
In your case I would do pickle.load(open('file.p','rb'))
You should also make sure you have dumped the data before you try to load it

CUDA memory error calculating shap values although enough memory

I am trying to calculate SHAP Values from a previously trained Random Forest. I am getting the following error:
MemoryError: std::bad_alloc: CUDA error at: /opt/anaconda3/envs/rapids-21.12/include/rmm/mr/device/cuda_memory_resource.hpp
The Code I am using is
import pickle
from cuml.explainer import KernelExplainer
import cupy as cp
filename = 'cuml_random_forest_model.sav'
cuml_model = pickle.load(open(filename, 'rb'))
arr_cupy_X_test = cp.load("arr_cupy_X_test.npy")
cu_explainer = KernelExplainer(model=cuml_model.predict,
data=arr_cupy_X_test.astype(cp.float32),
is_gpu_model=True)
cu_shap_values = cu_explainer.shap_values(arr_cupy_X_test)
I am using gpu_usage() and torch.cuda.empty_cache() to clear gpu memory. I have diminished the size of the test array arr_cupy_X_test down to 100, but still receiving the error.
Is there maybe another issue with the cuml kernel explainer?
Any suggestions welcome.
Reproducable code example (works with n_samples=2000, throws error with 10000):
from cuml import RandomForestRegressor
from cuml import make_regression
from cuml import train_test_split
from cuml.explainer import KernelExplainer
X, y = make_regression(n_samples=10000,n_features=180,noise=0.1,random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=2,random_state=42)
model = RandomForestRegressor().fit(X_train, y_train)
cu_explainer = KernelExplainer(model=model.predict, data=X_train, is_gpu_model=True)
cu_shap_values = cu_explainer.shap_values(X_test)

Troubles using dask distributed with datashader: 'can't pickle weakref objects'

I'm working with datashader and dask but I'm having problems when trying to plot with a cluster running. To make it more concrete, I have the following example (embedded in a bokeh plot):
import holoviews as hv
import pandas as pd
import dask.dataframe as dd
import numpy as np
from holoviews.operation.datashader import datashade
import datashader.transfer_functions as tf
#initialize the client/cluster
cluster = LocalCluster(n_workers=4, threads_per_worker=1)
dask_client = Client(cluster)
def datashade_plot():
hv.extension('bokeh')
#create some random data (in the actual code this is a parquet file with millions of rows, this is just an example)
delta = 1/1000
x = np.arange(0, 1, delta)
y = np.cumsum(np.sqrt(delta)*np.random.normal(size=len(x)))
df = pd.DataFrame({'X':x, 'Y':y})
#create dask dataframe
points_dd = dd.from_pandas(df, npartitions=3)
#create plot
points = hv.Curve(points_dd)
return hd.datashade(points)
dask_client.submit(datashade_plot,).result()
This raises a:
TypeError: can't pickle weakref objects
I have the theory that this happens because you can't distribute the datashade operations in the cluster. Sorry if this is a noob question, I'd be very grateful for any advice you could give me.
I think you want to go the other way. That is, pass datashader a dask dataframe instead of a pandas dataframe:
>>> from dask import dataframe as dd
>>> import multiprocessing as mp
>>> dask_df = dd.from_pandas(df, npartitions=mp.cpu_count())
>>> dask_df.persist()
...
>>> cvs = datashader.Canvas(...)
>>> agg = cvs.points(dask_df, ...)
XREF: https://datashader.org/user_guide/Performance.html

I am not able Training models in sklearn (scikit-learn) using python

i have data file it contain data to predict the admission in MS.
it contain 9 column 8 column contain student data and 9th column contain chance of selection of student.
i am new and i don't understand error come in training model
import pandas
import numpy as np
import sklearn as sl
from sklearn.neural_network import MLPClassifier
classifier = MLPClassifier()
data = pandas.read_csv('Addmition.csv')
data_array = np.array(data)
X = data_array[:,1:8]
y = data_array[:,8]
classifier.fit(X,y)
print(classifier)
Traceback (most recent call last):
File "c.py", line 14, in <module>
classifier.fit(X,y)
File "C:\Users\vishal jangid\AppData\Roaming\Python\Python37\site-packages\sklearn\neural_network\multilayer_perceptron.py", line 977, in fit
hasattr(self, "classes_")))
File "C:\Users\vishal jangid\AppData\Roaming\Python\Python37\site-packages\sklearn\neural_network\multilayer_perceptron.py", line 324, in _fit
X, y = self._validate_input(X, y, incremental)
File "C:\Users\vishal jangid\AppData\Roaming\Python\Python37\site-packages\sklearn\neural_network\multilayer_perceptron.py", line 920, in _validate_input
self._label_binarizer.fit(y)
File "C:\Users\vishal jangid\AppData\Roaming\Python\Python37\site-packages\sklearn\preprocessing\label.py", line 413, in fit
self.classes_ = unique_labels(y)
File "C:\Users\vishal jangid\AppData\Roaming\Python\Python37\site-packages\sklearn\utils\multiclass.py", line 96, in unique_labels
raise ValueError("Unknown label type: %s" % repr(ys))
ValueError: Unknown label type: (array
Try this:
import numpy as np
import sklearn as sl
from sklearn.neural_network import MLPRegressor
classifier = MLPRegressor()
data = pandas.read_csv('Addmition.csv')
data_array = np.array(data)
X = data_array[:,1:8]
y = data_array[:,8]
classifier.fit(X,y)
print(classifier)
Explanation:
In machine learning we may have two types of problems:
1) Classification:
Ex: Predict if a person is male or female. (discrete)
2) Regression:
Ex: Predict the age of the person. (continuous)
With this in hand we are going to see your problem, your label (chance of selection) is continous, thus we have a regression problem.
See that you are using the MLPClassifier, resulting in the 'Unknown label error'.
Try using the MLPRegressor.

Google Cloud ML exited with a non-zero status of 245 when training

I tried to train my model on Google Cloud ML using this sample code:
import keras
from keras import optimizers
from keras import losses
from keras import metrics
from keras.models import Model, Sequential
from keras.layers import Dense, Lambda, RepeatVector, TimeDistributed
import numpy as np
def test():
model = Sequential()
model.add(Dense(2, input_shape=(3,)))
model.add(RepeatVector(3))
model.add(TimeDistributed(Dense(3)))
model.compile(loss=losses.MSE,
optimizer=optimizers.RMSprop(lr=0.0001),
metrics=[metrics.categorical_accuracy],
sample_weight_mode='temporal')
x = np.random.random((1, 3))
y = np.random.random((1, 3, 3))
model.train_on_batch(x, y)
if __name__ == '__main__':
test()
and i got this error:
The replica master 0 exited with a non-zero status of 245. Termination reason: Error.
Detailed error output is big, so i'm pasting it here in pastebin
Note this output:
Module raised an exception for failing to call a subprocess Command '['python', '-m', u'trainer.test', '--job-dir', u'gs://my_test_bucket_keras/s_27_100630']' returned non-zero exit status -11.
And I guess the google cloud will run your code with an extra parameter called --job-dir. So perhaps you can try add the following code in your example code?
import ...
import argparse
def test():
model = Sequential()
model.add(Dense(2, input_shape=(3,)))
model.add(RepeatVector(3))
model.add(TimeDistributed(Dense(3)))
model.compile(loss=losses.MSE,
optimizer=optimizers.RMSprop(lr=0.0001),
metrics=[metrics.categorical_accuracy],
sample_weight_mode='temporal')
x = np.random.random((1, 3))
y = np.random.random((1, 3, 3))
model.train_on_batch(x, y)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
# Input Arguments
parser.add_argument(
'--job-dir',
help='GCS location to write checkpoints and export models',
required=True
)
args = parser.parse_args()
arguments = args.__dict__
test()
# test(**arguments) # or if you want to use this job_dir parameter in your code
Not 100% sure this will work but I think you can give it a try.
Also I have a post here to do something similar, perhaps you can take a look there as well.
Problem is resolved. All I had to do is use tensorflow 1.1.0 instead default 1.0.1

Resources