Cannot Reproduce Results in databricks for sklearn Random forest - machine-learning

I'm running some machine learning experiments in databricks. For random forest algorithm when i restart the cluster, each time the training output is changes even though random state is set. Anyone has any clue about this issue?
Note : I tried the same algorithm with same code in anaconda environment in my local machine, there is no different in the result even though I restart the machine.
clf_rf = RandomForestClassifier(n_estimators=10 , random_state=123)
clf_rf.fit(X_train,y_train)
y_pred = clf_rf.predict(X_test)
tn, fp, fn, tp = confusion_matrix(y_test,y_pred).ravel()
accuracy = metrics.accuracy_score(y_test, y_pred)
precision = metrics.precision_score(y_test, y_pred)
recall = metrics.recall_score(y_test, y_pred)
f1_score = metrics.f1_score(y_test, y_pred)
print(f"TP:{tp}")
print(f"FP:{fp}")
print(f"TN:{tn}")
print(f"FN:{fn}")
print(f"Accuracy : {accuracy}")
print(f"Precision : {precision}")
print(f"Recall : {recall}")
print(f"F1 Score : {f1_score}")
output of this code changes every time, I restart the cluster.

Try this:
from numpy.random import seed
seed(1)
clf_rf = RandomForestClassifier(n_estimators=10 , random_state=123)
clf_rf.fit(X_train,y_train)
y_pred = clf_rf.predict(X_test)
tn, fp, fn, tp = confusion_matrix(y_test,y_pred).ravel()
accuracy = metrics.accuracy_score(y_test, y_pred)
precision = metrics.precision_score(y_test, y_pred)
recall = metrics.recall_score(y_test, y_pred)
f1_score = metrics.f1_score(y_test, y_pred)
print(f"TP:{tp}")
print(f"FP:{fp}")
print(f"TN:{tn}")
print(f"FN:{fn}")
print(f"Accuracy : {accuracy}")
print(f"Precision : {precision}")
print(f"Recall : {recall}")
print(f"F1 Score : {f1_score}")

The randomness can come into your workflow when you do the train-test splitting. If you set the random_state in train_test_split, I think you would be fine.
Example to showcase that fixing randomness in a dataset can produce reproducible results.
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
X, y = make_classification(random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=12)
clf_rf = RandomForestClassifier(n_estimators=10 , random_state=123)
clf_rf.fit(X_train,y_train)
y_pred = clf_rf.predict(X_test)
tn, fp, fn, tp = confusion_matrix(y_test,y_pred).ravel()
accuracy = metrics.accuracy_score(y_test, y_pred)
precision = metrics.precision_score(y_test, y_pred)
recall = metrics.recall_score(y_test, y_pred)
f1_score = metrics.f1_score(y_test, y_pred)
print(f"TP:{tp}")
print(f"FP:{fp}")
print(f"TN:{tn}")
print(f"FN:{fn}")
print(f"Accuracy : {accuracy}")
print(f"Precision : {precision}")
print(f"Recall : {recall}")
print(f"F1 Score : {f1_score}")
Output:
TP:9
FP:1
TN:12
FN:3
Accuracy : 0.84
Precision : 0.9
Recall : 0.75
F1 Score : 0.8181818181818182

Related

Unexpected behaviour (inflated results on random-data) in scikit-learn with nested cross-validation

When trying to train/evaluate a support vector machine in scikit-learn, I am experiencing some unexpected behaviour and I am wondering whether I am doing something wrong or that this is a possible bug.
In a very specific subset of circumstances, nested cross-validation using GridSearchCV and SVM, provides inflated predictive results, even with randomly generated data.
For instance, see this code:
from sklearn import svm
from sklearn.linear_model import LogisticRegression
import numpy as np
from sklearn.model_selection import GridSearchCV, StratifiedKFold, LeaveOneOut
from sklearn.metrics import roc_auc_score, brier_score_loss
from tqdm import tqdm
import pandas as pd
N = 20
N_FEATURES = 50
param_grid = {'C': [1e-5, 1e-3, 1, 1e3, 1e5]}
scores = []
for z in tqdm(range(100)):
X = np.random.uniform(size=(N, N_FEATURES))
y = np.random.binomial(1, 0.5, size=N)
if z < 10:
y = np.array([0, 1] * int(N/2))
y = np.random.permutation(y)
for skf_outer in [StratifiedKFold(n_splits=5), LeaveOneOut()]:
for skf_inner in [5, LeaveOneOut()]:
for model in [svm.SVC(probability=True), LogisticRegression()]:
y_pred, y_real = [], []
for train_index, test_index in skf_outer.split(X, y):
X_train, X_test = X[train_index], X[test_index, :]
y_train, y_test = y[train_index], y[test_index]
clf = GridSearchCV(
model, param_grid, cv=skf_inner, n_jobs=-1, scoring='neg_brier_score'
)
clf.fit(X_train, y_train)
predictions = clf.predict_proba(X_test)[:, 1]
y_pred.extend(predictions)
y_real.extend(y_test)
scores.append([str(skf_outer), str(skf_inner), str(model), np.mean(y), brier_score_loss(np.array(y_real), np.array(y_pred)), roc_auc_score(np.array(y_real), np.array(y_pred))])
df_scores = pd.DataFrame(scores)
df_scores.columns = ['skf_outer', 'skf_inner', 'model', 'y_label', 'brier', 'auc']
df_scores['y_0.5'] = df_scores['y_label'] == 0.5
df_scores = df_scores.groupby(['skf_outer', 'skf_inner', 'model', 'y_0.5']).mean()
print(df_scores)
In the following circumstances:
Both in the inner- and outerloop of the CV, LeaveOneOut() is used
The SVM is used
The y labels are balanced (i.e. the mean of y is 0.5)
The predictions are much better than expected by random chance (AUC>0.9, sometimes even 1, Brier of 0.15 or lower). I can replicate this generating more samples, more features etc - the issue stays the same. Swapping the SVM for LogisticRegression (as shown in the analysis above), leads to expected results (AUC 0.5, Brier of 0.25). And for the other scenario's (no LOO-CV in either inner or outer loop, or a different distribution of y labels), the results are as expected.
Can anyone replicate this? Am I missing something obvious?
I've replicated this with an older version of sklearn (0.24.0) and the newest one (1.2.0).

How to understand the _dual_coef_ parameter in sklearn's kernel svm?

I have a small kernel svm code.
from sklearn import datasets
from sklearn.svm import SVC
import numpy as np
# Load the IRIS dataset for demonstration
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Train-test split
X_train, y_train = X[:140], y[:140]
X_test, y_test = X[140:], y[140:]
print(X.shape, X_train.shape, X_test.shape) # prints (150, 4) (140, 4) (10, 4)
# Fit a rbf kernel SVM
gamma = 0.7
svc = SVC(kernel='rbf', gamma=gamma, C=64, decision_function_shape='ovo')
# svc = SVC(kernel='rbf', gamma=gamma, C=64, probability=True, decision_function_shape='ovo')
# svc = SVC(kernel='rbf', gamma=gamma, C=64)
svc.fit(X_train, y_train)
print(svc.score(X_test, y_test))
# Get prediction for a point X_test using train SVM, svc
def get_pred(svc, X_test):
def RBF(x,z,gamma,axis=None):
return np.exp((-gamma*np.linalg.norm(x-z, axis=axis)**2))
A = []
# Loop over all suport vectors to calculate K(Xi, X_test), for Xi belongs to the set of support vectors
for x in svc.support_vectors_:
# A.append(RBF(x, X_test, svc._gamma))
A.append(RBF(x, X_test, gamma))
A = np.array(A)
return (np.sum(svc._dual_coef_*A)+svc.intercept_)
for i in range(X_test.shape[0]):
print(get_pred(svc, X_test[i]))
print(svc.decision_function([X_test[i]])) # The output should same
I want to understand the role of the dual_coef parameter in svm, so I implemented a prediction function get_pred of svm myself.
According to the mathematical expression of svm here.
But the output of the function I implemented is different from the function that comes with svm.
(150, 4) (140, 4) (10, 4)
1.0
[-4.24105215 -4.38979215 -3.52427244]
[[-0.42115154 -1.06817962 -2.36560357]]
[-2.34091311 -2.48965311 -1.6241334 ]
[[-0.61615543 -0.86736268 -0.47127757]]
[-4.34859785 -4.49733785 -3.63181814]
[[-0.86662754 -1.14637099 -1.94948189]]
[-4.14797518 -4.29671518 -3.43119547]
[[-0.32438219 -1.12869709 -2.30877848]]
[-3.80505008 -3.95379007 -3.08827037]
[[-0.3341635 -1.03315401 -2.05161515]]
[-3.83632958 -3.98506957 -3.11954987]
[[-0.62920059 -0.97474828 -1.84626328]]
[-3.94804683 -4.09678683 -3.23126712]
[[-0.90348467 -1.04135143 -1.61709331]]
[-4.24990319 -4.39864319 -3.53312348]
[[-0.83485694 -1.07466796 -1.95426087]]
[-3.39840443 -3.54714443 -2.68162472]
[[-0.52530703 -0.9980642 -1.48891578]]
[-3.03105705 -3.17979705 -2.31427734]
[[-0.93796146 -1.09834078 -0.60863738]]
How can I understand this parameter dual_coef, or put another way, how can I implement the prediction function of the kernel svm myself?

sklearn and weka kNN predictions exactly same for all except for one data point

I wrote a code for kNN using sklearn and then compared the predictions using the WEKA kNN. The comparison was done using the 10 test set predictions, out of which, only a single one is showing a high difference of >1.5 but all others are exactly the same. So, I am not sure about if my code is working fine or not. Here is my code:
df = pd.read_csv('xxxx.csv')
X = df.drop(['Name', 'activity'], axis=1)
y = df['activity']
Xstd = StandardScaler().fit_transform(X)
x_train, x_test, y_train, y_test = train_test_split(Xstd, y, test_size=0.2,
shuffle=False, random_state=None)
print(x_train.shape, x_test.shape)
X_train_trans = x_train
X_test_trans = x_test
for i in range(2, 3):
knn_regressor = KNeighborsRegressor(n_neighbors=i, algorithm='brute',
weights='uniform', metric='euclidean', n_jobs=1, p=2)
CV_pred_train = cross_val_predict(knn_regressor, X_train_trans, y_train,
n_jobs=-1, verbose=0, cv=LeaveOneOut())
print("LOO Q2: ", metrics.r2_score(y_train, CV_pred_train).round(2))
# Train Test predictions
knn_regressor.fit(X_train_trans, y_train)
train_r2 = knn_regressor.score(X_train_trans, y_train)
y_train_pred = knn_regressor.predict(X_train_trans).round(3)
train_r2_1 = metrics.r2_score(y_train, y_train_pred)
y_test_pred = knn_regressor.predict(X_test_trans).round(3)
train_r = stats.pearsonr(y_train, y_train_pred)
abs_error_train = (y_train - y_train_pred)
train_predictions = pd.DataFrame({'Actual': y_train, 'Predcited':
y_train_pred, "error": abs_error_train.round(3)})
MAE_train = metrics.mean_absolute_error(y_train, y_train_pred)
abs_error_test = (y_test_pred - y_test)
test_predictions = pd.DataFrame({'Actual': y_test, 'predcited':
y_test_pred, 'error': abs_error_test.round(3)})
test_r = stats.pearsonr(y_test, y_test_pred)
test_r2 = metrics.r2_score(y_test, y_test_pred)
MAE_test = metrics.mean_absolute_error(y_test, y_test_pred).round(3)
print(test_predictions)
The train set statistics are almost same in both sklearn and WEKA kNN.
the sklearn predictions are:
Actual predcited error
6.00 5.285 -0.715
5.44 5.135 -0.305
6.92 6.995 0.075
7.28 7.005 -0.275
5.96 6.440 0.480
7.96 7.150 -0.810
7.30 6.660 -0.640
6.68 7.200 0.520
***4.60 6.950 2.350***
and the weka predictions are:
actual predicted error
6 5.285 -0.715
5.44 5.135 -0.305
6.92 6.995 0.075
7.28 7.005 -0.275
5.96 6.44 0.48
7.96 7.15 -0.81
7.3 6.66 -0.64
6.68 7.2 0.52
***4.6 5.285 0.685***
parameters used in both algorithms are: k =2, brute force for distance calculation, metric: euclidean.
Any suggestions for the difference?

label_binarize Does not fit for sklearn Naive Bayes classifier showing bad input shape

I was trying to create roc curve for multiclass using Naive Bayes But it ending with
ValueError: bad input shape.
import numpy as np
import matplotlib.pyplot as plt
from itertools import cycle
from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.naive_bayes import BernoulliNB
from scipy import interp
# Import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Binarize the output
y = label_binarize(y, classes=[0, 1, 2])
n_classes = y.shape[1]
# Add noisy features to make the problem harder
random_state = np.random.RandomState(0)
n_samples, n_features = X.shape
X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]
# shuffle and split training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5,
random_state=0)
# Learn to predict each class against the other
classifier = BernoulliNB(alpha=1.0, binarize=6, class_prior=None, fit_prior=True)
y_score = classifier.fit(X_train, y_train).predict(X_test)
raise ValueError("bad input shape {0}".format(shape))
ValueError: bad input shape (75, 6)
The error because of binarizing the y variable. The estimator can work with string values itself.
Remove the following lines,
y = label_binarize(y, classes=[0, 1, 2])
n_classes = y.shape[1]
You are good to go!
To get the predicted probabilities for roc_curve, use the following:
classifier.fit(X_train, y_train)
y_score = classifier.predict_proba(X_test)
y_score.shape
# (75, 3)

How can I distribute a SVC to different workers(on other computers) using dask

I have a scheduler running on my PC and I want to train 10 instances of a SVC on different worker computers. I fiddled around but could not find a solution
I am assuming that you want to train thoses 10 SVC with different hyperparameters and find the best one (aka hyperparameters optimization that you can do using gridsearchCV). I am also assuming that you are using scikit learn.
Usually you would train the SVC using a code like :
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC
# Loading the Digits dataset
digits = datasets.load_digits()
# To apply an classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
X = digits.images.reshape((n_samples, -1))
y = digits.target
# Split the dataset in two equal parts
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.5, random_state=0)
# Set the parameters by cross-validation
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
'C': [1, 10, 100, 1000]},
{'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]
scores = ['precision', 'recall']
for score in scores:
print("# Tuning hyper-parameters for %s" % score)
print()
clf = GridSearchCV(SVC(), tuned_parameters, cv=5,
scoring='%s_macro' % score)
clf.fit(X_train, y_train)
print("Best parameters set found on development set:")
print()
print(clf.best_params_)
print()
print("Grid scores on development set:")
print()
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
print("%0.3f (+/-%0.03f) for %r"
% (mean, std * 2, params))
print()
print("Detailed classification report:")
print()
print("The model is trained on the full development set.")
print("The scores are computed on the full evaluation set.")
print()
y_true, y_pred = y_test, clf.predict(X_test)
print(classification_report(y_true, y_pred))
print()
but it would only train sequentially on one thread.
If you install dask-ML, you can leverage a drop in replacement for grid search
conda install dask-searchcv -c conda-forge
Replacing
from sklearn.model_selection import GridSearchCV
by
from dask_searchcv import GridSearchCV
should be sufficient.
However, in you case, you don't want to use the threaded scheduler but the distributed scheduler. Hence, you have to add the following code at the begining
# Distribute grid-search across a cluster
from dask.distributed import Client
scheduler_address = '127.0.0.1:8786'
client = Client(scheduler_address)
The final code should look like this (not tested)
from sklearn import datasets
from sklearn.model_selection import train_test_split
from dask_searchcv import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.svm import SVC
# Distribute grid-search across a cluster
from dask.distributed import Client
scheduler_address = '127.0.0.1:8786'
client = Client(scheduler_address)
# Loading the Digits dataset
digits = datasets.load_digits()
# To apply an classifier on this data, we need to flatten the image, to
# turn the data in a (samples, feature) matrix:
n_samples = len(digits.images)
X = digits.images.reshape((n_samples, -1))
y = digits.target
# Split the dataset in two equal parts
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.5, random_state=0)
# Set the parameters by cross-validation
tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
'C': [1, 10, 100, 1000]},
{'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]
scores = ['precision', 'recall']
for score in scores:
print("# Tuning hyper-parameters for %s" % score)
print()
clf = GridSearchCV(SVC(), tuned_parameters, cv=5,
scoring='%s_macro' % score)
clf.fit(X_train, y_train)
print("Best parameters set found on development set:")
print()
print(clf.best_params_)
print()
print("Grid scores on development set:")
print()
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
print("%0.3f (+/-%0.03f) for %r"
% (mean, std * 2, params))
print()
print("Detailed classification report:")
print()
print("The model is trained on the full development set.")
print("The scores are computed on the full evaluation set.")
print()
y_true, y_pred = y_test, clf.predict(X_test)
print(classification_report(y_true, y_pred))
print()

Resources