Inspection of trees in a Quantile Random Forest Regression model - random-forest

I am interested in training a random forest to learn some conditional quantile on some data {X, y} sampled independently from some distribution.
That is, for some $$\alpha \in (0, 1)$$, a mapping $$\hat{q}{\alpha}(x) \in [0, 1]$$ such that for each $X$, $$argmin{\hat{q}{\alpha} P(y < \hat{q}\alpha(x)) > \alpha$$.
Is there any clear way to build a random forest effectively in python that could yield such a model?
Additionally, I have one added requirement that may be possible with the current libraries, though I am unsure. Requirement: I would like to select a subset of points, A, from my training set and select and exclude those trees that were trained with points in A from my random forest as I make predictions.

There is a Python-based, scikit-learn compatible/compliant Quantile Regression Forest implementation that can be used to estimate conditional quantiles here: https://github.com/zillow/quantile-forest
Your additional requirement of making predictions on training samples by excluding trees that included those samples during training is called out-of-bag (OOB) estimation, and can also be done with the above package.
Setup should be as easy as:
pip install quantile-forest
Then, here's an example of how to fit a quantile random forest model and use it to predict quantiles with OOB estimation for a subset (here the first 100 rows) of the training data:
import numpy as np
from quantile_forest import RandomForestQuantileRegressor
from sklearn import datasets
from sklearn.model_selection import train_test_split
X, y = datasets.fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y)
qrf = RandomForestQuantileRegressor()
qrf.fit(X_train, y_train)
# Predict OOB quantiles for first 100 training samples.
y_pred_oob = qrf.predict(
X_train[:100, :],
quantiles=[0.025, 0.5, 0.975],
oob_score=True,
indices=np.arange(100),
)

Related

How to compare baseline and GridSearchCV results fair?

I am a bit confusing with comparing best GridSearchCV model and baseline.
For example, we have classification problem.
As a baseline, we'll fit a model with default settings (let it be logistic regression):
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
baseline = LogisticRegression()
baseline.fit(X_train, y_train)
pred = baseline.predict(X_train)
print(accuracy_score(y_train, pred))
So, the baseline gives us accuracy using the whole train sample.
Next, GridSearchCV:
from sklearn.model_selection import cross_val_score, GridSearchCV, StratifiedKFold
X_val, X_test_val,y_val,y_test_val = train_test_split(X_train, y_train, test_size=0.3, random_state=42)
cv = StratifiedKFold(n_splits=5, random_state=0, shuffle=True)
parameters = [ ... ]
best_model = GridSearchCV(LogisticRegression(parameters,scoring='accuracy' ,cv=cv))
best_model.fit(X_val, y_val)
print(best_model.best_score_)
Here, we have accuracy based on validation sample.
My questions are:
Are those accuracy scores comparable? Generally, is it fair to compare GridSearchCV and model without any cross validation?
For the baseline, isn't it better to use Validation sample too (instead of the whole Train sample)?
No, they aren't comparable.
Your baseline model used X_train to fit the model. Then you're using the fitted model to score the X_train sample. This is like cheating because the model is going to already perform the best since you're evaluating it based on data that it has already seen.
The grid searched model is at a disadvantage because:
It's working with less data since you have split the X_train sample.
Compound that with the fact that it's getting trained with even less data due to the 5 folds (it's training with only 4/5 of X_val per fold).
So your score for the grid search is going to be worse than your baseline.
Now you might ask, "so what's the point of best_model.best_score_? Well, that score is used to compare all the models used when searching for the optimal hyperparameters in your search space, but in no way should be used to compare against a model that was trained outside of the grid search context.
So how should one go about conducting a fair comparison?
Split your training data for both models.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Fit your models using X_train.
# fit baseline
baseline.fit(X_train, y_train)
# fit using grid search
best_model.fit(X_train, y_train)
Evaluate models against X_test.
# baseline
baseline_pred = baseline.predict(X_test)
print(accuracy_score(y_test, baseline_pred))
# grid search
grid_pred = best_model.predict(X_test)
print(accuracy_score(y_test, grid_pred))

When is the features independent in order to use NB classifier?

I am working with classification models and as I am new to it I have a question. It is said that Naive Bayes performs well when features are independent of each other. How do I know if features in my feature set are independent? Any example? Thanks!!
Independence of Features
In most cases people want to check whether one feature is highly correlated with another (or even repeated), so that one of those can be omitted. A correlation of 1 means, that you don't lose any information if the one of the correlated features is omitted. There are multiple ways to check correlation, e.g. in Python np.corrcoef, pd.DataFrame.corr and scipy.stats.pearsonr.
But things can be more complicated.
Features are independent of each other if you cant use features x_1, ..., x_n to predict feature x_n+1. In most cases one might check if features are linear dependent of each other, meaning:
x_n+1 = a_1 * x_1 + ... + a_n * x_n + error
If this is the case (and the error contribution is small) one might neglect the dependent feature. Note that you can therefore omit any of all n+1-features, since you can restructure your equation to have any of x_i on the lhs.
To check this one might calculate the eigenvalues and check for values close to zero.
Removing dependent Features
from sklearn import datasets
import numpy as np
from sklearn import decomposition
from sklearn import naive_bayes
from sklearn import model_selection
X, y = datasets.make_classification(n_samples=10000, n_features=10, n_repeated=0, n_informative=6, n_redundant=4, n_classes=2)
u, s, vh = np.linalg.svd(X)
#display s
s
array([8.06415389e+02, 6.69591201e+02, 4.31329281e+02, 4.02622029e+02,
2.85447317e+02, 2.53360358e+02, 4.07459972e-13, 2.55851809e-13,
1.72445591e-13, 6.68493846e-14])
So basically, 4 features are redundant. So now, we can use a feature reduction technique such as Principal Component Analysis or Linear Discriminant Analysis to reduce to only 6 features.
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y)
gnb = naive_bayes.GaussianNB()
gnb.fit(X_train, y_train)
gnb.score(X_test, y_test) #results in 0.7216
Now we reduce the features to 6.
pca = decomposition.PCA(n_components=6)
X_trafo = pca.fit_transform(X)
X_train, X_test, y_train, y_test = model_selection.train_test_split(X_trafo, y)
gnb.fit(X_train, y_train)
gnb.score(X_test, y_test) #results in 0.7216
Note that the values don't need to be exaclty the same.

Logistic Regression sklearn with categorical Output

i have to train a model with logistic Regression in sklearn. I saw everywhere that the outcome has to be binary but my label is good, bad or normal. I have 12 features and i don't know how can i deal with three Labels ? I am very thankful for every answer
You can use Multinomial Logistic Regression.
In python, you can modify your Logistic Regression code as:
LogisticRegression(multi_class='multinomial').fit(X_train,y_train)
You can see Logistic Regression documentation in Scikit-Learn for more details.
It's called as one-vs-all Classification or Multi class classification.
From sklearn.linear_model.LogisticRegression:
In the multiclass case, the training algorithm uses the one-vs-rest (OvR) scheme if the ‘multi_class’ option is set to ‘ovr’, and uses the cross-entropy loss if the ‘multi_class’ option is set to ‘multinomial’. (Currently the ‘multinomial’ option is supported only by the ‘lbfgs’, ‘sag’, ‘saga’ and ‘newton-cg’ solvers.)
Code example:
# Authors: Tom Dupre la Tour <tom.dupre-la-tour#m4x.org>
# License: BSD 3 clause
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.linear_model import LogisticRegression
# make 3-class dataset for classification
centers = [[-5, 0], [0, 1.5], [5, -1]]
X, y = make_blobs(n_samples=1000, centers=centers, random_state=40)
transformation = [[0.4, 0.2], [-0.4, 1.2]]
X = np.dot(X, transformation)
for multi_class in ('multinomial', 'ovr'):
clf = LogisticRegression(solver='sag', max_iter=100, random_state=42,
multi_class=multi_class).fit(X, y)
# print the training scores
print("training score : %.3f (%s)" % (clf.score(X, y), multi_class))
Check for full code example: Plot multinomial and One-vs-Rest Logistic Regression

Implementation of Cross-validation

I am confused since many individuals have their own approach to apply the cross-validation. For instance, some apply it on the whole dataset and some apply it on the training set.
My question is whether the below code is appropriate to implement cross-validation and make predictions from such model while having Cross-validation being applied?
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import KFold
model= GradientBoostingClassifier(n_estimators= 10,max_depth = 10, random_state = 0)#sepcifying the model
cv = KFold(n_splits=5, shuffle=True)
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import cross_val_score
#X -the whole dataset
#y - the whole dataset but target attributes only
y_pred = cross_val_predict(model, X, y, cv=cv)
scores = cross_val_score(model, X, y, cv=cv)
You need to have a test set to evaluate performance on completely unseen data even for cross validation. Performance tuning should not be done on this test set to avoid data leakage.
Split data into two segments train and test. There are various CV methods such as K-Fold, Stratified K-Fold etc. Visualization and further reading material here,
https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html
https://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html
In K-Fold CV training data is split into K sets. Then for each fold, K-1 of the fold is trained and the remaining one is used for performance evaluation.
The image and further detail about cross validation, train/validation/test split etc. can be found here.
https://scikit-learn.org/stable/modules/cross_validation.html
Visualization of K-Fold cross validation for 3 classes,

ROC AUC score for AutoEncoder and IsolationForest

I am a new in Machine Learning area & I am (trying to) implementing anomaly detection algorithms, one algorithm is Autoencoder implemented with help of keras from tensorflow library and the second one is IsolationForest implemented with help of sklearn library and I want to compare these algorithms with help of roc_auc_score ( function from Python), but I am not sure if I am doing it correct.
In documentation of roc_auc_score function I can see, that for input it should be like:
sklearn.metrics.roc_auc_score(y_true, y_score, average=’macro’, sample_weight=None, max_fpr=None
y_true :
True binary labels or binary label indicators.
y_score :
Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as returned by “decision_function” on some classifiers). For binary y_true, y_score is supposed to be the score of the class with greater label.
For AE I am computing roc_auc_score like this:
model.fit(...) # model from https://www.tensorflow.org/api_docs/python/tf/keras/Sequential
pred = model.predict(x_test) # predict function from https://www.tensorflow.org/api_docs/python/tf/keras/Sequential#predict
metric = np.mean(np.power(x_test - pred, 2), axis=1) #MSE
print(roc_auc_score(y_test, metric) # where y_test is true binary labels 0/1
For IsolationForest I am computing roc_auc_score like this:
model.fit(...) # model from https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html
metric = -(model.score_samples(x_test)) # https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html#sklearn.ensemble.IsolationForest.score_samples
print(roc_auc_score(y_test, metric) #where y_test is true binary labels 0/1
I am just curious if returned roc_auc_score from both implementations of AE and IsolationForest are comparable (I mean, if I am computing them in the correct way)? Especially in AE model, where I am putting MSE into the roc_auc_score (if not, what should be the input as y_score to this function?)
Comparing AE and IsolationForest in the context of anomaly dection using sklearn.metrics.roc_auc_score based on scores coming from AE MSE loss and IF decision_function() respectively is okay. Varying range of the y_score when switching classifier isn't an issue, since this range is taken into account for each classifier when computing the AUC.
To understand that AUC isn't range dependent, remember that you travel along the decision function values to obtain the ROC points. Rescaling the decision function values will only change the decision function thresholds accordingly, defining similar points of the ROC since the new thresholds will lead each to the same TPR and FPR as they did before the rescaling.
Couldn't find a convincing code line in sklearn.metrics.roc_auc_score's implementation, but you can easily observe this comparison in published code associated with a research paper. For example, in the Deep One-Class Classification paper's code (I'm not an author, I know the paper's code because I'm reproducing their results), AE MSE loss and IF decision_function() are the roc_auc_score inputs (whose outputs the paper is comparing):
AE roc_auc_score computation
Found in this script on github.
from sklearn.metrics import roc_auc_score
(...)
scores = torch.sum((outputs - inputs) ** 2, dim=tuple(range(1, outputs.dim())))
(...)
auc = roc_auc_score(labels, scores)
IsolationForest roc_auc_score computation
Found in this script on github.
from sklearn.metrics import roc_auc_score
(...)
scores = (-1.0) * self.isoForest.decision_function(X.astype(np.float32)) # compute anomaly score
y_pred = (self.isoForest.predict(X.astype(np.float32)) == -1) * 1 # get prediction
(...)
auc = roc_auc_score(y, scores.flatten())
Note: The two scripts come from two different repositories but are actually the source of a single paper's results. The authors only chose to create an extra repository for their PyTorch implementation of an AD method requiring a neural network.

Resources