Accuracy in outlier detection - machine-learning

I am having an issue understanding this question while learning about outliers. I have attached an image of the question. Is there anyone help me understanding the question as I am new to Data Mining and unable to crack this question. Resources for expanding my knowledge will be appreciated.
All I know right now is you can check the accuracy of one model for detecting an outlier by comparing the generated results and predicted ones. But in this problem, there is no such actual values which has led me towards the issue. It would be a great favor if anyone can help me out.
Thanks in advanceenter image description here

The purpose of the questions seems to be more related to the ROC curve interpretation than to the task at hand being an outlier prediction problem. It seems that it needs to understand how to compare two algorithms based on the ROC curve, and to conclude that the suitable metric to be used in this case is the AUC score.
Using Python and scikit-learn we can easily plot the two ROC curves like this:
#define three lists with the given data: two sets of scores and their true class
scores1 = [0.44, 0.94, 1.86, 2.15, 0.15, 0.5, 5.4, 3.09, 7.97, 5.21]
scores2 = [0.73, 0.18, 0.76, 1.6, 3.78, 4.45, 0.3, 3.3, 0.44, 9.94]
y = [0,0,0,1,0,0,1,1,0,0]
# calculate fpr, tpr and classification thresholds
from sklearn.metrics import roc_curve, roc_auc_score, RocCurveDisplay
fpr1, tpr1, thresholds1 = roc_curve(y, scores1)
fpr2, tpr2, thresholds2 = roc_curve(y, scores2)
auc1 = roc_auc_score(y, scores1)
auc2 = roc_auc_score(y, scores2)
# get the curve displays using the above metrics
curve1 = RocCurveDisplay(fpr=fpr1, tpr=tpr1, roc_auc=auc1,
estimator_name='Algo1')
curve2 = RocCurveDisplay(fpr=fpr2, tpr=tpr2, roc_auc=auc2,
estimator_name='Algo2')
curve1.plot()
curve2.plot()
Then, from the plots you can interpret based on the values you can see for False Positive Rate on the x-axis vs True Positive Rate on the y-axis and the trade-off they imply. Moreover, you will see that the algorithm 1 having a graph that accounts for larger scores of TPR than those of the algorithm 2, is a better algorithm for this task. Moreover, this can be formalized using the AUC as a metric, which was calculated using "roc_auc_score".
Note that you can also have the plot manually if you calculate FPR and TPR for each of the algorithms using their corresponding classification thresholds.
Hope it helps :)
Regards,
Jehona.

Related

XGBoost feature importance (TFIDF + TruncateSVD)

I have an XGBoost model that runs TFIDF vectorization and TruncateSVD reduction on text features. I want to understand feature importance of the model.
This is how I process text features in my dataset:
.......
tfidf = TfidfVectorizer(tokenizer=tokenize)
tfs = tfidf.fit_transform(token_dict)
svd = TruncatedSVD(n_components=15)
temp = pd.DataFrame(svd.fit_transform(tfs))
temp.rename(columns=lambda x: text_feature+'_'+str(x), inplace=True)
dataset=dataset.join(temp,how='inner')
.......
It works okayish and now I'm trying to understand importance of the features in the dataset. I generate the charts using:
xgb.plot_importance(model, max_num_features=15)
pyplot.show()
And get something similar to:
this chart
What would be the right way to "map" importance SVD dimensions to the dimensions of the initial dataset? So I know importance of summary and not summary_1, summary_2, summary_X.
Thanks
one thing you can try is getting the how important each original feature is to creating new features. you can get it using the following:
feature_importance_scores = np.abs(svd.components_).sum(axis=0)
feature_importance_scores /= feature_importance_scores.sum() # normalize to make it more clear
you can get the overall importance by multiplying these values with xgb.feature_importances_

How exactly are the thresholds evaluated while plotting roc curves using sklearn

I am currently doing a 3 class classification problem using a random forest. I wanted to do some analysis using ROC curves. Since ROC are usually for binary classifiers, i used OneVSRest with random forest. This gave me 3 binary classifiers and using sklearn roc_curve() i plotted them. I dont understand how the threshold is being taken? From my understanding of roc curves for binary classifier , if we have logistic regression, we can change the threshold between 0 and 1 in order to classify. E.g., alpha = 0.5 : if p > 0.5, set to class 1 else class 2;
for alpha = 0.6 : if p > 0.6 set to class 1 else class 2 etc and so on. This is how the thresholds change (alpha is the threshold here) and thats how we get the roc curve by plotting the TPR vs FPR for different thresholds
Now how does this work in case of a random forest or any tree based algorithm? what is the threshold? and how do we extend this for multi class?
Please do correct me if i made a wrong assumption here
I tried different methods
OneVsRest(RandomForestClassifier())
And got 3 curves
But i dont know how these thresholds are coming and how exactly i need to use this to interpret results - like finding optimal threshold etc

RandomizedSearchCV-Appropriate hyper-parameter distribution

In "Hands-on Machine Learning with Scikit-Learn, Keras & TensorFlow" book I see below distributions(reciprocal and Expon) being applied for Hyperparameters C and gamma. How did Author(Aurelion) came up with these distributions ? I mean how to determine which distribution would be appropriate for application in RandomizedSearchCV ?
param_distribs = {
'kernel': ['linear', 'rbf'],
'C': reciprocal(20, 200000),
'gamma': expon(scale=1.0),
}
I hope I got the question right.
It depends on the ML model. Randomized or Grid Search is used to the search for the best hyper-parameter that would result in the best estimator for prediction.
For example, consider the following code example. The ```rf_clf`` is the Random Forest model object. The param_distribs will contain the parameters with arbitrary choice of the values.
from scipy.stats import randint
param_distribs = {
'n_estimators': randint(low=1, high=500),
'max_depth': randint(low=1, high=10),
'max_features':randint(low=1,high=10),
}
rf_clf = RandomForestClassifier(random_state=42)
rnd_search_rf = RandomizedSearchCV(rf_clf, param_distributions=param_distribs,
n_iter=10, cv=5, scoring='accuracy', random_state=42)
rnd_search_rf.fit(X_train,y_train)
The best estimator can be accessed via
rnd_search_rf.best_estimator_
I found below comments in sample code on Github.
C-->The distribution we used for C looks quite different: the scale of the samples is picked from a uniform distribution within a given range, which is why the right graph, which represents the log of the samples, looks roughly constant. This distribution is useful when you don't have a clue of what the target scale is Reciprocal-:The reciprocal distribution is useful when you have no idea what the scale of the hyperparameter should be (indeed, as you can see on the figure on the right, all scales are equally likely, within the given range), whereas the exponential distribution is best when you know (more or less) what the scale of the hyperparameter should be.

Questions around XGBoost

I am trying to understand the XGBoost algorithm and have a few questions around it.
I have read various blogs but all seem to tell a different story. Below is a snippet from the code that I am using (only for reference).
param <- list( objective = 'reg:linear',
eta = 0.01,
max_depth = 7,
subsample = 0.7,
colsample_bytree = 0.7,
min_child_weight = 5
)
Below are the 4 questions that I have:
1) It seems that XGBoost uses Gradient decent to minimise the cost function by changing the coefficients. I understand that it can be done for a gblinear model which uses linear regression.
However, for a gbtree model, how can XGboost apply gradient decent as there are no coefficients in the tree based model for the model to change. Or are there?
2) Similarly, gbtree model uses parameters lambda for L2 regularisation and alpha for L1 regularisation. I understand that regularisation applies some constraints on coefficients, but again a gbtree model has no coefficients. So how can it apply constraints on it?
3) What is the job of an objective function. For e.g. reg:linear. From what I understand, assigning an objective function only tells the model which evaluation metric to use. But then, there is a separate eval_metric parameter for it. So why do we need objective function?
4) What is min_child_weight in simple terms? I thought it is just the minimum no. of observations in the leaf node. But I think it has something to do with hessian metrics etc, which I don't understand well.
Hence, I would really appreciate if anyone can through some more light on these in simple and easy to understand terms?

Optimal parameter estimation for a classifier with multiple parameters

The image on the left shows a standard ROC curve formed by sweeping a single threshold and recording the corresponding True Positive Rate (TPR) and False Positive Rate (FPR).
The image on the right shows my problem setup where there are 3 parameters, and for each, we have only 2 choices. Together, it produces 8 points as depicted on the graph. In practice, I intend to have thousands of possible combinations of 100s of parameters, but the concept remains the same in this down-scaled case.
I intend to find 2 things here:
Determine the optimum parameter(s) for the given data
Provide an overall performance score for all combinations of parameters
In the case of the ROC curve on the left, this is done easily using the following methods:
Optimal parameter: Maximal difference of TPR and FPR with a cost component (I believe it is called the J-statistic?)
Overall performance: Area under the curve (the shaded portion in the graph)
However, for my case in the image on the right, I do not know if the methods I have chosen are the standard principled methods that are normally used.
Optimal parameter set: Same maximal difference of TPR and FPR
Parameter score = TPR - FPR * cost_ratio
Overall performance: Average of all "parameter scores"
I have found a lot of reference material for the ROC curve with a single threshold and while there are other techniques available to determine the performance, the ones mentioned in this question is definitely considered a standard approach. I found no such reading material for the scenario presented on the right.
Bottomline, the question here is two-fold: (1) Provide methods to evaluate the optimal parameter set and overall performance in my problem scenario, (2) Provide reference that claims the suggested methods to be a standard approach for the given scenario.
P.S.: I had first posted this question on the "Cross Validated" forum, but didn't get any responses, in fact, got only 7 views in 15 hours.
I'm going to expand a little on aberger's previous answer on a Grid Search. As with any tuning of a model it's best to optimise hyper-parameters using one portion of the data and evaluate those parameters using another proportion of the data, so GridSearchCV is best for this purpose.
First I'll create some data and split it into training and test
import numpy as np
from sklearn import model_selection, ensemble, metrics
np.random.seed(42)
X = np.random.random((5000, 10))
y = np.random.randint(0, 2, 5000)
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.3)
This gives us a classification problem, which is what I think you're describing, though the same would apply to regression problems too.
Now it's helpful to think about what parameters you may want to optimise. A cross-validated grid search is a computational expensive process, so the smaller the search space the quicker it gets done. I will show an example for a RandomForestClassifier because it's my go to model.
clf = ensemble.RandomForestClassifier()
parameters = {'n_estimators': [10, 20, 30],
'max_features': [5, 8, 10],
'max_depth': [None, 10, 20]}
So now I have my base estimator and a list of parameters that I want to optimise. Now I just have to think about how I want to evaluate each of the models that I'm going to build. It seems from your question that you're interested in the ROC AUC, so that's what I'll use for this example. Though you can chose from many default metrics in scikit or even define your own.
gs = model_selection.GridSearchCV(clf, param_grid=parameters,
scoring='roc_auc', cv=5)
gs.fit(X_train, y_train)
This will fit a model for all possible combinations of parameters that I have given it, using 5-fold cross-validation evaluate how well those parameters performed using the ROC AUC. Once that's been fit, we can look at the best parameters and pull out the best performing model.
print gs.best_params_
clf = gs.best_estimator_
Outputs:
{'max_features': 5, 'n_estimators': 30, 'max_depth': 20}
Now at this point you may want to retrain your classifier on all of the training data, as currently it's been trained using cross-validation. Some people prefer not to, but I'm a retrainer!
clf.fit(X_train, y_train)
So now we can evaluate how well the model performs on both our training and test set.
print metrics.classification_report(y_train, clf.predict(X_train))
print metrics.classification_report(y_test, clf.predict(X_test))
Outputs:
precision recall f1-score support
0 1.00 1.00 1.00 1707
1 1.00 1.00 1.00 1793
avg / total 1.00 1.00 1.00 3500
precision recall f1-score support
0 0.51 0.46 0.48 780
1 0.47 0.52 0.50 720
avg / total 0.49 0.49 0.49 1500
We can see that this model has overtrained by the poor score on the test set. But this is not surprising as the data is just random noise! Hopefully when performing these methods on data with a signal you will end up with a well-tuned model.
EDIT
This is one of those situations where 'everyone does it' but there's no real clear reference to say this is the best way to do it. I would suggest looking for an example close to the classification problem that you're working on. For example using Google Scholar to search for "grid search" "SVM" "gene expression"
I feeeeel like we're talking about Grid Search in scikit-learn. It (1), provides methods to evaluate optimal (hyper)parameters and (2), is implemented in a massively popular and well referenced statistical software package.

Resources