I have been executing multiple runs of cross-validation and obtaining several AUROCs (area under the ROC). I have found out that the distribution of those AUC follow a normal distribution. Is there any scientific explanation to that? Thanks.
The Central Limit Theorem is often used to justify approximate normality of (sample) distributions of statistics calculated on large amounts of data. This will obviously break down for AUC close to 0 or 1 because the normal distribution has support on the entire real line.
Why do you care? Is it just for curiosity or are you trying to do something with this intuition?
If you want to compute intervals, a better technique is to use the bootstrap. If you're comparing ROCs of two models, you can bootstrap the paired decisions of the two models to get intervals on the difference.
A normal distribution of AUROC values is not possible.
Because normal distributions are infinite, but AURUC is bounded to [0:1]. So at most it looks vaguely like a normal distribution.
It's more likely that you are observing a Binomial distribution.
There is a probabilistic interpretation of AUROC (sorry, I don't remember the source for this). Assuming that there were a "true" probability p, and you are observing k random samples from this true probability p, the distribution of AUROC values maybe is B(n,p)/n then?
Related
I have a question about variable importance ranking.
I built an MLP and an RF model using the same dataset with 34 variables and achieved the same accuracy on a similar test dataset. As you can see in the picture below the top variables for the SHAP summary plot and the RF VIM are quite different.
Interestingly, I removed the low-ranked variable from the MLP and the accuracy increased. However, the RF result didn’t change.
Does that mean the RF is not a good choice for modeling this dataset?
It’s still strange to me that the rankings are so different:
SHAP summary plot vs. RF VIM, I numbered the top and low-ranked variable
Shouldn't the variables ranking be the same for MLP and RF?
No. There may be tendency for different algos to rank certain features higher, but there is no reason for ranking to be the same.
Different algorithms:
May have different objective functions to achieve intended goal.
May use features differently to achieve min (max) of the objective function.
On top, what you cite as RF "feature importances" (mean decrease in Gini) is only one of the many ways to calculate "feature importance" for RF (including which metric you use, and how you calculate total decrease due to a feature). In contrast, SHAP is model agnostic when it comes to explaining feature contributions to outcome.
In sum:
Different models will have different opinions about what is important and not. What is important for one algo may be not so important for another and vice versa. It doesn't tell anything about applicability of a model to a specific dataset.
Use SHAP values (or any other feature importance metric that you and your clients understand) to explain a model (if necessary).
Choose "best" model based on your goals: performance or explainability.
I am trying to implement Naive Bayes Gaussian classifier on the number classification data. Where each feature represents a pixel.
When trying to implement this, I hit a bump, I've noticed that some the feature variance equate to 0.
This is an issue because I would not be able to divide by 0 when trying to solve the probability.
What can I do to work around this?
Very short answer is you cannot - even though you can usually try to fit Gaussian distribution to any data (no matter its true distribution) there is one exception - the constant case (0 variance). So what can you do? There are three main solutions:
Ignore 0-variance pixels. I do not recommend this approach as it loses information, but if it is 0 variance for each class (which is a common case for MNIST - some pixels are black, independently from class) then it is actually fully justified mathematically. Why? The answer is really simple, if for each class, given feature is constant (equal to some single value) then it brings literally no information for classification, thus ignoring it will not affect the model which assumes conditional independence of features (such as NB).
Instead of doing MLE estimate (so using N(mean(X), std(X))) use the regularised estimator, for example of form N(mean(X), std(X) + eps), which is equivalent to adding eps-noise independently to each pixel. This is a very generic approach that I would recommend.
Use better distribution class, if your data is images (and since you have 0 variance I assume these are binary images, maybe even MNIST) you have K features, each in [0, 1] interval. You can use multinomial distribution with bucketing, so P(x e Bi|y) = #{ x e Bi | y } / #{ x | y }. Finally this is usually the best thing to do (however requires some knowledge of your data), as the problem is you are trying to use a model which is not suited for the data provided, and I can assure you, that proper distribution will always give better results with NB. So how can you find a good distribution? Plot conditonal marginals P(xi|y) for each feature, and look how they look like, based on that - choose distribution class which matches the behaviour, I can assure you these will not look like Gaussians.
Another general question on data science!
Let's say I have a bunch of samples and I have to detect outliers on each sample. My data would be univariate, so I can use simple methods like standard deviation or median absolute deviation.
Now my question is: how would one do any sort of validation to see if results are coherent, especially if looking at them by eye wouldn't be an option because of the size of the data? For example to choose how many standard deviations to use to define outliers. I haven't seen any quantitative method so far. Does it even exist?
Cheers
Interestingly you didn't define the dimension of "size of the data". Which is I think important here. E.g., you can draw a q-q plot for high-dimensional data but not that easy for many data-points.
However, when looking for a general methodology I would attack this problem from a probabilistic perspective. This will never tell you which data point is an outlier, however, it will tell you what is the probability that you have an outlier (in certain areas of your data).
I have to make two assumptions (a) you know the family of distribution your data stems from, e.g., normal or poisson (b) you can estimate the parameters of this family given a data set.
Now you can define the Hypothesis that you data is from this Distribution and the alternative Hypothesis (H0) that the data is not from this distribution. If you draw a random sample from your estimated distribution, this drawn distribution should be on average as likely to come from the distribution as your observed sample. If this is not the case
However, probably more interesting is to find the sub-space which contains the outlier. This can be done with the following empirical procedure. If you now estimate the parameters of your distribution given your by Data. You can compare the estimated distribution with the histogram of the seen data. This gives you for each bin of the histogram a probability that ic contains an outlier. For high dimensional data this can be checked programtically.
I need some point of view to know if what I am doing is good or wrong or if there is better way to do it.
I have 10 000 elements. For each of them I have like 500 features.
I am looking to measure the separability between 2 sets of those elements. (I already know those 2 groups I don't try to find them)
For now I am using svm. I train the svm on 2000 of those elements, then I look at how good the score is when I test on the 8000 other elements.
Now I would like to now which features maximize this separation.
My first approach was to test each combination of feature with the svm and follow the score given by the svm. If the score is good those features are relevant to separate those 2 sets of data.
But this takes too much time. 500! possibility.
The second approach was to remove one feature and see how much the score is impacted. If the score changes a lot that feature is relevant. This is faster, but I am not sure if it is right. When there is 500 feature removing just one feature don't change a lot the final score.
Is this a correct way to do it?
Have you tried any other method ? Maybe you can try decision tree or random forest, it would give out your best features based on entropy gain. Can i assume all the features are independent of each other. if not please remove those as well.
Also for Support vectors , you can try to check out this paper:
http://axon.cs.byu.edu/Dan/778/papers/Feature%20Selection/guyon2.pdf
But it's based more on linear SVM.
You can do statistical analysis on the features to get indications of which terms best separate the data. I like Information Gain, but there are others.
I found this paper (Fabrizio Sebastiani, Machine Learning in Automated Text Categorization, ACM Computing Surveys, Vol. 34, No.1, pp.1-47, 2002) to be a good theoretical treatment of text classification, including feature reduction by a variety of methods from the simple (Term Frequency) to the complex (Information-Theoretic).
These functions try to capture the intuition that the best terms for ci are the
ones distributed most differently in the sets of positive and negative examples of
ci. However, interpretations of this principle vary across different functions. For instance, in the experimental sciences χ2 is used to measure how the results of an observation differ (i.e., are independent) from the results expected according to an initial hypothesis (lower values indicate lower dependence). In DR we measure how independent tk and ci are. The terms tk with the lowest value for χ2(tk, ci) are thus the most independent from ci; since we are interested in the terms which are not, we select the terms for which χ2(tk, ci) is highest.
These techniques help you choose terms that are most useful in separating the training documents into the given classes; the terms with the highest predictive value for your problem. The features with the highest Information Gain are likely to best separate your data.
I've been successful using Information Gain for feature reduction and found this paper (Entropy based feature selection for text categorization Largeron, Christine and Moulin, Christophe and Géry, Mathias - SAC - Pages 924-928 2011) to be a very good practical guide.
Here the authors present a simple formulation of entropy-based feature selection that's useful for implementation in code:
Given a term tj and a category ck, ECCD(tj , ck) can be
computed from a contingency table. Let A be the number
of documents in the category containing tj ; B, the number
of documents in the other categories containing tj ; C, the
number of documents of ck which do not contain tj and D,
the number of documents in the other categories which do
not contain tj (with N = A + B + C + D):
Using this contingency table, Information Gain can be estimated by:
This approach is easy to implement and provides very good Information-Theoretic feature reduction.
You needn't use a single technique either; you can combine them. Term-Frequency is simple, but can also be effective. I've combined the Information Gain approach with Term Frequency to do feature selection successfully. You should experiment with your data to see which technique or techniques work most effectively.
If you want a single feature to discriminate your data, use a decision tree, and look at the root node.
SVM by design looks at combinations of all features.
Have you thought about Linear Discriminant Analysis (LDA)?
LDA aims at discovering a linear combination of features that maximizes the separability. The algorithm works by projecting your data in a space where the variance within classes is minimum and the one between classes is maximum.
You can use it reduce the number of dimensions required to classify, and also use it as a linear classifier.
However with this technique you would lose the original features with their meaning, and you may want to avoid that.
If you want more details I found this article to be a good introduction.
Given D=(x,y), y=F(x), it seems most machine learning methods only outputs y as a univariate, either a label or a real value. But I am facing a situation that x vector may only have 5~9 dimensions while I need y to be a multinomial distribution vector which can have up to 800 dimensions. This makes the problem really tricky.
I looked into a lot of things in multitask machine learning methods, where I can train all these y_i at the same time. And of course, another stupid way is that I can also train all these dimensions separately without considering the linkage between tasks. But the problem is, after reviewing many papers, seem that most MTL experiments only deal with 10~30 tasks, which means 800 tasks can be crazy and bad to train. Maybe clustering could be a solution, but I am really curious that can anyone give some suggestions about other ways to deal with this problem, not from a MTL perspective.
When the input is so "small" and the output so big, I would expect there to be a different representation of those output values. You could analyze if they are a linear or nonlinear combination of some sort, so to estimate the "function parameters" instead of the values itself. Example: We once have estimated a time series which could be "reduced" to a weighted sum of normal distributions, so we just had to estimate the weights and parameters.
In the end you will reach only a 6-to-12-dimensional subspace in some sense (not linear, probably) when you have only 6 input parameters. They can of course be a bit complicated, but to avoid the chaos in a 800-dim space I would really look into parametrizing the result.
And as I commented the machine learning that I know produce vectors. http://en.wikipedia.org/wiki/Bayes_estimator