what's the relation between mutual information and predict accuracy for classification or MSE for regression? Is it possible to have high accuracy/low MSE with low mutual information in data mining?
Mutual information is defined for pairs of probability distributions. Much of what can be said regarding its relationship to other quantities depends heavily on how you compute and represent these probability distributions (e.g. discrete versus continuous probability distributions).
Given a set of probability distributions, the relationship between classification accuracy and mutual information has been studied in the literature. In short, one quantity puts bounds on the other, at least for discrete probability distributions.
I don't know of any formal studies looking at the relationship between the MSE and mutual information.
All of that being said, if I had a concrete data set and got low mutual information scores for two variables but also a very low MSE in a regression model, I would take a hard look at how the mutual information was computed. 99 out of 100 times this occurs because the original formulation of Shannon entropy (and by extension mutual information) is used on continuous / floating point data, even though this method only applies to discrete data.
Related
I am currently building a binary classification model to predict stock price movements (trend prediction). More specifically, the model predicts the probability that a stock outperforms the daily median return:
>Class 0: return >= median
>
>Class 1: return < median return
Accordingly, I (should) be dealing with a balanced prediction problem.
The ten stocks with the highest probability will be bought, and the ten stocks with the lowest probability will be shorted daily. So, ideally, the model performs well on both classes (I use softmax, so the model must exclusively decide).
I am wondering whether I should use the Accuracy, F1 or AUC-ROC when choosing the optimal model under these circumstances?
My understanding is that both are suitable metrics when the two classes are equally important. This StackExchange-Answer recommends the AUC over Accuracy because it will "strongly discourage people going for models that are representative, but not discriminative (...) and [only] select models that achieve false positive and true positive rates that are significantly above random chance, which is not guaranteed for accuracy". In contrast, this answer recommends the F1-Score because it is the combination of accuracy and AUC score.
I guess what's confusing me is that I will make use of both classes based on the probabilty assigned by the model. Also, I do not have an imbalanced dataset which usually calls for using the AUC-ROC.
Which evaluation metric should I choose to find the optimal model on validation data?
Thanks a lot for any thoughts or recommendations.
Given that I have a deep learning model(handover from former colleague). For some reason, the train/dev set was missing.
In my situation, I want to classify my dataset into 100 categories. The dataset is extremely imbalanced. The dataset size is about tens of millions
First of all, I run the model and got the prediction on the whole dataset.
Then, I sample 100 records per category(according to the prediction) and got a 10,000 test set.
Next, I labeled the ground truth of each record for the test set and calculate the precision, recall, f1 for each category and got F1-micro and F1-macro.
How to estimate the accuracy or other metrics on the whole dataset? Is it correct that I use the weighted sum of each category's precision(the weight is the proportion of prediction on the whole) to estimate?
Since the distribution of prediction category is not same as the distribution of real category, I guess the weighted approach does not work. Any one can explain it?
The issue if you take a weighted average is that if your classifier performs well on the majority class, but poorly on minority classes (which is the typical scenario), it will not be reflected in the score.
One of the recommended approaches is rather to use the balanced accuracy score (see here for the scikit learn implementation). Basically, it is an average of all recall scores: for each observation in a class, it looks at how many of were correctly classified, and averages this across all classes. This will give you a sensible overall score to report.
I'm working with sentiment analysis using NB classifier. I've found some information (blogs, tutorials etc) that training corpus should be balanced:
33.3% Positive;
33.3% Neutral
33.3% Negative
My question is:
Why corspus should be balanced? The Bayes theorem is based on propability of reason/case. So for training purpose isn't it important that in real world for example negative tweets are only 10% not 33.3%?
You are correct, balancing data is important for many discriminative models, but not really for NB.
However, it might be still more beneficial to bias P(y) estimators to get better predictive performance (since due to various simplifications models use, probability assigned to minority class can be heaviy underfitted). For NB it is not about balancing data, but literally modifying the estimated P(y) so that on the validation set accuracy is maximised.
In my opinion the best dataset for training purposes if a sample of the real world data that your classifier will be used with.
This is true for all classifiers (but some of them are indeed not suitable to unbalanced training sets in which cases you don't really have a choice to skew the distribution), but particularly for probabilistic classifiers such as Naive Bayes. So the best sample should reflect the natural class distribution.
Note that this is important not only for the class priors estimates. Naive Bayes will calculate for each feature the likelihood of predicting the class given the feature. If your bayesian classifier is built specifically to classify texts, it will use global document frequency measures (the number of times a given word occurs in the dataset, across all categories). If the number of documents per category in the training set doesn't reflect their natural distribution, the global term frequency of terms usually seen in unfrequent categories will be overestimated, and that of frequent categories underestimated. Thus not only the prior class probability will be incorrect, but also all the P(category=c|term=t) estimates.
My crime classification dataset has indicator features, such as has_rifle.
The job is to train and predict whether data points are criminals or not. The metric is weighted mean absolute error, where if the person is criminal, and the model predicts him/her as not, then the weight is large as 5. If person is not criminal and the model predicts as he/she is, then weight is 1. Otherwise the model predicts correctly, with weight 0.
I've used classif:multinom method in mlr in R, and tuned the threshold to 1/6. The result is not that good. Adaboost is slightly better. Though neither is perfect.
I'm wondering which method is typically used in this kind of binary classification problem with a sparse {0,1} matrix? And how to improve the performance measured by the weighted mean absolute error metric?
Dealing with sparse data is not a trivial task. Lack of information makes difficult to capture features such as variance. I would suggest you to search for subspace clustering methods or to be more specific, soft subspace clustering. The last one usually identifies relevant/irrelevant data dimensions. It is a good approach when you want to improve classification accuracy.
I'm building a binary classification tree using mutual information gain as the splitting function. But since the training data is skewed toward a few classes, it is advisable to weight each training example by the inverse class frequency.
How do I weight the training data? When calculating the probabilities to estimate the entropy, do I take weighted averages?
EDIT: I'd like an expression for entropy with the weights.
The Wikipedia article you cited goes into weighting. It says:
Weighted variants
In the traditional formulation of the mutual information,
each event or object specified by (x,y) is weighted by the corresponding probability p(x,y). This assumes that all objects or events are equivalent apart from their probability of occurrence. However, in some applications it may be the case that certain objects or events are more significant than others, or that certain patterns of association are more semantically important than others.
For example, the deterministic mapping {(1,1),(2,2),(3,3)} may be viewed as stronger (by some standard) than the deterministic mapping {(1,3),(2,1),(3,2)}, although these relationships would yield the same mutual information. This is because the mutual information is not sensitive at all to any inherent ordering in the variable values (Cronbach 1954, Coombs & Dawes 1970, Lockhead 1970), and is therefore not sensitive at all to the form of the relational mapping between the associated variables. If it is desired that the former relation — showing agreement on all variable values — be judged stronger than the later relation, then it is possible to use the following weighted mutual information (Guiasu 1977)
which places a weight w(x,y) on the probability of each variable value co-occurrence, p(x,y). This allows that certain probabilities may carry more or less significance than others, thereby allowing the quantification of relevant holistic or prägnanz factors. In the above example, using larger relative weights for w(1,1), w(2,2), and w(3,3) would have the effect of assessing greater informativeness for the relation {(1,1),(2,2),(3,3)} than for the relation {(1,3),(2,1),(3,2)}, which may be desirable in some cases of pattern recognition, and the like.
http://en.wikipedia.org/wiki/Mutual_information#Weighted_variants
State-value weighted entropy as a measure of investment risk.
http://www56.homepage.villanova.edu/david.nawrocki/State%20Weighted%20Entropy%20Nawrocki%20Harding.pdf