Comparing two probabilities from the same normal distribution - normal-distribution

I have a normal distribution. I would like to compare two input probabilities from this population to measure how "similar" they are. Everything is subjective but I wanted to be able to say that x is more "similar" to y then z to y, using some sort of an equation against the normal distribution.
For example, if the mean of my population is 10, and my standard deviation is 3. I would like my simple algorithm to say find that two points (19 and 17) are more similar than two other points (9 and 10) simply because it's a lot less likely to get point 17 (since it's more than two sigmas away from the mean), thus getting another random point to be near that first point with the low probability, shows higher similarity than comparing two points that occur with much equally higher probabilities.
Using something like P(X < p1) - P(X < p2), is not good enough because I may get 0 if both points are the same. However, obtaining two points of 9 and 9, should score less similarity than two points (20 and 20) since 20 is a lot less likely to occur than 9.
I feel like I need to use the difference above but somehow also use the mean and sigma to formulate a similarity "formula".
Is there an existing stat test that captures what I'm trying to do above? If not, does anyone have a suggestion as how I would go about solving the problem above?
Thank you!

Related

implement Naive Bayes Gaussian classifier on the number classification data

I am trying to implement Naive Bayes Gaussian classifier on the number classification data. Where each feature represents a pixel.
When trying to implement this, I hit a bump, I've noticed that some the feature variance equate to 0.
This is an issue because I would not be able to divide by 0 when trying to solve the probability.
What can I do to work around this?
Very short answer is you cannot - even though you can usually try to fit Gaussian distribution to any data (no matter its true distribution) there is one exception - the constant case (0 variance). So what can you do? There are three main solutions:
Ignore 0-variance pixels. I do not recommend this approach as it loses information, but if it is 0 variance for each class (which is a common case for MNIST - some pixels are black, independently from class) then it is actually fully justified mathematically. Why? The answer is really simple, if for each class, given feature is constant (equal to some single value) then it brings literally no information for classification, thus ignoring it will not affect the model which assumes conditional independence of features (such as NB).
Instead of doing MLE estimate (so using N(mean(X), std(X))) use the regularised estimator, for example of form N(mean(X), std(X) + eps), which is equivalent to adding eps-noise independently to each pixel. This is a very generic approach that I would recommend.
Use better distribution class, if your data is images (and since you have 0 variance I assume these are binary images, maybe even MNIST) you have K features, each in [0, 1] interval. You can use multinomial distribution with bucketing, so P(x e Bi|y) = #{ x e Bi | y } / #{ x | y }. Finally this is usually the best thing to do (however requires some knowledge of your data), as the problem is you are trying to use a model which is not suited for the data provided, and I can assure you, that proper distribution will always give better results with NB. So how can you find a good distribution? Plot conditonal marginals P(xi|y) for each feature, and look how they look like, based on that - choose distribution class which matches the behaviour, I can assure you these will not look like Gaussians.

Classification Accuracy Only 5% Higher Than Random Picking

I am trying to predict a public DotA 2 match outcome with given hero picks. It is usually possible for a human. There could only be 2 outcomes for a given side: it is either a win or a loss.
In fact, I am new to machine learning. I wanted to do this mini-project as an exercise but it already took 2 days of my time.
So, I made a dataset of around 2000 matches with about the same skill bracket. Each match contains exactly 13 000 features. Each feature is either 0 or 1 and specifies whether radiant have certain hero or not, whether dire have certain hero or not, whether radiant have one and dire another at a time (and vice versa). All combinations sum up to around 13000 features. Most of them are 0, of course. Labels are also 0 or 1 and indicate whether Radiant team won.
I used different sets for training and for testing.
Logistic regression classifier gave me 100% accuracy on training set and around 58% accuracy for test set.
SVM on the other hand scored 55% on training and 53% on test.
When I decreased number of examples by 1000 I've got 54.5% on training and 55% on test.
Should I continue increasing number of examples?
Should I select different features?
If I add more combinations of heroes feature number will explode. Or maybe there is no way to predict match outcome judging only on the heroes selected and I need to gain data about each players online rating and hero they selected and so on?
Plot of prediction accurace based on number of training examples:
Check out 2 latest graphs I added. I think I've got pretty decent results.
Also:
1. I asked 2 friends of mine to predict 10 matches and they both predicted 6 right. This amounts to 60% just as you said. 10 matches is not a big set, but they wont bother with bigger ones.
2. I downloaded 400 000 latest dota matches. MMR >3000, only all pick mode. Assuming that 1 billion dota matches are played each year 400k are from the same patch.
3. Concatenating hero picks of both sides was the orginal idea. Also, there are 114 heroes in dota, so I have 228 features now
4. In most matches odds are more or less equal, but there is fraction of picks, where one of the teams has advantage.From small up to critical.
What I ask you to do is to verify my conclusions, because results I've got are too bright for linear model.
[Probabilities test][2]
Actual probabilites and predicted probability ranges
distribution of predictions by probability range
The issue here is with your assertion that predicting a dota 2 match based on hero picks is "usually possible for a human". For this particular task it's likely more that there's a low cap on possible accuracy than anything. I watch a lot of dota, and even when you focus on the pro scene the accuracy of casters based on hero picks is quite low. My very preliminary analysis puts their accuracy at within spitting distance of 60%.
Secondly, how many dota matches are actually determined by hero picks? It's not many. In the vast majority of cases, especially in pub matches where skill levels are highly variable, team play matters much more than hero picks.
That's the first issue with your problem, but there are definitely other large issues with the way you've structured the problem that could help you get another couple of accuracy points (though again I doubt you can get far above 60%)
My first suggestion would be to change the way you're generating features. Feeding 13k features into an LR model with 2k examples is a recipe for disaster. Especially in the case of dota, where individual heroes don't matter very much and synergies and counters are drastically more important. I would start by reducing your feature count to ~200 by just concatenating hero picks by both sides. 111 for Radiant, 111 for Dire, 1 if hero is picked, 0 otherwise. This will help with overfitting, but then you run into the issue with LR is not a particularly good fit for the problem because individual heroes don't matter as much.
My second suggestion would be to constrain your match search for a single patch, ideally a later one where there's sufficient data. If you can get different data for different patches so much the better. An LR approach will give you decent accuracy for patches where specific heroes are overpowered, and especially at small data sizes you're a bit hosed if you dealing between patches as the heroes are actually changing.
My third suggestion would be to change models to one that's better at model inter-dependencies between your features. Random forest models are a pretty easy and straightforward approach that should give you better performance than straight LR for a problem like this, and has a built-in in sklearn.
If you want to go a bit further, then using an MLP-style network model could be relatively effective. I don't see an obvious framing of the problem to take advantage of modern network models though (CNNs and RNNs), so unless you change the problem definition a bit I think that this is going to be more hassle than it's worth.
Always, when in doubt get more data, and don't forget that people are very, very bad at this problem as well.

Intuition behind Kernelized perceptron

I understand the derivation of the kernelized perceptron function, but I'm trying to figure out the intuition behind the final formula
f(X) = sum_i (alpha_i*y_i*K(X,x_i))
Where (x_i,y_i) are all the samples in the training data, alpha_i is the number of times we've made a mistake on that sample, and X is the sample we're trying to predict (during training or otherwise). Now, I understand why the kernel function is considered to be a measure of similarity (since it's a dot product in a higher dimensional space), but what I don't get is how this formula comes together.
My original attempt was that we're trying to predict a sample based on how similar it is to the other samples - and multiply it by y_i so that it contributes the correct sign (points that are closer are better indicators of the label than points that are farther). But why should a sample that we've made several mistakes on contribute more?
tl;dr: In a Kernelized perceptron, why should a sample that we've made several mistakes on contribute more to the prediction than ones we haven't made mistakes on?
My original attempt was that we're trying to predict a sample based on how similar it is to the other samples - and multiply it by y_i so that it contributes the correct sign (points that are closer are better indicators of the label than points that are farther).
This is pretty much what's going on. Although the idea is if alpha_i*y_i*K(X,x_i) already is well classified, then you don't need to update it further.
But if the point is misclassified we need to update it. The best way would be in the opposite direction right? that's if the result is negative we should be adding a possitive quantity (y_i). If the result is possitive (and it is missclassified) then we want to sum a negative value (y_i again).
As you can see, y_i already give us the right update direction and hence we use a misclassification counter to give a magnitude to that update.

what does Maximum Likelihood Estimation exactly mean?

When we are training our model we usually use MLE to estimate our model. I know it means that the most probable data for such a learned model is our training set. But I'm wondering if its probability match 1 exactly or not?
You almost have it right. The Likelihood of a model (theta) for the observed data (X) is the probability of observing X, given theta:
L(theta|X) = P(X|theta)
For Maximum Likelihood Estimation (MLE), you choose the value of theta that provides the greatest value of P(X|theta). This does not necessarily mean that the observed value of X is the most probable for the MLE estimate of theta. It just means that there is no other value of theta that would provide a higher probability for the observed value of X.
In other words, if T1 is the MLE estimate of theta, and if T2 is any other possible value of theta, then P(X|T1) > P(X|T2). However, there still could be another possible value of the data (Y) different than the observed data (X) such that P(Y|T1) > P(X|T1).
The probability of X for the MLE estimate of theta is not necessarily 1 (and probably never is except for trivial cases). This is expected since X can take multiple values that have non-zero probabilities.
To build on what bogatron said with an example, the parameters learned from MLE are the ones that explain the data you see (and nothing else) the best. And no, the probability is not 1 (except in trivial cases).
As an example (that has been used billions of times) of what MLE does is:
If you have a simple coin-toss problem, and you observe 5 results of coin tosses (H, H, H, T, H) and you do MLE, you will end up giving p(coin_toss == H) a high probability (0.80) because you see Heads way too many times. There are good and bad things about MLE obviously...
Pros: It is an optimization problem, so it is generally quite fast to solve (even if there isn't an analytical solution).
Cons: It can overfit when there isn't a lot of data (like our coin-toss example).
The example I got in my stat classes was as follows:
A suspect is on the run ! Nothing is known about them, except that they're approximatively 1m80 tall. Should the police look for a man or a woman ?
The idea here is that you have a parameter for your model (M/F), and probabilities given that parameter. There are tall men, tall women, short men and short women. However, in the absence of any other information, the probability of a man being 1m80 is larger than the probability of a woman being 1m80. Likelihood (as bogatron very well explained) is a formalisation of that, and maximum likelihood is the estimation method based on favouring parameters which are more likely to result in the actual observations.
But that's just a toy example, with a single binary variable... Let's expand it a bit: I threw two identical die, and the sum of their value is 7. How many side did my die have ? Well, we all know that the probability of two D6 summing to 7 is quite high. But it might as well be D4, D20, D100, ... However, P(7 | 2D6) > P(7 | 2D20), and P(7 | 2D6) > P(7 | 2D100) ..., so you might estimate that my die are 6-faced. That doesn't mean it's true, but its a reasonable estimation, in the absence of any additional information.
That's better, but we're not in machine-learning territory yet... Let's get there: if you want to fit your umptillion-layer neural network on some empirical data, you can consider all possible parameterisations, and how likely each of them is to return the empirical data. That's exploring an umptillion-dimensional space, each dimensions having infinitely many possibilities, but you can map every single one of these points to a likelihood. It is then reasonable to fit your network using these parameters: given that the empirical data did occur, it is reasonable to assume that they should be likely under your model.
That doesn't mean that your parameters are likely ! Just that under these parameters, the observed value is likely. Statistical estimation is usually not a closed problem with a single solution (like solving an equation might be, and where you would have a probability of 1), but we need to find a best solution, according to some metric. Likelihood is such a metric, and is used widely because it has some interesting properties:
It makes intuitive sense
It's reasonably simple to compute, fit and optimise, for a large family of models
For normal variables (which tend to crop up everywhere) MLE gives the same results as other methods, such as least-squares estimations
Its formulation in terms of conditional probabilities makes it easy to use/manipulate it in Bayesian frameworks

Distance measure for categorical attributes for k-Nearest Neighbor

For my class project, I am working on the Kaggle competition - Don't get kicked
The project is to classify test data as good/bad buy for cars. There are 34 features and the data is highly skewed. I made the following choices:
Since the data is highly skewed, out of 73,000 instances, 64,000 instances are bad buy and only 9,000 instances are good buy. Since building a decision tree would overfit the data, I chose to use kNN - K nearest neighbors.
After trying out kNN, I plan to try out Perceptron and SVM techniques, if kNN doesn't yield good results. Is my understanding about overfitting correct?
Since some features are numeric, I can directly use the Euclid distance as a measure, but there are other attributes which are categorical. To aptly use these features, I need to come up with my own distance measure. I read about Hamming distance, but I am still unclear on how to merge 2 distance measures so that each feature gets equal weight.
Is there a way to find a good approximate for value of k? I understand that this depends a lot on the use-case and varies per problem. But, if I am taking a simple vote from each neighbor, how much should I set the value of k? I'm currently trying out various values, such as 2,3,10 etc.
I researched around and found these links, but these are not specifically helpful -
a) Metric for nearest neighbor, which says that finding out your own distance measure is equivalent to 'kernelizing', but couldn't make much sense from it.
b) Distance independent approximation of kNN talks about R-trees, M-trees etc. which I believe don't apply to my case.
c) Finding nearest neighbors using Jaccard coeff
Please let me know if you need more information.
Since the data is unbalanced, you should either sample an equal number of good/bad (losing lots of "bad" records), or use an algorithm that can account for this. I think there's an SVM implementation in RapidMiner that does this.
You should use Cross-Validation to avoid overfitting. You might be using the term overfitting incorrectly here though.
You should normalize distances so that they have the same weight. By normalize I mean force to be between 0 and 1. To normalize something, subtract the minimum and divide by the range.
The way to find the optimal value of K is to try all possible values of K (while cross-validating) and chose the value of K with the highest accuracy. If a "good" value of K is fine, then you can use a genetic algorithm or similar to find it. Or you could try K in steps of say 5 or 10, see which K leads to good accuracy (say it's 55), then try steps of 1 near that "good value" (ie 50,51,52...) but this may not be optimal.
I'm looking at the exact same problem.
Regarding the choice of k, it's recommended be an odd value to avoid getting "tie votes".
I hope to expand this answer in the future.

Resources