Difference between probability and likelihood

Difference between probability and likelihood - machine-learning

I had to think through what it meant when I read that likelihood is not probability, but the following case occurred to me.
What is the likelihood that a coin is fair, given that we see four heads in a row?
We can't really say anything about probability here, but the word "trust" seems apt.
Do we feel we can trust the coin?
Found on internet.
Probability quantifies anticipation (of outcome), likelihood quantifies trust (in model).
I Can someone give clean explanation.

Probability is the quantity most people are familiar with which deals with predicting new data given a known model ("what is the probability of getting heads six times in a row flipping this 50:50 coin?")
while,
Likelihood deals with fitting models given some known data ("what is the likelihood that this coin is/isn't rigged given that I just flipped heads six times in a row?").
A likelihood quantity such as 0.12 is not meaningful to the layman unless it is explained exactly what 0.12 means: a measure of the sample's support for the assumed model i.e. low values either mean rare data or incorrect model!
"https://www.youtube.com/watch?v=pYxNSUDSFH4" can be helpful

Related

Removing bias in user ratings

I have got a dataset with users ratings on images. I am normalizing the ratings using mean- standard deviation normalization to remove bias in the dataset due to user specific preferences. Is this a correct way to handle bias or is there any other way to remove bias in users ratings.

This is certainly wrong on a couple of points:
If you 'normalise' input by standard deviation in this way, what you are saying is that "low variability doesn't matter much, only the outliers really count" -- because the outliers will have themselves a deviation larger than the standard one...
You are dealing with 'votes' of user satisfaction, not 'measurements'. Bias, by definition is information about satisfaction -- you are throwing it away. I.e. 150 years ago people used to find the "No dogs, no Irish" thing acceptable, these days not so much. If you want to predict how well a restaurant is likely to be regarded after a visit, you can't discount 0 star votes merely because the people objected to the sign!
When it comes to star ratings as a prediction for how likely something is to be "enjoyed" or "regretted" you might want to read this article: https://www.evanmiller.org/how-not-to-sort-by-average-rating.html
Note that the linked article is primarily interested in modelling "given past ratings, does the current vote indicate: (a) a continuation of past 'satisfaction', (b) a shifting trend towards increasing 'satisfaction', (c) a shifting trend towards decreasing 'satisfaction'" in terms of stars to award.

Identifying positivity and negativity separately from a negative dataset

First of all, i would like you to know that I am new to machine learning (ML). I am working on a project which detects how positive or negative a set of words can be, therefore i have created a database containing possible negative words. So that ML can take place and predict the overall score on how positive or negative the whole set of words.
My questions are is it possible to classify positive words with only negative words in the dataset? Does it affect the accuracy of predicting if it is possible?

No, it's not generally possible. The model will have no way to differentiate among (1) new negative phrases; (2) neutral phrases; (3) positive phrases. In fact, with only negative phrases, the model will have a hard time learning that "bad" and "not bad" are opposites, as it has seen plenty of "not" references in the negative literature, such as "not worth watching, even for free."

training set with only one label, missing the other

Hi I've been doing a machine learning project about predicting if a given (query, answer) pair is a good match (label the pair with 1 if it is a good match, 0 otherwise). But the problem is, in the training set, all the items are labelled with 1. So I got confused because I don't think the training set has strong discriminative power. To be more specific, now I could extract some features like:
1. textual similarity between query and answer
2. some attributes like the posting date, who created it, which aspect is it about etc.
Maybe I should try semi supervised learning (never studied it so have no idea if it will work)? But with such a training set I even cannot do validation....

Actually, you can train a data set on only positive examples; 1-class SVM does this. However, this presumes that anything "sufficiently outside" the original data set is negative data, with "sufficiently outside" affected mainly by gamma (allowed error rate) and k (degree of the kernel function).
A solution for your problem depends on the data you have. You are quite correct that a model trains better when given representative negative examples. The description you give strongly suggests that you do know there are insufficient matches.
Do you need a strict +/- scoring for the matches? Most applications simply rank them: the match strength is the score. This changes your problem from a classification to a prediction case. If you do need a strict +/- partition (classification), then I suggest that you slightly alter your training set: include only obvious examples: throw out anything scored near your comfort threshold for declaring a match.
With these inputs only, train your model. You'll have a clear "alley" between good and bad matches, and the model will "decide" which way to judge the in-between cases in testing and production.

what does Maximum Likelihood Estimation exactly mean?

When we are training our model we usually use MLE to estimate our model. I know it means that the most probable data for such a learned model is our training set. But I'm wondering if its probability match 1 exactly or not?

You almost have it right. The Likelihood of a model (theta) for the observed data (X) is the probability of observing X, given theta:
L(theta|X) = P(X|theta)
For Maximum Likelihood Estimation (MLE), you choose the value of theta that provides the greatest value of P(X|theta). This does not necessarily mean that the observed value of X is the most probable for the MLE estimate of theta. It just means that there is no other value of theta that would provide a higher probability for the observed value of X.
In other words, if T1 is the MLE estimate of theta, and if T2 is any other possible value of theta, then P(X|T1) > P(X|T2). However, there still could be another possible value of the data (Y) different than the observed data (X) such that P(Y|T1) > P(X|T1).
The probability of X for the MLE estimate of theta is not necessarily 1 (and probably never is except for trivial cases). This is expected since X can take multiple values that have non-zero probabilities.

To build on what bogatron said with an example, the parameters learned from MLE are the ones that explain the data you see (and nothing else) the best. And no, the probability is not 1 (except in trivial cases).
As an example (that has been used billions of times) of what MLE does is:
If you have a simple coin-toss problem, and you observe 5 results of coin tosses (H, H, H, T, H) and you do MLE, you will end up giving p(coin_toss == H) a high probability (0.80) because you see Heads way too many times. There are good and bad things about MLE obviously...
Pros: It is an optimization problem, so it is generally quite fast to solve (even if there isn't an analytical solution).
Cons: It can overfit when there isn't a lot of data (like our coin-toss example).

The example I got in my stat classes was as follows:
A suspect is on the run ! Nothing is known about them, except that they're approximatively 1m80 tall. Should the police look for a man or a woman ?
The idea here is that you have a parameter for your model (M/F), and probabilities given that parameter. There are tall men, tall women, short men and short women. However, in the absence of any other information, the probability of a man being 1m80 is larger than the probability of a woman being 1m80. Likelihood (as bogatron very well explained) is a formalisation of that, and maximum likelihood is the estimation method based on favouring parameters which are more likely to result in the actual observations.
But that's just a toy example, with a single binary variable... Let's expand it a bit: I threw two identical die, and the sum of their value is 7. How many side did my die have ? Well, we all know that the probability of two D6 summing to 7 is quite high. But it might as well be D4, D20, D100, ... However, P(7 | 2D6) > P(7 | 2D20), and P(7 | 2D6) > P(7 | 2D100) ..., so you might estimate that my die are 6-faced. That doesn't mean it's true, but its a reasonable estimation, in the absence of any additional information.
That's better, but we're not in machine-learning territory yet... Let's get there: if you want to fit your umptillion-layer neural network on some empirical data, you can consider all possible parameterisations, and how likely each of them is to return the empirical data. That's exploring an umptillion-dimensional space, each dimensions having infinitely many possibilities, but you can map every single one of these points to a likelihood. It is then reasonable to fit your network using these parameters: given that the empirical data did occur, it is reasonable to assume that they should be likely under your model.
That doesn't mean that your parameters are likely ! Just that under these parameters, the observed value is likely. Statistical estimation is usually not a closed problem with a single solution (like solving an equation might be, and where you would have a probability of 1), but we need to find a best solution, according to some metric. Likelihood is such a metric, and is used widely because it has some interesting properties:
It makes intuitive sense
It's reasonably simple to compute, fit and optimise, for a large family of models
For normal variables (which tend to crop up everywhere) MLE gives the same results as other methods, such as least-squares estimations
Its formulation in terms of conditional probabilities makes it easy to use/manipulate it in Bayesian frameworks

Naive Bays spam filtering

I am trying to implement my first spam filter using a naive bayes classifier. I am using the data provided by UCI’s machine learning data repository (http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection). The data is a table of features corresponding to a few thousand spam and non-spam(ham) messages. Therefore, my features are limited to those provided by the table.
My goal is to implement a classifier that can calculate P(S∣M), the probability of being spam given a message. So far I have been using the following equation to calculate P(S∣F), the probability of being spam given a feature.
P(S∣F)=P(F∣S)/(P(F∣S)+P(F∣H))
from http://en.wikipedia.org/wiki/Bayesian_spam_filtering
where P(F∣S) is the probability of feature given spam and P(F∣H) is the probability of feature given ham. I am having trouble bridging the gap from knowing a P(S∣F) to P(S∣M) where M is a message and a message is simply a bag of independent features.
At a glance I want to just multiply the features together. But that would make most numbers very small, I am not sure if that is normal.
In short these are the questions I have right now.
1.) How to take a set of P(S∣F) to a P(S∣M).
2.) Once P(S∣M) has been calculated, how do I define a a threshold for my classifier?
3.) Fortunately my feature set was selected for me, how would I go about selecting or finding my own feature set?
I would also appreciate resources that might help me out as well. Thanks for your time.

You want to use Naive Bayes:
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
It's probably beyond the scope of this answer to explain it, but essentially you multiply the probability of each feature give spam together, and multiply that by the prior probability of spam. Then repeat for ham (i.e. multiple each feature given ham together, and multiply that by the prior probability of ham). Now you have two numbers which can be normalized to probabilities by dividing each by the total of both. That will give you the probability of S|M and S|H. Again read the article above. If you want to avoid numerical underflow, take the log of each conditional and prior probability (any base) and add, instead of multiplying the original probabilities. Adding logs is equivalent to multiplying the original numbers. This won't give you a probability number at the end, but you can still take the one with the larger value as the predicted class.
You should not need to set a threshold, simply classify each instance by what is more likely, spam or ham (or whichever gives you the greater log likelihood).
There is no simple answer to this. Using a bag of words model is reasonable for this problem. Avoid very infrequent (occurring in < 5 documents) and also very frequent words, such as the, and a. A stop word list is often used to remove these. A feature selection algorithm can also help. Removing features that are highly correlated will help, particularly with Naive Bayes, which is highly sensitive to this.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart