What is Maximum Entropy? - machine-learning

Can someone give me a clear and simple definition of Maximum entropy classification? It would be very helpful if someone can provide a clear analogy, as I am struggling to understand.

"Maximum Entropy" is synonymous with "Least Informative". You wouldn't want a classifier that was least informative. It is in reference to how the priors are established. Frankly, "Maximum Entropy Classification" is an example of using buzz words.
For an example of an uninformative prior, consider given a six-sided object. The probability that any given face will appear if the object is tossed is 1/6. This would be your starting prior. It's the least informative. You really wouldn't want to start with anything else or you will bias later calculations. Of course, if you have knowledge that one side will appear more often you should incorporate that into your priors.
The Bayes formula is P(H|E) = P(E|H)P(H)/P(D)
where P(H) is the prior for the hypothesis and P(D) is the sum of all possible numerators.
For text classification where a missing word is to be inserted, E is some given document and H is the given word. IOW, the hypothesis is that H is the word which should be selected and P(H) is the weight given to the word.
Maximum Entropy Text classification means: start with least informative weights (priors) and optimize to find weights that maximize the likelihood of the data, the P(D). Essentially, it's the EM algorithm.
A simple Naive Bayes classifier would assume the prior weights would be proportional to the number of times the word appears in the document. However,this ignore correlations between words.
The so-called MaxEnt classifier, takes the correlations into account.
I can't think of a simple example to illustrate this but I can think of some correlations. For example, "the missing" in English should give higher weights to nouns but a Naive Bayes classifier might give equal weight to a verb if its relative frequency were the same as a given noun. A MaxEnt classifier considering missing would give more weight to nouns because they would be more likely in context.

I may also advise HIDDEN MARKOV AND
MAXIMUM ENTROPY
MODELS from the Department of Computer Science, Johns Hopkins. Specifically, take a look at chapter 6.6. This book explains the Maximum Entropy on the example of PoS tagging and compare MaxEnt application in MEMM with Hidden Markov Model. There are also explanation what is exactly MaxEnt with math behind.

(Taken from UNDERSTANDING DEEP LEARNING
GENERALIZATION
BY
MAXIMUM
ENTROPY (Zheng et al., 2017):
(Original Maximum Entropy Model) Supposing the dataset has input X and label
Y, the task is to find a good prediction of Y using X. The prediction Yˆ needs to maximize the
conditional entropy H(Yˆ |X) while preserving the same distribution with data (X, Y ). This is
formulated as:
min −H(Yˆ |X) (1)
s.t. P(X, Y ) = P(X, Yˆ ),
\sum(Yˆ) P(Yˆ |X) = 1
Berger et al., 1996 solves this with lagrange multipliers ωi as an exponential form:
Pω(Yˆ = y|X = x) = 1/Zω(x) exp (\sum(i) ωifi(x, y))

Related

Mathematical representation of likelihood

Likelihood is dealing with fitting models given some known data ...so does it implies p(y|x)...??
As much I know, likelihood formula is p(x|y). So it do contradict with the definition. Please explain.
We can usually interpret the likelihood in a machine learning model as the probability of observing y when the model is given x.
The maximum likelihood estimator can be derived as the product sum of the conditioned probabilities:
or just
That is, the probability of observing the i'th output y, when we give the i'th input x to some model with hypothesis h.
For reference, you could consider a multivariate regression model with regression coefficients w. Naturally we want to maximize the likelihood, by searching for some optimal weights w*. We can thus write the likelihood as
Hope it clarifies a bit :)

Difference between Probabilistic kNN and Naive Bayes

I'm trying to modify an standard kNN algorithm to obtain the probability of belonging to a class instead of just the usual classification. I haven't found much information about Probabilistic kNN, but as far as I understand, it works similar to kNN, with the difference that it calculates the percentage of examples of every class inside the given radius.
So I wonder, what's the difference then between Naive Bayes and Probabilistic kNN? I just can spot that Naive Bayes takes into consideration the prior possibility, while PkNN does not. Am I getting it wrong?
Thanks in advance!
To be honest there is nearly no similarity.
Naive bayes assumes that each class is distributed according to a simple distribution, independent on feature basis. For contiuous case - It will fit a radial Normal distribution to your whole class (each of them) and then make a decision through argmax_y N(m_y, Sigma_y)
KNN on the other hand is not a probabilistic model. Modification that you are refering to is simply a "smooth" version of the original idea, where you return ratio of each class in the nearest neighbours set (and this is not really any "probabilistic kNN", it is just regular kNN which rough estimate of probability). This assumes nothing about data distribution (besides being localy smooth). In particular - it is a nonparametric model which, given enough training samples, will fit perfectly to any dataset. Naive Bayes will fit perfectly only to K gaussians (where K is number of classes).
(I don't know how to format math formulas. For more details and clear representations, please see this.)
I would like to propose an opposite view that KNN is a kind of simplified Naive Bayes (NB) by viewing KNN as a mean of density estimation.
To perform density estimation, we attempt to estimate p(x) = k/NV, where k is the number of samples lying in a region R, N is the total sample number, and V is the volume of the region R. Usually, there are two ways to estimate it: (1) fixing V, calculate k, which is known as kernel density estimation or Parzen window; (2) fixing k, calculate V, which is the KNN-based density estimation. The latter one is much less famous than the former one due to its many drawbacks.
Yet, we can use KNN-based density estimation to connect KNN and NB. Given total N samples, Ni samples for class ci, we can write the NB in the form of KNN-based density estimation by considering a region contain x:
P(ci|x) = P(x|ci)P(ci)/P(x) = (ki/NiV)(Ni/N)/(k/NV) = ki/k,
where ki is the sample number of class ci lying in the region. The final form ki/k is actually the KNN classifier.

word2vec: negative sampling (in layman term)?

I'm reading the paper below and I have some trouble , understanding the concept of negative sampling.
http://arxiv.org/pdf/1402.3722v1.pdf
Can anyone help , please?
The idea of word2vec is to maximise the similarity (dot product) between the vectors for words which appear close together (in the context of each other) in text, and minimise the similarity of words that do not. In equation (3) of the paper you link to, ignore the exponentiation for a moment. You have
v_c . v_w
-------------------
sum_i(v_ci . v_w)
The numerator is basically the similarity between words c (the context) and w (the target) word. The denominator computes the similarity of all other contexts ci and the target word w. Maximising this ratio ensures words that appear closer together in text have more similar vectors than words that do not. However, computing this can be very slow, because there are many contexts ci. Negative sampling is one of the ways of addressing this problem- just select a couple of contexts ci at random. The end result is that if cat appears in the context of food, then the vector of food is more similar to the vector of cat (as measures by their dot product) than the vectors of several other randomly chosen words (e.g. democracy, greed, Freddy), instead of all other words in language. This makes word2vec much much faster to train.
Computing Softmax (Function to determine which words are similar to the current target word) is expensive since requires summing over all words in V (denominator), which is generally very large.
What can be done?
Different strategies have been proposed to approximate the softmax. These approaches can be grouped into softmax-based and sampling-based approaches. Softmax-based approaches are methods that keep the softmax layer intact, but modify its architecture to improve its efficiency (e.g hierarchical softmax). Sampling-based approaches on the other hand completely do away with the softmax layer and instead optimise some other loss function that approximates the softmax (They do this by approximating the normalization in the denominator of the softmax with some other loss that is cheap to compute like negative sampling).
The loss function in Word2vec is something like:
Which logarithm can decompose into:
With some mathematic and gradient formula (See more details at 6) it converted to:
As you see it converted to binary classification task (y=1 positive class, y=0 negative class). As we need labels to perform our binary classification task, we designate all context words c as true labels (y=1, positive sample), and k randomly selected from corpora as false labels (y=0, negative sample).
Look at the following paragraph. Assume our target word is "Word2vec". With window of 3, our context words are: The, widely, popular, algorithm, was, developed. These context words consider as positive labels. We also need some negative labels. We randomly pick some words from corpus (produce, software, Collobert, margin-based, probabilistic) and consider them as negative samples. This technique that we picked some randomly example from corpus is called negative sampling.
Reference :
(1) C. Dyer, "Notes on Noise Contrastive Estimation and Negative Sampling", 2014
(2) http://sebastianruder.com/word-embeddings-softmax/
I wrote an tutorial article about negative sampling here.
Why do we use negative sampling? -> to reduce computational cost
The cost function for vanilla Skip-Gram (SG) and Skip-Gram negative sampling (SGNS) looks like this:
Note that T is the number of all vocabs. It is equivalent to V. In the other words, T = V.
The probability distribution p(w_t+j|w_t) in SG is computed for all V vocabs in the corpus with:
V can easily exceed tens of thousand when training Skip-Gram model. The probability needs to be computed V times, making it computationally expensive. Furthermore, the normalization factor in the denominator requires extra V computations.
On the other hand, the probability distribution in SGNS is computed with:
c_pos is a word vector for positive word, and W_neg is word vectors for all K negative samples in the output weight matrix. With SGNS, the probability needs to be computed only K + 1 times, where K is typically between 5 ~ 20. Furthermore, no extra iterations are necessary to compute the normalization factor in the denominator.
With SGNS, only a fraction of weights are updated for each training sample, whereas SG updates all millions of weights for each training sample.
How does SGNS achieve this? -> by transforming multi-classification task into binary classification task.
With SGNS, word vectors are no longer learned by predicting context words of a center word. It learns to differentiate the actual context words (positive) from randomly drawn words (negative) from the noise distribution.
In real life, you don't usually observe regression with random words like Gangnam-Style, or pimples. The idea is that if the model can distinguish between the likely (positive) pairs vs unlikely (negative) pairs, good word vectors will be learned.
In the above figure, current positive word-context pair is (drilling, engineer). K=5 negative samples are randomly drawn from the noise distribution: minimized, primary, concerns, led, page. As the model iterates through the training samples, weights are optimized so that the probability for positive pair will output p(D=1|w,c_pos)≈1, and probability for negative pairs will output p(D=1|w,c_neg)≈0.

maximum entropy model and logistic regression

I am doing a project that has some Natural Language Processing to do. I am using stanford MaxEnt Classifier for the purpose.But I am not sure, whether Maximum entropy model and logistic regression are one at the same or is it some special kind of logistic regression?
Can anyone come up with an explanation?
This is exactly the same model. NLP society prefers the name Maximum Entropy and uses the sparse formulation which allows to compute everything without direct projection to the R^n space (as it is common for NLP to have huge amount of features and very sparse vectors).
You may wanna read the attachment in this post, which gives a simple derivation:
http://www.win-vector.com/blog/2011/09/the-equivalence-of-logistic-regression-and-maximum-entropy-models/
An explanation is quoted from "Speech and Language Processing" by Daniel Jurafsky & James H. Martin.:
Each feature is an indicator function, which picks out a subset of the training observations. For each feature we add a constraint on our total distribution, specifying that our distribution for this subset should match the empirical distribution we saw in our training data. We then choose the maximum entropy distribution which otherwise accords with these constraints.
Berger et al. (1996) show that the solution to this optimization problem turns out to be exactly the probability distribution of a multinomial logistic regression model whose weights W maximize the likelihood of the training data!
In Max Entropy the feature is represnt with f(x,y), it mean you can design feature by using the label y and the observerable feature x, while, if f(x,y) = x it is the situation in logistic regression.
in NLP task like POS, it is common to design feature's combining labels. for example: current word ends with "ous" and next word is noun. it can be feature to predict whether the current word is adj

Text Classification - how to find the features that most affected the decision

When using SVMlight or LIBSVM in order to classify phrases as positive or negative (Sentiment Analysis), is there a way to determine which are the most influential words that affected the algorithms decision? For example, finding that the word "good" helped determine a phrase as positive, etc.
If you use the linear kernel then yes - simply compute the weights vector:
w = SUM_i y_i alpha_i sv_i
Where:
sv - support vector
alpha - coefficient found with SVMlight
y - corresponding class (+1 or -1)
(in some implementations alpha's are already multiplied by y_i and so they are positive/negative)
Once you have w, which is of dimensions 1 x d where d is your data dimension (number of words in the bag of words/tfidf representation) simply select the dimensions with high absolute value (no matter positive or negative) in order to find the most important features (words).
If you use some kernel (like RBF) then the answer is no, there is no direct method of taking out the most important features, as the classification process is performed in completely different way.
As #lejlot mentioned, with linear kernel in SVM, one of the feature ranking strategies is based on the absolute values of weights in the model. Another simple and effective strategy is based on F-score. It considers each feature separately and therefore cannot reveal mutual information between features. You can also determine how important a feature is by removing that feature and observe the classification performance.
You can see this article for more details on feature ranking.
With other kernels in SVM, the feature ranking is not that straighforward, yet still feasible. You can construct an orthogonal set of basis vectors in the kernel space, and calculate the weights by kernel relief. Then the implicit feature ranking can be done based on the absolute value of weights. Finally the data is projected into the learned subspace.

Resources