How MLE is used to train a n-gram model? - machine-learning

I learned many documents about training a n-gram model using MLE, but as I noticed all the implementation is just to calculate the conditional probability by count the n-grams, my question is what is the relationship with MLE?

Intuitively, you would have to count all the n-grams in all texts in the world to compute their probabilities. Since this is highly unrealistic, MLE provides a way to estimate these n-gram probabilities by counting them in the given corpus.
For instance, if you need the bigram probability of a word y following a word x, you count the number of their occurrence as a pair, . Then you have to normalize this count by dividing it by the sum of all bigrams starting with x (i.e: x being followed by every possible word), , so that the MLE estimate ultimately lies between 0 and 1.
Therefore, this bigram probability can be estimated by the following expression:
Note that this expression can be further simplified because the sum of all bigram counts starting with x must add up to the unigram count of x itself:

Related

What is Maximum Entropy?

Can someone give me a clear and simple definition of Maximum entropy classification? It would be very helpful if someone can provide a clear analogy, as I am struggling to understand.
"Maximum Entropy" is synonymous with "Least Informative". You wouldn't want a classifier that was least informative. It is in reference to how the priors are established. Frankly, "Maximum Entropy Classification" is an example of using buzz words.
For an example of an uninformative prior, consider given a six-sided object. The probability that any given face will appear if the object is tossed is 1/6. This would be your starting prior. It's the least informative. You really wouldn't want to start with anything else or you will bias later calculations. Of course, if you have knowledge that one side will appear more often you should incorporate that into your priors.
The Bayes formula is P(H|E) = P(E|H)P(H)/P(D)
where P(H) is the prior for the hypothesis and P(D) is the sum of all possible numerators.
For text classification where a missing word is to be inserted, E is some given document and H is the given word. IOW, the hypothesis is that H is the word which should be selected and P(H) is the weight given to the word.
Maximum Entropy Text classification means: start with least informative weights (priors) and optimize to find weights that maximize the likelihood of the data, the P(D). Essentially, it's the EM algorithm.
A simple Naive Bayes classifier would assume the prior weights would be proportional to the number of times the word appears in the document. However,this ignore correlations between words.
The so-called MaxEnt classifier, takes the correlations into account.
I can't think of a simple example to illustrate this but I can think of some correlations. For example, "the missing" in English should give higher weights to nouns but a Naive Bayes classifier might give equal weight to a verb if its relative frequency were the same as a given noun. A MaxEnt classifier considering missing would give more weight to nouns because they would be more likely in context.
I may also advise HIDDEN MARKOV AND
MAXIMUM ENTROPY
MODELS from the Department of Computer Science, Johns Hopkins. Specifically, take a look at chapter 6.6. This book explains the Maximum Entropy on the example of PoS tagging and compare MaxEnt application in MEMM with Hidden Markov Model. There are also explanation what is exactly MaxEnt with math behind.
(Taken from UNDERSTANDING DEEP LEARNING
GENERALIZATION
BY
MAXIMUM
ENTROPY (Zheng et al., 2017):
(Original Maximum Entropy Model) Supposing the dataset has input X and label
Y, the task is to find a good prediction of Y using X. The prediction Yˆ needs to maximize the
conditional entropy H(Yˆ |X) while preserving the same distribution with data (X, Y ). This is
formulated as:
min −H(Yˆ |X) (1)
s.t. P(X, Y ) = P(X, Yˆ ),
\sum(Yˆ) P(Yˆ |X) = 1
Berger et al., 1996 solves this with lagrange multipliers ωi as an exponential form:
Pω(Yˆ = y|X = x) = 1/Zω(x) exp (\sum(i) ωifi(x, y))

Learning to tag sentences with keywords based on examples

I have a set (~50k elements) of small text fragments (usually one or two sentences) each one tagged with a set of keywords chosen from a list of ~5k words.
How would I go to implement a system that, learning from this examples can then tag new sentences with the same set of keywords? I don't need code, I'm just looking for some pointers and methods/papers/possible ideas on how to implement this.
If I understood you well, what you need is a measure of similarity for a pair of documents. I have been recently using TF-IDF for clustering of documents and it worked quiet well. I think here you can use TF-IDF values and calculate a cosine similarity for the corresponding TF-IDF values for each of the documents.
TF-IDF computation
TF-IDF stands for Term Frequency - Inverse Document Frequency. Here is a definition how it can be calculated:
Compute TF-IDF values for all words in all documents
- TF-IDF score of a word W in document D is
TF-IDF(W, D) = TF(W, D) * IDF(W)
where TF(W, D) is frequency of word W in document D
IDF(W) = log(N/(2 + #W))
N - number of documents
#W - number of documents that contain word W
- words contained in the title will count twice (means more important)
- normalize TF-IDF values: sum of all TF-IDF(W, D)^2 in a document should be 1.
Depending of the technology You use, this may be achieved in different ways. I have had implemented it in Python using a nested dictionary. Firstly I use document name D as key and then for each document D I have a nested dictionary with word W as key and each word W has a corresponding numeric value which is the calculated TF-IDF.
Similarity computation
Let say You have calculated the TF-IDF values already and You want to compare 2 documents W1 and W2 how similar they are. For that we need to use some similarity metric. There are many choices, each one having pros and cons. In this case, IMO, Jaccard similarity and cosine similarity would work good. Both functions would have TF-IDF and names of the 2 documentsW1 and W2 as its arguments and it would return a numeric value which indicates how similar the 2 documents are.
After computing the similarity between 2 documents you will obtain a numeric value. The greater the value, the more similar 2 documents W1 and W2 are. Now, depending on what You want to achieve, we have 2 scenarios.
If You want for 1 document to assign only the tags of the most similar document, then You compare it with all other documents and assign to the new document the tags of the most similar one.
You can set some threshold and You can assign all tags of documents which have similarity with the document in question greater than the threshold value. If You set threshold = 0.7, than all document W will have the tags of all already tagged documents V for which similarity(W, V) > 0.7.
I hope it helps.
Best of luck :)
Given your description, you are looking for some form of supervised learning. There are many methods in that class, e.g. Naive Bayes classifiers, Support Vector Machines (SVM), k Nearest Neighbours (kNN) and many others.
For a numerical representation of your text, you can chose a bag-of-words or a frequency list (essentially, each text is represented by a vector on a high dimensional vectorspace spanned by all the words).
BTW, it is far easier to tag the texts with one keyword (a classification task) than to assign them up to five ones (the number of possible classes explodes combinatorically)

word2vec: negative sampling (in layman term)?

I'm reading the paper below and I have some trouble , understanding the concept of negative sampling.
http://arxiv.org/pdf/1402.3722v1.pdf
Can anyone help , please?
The idea of word2vec is to maximise the similarity (dot product) between the vectors for words which appear close together (in the context of each other) in text, and minimise the similarity of words that do not. In equation (3) of the paper you link to, ignore the exponentiation for a moment. You have
v_c . v_w
-------------------
sum_i(v_ci . v_w)
The numerator is basically the similarity between words c (the context) and w (the target) word. The denominator computes the similarity of all other contexts ci and the target word w. Maximising this ratio ensures words that appear closer together in text have more similar vectors than words that do not. However, computing this can be very slow, because there are many contexts ci. Negative sampling is one of the ways of addressing this problem- just select a couple of contexts ci at random. The end result is that if cat appears in the context of food, then the vector of food is more similar to the vector of cat (as measures by their dot product) than the vectors of several other randomly chosen words (e.g. democracy, greed, Freddy), instead of all other words in language. This makes word2vec much much faster to train.
Computing Softmax (Function to determine which words are similar to the current target word) is expensive since requires summing over all words in V (denominator), which is generally very large.
What can be done?
Different strategies have been proposed to approximate the softmax. These approaches can be grouped into softmax-based and sampling-based approaches. Softmax-based approaches are methods that keep the softmax layer intact, but modify its architecture to improve its efficiency (e.g hierarchical softmax). Sampling-based approaches on the other hand completely do away with the softmax layer and instead optimise some other loss function that approximates the softmax (They do this by approximating the normalization in the denominator of the softmax with some other loss that is cheap to compute like negative sampling).
The loss function in Word2vec is something like:
Which logarithm can decompose into:
With some mathematic and gradient formula (See more details at 6) it converted to:
As you see it converted to binary classification task (y=1 positive class, y=0 negative class). As we need labels to perform our binary classification task, we designate all context words c as true labels (y=1, positive sample), and k randomly selected from corpora as false labels (y=0, negative sample).
Look at the following paragraph. Assume our target word is "Word2vec". With window of 3, our context words are: The, widely, popular, algorithm, was, developed. These context words consider as positive labels. We also need some negative labels. We randomly pick some words from corpus (produce, software, Collobert, margin-based, probabilistic) and consider them as negative samples. This technique that we picked some randomly example from corpus is called negative sampling.
Reference :
(1) C. Dyer, "Notes on Noise Contrastive Estimation and Negative Sampling", 2014
(2) http://sebastianruder.com/word-embeddings-softmax/
I wrote an tutorial article about negative sampling here.
Why do we use negative sampling? -> to reduce computational cost
The cost function for vanilla Skip-Gram (SG) and Skip-Gram negative sampling (SGNS) looks like this:
Note that T is the number of all vocabs. It is equivalent to V. In the other words, T = V.
The probability distribution p(w_t+j|w_t) in SG is computed for all V vocabs in the corpus with:
V can easily exceed tens of thousand when training Skip-Gram model. The probability needs to be computed V times, making it computationally expensive. Furthermore, the normalization factor in the denominator requires extra V computations.
On the other hand, the probability distribution in SGNS is computed with:
c_pos is a word vector for positive word, and W_neg is word vectors for all K negative samples in the output weight matrix. With SGNS, the probability needs to be computed only K + 1 times, where K is typically between 5 ~ 20. Furthermore, no extra iterations are necessary to compute the normalization factor in the denominator.
With SGNS, only a fraction of weights are updated for each training sample, whereas SG updates all millions of weights for each training sample.
How does SGNS achieve this? -> by transforming multi-classification task into binary classification task.
With SGNS, word vectors are no longer learned by predicting context words of a center word. It learns to differentiate the actual context words (positive) from randomly drawn words (negative) from the noise distribution.
In real life, you don't usually observe regression with random words like Gangnam-Style, or pimples. The idea is that if the model can distinguish between the likely (positive) pairs vs unlikely (negative) pairs, good word vectors will be learned.
In the above figure, current positive word-context pair is (drilling, engineer). K=5 negative samples are randomly drawn from the noise distribution: minimized, primary, concerns, led, page. As the model iterates through the training samples, weights are optimized so that the probability for positive pair will output p(D=1|w,c_pos)≈1, and probability for negative pairs will output p(D=1|w,c_neg)≈0.

Libsvm: SVM normalizing starts from 0 or 0.001

I am using libsvm for my document classification.
I use svm.h and svm.cc only in my project.
Its struct svm_problem requires array of svm_node that are non-zero thus using sparse.
I get a vector of tf-idf words with lets say in range [5,10]. If i normalize it to [0,1], all the 5's would become 0.
Should i remove these zeroes when sending it to svm_train ?
Does removing these would not reduce the information and lead to poor results ?
should i start the normalization from 0.001 rather than 0 ?
Well, in general, in SVM does normalizing in [0,1] not reduces information ?
SVM is not a Naive Bayes, feature's values are not counters, but dimensions in multidimensional real valued space, 0's have exactly the same amount of information as 1's (which also answers your concern regarding removing 0 values - don't do it). There is no reason to ever normalize data to [0.001, 1] for the SVM.
The only issue here is that column-wise normalization is not a good idea for the tf-idf, as it will degenerate yout features to the tf (as for perticular i'th dimension, tf-idf is simply tf value in [0,1] multiplied by a constant idf, normalization will multiply by idf^-1). I would consider one of the alternative preprocessing methods:
normalizing each dimension, so it has mean 0 and variance 1
decorrelation by making x=C^-1/2*x, where C is data covariance matrix

Tanimoto Score and when it is used

I read a wiki article which describes about Jaccard index and explains the Tanimoto score as extended Jaccard index, but what exactly it tries to do?
How is it different from other similarity scores?
When is it used?
Thank you
I just read the wikipedia article too, so I can only interpret the content for you.
Jaccards score is used for vectors that take discrete values, most often for binary values (1 or 0). Tanimoto score is used for vectors that can take on continuous values. It is designed so that if the vector only takes values of 1 and 0, it works the same as Jaccards.
I would imagine you would Tanimoto's when you have a 'mixed' vector that has some continuous valued parts and some binary valued parts.
what exactly it tries to do?
Tanimoto score assumes that each data object is a
vector of attributes. The attributes may or may not be binary in this case. If they all are binary, the Tanimoto method reduces to the Jaccard method.
T(A,B)= A.B/(||A||2 + ||B||2 - A.B)
In the equation, A and B are data objects represented by vectors. The similarity score
is the dot product of A and B divided by the squared magnitudes of A and B minus the
dot product.
How is it different from other similarity scores?
Tanimoto v/s Jaccard: If the attributes are binary, Tanimoto is reduced to Jaccard Index.
There are various similarity scores available but let's compare with the most frequently used.
Tanimoto v/s Dice:
The Tanimoto coefficent is determined by looking at the number of attributes that are common to both data objects (the intersection of the data strings) compared to the number of attributes that are in either (the union of the data objects).
The Dice coefficient is the number of attributes in common to both data objects relative to the average size of the total number of attributes present, i.e.
( A intersect B ) / 0.5 ( A + B )
D(A,B) = A.B/(0.5(||A||2 + ||B||2))
Tanimoto v/s Cosine
Finding the cosine similarity between two data objects requires that both objects represent their attributes in a vector. Similarity is then measured as the angle between the two vectors.
Cos(θ) = A.B/(||A||.||B||)
You can also refer When can two objects have identical Tanimoto and Cosine score.
Tanimoto v/s Pearson:
The Pearson Coefficient is a complex and sophisticated approach to finding similarity. The method generates a "best fit" line between attributes in two data objects. The Pearson Coefficient is found using the following equation:
p(A,B) = cov(A,B)/σAσB
where,
cov(A,B) --> Covariance
σ A --> Standard deviation of A
σ B --> Standard deviation of B
The coefficient is found from dividing the covariance by the product of the standard deviations of the attributes of two data objects. It is more robust against data that isn't normalized. For example, if one person ranked movies "a", "b", and "c" with scores of 1, 2, and 3 respectively, he would have a perfect correlation to someone who ranked the same movies with a 4, 5, and 6.
For more information on Tanimoto score v/s other similarity scores/coefficients you can refer:
Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations?
When is it used?
Tanimoto score can be used in both the situations:
When attributes are Binary
When attributes are Non-Binary
Following applications extensively use Tanimoto score:
Chemoinformatics
Clustering
Plagiarism detection
Automatic thesaurus extraction
To visualize high-dimensional datasets
Analyze market-basket transactional data
Detect anomalies in spatio-temporal data

Resources