Why can we use entropy to measure the quality of language model?

Why can we use entropy to measure the quality of language model? - machine-learning

I am reading the < Foundations of Statistical Natural Language Processing >. It has the following statement about the relationship between information entropy and language model:
...The essential point here is that if a model captures more of the
structure of a language, then the entropy of the model should be
lower. In other words, we can sue entropy as a measure of the quality
of our models...
But how about this example:
Suppose we have a machine that spit 2 characters, A and B, one by one. And the designer of the machine makes A and B has the equal probability.
I am not the designer. And I try to model it through experiment.
During a initial experiment, I see the machine split the following character sequence:
A, B, A
So I model the machine as P(A)=2/3 and P(B)=1/3. And we can calculate entropy of this model as :
-2/3*Log(2/3)-1/3*Log(1/3)= 0.918 bit (the base is 2)
But then, the designer tell me about his design, so I refined my model with this more information. The new model looks like this:
P(A)=1/2 P(B)=1/2
And the entropy of this new model is:
-1/2*Log(1/2)-1/2*Log(1/2) = 1 bit
The second model is obviously better than the first one. But the entropy increased.
My point is, due to the arbitrariness of the model being tried, we cannot blindly say a smaller entropy indicates a better model.
Could anyone shed some light on this?
ADD 1
(Much thanks to Rob Neuhaus!)
Yes, after I re-digested the mentioned NLP book. I think I can explain it now.
What I calculated is actually the entropy of the language model distribution. It cannot be used to evaluate the effectiveness of a language model.
To evaluate a language model, we should measure how much surprise it gives us for real sequences in that language. For each real word encountered, the language model will give a probability p. And we use -log(p) to quantify the surprise. And we average the total surprise over a long enough sequence. So, in case of a 1000-letter sequence with 500 A and 500 B,
the surprise given by the 1/3-2/3 model will be:
[-500*log(1/3) - 500*log(2/3)]/1000 = 1/2 * Log(9/2)
While the correct 1/2-1/2 model will give:
[-500*log(1/2) - 500*log(1/2)]/1000 = 1/2 * Log(8/2)
So, we can see, the 1/3, 2/3 model gives more surprise, which indicates it is worse than the correct model.
Only when the sequence is long enough, the average effect will mimic the expectation over the 1/2-1/2 distribution. If the sequence is short, it won't give a convincing result.
I didn't mention the cross-entropy here since I think this jargon is too intimidating and not much helpful to reveal the root cause.

If you had a larger sample of data, it's very likely that the model that assigns 2/3 to A and 1/3 to B will do worse than the true model, which gives 1/2 to each. The problem is that your training set is too small, so you were mislead into thinking the wrong model was better. I encourage you to experiment, generate a random string of length 10000, where each character equally likely. Then measure the cross entropy of the 2/3,1/3 model vs the 1/2,1/2 model on that much longer string. I am sure you will see the latter performs better. Here is some sample Python code demonstrating the fact.
from math import log
import random
def cross_entropy(prediction_probability_seq):
probs = list(prediction_probability_seq)
return -sum(log(p, 2) for p in probs) / len(probs)
def predictions(seq, model):
for item in seq:
yield model[item]
rand_char_seq = [random.choice(['a', 'b']) for _ in xrange(1000)]
def print_ent(m):
print 'cross entropy of', str(m), \
cross_entropy(predictions(rand_char_seq, m))
print_ent({'a': .5, 'b': .5})
print_ent({'a': 2./3, 'b': 1./3})
Notice that if you add an extra 'a' to the choice, then the second model (which is closer to the true distribution) gets lower cross entropy than the first.
However, one other thing to consider is that you really want to measure the likelihood on held out data that you didn't observe during training. If you do not do this, more complicated models that memorize the noise in the training data will have an advantage over smaller/simpler models that don't have as much ability to memorize noise.
One real problem with likelihood as measuring language model quality is that it sometimes doesn't perfectly predict the actual higher level application error rate. For example, language models are often used in speech recognition systems. There have been improved language models (in terms of entropy) that didn't drive down the overall system's word error rate, which is what the designers really care about. This can happen if the language model improves predictions where the entire recognition system is already confident enough to get the right answer.

Related

Model selection for classification with random train/test sets

I'm working with an extremelly unbalanced and heterogeneous multiclass {K = 16} database for research, with a small N ~= 250. For some labels the database has a sufficient amount of examples for supervised machine learning, but for others I have almost none. I'm also not in a position to expand my database for a number of reasons.
As a first approach I divided my database into training (80%) and test (20%) sets in a stratified way. On top of that, I applied several classification algorithms that provide some results. I applied this procedure over 500 stratified train/test sets (as each stratified sampling takes individuals randomly within each stratum), hoping to select an algorithm (model) that performed acceptably.
Because of my database, depending on the specific examples that are part of the train set, the performance on the test set varies greatly. I'm dealing with runs that have as high (for my application) as 82% accuracy and runs that have as low as 40%. The median over all runs is around 67% accuracy.
When facing this situation, I'm unsure on what is the standard procedure (if there is any) when selecting the best performing model. My rationale is that the 90% model may generalize better because the specific examples selected in the training set are be richer so that the test set is better classified. However, I'm fully aware of the possibility of the test set being composed of "simpler" cases that are easier to classify or the train set comprising all hard-to-classify cases.
Is there any standard procedure to select the best performing model considering that the distribution of examples in my train/test sets cause the results to vary greatly? Am I making a conceptual mistake somewhere? Do practitioners usually select the best performing model without any further exploration?
I don't like the idea of using the mean/median accuracy, as obviously some models generalize better than others, but I'm by no means an expert in the field.
Confusion matrix of the predicted label on the test set of one of the best cases:
Confusion matrix of the predicted label on the test set of one of the worst cases:
They both use the same algorithm and parameters.

Good Accuracy =/= Good Model
I want to firstly point out that a good accuracy on your test set need not equal a good model in general! This has (in your case) mainly to do with your extremely skewed distribution of samples.
Especially when doing a stratified split, and having one class dominatingly represented, you will likely get good results by simply predicting this one class over and over again.
A good way to see if this is happening is to look at a confusion matrix (better picture here) of your predictions.
If there is one class that seems to confuse other classes as well, that is an indicator for a bad model. I would argue that in your case it would be generally very hard to find a good model unless you do actively try to balance your classes more during training.
Use the power of Ensembles
Another idea is indeed to use ensembling over multiple models (in your case resulting from different splits), since it is assumed to generalize better.
Even if you might sacrifice a lot of accuracy on paper, I would bet that a confusion matrix of an ensemble is likely to look much better than the one of a single "high accuracy" model. Especially if you disregard the models that perform extremely poor (make sure that, again, the "poor" performance comes from an actual bad performance, and not just an unlucky split), I can see a very good generalization.
Try k-fold Cross-Validation
Another common technique is k-fold cross-validation. Instead of performing your evaluation on a single 80/20 split, you essentially divide your data in k equally large sets, and then always train on k-1 sets, while evaluating on the other set. You then not only get a feeling whether your split was reasonable (you usually get all the results for different splits in k-fold CV implementations, like the one from sklearn), but you also get an overall score that tells you the average of all folds.
Note that 5-fold CV would equal a split into 5 20% sets, so essentially what you are doing now, plus the "shuffling part".
CV is also a good way to deal with little training data, in settings where you have imbalanced classes, or where you generally want to make sure your model actually performs well.

What type of ML is this? Algorithm to repeatedly choose 1 correct candidate from a pool (or none)

I have a set of 3-5 black box scoring functions that assign positive real value scores to candidates.
Each is decent at ranking the best candidate highest, but they don't always agree--I'd like to find how to combine the scores together for an optimal meta-score such that, among a pool of candidates, the one with the highest meta-score is usually the actual correct candidate.
So they are plain R^n vectors, but each dimension individually tends to have higher value for correct candidates. Naively I could just multiply the components, but I hope there's something more subtle to benefit from.
If the highest score is too low (or perhaps the two highest are too close), I just give up and say 'none'.
So for each trial, my input is a set of these score-vectors, and the output is which vector corresponds to the actual right answer, or 'none'. This is kind of like tech interviewing where a pool of candidates are interviewed by a few people who might have differing opinions but in general each tend to prefer the best candidate. My own application has an objective best candidate.
I'd like to maximize correct answers and minimize false positives.
More concretely, my training data might look like many instances of
{[0.2, 0.45, 1.37], [5.9, 0.02, 2], ...} -> i
where i is the ith candidate vector in the input set.
So I'd like to learn a function that tends to maximize the actual best candidate's score vector from the input. There are no degrees of bestness. It's binary right or wrong. However, it doesn't seem like traditional binary classification because among an input set of vectors, there can be at most 1 "classified" as right, the rest are wrong.
Thanks

Your problem doesn't exactly belong in the machine learning category. The multiplication method might work better. You can also try different statistical models for your output function.
ML, and more specifically classification, problems need training data from which your network can learn any existing patterns in the data and use them to assign a particular class to an input vector.
If you really want to use classification then I think your problem can fit into the category of OnevsAll classification. You will need a network (or just a single output layer) with number of cells/sigmoid units equal to your number of candidates (each representing one). Note, here your number of candidates will be fixed.
You can use your entire candidate vector as input to all the cells of your network. The output can be specified using one-hot encoding i.e. 00100 if your candidate no. 3 was the actual correct candidate and in case of no correct candidate output will be 00000.
For this to work, you will need a big data set containing your candidate vectors and corresponding actual correct candidate. For this data you will either need a function (again like multiplication) or you can assign the outputs yourself, in which case the system will learn how you classify the output given different inputs and will classify new data in the same way as you did. This way, it will maximize the number of correct outputs but the definition of correct here will be how you classify the training data.
You can also use a different type of output where each cell of output layer corresponds to your scoring functions and 00001 means that the candidate your 5th scoring function selected was the right one. This way your candidates will not have to be fixed. But again, you will have to manually set the outputs of the training data for your network to learn it.
OnevsAll is a classification technique where there are multiple cells in the output layer and each perform binary classification in between one of the classes vs all others. At the end the sigmoid with the highest probability is assigned 1 and rest zero.
Once your system has learned how you classify data through your training data, you can feed your new data in and it will give you output in the same way i.e. 01000 etc.
I hope my answer was able to help you.:)

training set with only one label, missing the other

Hi I've been doing a machine learning project about predicting if a given (query, answer) pair is a good match (label the pair with 1 if it is a good match, 0 otherwise). But the problem is, in the training set, all the items are labelled with 1. So I got confused because I don't think the training set has strong discriminative power. To be more specific, now I could extract some features like:
1. textual similarity between query and answer
2. some attributes like the posting date, who created it, which aspect is it about etc.
Maybe I should try semi supervised learning (never studied it so have no idea if it will work)? But with such a training set I even cannot do validation....

Actually, you can train a data set on only positive examples; 1-class SVM does this. However, this presumes that anything "sufficiently outside" the original data set is negative data, with "sufficiently outside" affected mainly by gamma (allowed error rate) and k (degree of the kernel function).
A solution for your problem depends on the data you have. You are quite correct that a model trains better when given representative negative examples. The description you give strongly suggests that you do know there are insufficient matches.
Do you need a strict +/- scoring for the matches? Most applications simply rank them: the match strength is the score. This changes your problem from a classification to a prediction case. If you do need a strict +/- partition (classification), then I suggest that you slightly alter your training set: include only obvious examples: throw out anything scored near your comfort threshold for declaring a match.
With these inputs only, train your model. You'll have a clear "alley" between good and bad matches, and the model will "decide" which way to judge the in-between cases in testing and production.

Predictive features with high presence in one class

I am doing a logistic regression to predict the outcome of a binary variable, say whether a journal paper gets accepted or not. The dependent variable or predictors are all the phrases used in these papers - (unigrams, bigrams, trigrams). One of these phrases has a skewed presence in the 'accepted' class. Including this phrase gives me a classifier with a very high accuracy (more than 90%), while removing this phrase results in accuracy dropping to about 70%.
My more general (naive) machine learning question is:
Is it advisable to remove such skewed features when doing classification?
Is there a method to check skewed presence for every feature and then decide whether to keep it in the model or not?

If I understand correctly you ask whether some feature should be removed because it is a good predictor (it makes your classifier works better). So the answer is short and simple - do not remove it in fact, the whole concept is to find exactly such features.
The only reason to remove such feature would be that this phenomena only occurs in the training set, and not in real data. But in such case you have wrong data - which does not represnt the underlying data density and you should gather better data or "clean" the current one so it has analogous characteristics as the "real ones".

Based on your comments, it sounds like the feature in your documents that's highly predictive of the class is a near-tautology: "paper accepted on" correlates with accepted papers because at least some of the papers in your database were scraped from already-accepted papers and have been annotated by the authors as such.
To me, this sounds like a useless feature for trying to predict whether a paper will be accepted, because (I'd imagine) you're trying to predict paper acceptance before the actual acceptance has been issued ! In such a case, none of the papers you'd like to test your algorithm with will be annotated with "paper accepted on." So, I'd remove it.
You also asked about how to determine whether a feature correlates strongly with one class. There are three things that come to mind for this problem.
First, you could just compute a basic frequency count for each feature in your dataset and compare those values across classes. This is probably not super informative, but it's easy.
Second, since you're using a log-linear model, you can train your model on your training dataset, and then rank each feature in your model by its weight in the logistic regression parameter vector. Features with high positive weight are indicative of one class, while features with large negative weight are strongly indicative of the other.
Finally, just for the sake of completeness, I'll point out that you might also want to look into feature selection. There are many ways of selecting relevant features for a machine learning algorithm, but I think one of the most intuitive from your perspective might be greedy feature elimination. In such an approach, you train a classifier using all N features in your model, and measure the accuracy on some held-out validation set. Then, train N new models, each with N-1 features, such that each model eliminates one of the N features, and measure the resulting drop in accuracy. The feature with the biggest drop was probably strongly predictive of the class, while features that have no measurable difference can probably be omitted from your final model. As larsmans points out correctly in the comments below, this doesn't scale well at all, but it can be a useful method sometimes.

Does prior distribution matter in classification?

Currently I get a classification problem with two classes. what I want to do is that given a bunch of candidates, find out who will more likely to be the class 1. The problem is that class 1 is very rare (around 1%), which I guess makes my prediction quite inaccurate.
For training the dataset, can I sample half class 1 and half class 0? This will change the prior distribution, but I don't know whether the prior distribution affects the classification results?

Indeed, a very imbalanced dataset can cause problems in classification. Because by defaulting to the majority class 0, you can get your error rate already very low.
There are some workarounds that may or may not work for your particular problem, such as giving equal weight to the two classes (thus weighting instances from the rare class stronger), oversampling the rare class (i.e. learning each instance multiple times), producing slight variations of the rare objects to restore balance etc. SMOTE and so on.
You really should to grab some classification or machine learning book, and check the index for "imbalanced classification" or "unbalanced classification". If the book is any good, it will discuss this problem. (I just assume you did not know the term that they use.)

If you're forced to pick exactly one from a group, then the prior distribution over classes won't matter because it will be constant for all members of that group. If you must look at each in turn and make an independent decision as to whether they're class one or class two, the prior will potentially change the decision, depending on which method you choose to do the classification. I would suggest you get hold of as many examples of the rare class as possible, but beware that feeding a 50-50 split to a classifier as training blindly may make it implicitly fit a model that assumes this is the distribution at test time.

Sampling your two classes evenly doesn't change assumed priors unless your classification algorithm computes (and uses) priors based on the training data. You stated that your problem is "given a bunch of candidates, find out who will more likely to be the class 1". I read this to mean that you want to determine which observation is most likely to belong to class 1. To do this, you want to pick the observation $x_i$ that maximizes $p(c_1|x_i)$. Using Bayes' theorem, this becomes:
$$
p(c_1|x_i)=\frac{p(x_i|c_1)p(c_1)}{p(x_i)}
$$
You can ignore $p(c_1)$ in the equation above since it is a constant. However, computing the denominator will still involve using prior probabilities. Since your problem is really more of a target detection problem than a classification problem, an alternate approach for detecting low probability targets is to take the likelihood ratio of the two classes:
$$
\Lambda=\frac{p(x_i|c_1)}{p(x_i|c_0)}
$$
To pick which of your candidates is most likely to belong to class 1, pick the one with the highest value of $\Lambda$. If your two classes are described by multivariate Gaussian distributions, you can replace $\Lambda$ with its natural logarithm, resulting in a simpler quadratic detector. If you further assume that the target and background have the same covariance matrices, this results in a linear discriminant (http://en.wikipedia.org/wiki/Linear_discriminant_analysis).

You may want to consider Bayesian utility theory to re-weight the costs of different kinds of error to get away from the problem of the priors dominating the decision.
Let A be the 99% prior probability class, B be the 1% class.
If we just say that all errors incur the same cost (negative utility), then
it's possible that the optimal decision approach is to always declare "A". Many
classification algorithms (implicitly) assume this.
If instead, we declare that the cost of declaring "B" when, in fact, the instance
was "A" is much bigger than the cost of the opposite error, then the decision logic
becomes, in a sense, more sensitive to slighter differences in the features.
This kind of situation frequently comes up in fault detection -- faults in the monitored
system will be rare, but you want to be sure that if we see any data that points to
an error condition, action needs to be taken (even if it is just reviewing the data).

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart