What is the difference between mteval-v13a.pl and NLTK BLEU? - machine-learning

There is an implementation of BLEU score in Python NLTK,
nltk.translate.bleu_score.corpus_bleu
But I am not sure if it is the same as the mtevalv13a.pl script.
What is the difference between them?

TL;DR
Use https://github.com/mjpost/sacrebleu when evaluating Machine Translation systems.
In Short
No, the BLEU in NLTK isn't the exactly the same as the mteval-13a.perl.
But it can get really close, see https://github.com/nltk/nltk/issues/1330#issuecomment-256237324
nltk.translate.corpus_bleu corresponds to mteval-13a.pl up to the 4th order of ngram with some floating point discrepancies
The details of the comparison and the dataset used can be downloaded from https://github.com/nltk/nltk_data/blob/gh-pages/packages/models/wmt15_eval.zip or:
import nltk
nltk.download('wmt15_eval')
The major differences:
In Long
There are several difference between mteval-13a.pl and nltk.translate.corpus_bleu:
The first difference is the fact that mteval-13a.pl comes with its own NIST tokenizer while the NLTK version of BLEU is the implementation of the metric and assumes that input is pre-tokenized.
BTW, this ongoing PR will bridge the gap between NLTK and NIST tokenizers
The other major difference is that mteval-13a.pl expects the input to be in .sgm format while NLTK BLEU takes in python list of lists of strings, see the README.txt in the zipball here for more information of how to convert textfile to SGM.
mteval-13a.pl expects an ngram order of at least 1-4. If the minimum ngram order for the sentence/corpus is less than 4, it will return a 0 probability which is a math.log(float('-inf')). To emulate this behavior, NLTK has a put an _emulate_multibleu flag:
See https://github.com/nltk/nltk/blob/develop/nltk/translate/bleu_score.py#L477
mteval-13a.pl is able to generate NIST scores while NLTK doesn't have NIST score implementation (at least not yet)
NIST score in NLTK is upcoming in this PR
Other than the differences, NLTK BLEU scores packed in more features:
to handle fringe cases that the original BLEU (Papineni, ‎2002) overlooked
See https://github.com/nltk/nltk/pull/1383
Also to handle fringe cases where the largest order of Ngram is < 4, the uniform weights of the individual ngram precision will be reweighted such that the mass of the weights sums to 1.0
See https://github.com/nltk/nltk/blob/develop/nltk/translate/bleu_score.py#L175
while NIST has a smoothing method for geometric sequence smoothing, NLTK has an equivalent object with the same smoothing method and even more smoothing methods to handle sentence level BLEU from Chen and Collin, 2014
Lastly to validate the features added in NLTK's version of BLEU, a regression test is added to accounts for them, see https://github.com/nltk/nltk/blob/develop/nltk/test/unit/translate/test_bleu.py

Related

How to use bigrams + trigrams + word-marks vocabulary in countVectorizer?

I'm using text classification with naive Bayes and countVectorizer to classify dialects. I read a research paper that the author has used a combination of :
bigrams + trigrams + word-marks vocabulary
He means by word-marks here, the words that are specific to a certain dialect.
How can I tweak those parameters in countVectorizer?
word marks
So those are examples of word marks, but it isn't what I have, because mine are arabic. So I translated them.
word_marks=['love', 'funny', 'happy', 'amazing']
Those are used to classify a text.
Also, in the this post:
Understanding the `ngram_range` argument in a CountVectorizer in sklearn
There was this answer :
>>> v = CountVectorizer(ngram_range=(1, 2), vocabulary={"keeps", "keeps the"})
>>> v.fit_transform(["an apple a day keeps the doctor away"]).toarray()
array([[1, 1]]) # unigram and bigram found
I couldn't understand the output, what does [1,1] mean here? and how was he able to use ngram with vocabulary? aren't both of them mutually exclusive?
You want to use the n_gram range argument to use bigrams and trigrams. In your case, it would be CountVectorizer(ngram_range=(1, 3)).
See the accepted answer to this question for more details.
Please provide example of "word-marks" for the other part of your question.
You may have to run CountVectorizer twice - once for n-grams and once for your custom word-mark vocabulary. You can then concatenate the two outputs from the two CountVectorizers to get a single feature set of n-gram counts and custom vocabulary counts. The answer to the above question also explains how to specify a custom vocabulary for this second use of CountVectorizer.
Here's a SO answer on concatenating arrays

Use pos tagging in bag of words

I'm using the bag of words for text classification.
Results aren't good enough, test set accuracy is below 70%.
One of the things I'm considering is to use POS tagging to distinguish the function of words. How is the to go approach to doing it?
I'm thinking on append the tags to the words, for example the word "love", if it's used as a noun use:
love_noun
and if it's a verb use:
love_verb
Test set accuracy near 70% is not that bad if you have hundreds of categories. You might want to measure overall precision and recall instead of accuracy.
What you proposed sounds good, which is an approach to add feature conjunctions as additional features. Here are a few suggestions:
Still keep your original features. That is to say, don't replace love with love_noun or love_verb. Instead, you have two features coming from love:
love, love_noun (or)
love, love_verb
If you need some sample code, you can start from nltk python package.
>>> from nltk import pos_tag, word_tokenize
>>> pos_tag(word_tokenize("Love is a lovely thing"))
[('Love', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('lovely', 'JJ'), ('thing', 'NN')]
Consider using n-grams, maybe starting from adding 2-grams. For example, you might have "in" and "stock" and you might just remove "in" because it is a stop-word. If you consider 2-grams, you will get a new feature:
in-stock
which has a different meaning to "stock". It might help a lot in certain cases, for example, to distinguish from "finance" from "shopping".

Modeling features of Relation Extraction in the SVMlight input format

I am currently working on a project that focuses on relation extraction from a corpus of Wikipedia text, and I plan to use an SVM to extract these relations. To model this, I plan to use Word features, POS Tag features, Entity features, Mention features and so on as mentioned in the following paper - https://gate.ac.uk/sale/eswc06/eswc06-relation.pdf (Page 6 onwards)
Now, I have set up the pipeline for feature extraction and got the corpus annotated and I wish to use a package like SVM-Light for the purpose of the project. According to the input file format of the SVM-Light package, this is the requisite format -
.=. : : ... : #
Example (from the SVM-Light webpage) -
In classification mode, the target value denotes the class of the example. +1 as the target value marks a positive example, -1 a negative example respectively. So, for example, the line
-1 1:0.43 3:0.12 9284:0.2 # abcdef
specifies a negative example for which feature number 1 has the value 0.43, feature number 3 has the value 0.12, feature number 9284 has the value 0.2, and all the other features have value 0. In addition, the string abcdef is stored with the vector, which can serve as a way of providing additional information for user defined kernels.
Now, I wish to know how do we model the features that I am using whose values include words, POS Tags and entity types and subtypes into the feature vector accepted by the SVM-Light package, where each feature has a real number value associated with it. How is the mapping from my choice of features to these real values done?
It would be of great help if someone who has worked at a similar problem before could just prod me in the right direction.
Thanks.

How do I cluster with KL-divergence?

I want to cluster my data with KL-divergence as my metric.
In K-means:
Choose the number of clusters.
Initialize each cluster's mean at random.
Assign each data point to a cluster c with minimal distance value.
Update each cluster's mean to that of the data points assigned to it.
In the Euclidean case it's easy to update the mean, just by averaging each vector.
However, if I'd like to use KL-divergence as my metric, how do I update my mean?
Clustering with KL-divergence may not be the best idea, because KLD is missing an important property of metrics: symmetry. Obtained clusters could then be quite hard to interpret. If you want to go ahead with KLD, you could use as distance the average of KLD's i.e.
d(x,y) = KLD(x,y)/2 + KLD(y,x)/2
It is not a good idea to use KLD for two reasons:-
It is not symmetry KLD(x,y) ~= KLD(y,x)
You need to be careful when using KLD in programming: the division may lead to Inf values and NAN as a result.
Adding a small number may affect the accuracy.
Well, it might not be a good idea use KL in the "k-means framework". As it was said, it is not symmetric and K-Means is intended to work on the euclidean space.
However, you can try using NMF (non-negative matrix factorization). In fact, in the book Data Clustering (Edited by Aggarwal and Reddy) you can find the prove that NMF (in a clustering task) works like k-means, only with the non-negative constrain. The fun part is that NMF may use a bunch of different distances and divergences. If you program python: scikit-learn 0.19 implements the beta divergence, which has a variable beta as a degree of liberty. Depending on the value of beta, the divergence has a different behavour. On beta equals 2, it assumes the behavior of the KL divergence.
This is actually very used in the topic model context, where people try to cluster documents/words over topics (or themes). By using KL, the results can be interpreted as a probabilistic function on how the word-topic and topic distributions are related.
You can find more information:
FÉVOTTE, C., IDIER, J. “Algorithms for Nonnegative Matrix
Factorization with the β-Divergence”, Neural Computation, v. 23, n.
9, pp. 2421– 2456, 2011. ISSN: 0899-7667. doi: 10.1162/NECO_a_00168.
Dis- ponível em: .
LUO, M., NIE, F., CHANG, X., et al. “Probabilistic Non-Negative
Matrix Factorization and Its Robust Extensions for Topic Modeling.”
In: AAAI, pp. 2308–2314, 2017.
KUANG, D., CHOO, J., PARK, H. “Nonnegative matrix factorization for
in- teractive topic modeling and document clustering”. In:
Partitional Clus- tering Algorithms, Springer, pp. 215–243, 2015.
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html
K-means is intended to work with Euclidean distance: if you want to use non-Euclidean similarities in clustering, you should use a different method. The most principled way to cluster with an arbitrary similarity metric is spectral clustering, and K-means can be derived as a variant of this where the similarities are the Euclidean distances.
And as #mitchus says, KL divergence is not a metric. You want the Jensen-Shannon divergence or its square root named as the Jensen-Shannon distance as it has symmetry.

How do you write a program to find if certain words are similar?

Ie: "college" and "schoolwork" and "academy" belong in the same cluster,
the words "essay", "scholarships" , "money" also belong in the same cluster. Is this a ML or NLP problem?
It depends on how strict your definition of similar is.
Machine Learning Techniques
As others have pointed out, you can use something like latent semantic analysis or the related latent Dirichlet allocation.
Semantic Similarity and WordNet
As was pointed out, you may wish to use an existing resource for something like this.
Many research papers (example) use the term semantic similarity. The basic idea is of computing this is usually done by finding the distance between two words on a graph, where a word is a child if it is a type of its parent. Example: "songbird" would be a child of "bird". Semantic similarity can be used as a distance metric for creating clusters, if you wish.
Example Implementation
In addition, if you put a threshold on the value of some semantic similarity measure, you can get a boolean True or False. Here is a Gist I created (word_similarity.py) that uses NLTK's corpus reader for WordNet. Hopefully that points you towards the right direction, and gives you a few more search terms.
def sim(word1, word2, lch_threshold=2.15, verbose=False):
"""Determine if two (already lemmatized) words are similar or not.
Call with verbose=True to print the WordNet senses from each word
that are considered similar.
The documentation for the NLTK WordNet Interface is available here:
http://nltk.googlecode.com/svn/trunk/doc/howto/wordnet.html
"""
from nltk.corpus import wordnet as wn
results = []
for net1 in wn.synsets(word1):
for net2 in wn.synsets(word2):
try:
lch = net1.lch_similarity(net2)
except:
continue
# The value to compare the LCH to was found empirically.
# (The value is very application dependent. Experiment!)
if lch >= lch_threshold:
results.append((net1, net2))
if not results:
return False
if verbose:
for net1, net2 in results:
print net1
print net1.definition
print net2
print net2.definition
print 'path similarity:'
print net1.path_similarity(net2)
print 'lch similarity:'
print net1.lch_similarity(net2)
print 'wup similarity:'
print net1.wup_similarity(net2)
print '-' * 79
return True
Example output
>>> sim('college', 'academy')
True
>>> sim('essay', 'schoolwork')
False
>>> sim('essay', 'schoolwork', lch_threshold=1.5)
True
>>> sim('human', 'man')
True
>>> sim('human', 'car')
False
>>> sim('fare', 'food')
True
>>> sim('fare', 'food', verbose=True)
Synset('fare.n.04')
the food and drink that are regularly served or consumed
Synset('food.n.01')
any substance that can be metabolized by an animal to give energy and build tissue
path similarity:
0.5
lch similarity:
2.94443897917
wup similarity:
0.909090909091
-------------------------------------------------------------------------------
True
>>> sim('bird', 'songbird', verbose=True)
Synset('bird.n.01')
warm-blooded egg-laying vertebrates characterized by feathers and forelimbs modified as wings
Synset('songbird.n.01')
any bird having a musical call
path similarity:
0.25
lch similarity:
2.25129179861
wup similarity:
0.869565217391
-------------------------------------------------------------------------------
True
>>> sim('happen', 'cause', verbose=True)
Synset('happen.v.01')
come to pass
Synset('induce.v.02')
cause to do; cause to act in a specified manner
path similarity:
0.333333333333
lch similarity:
2.15948424935
wup similarity:
0.5
-------------------------------------------------------------------------------
Synset('find.v.01')
come upon, as if by accident; meet with
Synset('induce.v.02')
cause to do; cause to act in a specified manner
path similarity:
0.333333333333
lch similarity:
2.15948424935
wup similarity:
0.5
-------------------------------------------------------------------------------
True
I suppose you could build your own database of such associations sing ML and NLP techniques, but you might also consider querying existing resources such as WordNet to get the job done.
If you have a sizable collection of documents related to the topic of interest, you might want to look at Latent Direchlet Allocation. LDA is a fairly standard NLP technique that automatically clusters words into topics, where similarity between words is determined by collocation in the same document (you can treat a single sentence as a document if that serves your needs better).
You'll find a number of LDA toolkits available. We'd need more detail on your exact problem before recommending one over another. I'm not enough of an expert to make that recommendation anyway, but I can at least suggest you look at LDA.
The famous quote regarding your question is by John Rupert Firth in 1957:
You shall know a word by the company it keeps
To start delving into this topic you can look into this presentation.
Word2Vec can play role to find similar words (contextually/semantically). In word2vec, we have words as vector in n-dimensional space, and can calculate distance between words (Euclidean Distance) or can simply make clusters.
After this, we can come up with some numerical value for similarity b/w 2 words.

Resources