Fasttext aligned word vectors for translating homographs - machine-learning

Homograph is a word that shares the same written form as another word but has a different meaning, like right in the sentences below:
success is about making the right decisions.
Turn right after the traffic light
The English word "right", in the first case is translated to Swedish as "rätt" and to "höger" in the second case. The correct translation is possible by looking at the context (surrounding words).
Question 1. I wonder if fasttext aligned word embedding can come to help for translating these homograph words or words with several possible translations into another language?
[EDIT] The goal is not to query the model for the right translation. The goal is to pick the right translation when the following information is given:
the two (or several) possible translations options in the target language like "rätt" and "höger"
the surrounding words in the source language
Question 2. I loaded the english pre-trained vectors model and the English aligned vector model. While both were trained on Wikipedia articles, I noticed that the distances between two words were sort of preserved but the size of the dataset files (wiki.en.vec vs wiki.en.align.vec) are noticeably different (1GB). Wouldn't it make sense if we only use the aligned version? What information is not captured by the aligned dataset?

For question 1, I suppose it's possible that these 'aligned' vectors could help translate homographs, but still face the problem that any token only has a single vector – even if that one token has multiple meanings.
Are you assuming that you already know that right[en] could be translated into either rätt[se] or höger[se], from some external table? (That is, you're not using the aligned word-vectors as the primary means of translation, just an adjunct to other methods?)
If so, one technique that might help would be to see which of rätt[se] or höger[se] is closer to other words that surround your particular instance of right[en]. (You might tally each's rank-closeness to every word within n spots of right[en], or calculate their cosine-similarity to the average of the n words around right[en], for example.)
(You could potentially even do this with non-aligned word vectors, if your more-precise words have multiple, alternate, non-homograph/non-polysemous translations in English. For example, to determine which sense of right[en] is more likely, you could use the non-aligned English word vectors for correct[en] and rightward[en] – less polysemous correlates of rätt[se] & höger[se] – to check for similarity-to-surrounding words.)
A write-up that might create other ideas is "Linear algebraic structure of word meanings" which, quite surprisingly, is able to tease-out alternate meanings of homograph tokens even when the original word-vectors training was not word-sense-aware. (Might the 'atoms of discourse' in their model be equally findable across merged/aligned multi-language vector spaces, and then the closeness-of-context-words to different atoms a good guide to word-sense-disambiguation?)
For question 2, you imply the aligned word set is smaller in size. Have you checked if that's just because it includes fewer words? That seems the simplest explanation, and just checking which words are left out would let you know what you're losing.

Related

How to account for variation in spelling (especially for slang) for Word Embeddings/Word2Vec generation using song lyrics?

So I am working on a artist classification project that utilizes hip hop lyrics from genius.com. The problem is these lyrics are user generated, so the same word can be spelled in various different ways, especially if it is slang which is a very common case in hip hop.
I looked into spell correction using hunspell/pyhunspell, but the problem with that is it doesn't fix slang misspellings. I technically could make a mini dictionary with a bunch of misspelled variations but that is effectively useless because there could be a dozen variations of the same word over my (growing) 6000 song corpus.
Any suggestions?
You could try to stem your words. More information on stemming here. This would help grouping together words with close spelling variations.
A popular stemming scheme is the Porter Stemmer, which implementation can be found in most NLP packages, eg. NLTK
I would discard, if possible, short words, or contracted words which somehow are too hard to automatically correct them (conditioned on checking that it won't affect your final result).
For longer words, you may want to use metrics like Levenshtein distance or Jaro similarity. The first one consists of the minimum number of additions, deletes or replaces to convert one candidate word into another. The second one, provides a similar result, between 0 and 1, and putting more emphasis in the last characters of a word.
If you have access to the correct version of your slang word, you could convert the closest candidates to the correct one. Of course, trying not to apply it to different correct words.
If you're working with Python, here some implementations are provided.

How to use word embeddings/word2vec .. differently? With an actual, physical dictionary

If my title is incorrect/could be better, please let me know.
I've been trying to find an existing paper/article describing the problem that I'm having: I'm trying to create vectors for words so that they are equal to the sum of their parts.
For example: Cardinal(the bird) would be equal to the vectors of: red, bird, and ONLY that.
In order to train such a model, the input might be something like a dictionary, where each word is defined by it's attributes.
Something like:
Cardinal: bird, red, ....
Bluebird: blue, bird,....
Bird: warm-blooded, wings, beak, two eyes, claws....
Wings: Bone, feather....
So in this instance, each word-vector is equal to the sum of the word-vector of its parts, and so on.
I understand that in the original word2vec, semantic distance was preserved, such that Vec(Madrid)-Vec(Spain)+Vec(Paris) = approx Vec(Paris).
Thanks!
PS: Also, if it's possible, new words should be able to be added later on.
If you're going to be building a dictionary of the components you want, you don't really need word2vec at all. You've already defined the dimensions you want specified: just use them, e.g. in Python:
kb = {"wings": {"bone", "feather"},
"bird": {"wings", "warm-blooded", ...}, ...}
Since the values are sets, you can do set intersection:
kb["bird"] | kb["reptile"]
You'll need to do find some ways decompose the elements recursively for comparisons, simplifications, etc. These are decisions you'll have to make based on what you expect to happen during such operations.
This sort of manual dictionary development is quite an old fashioned approach. Folks like Schank and Abelson used to do stuff like this in the 1970's. The problem is, as these dictionaries get more complex, they become intractable to maintain and more inaccurate in their approximations. You're welcome to try as an exercise---it can be kind of fun!---but keep your expectations low.
You'll also find aspects of meaning lost in these sorts of decompositions. One of word2vec's remarkable properties is its sensitives to the gestalt of words---words may have meaning that is composed of parts, but there's a piece in that composition that makes the whole greater than the sum of the parts. In a decomposition, the gestalt is lost.
Rather than trying to build a dictionary, you might be best off exploring what W2V gives you anyway, from a large corpus, and seeing how you can leverage that information to your advantage. The linguistics of what exactly W2V renders from text aren't wholly understood, but in trying to do something specific with the embeddings, you might learn something new about language.

Named entity recognition (NER) features

I'm new to Named Entity Recognition and I'm having some trouble understanding what/how features are used for this task.
Some papers I've read so far mention features used, but don't really explain them, for example in
Introduction to the CoNLL-2003 Shared Task:Language-Independent Named Entity Recognition, the following features are mentioned:
Main features used by the the sixteen systems that participated in the
CoNLL-2003 shared task sorted by performance on the English test data.
Aff: affix information (n-grams); bag: bag of words; cas: global case
information; chu: chunk tags; doc: global document information; gaz:
gazetteers; lex: lexical features; ort: orthographic information; pat:
orthographic patterns (like Aa0); pos: part-of-speech tags; pre:
previously predicted NE tags; quo: flag signing that the word is
between quotes; tri: trigger words.
I'm a bit confused by some of these, however. For example:
isn't bag of words supposed to be a method to generate features (one for each word)? How can BOW itself be a feature? Or does this simply mean we have a feature for each word as in BOW, besides all the other features mentioned?
how can a gazetteer be a feature?
how can POS tags exactly be used as features ? Don't we have a POS tag for each word? Isn't each object/instance a "text"?
what is global document information?
what is the feature trigger words?
I think all I need here is to just to look at an example table with each of these features as columns and see their values to understand how they really work, but so far I've failed to find an easy to read dataset.
Could someone please clarify or point me to some explanation or example of these features being used?
Here's a shot at some answers (and by the way the terminology on all this stuff is super overloaded).
isn't bag of words supposed to be a method to generate features (one for each word)? How can BOW itself be a feature? Or does this simply mean we have a feature for each word as in BOW, besides all the other features mentioned?
how can a gazetteer be a feature?
In my experience BOW Feature Extraction is used to produce word features out of sentences. So IMO BOW is not one feature, it is a method of generating features out of a sentence (or a block of text you are using). Uning NGrams can help with accounting for sequence, but BOW features amount to unordered bags of strings.
how can POS tags exactly be used as features ? Don't we have a POS tag for each word?
POS Tags are used as features because they can help with "word sense disambiguation" (at least on a theoretical level). For instance, the word "May" can be a name of a person or a month of a year or a poorly capitalized conjugated verb, but the POS tag can be the feature that differentiates that fact. And yes, you can get a POS tag for each word, but unless you explicitly use those tags in your "feature space" then the words themselves have no idea what they are in terms of their POS.
Isn't each object/instance a "text"?
If you mean what I think you mean, then this is true only if you have extracted object-instance "pairs" and stored them as features (an array of them derived from a string of tokens).
what is global document information?
I perceive this one to mean as such: Most NLP tasks function on a sentence. Global document information is data from all the surrounding text in the entire document. For instance, if you are trying to extract geographic placenames but disambiguate them, and you find the word Paris, which one is it? Well if France is mentioned 5 sentences above, that could increase the likelihood of it being Paris France rather than Paris Texas or worst case, the person Paris Hilton. It's also really important in what is called "coreference resolution", which is when you correlate a name to a pronoun reference (mapping a name mention to "he" or "she" etc).
what is the feature trigger words?
Trigger words are specific tokens or sequences that have high reliability as a stand alone thing to have a specific meaning. For instance, in sentiment analysis, curse words with exclamation marks often indicate negativity. There can be many permutations of this.
Anyway, my answers here are not perfect, and are prone to all manner of problems in human epistemology and inter-subjectivity, but those are the way I've been thinking about this things over the years I've been trying to solve problems with NLP.
Hopefully someone else will chime in, especially if I'm way off.
You should probably keep in mind that NER classify each word/token separately from features that are internal or external clues. Internal clues takes into account the word itself (morphology as uppercase letters, is the token present in a dedicated lexicon, POS) and external ones relies on contextual information (previous and next word, document features).
isn't bag of words supposed to be a method to generate features (one
for each word)? How can BOW itself be a feature? Or does this simply
mean we have a feature for each word as in BOW, besides all the other
features mentioned?
Yes, BOW generates one feature per word, with sometimes feature selection methods that reduces the number features taken into account (e.g. minimum frequency of words)
how can a gazetteer be a feature?
Gazetteer may also generate one feature per word, but in most cases it does enrich data, by labelling words or multi-word expressions (as full proper names). It is an ambiguous step: "Georges Washington" will lead to two features: entire "Georges Washington" as a celebrity and "Washington" as a city.
how can POS tags exactly be used as features ? Don't we have a POS tag
for each word? Isn't each object/instance a "text"?
For classifiers, each instance is a word. This is why sequence labelling (e.g. CRF) methods are used: they allow to leverage previous words and next words as additional contextual features to classify the current word. Labelling a text is done as a process relying on the most likely NE types for each word in the sequence.
what is global document information?
This could be metadata (e.g. date, author), topics (full text categorization), coreference, etc.
what is the feature trigger words?
Triggers are external clues, contextual patterns that help disambiguation. For instance "Mr" will be used as a feature that strongly suggest that the following tokens would be a person.
I recently implemented a NER system in python and I found the following features helpful:
character-level ngrams (using CountVectorizer)
previous word features and labels (i.e. context)
viterbi or beam-search on label sequence probability
part of speech (pos), word-length, word-count, is_capitalized, is_stopword

Selecting suitable model for creating Language Identification tool

I am working on developing a tool for language identification of a given text i.e. given a sample text, identify the language (for e.g. English, Swedish, German, etc.) it is written in.
Now the strategy I have decided to follow (based on a few references I have gathered) are as follows -
a) Create a character n-gram model (The value of n is decided based on certain heuristics and computations)
b) Use a machine learning classifier(such as naive bayes) to predict the language of the given text.
Now, the doubt I have is - Is creating a character N-gram model necessary. As in, what disadvantage does a simple bag of words strategy have i.e. if I use all the words possible in the respective language to create a prediction model, what could be the possible cases where it would fail.
The reason why this doubt arose was the fact that any reference document/research paper I've come across states that language identification is a very difficult task. However, just using this strategy of using the words in the language seems to be a simple task.
EDIT: One reason why N-gram should be preferred is to make the model robust even if there are typos as stated here. Can anyone point out more?
if I use all the words possible in the respective language to create a prediction model, what could be the possible cases where it would fail
Pretty much the same cases were a character n-gram model would fail. The problem is that you're not going to find appropriate statistics for all possible words.(*) Character n-gram statistics are easier to accumulate and more robust, even for text without typos: words in a language tend to follow the same spelling patterns. E.g. had you not found statistics for the Dutch word "uitbuiken" (a pretty rare word), then the occurrence of the n-grams "uit", "bui" and "uik" would still be strong indicators of this being Dutch.
(*) In agglutinative languages such as Turkish, new words can be formed by stringing morphemes together and the number of possible words is immense. Check the first few chapters of Jurafsky and Martin, or any undergraduate linguistics text, for interesting discussions on the possible number of words per language.
Cavnar and Trenkle proposed a very simple yet efficient approach using character n-grams of variable length. Maybe you should try to implement it first and move to a more complex ML approach if C&T approach doesn't meet your requirements.
Basically, the idea is to build a language model using only the X (e.g. X = 300) most frequent n-grams of variable length (e.g. 1 <= N <= 5). Doing so, you are very likely to capture most functional words/morphemes of the considered language... without any prior linguistic knowledge on that language!
Why would you choose character n-grams over a BoW approach? I think the notion of character n-gram is pretty straightforward and apply to every written language. Word, is a much much complex notion which greatly differ from one language to another (consider languages with almost no spacing marks).
Reference: http://odur.let.rug.nl/~vannoord/TextCat/textcat.pdf
The performance really depends on your expected input. If you will be classifying multi-paragraph text all in one language, a functional words list (which your "bag of words" with pruning of hapaxes will quickly approximate) might well serve you perfectly, and could work better than n-grams.
There is significant overlap between individual words -- "of" could be Dutch or English; "and" is very common in English but also means "duck" in the Scandinavian languages, etc. But given enough input data, overlaps for individual stop words will not confuse your algorithm very often.
My anecdotal evidence is from using libtextcat on the Reuters multilingual newswire corpus. Many of the telegrams contain a lot of proper names, loan words etc. which throw off the n-gram classifier a lot of the time; whereas just examining the stop words would (in my humble estimation) produce much more stable results.
On the other hand, if you need to identify short, telegraphic utterances which might not be in your dictionary, a dictionary-based approach is obviously flawed. Note that many North European languages have very productive word formation by free compounding -- you see words like "tandborstställbrist" and "yhdyssanatauti" being coined left and right (and Finnish has agglutination on top -- "yhdyssanataudittomienkinkohan") which simply cannot be expected to be in a dictionary until somebody decides to use them.

How do I design a heuristic for matching translated sentences?

Summary
I am trying to design a heuristic for matching up sentences in a translation (from the original language to the translated language) and would like guidance and tips. Perhaps there is a heuristic that already does something similar? So given two text files, I would like to be able to match up the sentences (so I can pick out a sentence and say this is the translation of that sentence).
Details
The input text would be translated novels. So I do not expect the translations to be literal, although, using something like google translate might be a good way to test the accuracy of the heuristic.
To help me, I have a library that will gloss the contents of the translated text and give me the definitions of the words in the sentence. Other things I know:
Chapters and order are preserved; I know that the first sentence in chapter three will match with the first sentence in chapter three of the translation (Note, this is not strictly true; the first sentence might match up with the first two sentences, or even the second sentence)
I can calculate the overall size (characters, sentences, paragraphs); which could give me an idea of the average difference in sentence size (for example, the translation might be 30% longer).
Looking at the some books I have, the translated version has about 30% more sentences than the original text.
Implementation
(if it matters)
I am planning to do this in Java - but I am not that fussed - any language will do.
I am not greatly concerned about speed.
I guess to to be sure of the matches, some user feedback might be required. Like saying "Yes, this sentence definitely matches with that sentence." This would give the heuristic some more ground to stand on. This would mean that the user would need a little proficiency in the languages.
Background
(for those interested)
The reason I want to make this is that I want it to assist with my foreign language study. I am studying Japanese and find it hard to find "good" material (where "good" is defined by what I like). There are already tools to do something similar with subtitles from videos (an easier task - using the timing information of the video). But nothing, as far as I know, for texts.
There are tools called "sentence aligners" used in NLP research that does exactly what you want.
I advise hunalign:
http://mokk.bme.hu/resources/hunalign/
and MS sentence aligner:
http://research.microsoft.com/en-us/downloads/aafd5dcf-4dcc-49b2-8a22-f7055113e656/
Both are quite OK, but remember that nothing is perfect. Sentences that are too hard to be aligned will be dropped and some sentences may be wrongly aligned.

Resources