What does ph stand for in reinforcement learning - machine-learning

I'm reading the OpenAI baselines code for DQN.
There's a function named make_obs_ph that take a name and creates a placeholder of input with that name.
What does ph stand for ?

The meaning there is placeholder.
place, holder. ph.

Related

Fasttext aligned word vectors for translating homographs

Homograph is a word that shares the same written form as another word but has a different meaning, like right in the sentences below:
success is about making the right decisions.
Turn right after the traffic light
The English word "right", in the first case is translated to Swedish as "rätt" and to "höger" in the second case. The correct translation is possible by looking at the context (surrounding words).
Question 1. I wonder if fasttext aligned word embedding can come to help for translating these homograph words or words with several possible translations into another language?
[EDIT] The goal is not to query the model for the right translation. The goal is to pick the right translation when the following information is given:
the two (or several) possible translations options in the target language like "rätt" and "höger"
the surrounding words in the source language
Question 2. I loaded the english pre-trained vectors model and the English aligned vector model. While both were trained on Wikipedia articles, I noticed that the distances between two words were sort of preserved but the size of the dataset files (wiki.en.vec vs wiki.en.align.vec) are noticeably different (1GB). Wouldn't it make sense if we only use the aligned version? What information is not captured by the aligned dataset?
For question 1, I suppose it's possible that these 'aligned' vectors could help translate homographs, but still face the problem that any token only has a single vector – even if that one token has multiple meanings.
Are you assuming that you already know that right[en] could be translated into either rätt[se] or höger[se], from some external table? (That is, you're not using the aligned word-vectors as the primary means of translation, just an adjunct to other methods?)
If so, one technique that might help would be to see which of rätt[se] or höger[se] is closer to other words that surround your particular instance of right[en]. (You might tally each's rank-closeness to every word within n spots of right[en], or calculate their cosine-similarity to the average of the n words around right[en], for example.)
(You could potentially even do this with non-aligned word vectors, if your more-precise words have multiple, alternate, non-homograph/non-polysemous translations in English. For example, to determine which sense of right[en] is more likely, you could use the non-aligned English word vectors for correct[en] and rightward[en] – less polysemous correlates of rätt[se] & höger[se] – to check for similarity-to-surrounding words.)
A write-up that might create other ideas is "Linear algebraic structure of word meanings" which, quite surprisingly, is able to tease-out alternate meanings of homograph tokens even when the original word-vectors training was not word-sense-aware. (Might the 'atoms of discourse' in their model be equally findable across merged/aligned multi-language vector spaces, and then the closeness-of-context-words to different atoms a good guide to word-sense-disambiguation?)
For question 2, you imply the aligned word set is smaller in size. Have you checked if that's just because it includes fewer words? That seems the simplest explanation, and just checking which words are left out would let you know what you're losing.

Named entity recognition (NER) features

I'm new to Named Entity Recognition and I'm having some trouble understanding what/how features are used for this task.
Some papers I've read so far mention features used, but don't really explain them, for example in
Introduction to the CoNLL-2003 Shared Task:Language-Independent Named Entity Recognition, the following features are mentioned:
Main features used by the the sixteen systems that participated in the
CoNLL-2003 shared task sorted by performance on the English test data.
Aff: affix information (n-grams); bag: bag of words; cas: global case
information; chu: chunk tags; doc: global document information; gaz:
gazetteers; lex: lexical features; ort: orthographic information; pat:
orthographic patterns (like Aa0); pos: part-of-speech tags; pre:
previously predicted NE tags; quo: flag signing that the word is
between quotes; tri: trigger words.
I'm a bit confused by some of these, however. For example:
isn't bag of words supposed to be a method to generate features (one for each word)? How can BOW itself be a feature? Or does this simply mean we have a feature for each word as in BOW, besides all the other features mentioned?
how can a gazetteer be a feature?
how can POS tags exactly be used as features ? Don't we have a POS tag for each word? Isn't each object/instance a "text"?
what is global document information?
what is the feature trigger words?
I think all I need here is to just to look at an example table with each of these features as columns and see their values to understand how they really work, but so far I've failed to find an easy to read dataset.
Could someone please clarify or point me to some explanation or example of these features being used?
Here's a shot at some answers (and by the way the terminology on all this stuff is super overloaded).
isn't bag of words supposed to be a method to generate features (one for each word)? How can BOW itself be a feature? Or does this simply mean we have a feature for each word as in BOW, besides all the other features mentioned?
how can a gazetteer be a feature?
In my experience BOW Feature Extraction is used to produce word features out of sentences. So IMO BOW is not one feature, it is a method of generating features out of a sentence (or a block of text you are using). Uning NGrams can help with accounting for sequence, but BOW features amount to unordered bags of strings.
how can POS tags exactly be used as features ? Don't we have a POS tag for each word?
POS Tags are used as features because they can help with "word sense disambiguation" (at least on a theoretical level). For instance, the word "May" can be a name of a person or a month of a year or a poorly capitalized conjugated verb, but the POS tag can be the feature that differentiates that fact. And yes, you can get a POS tag for each word, but unless you explicitly use those tags in your "feature space" then the words themselves have no idea what they are in terms of their POS.
Isn't each object/instance a "text"?
If you mean what I think you mean, then this is true only if you have extracted object-instance "pairs" and stored them as features (an array of them derived from a string of tokens).
what is global document information?
I perceive this one to mean as such: Most NLP tasks function on a sentence. Global document information is data from all the surrounding text in the entire document. For instance, if you are trying to extract geographic placenames but disambiguate them, and you find the word Paris, which one is it? Well if France is mentioned 5 sentences above, that could increase the likelihood of it being Paris France rather than Paris Texas or worst case, the person Paris Hilton. It's also really important in what is called "coreference resolution", which is when you correlate a name to a pronoun reference (mapping a name mention to "he" or "she" etc).
what is the feature trigger words?
Trigger words are specific tokens or sequences that have high reliability as a stand alone thing to have a specific meaning. For instance, in sentiment analysis, curse words with exclamation marks often indicate negativity. There can be many permutations of this.
Anyway, my answers here are not perfect, and are prone to all manner of problems in human epistemology and inter-subjectivity, but those are the way I've been thinking about this things over the years I've been trying to solve problems with NLP.
Hopefully someone else will chime in, especially if I'm way off.
You should probably keep in mind that NER classify each word/token separately from features that are internal or external clues. Internal clues takes into account the word itself (morphology as uppercase letters, is the token present in a dedicated lexicon, POS) and external ones relies on contextual information (previous and next word, document features).
isn't bag of words supposed to be a method to generate features (one
for each word)? How can BOW itself be a feature? Or does this simply
mean we have a feature for each word as in BOW, besides all the other
features mentioned?
Yes, BOW generates one feature per word, with sometimes feature selection methods that reduces the number features taken into account (e.g. minimum frequency of words)
how can a gazetteer be a feature?
Gazetteer may also generate one feature per word, but in most cases it does enrich data, by labelling words or multi-word expressions (as full proper names). It is an ambiguous step: "Georges Washington" will lead to two features: entire "Georges Washington" as a celebrity and "Washington" as a city.
how can POS tags exactly be used as features ? Don't we have a POS tag
for each word? Isn't each object/instance a "text"?
For classifiers, each instance is a word. This is why sequence labelling (e.g. CRF) methods are used: they allow to leverage previous words and next words as additional contextual features to classify the current word. Labelling a text is done as a process relying on the most likely NE types for each word in the sequence.
what is global document information?
This could be metadata (e.g. date, author), topics (full text categorization), coreference, etc.
what is the feature trigger words?
Triggers are external clues, contextual patterns that help disambiguation. For instance "Mr" will be used as a feature that strongly suggest that the following tokens would be a person.
I recently implemented a NER system in python and I found the following features helpful:
character-level ngrams (using CountVectorizer)
previous word features and labels (i.e. context)
viterbi or beam-search on label sequence probability
part of speech (pos), word-length, word-count, is_capitalized, is_stopword

Entities on my gazette are not recognized

I would like to create a custom NER model. That's what i did:
TRAINING DATA (stanford-ner.tsv):
Hello O
! O
My O
name O
is O
Damiano PERSON
. O
PROPERTIES (stanford-ner.prop):
trainFile = stanford-ner.tsv
serializeTo = ner-model.ser.gz
map = word=0,answer=1
maxLeft=1
useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useDisjunctive=true
useSequences=true
usePrevSequences=true
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
useGazettes=true
gazette=gazzetta.txt
cleanGazette=true
GAZZETTE gazzetta.txt):
PERSON John
PERSON Andrea
I build the model via command line with:
java -classpath "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -prop stanford-ner.prop
And test with:
java -classpath "stanford-ner.jar:lib/*" edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier ner-model.ser.gz -textFile test.txt
I did two tests with the following texts:
>>> TEST 1 <<<
TEXT:
Hello! My name is Damiano and this is a fake text to test.
OUTPUT
Hello/O !/O
My/O name/O is/O Damiano/PERSON and/O this/O is/O a/O fake/O text/O to/O test/O ./O
>>> TEST 2 <<<
TEXT:
Hello! My name is John and this is a fake text to test.
OUTPUT
Hello/O !/O
My/O name/O is/O John/O and/O this/O is/O a/O fake/O text/O to/O test/O ./O
As you can see only "Damiano" entity is found. This entity is in my training data but "John" (second test) is inside the gazzette. So the question is.
Why does John entity is not recognized ?
Thank you so much in advance.
As Stanford FAQ says,
If a gazette is used, this does not guarantee that words in the
gazette are always used as a member of the intended class, and it does
not guarantee that words outside the gazette will not be chosen. It
simply provides another feature for the CRF to train against. If the
CRF has higher weights for other features, the gazette features may be
overwhelmed.
If you want something that will recognize text as a member of a class
if and only if it is in a list of words, you might prefer either the
regexner or the tokensregex tools included in Stanford CoreNLP. The
CRF NER is not guaranteed to accept all words in the gazette as part
of the expected class, and it may also accept words outside the
gazette as part of the class.
Btw, it is not a good practice to test machine learning pipelines in a 'unit-test'-way, i.e. with only one or two examples, because it is supposed to work on much greater volume of data and, more importantly, it is probabilistic by nature.
If you want to check if your gazette file is actually used, it may be better to take existent examples (see the bottom of the page linked above for austen.gaz.prop and austen.gaz.txt examples) and replace multiple names by your own ones, then check. If it fails, firstly try to change your test, e.g. add more names, reformulate text and so on.
gazzette will only help for extracting extra features from the training data, if you don't have any occurrence of these words inside your training data or any connection to labeled tokens, your model will not benefits from that. One of the experiments that I would suggest is to add Damiano to your gazzette.
Why does John entity is not recognized ?
It looks to me that your minimal example should most probably add "Damiano" to the gazetteer as a PERSON category. Currently, the training data allows the model to learn that "Damiano" is a PERSON label, but I think this is not related to the gazetteer categories (i.e. having PERSON on both sides is not sufficient).

Selecting suitable model for creating Language Identification tool

I am working on developing a tool for language identification of a given text i.e. given a sample text, identify the language (for e.g. English, Swedish, German, etc.) it is written in.
Now the strategy I have decided to follow (based on a few references I have gathered) are as follows -
a) Create a character n-gram model (The value of n is decided based on certain heuristics and computations)
b) Use a machine learning classifier(such as naive bayes) to predict the language of the given text.
Now, the doubt I have is - Is creating a character N-gram model necessary. As in, what disadvantage does a simple bag of words strategy have i.e. if I use all the words possible in the respective language to create a prediction model, what could be the possible cases where it would fail.
The reason why this doubt arose was the fact that any reference document/research paper I've come across states that language identification is a very difficult task. However, just using this strategy of using the words in the language seems to be a simple task.
EDIT: One reason why N-gram should be preferred is to make the model robust even if there are typos as stated here. Can anyone point out more?
if I use all the words possible in the respective language to create a prediction model, what could be the possible cases where it would fail
Pretty much the same cases were a character n-gram model would fail. The problem is that you're not going to find appropriate statistics for all possible words.(*) Character n-gram statistics are easier to accumulate and more robust, even for text without typos: words in a language tend to follow the same spelling patterns. E.g. had you not found statistics for the Dutch word "uitbuiken" (a pretty rare word), then the occurrence of the n-grams "uit", "bui" and "uik" would still be strong indicators of this being Dutch.
(*) In agglutinative languages such as Turkish, new words can be formed by stringing morphemes together and the number of possible words is immense. Check the first few chapters of Jurafsky and Martin, or any undergraduate linguistics text, for interesting discussions on the possible number of words per language.
Cavnar and Trenkle proposed a very simple yet efficient approach using character n-grams of variable length. Maybe you should try to implement it first and move to a more complex ML approach if C&T approach doesn't meet your requirements.
Basically, the idea is to build a language model using only the X (e.g. X = 300) most frequent n-grams of variable length (e.g. 1 <= N <= 5). Doing so, you are very likely to capture most functional words/morphemes of the considered language... without any prior linguistic knowledge on that language!
Why would you choose character n-grams over a BoW approach? I think the notion of character n-gram is pretty straightforward and apply to every written language. Word, is a much much complex notion which greatly differ from one language to another (consider languages with almost no spacing marks).
Reference: http://odur.let.rug.nl/~vannoord/TextCat/textcat.pdf
The performance really depends on your expected input. If you will be classifying multi-paragraph text all in one language, a functional words list (which your "bag of words" with pruning of hapaxes will quickly approximate) might well serve you perfectly, and could work better than n-grams.
There is significant overlap between individual words -- "of" could be Dutch or English; "and" is very common in English but also means "duck" in the Scandinavian languages, etc. But given enough input data, overlaps for individual stop words will not confuse your algorithm very often.
My anecdotal evidence is from using libtextcat on the Reuters multilingual newswire corpus. Many of the telegrams contain a lot of proper names, loan words etc. which throw off the n-gram classifier a lot of the time; whereas just examining the stop words would (in my humble estimation) produce much more stable results.
On the other hand, if you need to identify short, telegraphic utterances which might not be in your dictionary, a dictionary-based approach is obviously flawed. Note that many North European languages have very productive word formation by free compounding -- you see words like "tandborstställbrist" and "yhdyssanatauti" being coined left and right (and Finnish has agglutination on top -- "yhdyssanataudittomienkinkohan") which simply cannot be expected to be in a dictionary until somebody decides to use them.

How do I design a heuristic for matching translated sentences?

Summary
I am trying to design a heuristic for matching up sentences in a translation (from the original language to the translated language) and would like guidance and tips. Perhaps there is a heuristic that already does something similar? So given two text files, I would like to be able to match up the sentences (so I can pick out a sentence and say this is the translation of that sentence).
Details
The input text would be translated novels. So I do not expect the translations to be literal, although, using something like google translate might be a good way to test the accuracy of the heuristic.
To help me, I have a library that will gloss the contents of the translated text and give me the definitions of the words in the sentence. Other things I know:
Chapters and order are preserved; I know that the first sentence in chapter three will match with the first sentence in chapter three of the translation (Note, this is not strictly true; the first sentence might match up with the first two sentences, or even the second sentence)
I can calculate the overall size (characters, sentences, paragraphs); which could give me an idea of the average difference in sentence size (for example, the translation might be 30% longer).
Looking at the some books I have, the translated version has about 30% more sentences than the original text.
Implementation
(if it matters)
I am planning to do this in Java - but I am not that fussed - any language will do.
I am not greatly concerned about speed.
I guess to to be sure of the matches, some user feedback might be required. Like saying "Yes, this sentence definitely matches with that sentence." This would give the heuristic some more ground to stand on. This would mean that the user would need a little proficiency in the languages.
Background
(for those interested)
The reason I want to make this is that I want it to assist with my foreign language study. I am studying Japanese and find it hard to find "good" material (where "good" is defined by what I like). There are already tools to do something similar with subtitles from videos (an easier task - using the timing information of the video). But nothing, as far as I know, for texts.
There are tools called "sentence aligners" used in NLP research that does exactly what you want.
I advise hunalign:
http://mokk.bme.hu/resources/hunalign/
and MS sentence aligner:
http://research.microsoft.com/en-us/downloads/aafd5dcf-4dcc-49b2-8a22-f7055113e656/
Both are quite OK, but remember that nothing is perfect. Sentences that are too hard to be aligned will be dropped and some sentences may be wrongly aligned.

Resources