Using bigram using Stanford NLP in java - token

I am using Stanford NLP API for document collection and this is the code i used for tokenization
PTBTokenizer<CoreLabel> ptbt = new PTBTokenizer<>(reader,
new CoreLabelTokenFactory(), "");
while (ptbt.hasNext()) {
CoreLabel token = ptbt.next();
String word = token.get(TextAnnotation.class);
}
this code however delimited on white space. I mean it convert words like ALARM Activated in to two words ALARM and Activated. I guess bigram could solve the problem but i am not sure how to use it here. Can Any body suggest some thing to use bigram with PTBtokenizer or how to use bigram in tokenization using Stanford NLP.

Related

Bert model output interpretation

I searched a lot for this but havent still got a clear idea so I hope you can help me out:
I am trying to translate german texts to english! I udes this code:
tokenizer = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-de-en")
model = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-de-en")
batch = tokenizer(
list(data_bert[:100]),
padding=True,
truncation=True,
max_length=250,
return_tensors="pt")["input_ids"]
results = model(batch)
Which returned me a size error! I fixed this problem (thanks to the community: https://github.com/huggingface/transformers/issues/5480) with switching the last line of code to:
results = model(input_ids = batch,decoder_input_ids=batch)
Now my output looks like a really long array. What is this output precisely? Are these some sort of word embeddings? And if yes: How shall I go on with converting these embeddings to the texts in the english language? Thanks alot!
Adding to Timbus's answer,
What is this output precisely? Are these some sort of word embeddings?
results is of type <class 'transformers.modeling_outputs.Seq2SeqLMOutput'> and you can do
results.__dict__.keys()
to check that results contains the following:
dict_keys(['loss', 'logits', 'past_key_values', 'decoder_hidden_states', 'decoder_attentions', 'cross_attentions', 'encoder_last_hidden_state', 'encoder_hidden_states', 'encoder_attentions'])
You can read more about this class in the huggingface documentation.
How shall I go on with converting these embeddings to the texts in the
english language?
To interpret the text in English, you can use model.generate which is easily decodable in the following way:
predictions = model.generate(batch)
english_text = tokenizer.batch_decode(predictions)
I think one possible answer to your dilemma is provided in this question:
https://stackoverflow.com/questions/61523829/how-can-i-use-bert-fo-machine-translation#:~:text=BERT%20is%20not%20a%20machine%20translation%20model%2C%20BERT,there%20are%20doubts%20if%20it%20really%20pays%20off.
Practically with the output of BERT, you get a vectorized representation for each of your words. In essence, it is easier to use the output for other tasks, but trickier in the case of Machine Translation.
A good starting point of using a seq2seq model from the transformers library in the context of machine translation is the following: https://github.com/huggingface/notebooks/blob/master/examples/translation.ipynb.
The example above provides how to translate from English to Romanian.

Find sentences with describing context using stanford NLP

Is there any way to find those sentences that are describing objects?
For example sentences like "This is a good product" or "You are very beautiful"
I guess I can create an algorithm by using TokenSequencePattern and filter with POS some patterns like PRONUN + VERB + ADJECTIVE but don't think would be something reliable.
I am asking you if there is something out of the box, what I am trying to do is to identify review comments on a webpage.
Instead of POS tagging, you would achieve better results by dependency parsing. By using that instead of POS tagging & patterns as you mentioned, you will have richer and accurate information about the sentence structure.
Example:
https://demos.explosion.ai/displacy/?text=The%20product%20was%20really%20very%20good.&model=en_core_web_sm&cpu=0&cph=0
Stanford NLP does support depedency parsing.
Apart from that you can also use the brilliant SpaCy.

Transform Text into Different Languages

I want to make some words and phrases in different languages from Google Translator without translating it's actual meaning.Is it possible to convert the text to other languages rather than translating it.
Example:
i want plain conversion like cambridge - كامبردج, कैंब्रिज ,cambridge ,剑桥,Кембридж
i donot want translation like university - جامعة ,विश्वविद्यालय,universitet,大学,
Университет
Yes. This is called "transliteration". There are multiple ways to do it programmatically depending on which programming language you are using. Here, for demonstration, I'm using ICU4J library in Groovy:
// https://mvnrepository.com/artifact/com.ibm.icu/icu4j
#Grapes(
#Grab(group='com.ibm.icu', module='icu4j', version='59.1')
)
import com.ibm.icu.text.Transliterator;
String sourceString = "cambridge";
List<String> transformSchemes = ["Latin-Arabic", "Latin-Cyrillic", "Latin-Devanagari", "Latin-Hiragana"]
for (t in transformSchemes) {
println "${t}: " + Transliterator.getInstance(t).transform(sourceString);
}
Which returns:
Latin-Arabic: كَمبرِدگِ
Latin-Cyrillic: цамбридге
Latin-Devanagari: चंब्रिद्गॆ
Latin-Hiragana: かんぶりでげ
Obviously, since these are rule-based transformations from one language to another, they tend to be imperfect.
Therefore, if you are looking for names of places (since you mentioned "Cambridge" as an example), you'll have better luck using a database of names of places; ICU has some names of cities and many names of countries. You could also use Wikidata API to retrieve such information; here is a sample call: https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q350

Named entity recognition (NER) features

I'm new to Named Entity Recognition and I'm having some trouble understanding what/how features are used for this task.
Some papers I've read so far mention features used, but don't really explain them, for example in
Introduction to the CoNLL-2003 Shared Task:Language-Independent Named Entity Recognition, the following features are mentioned:
Main features used by the the sixteen systems that participated in the
CoNLL-2003 shared task sorted by performance on the English test data.
Aff: affix information (n-grams); bag: bag of words; cas: global case
information; chu: chunk tags; doc: global document information; gaz:
gazetteers; lex: lexical features; ort: orthographic information; pat:
orthographic patterns (like Aa0); pos: part-of-speech tags; pre:
previously predicted NE tags; quo: flag signing that the word is
between quotes; tri: trigger words.
I'm a bit confused by some of these, however. For example:
isn't bag of words supposed to be a method to generate features (one for each word)? How can BOW itself be a feature? Or does this simply mean we have a feature for each word as in BOW, besides all the other features mentioned?
how can a gazetteer be a feature?
how can POS tags exactly be used as features ? Don't we have a POS tag for each word? Isn't each object/instance a "text"?
what is global document information?
what is the feature trigger words?
I think all I need here is to just to look at an example table with each of these features as columns and see their values to understand how they really work, but so far I've failed to find an easy to read dataset.
Could someone please clarify or point me to some explanation or example of these features being used?
Here's a shot at some answers (and by the way the terminology on all this stuff is super overloaded).
isn't bag of words supposed to be a method to generate features (one for each word)? How can BOW itself be a feature? Or does this simply mean we have a feature for each word as in BOW, besides all the other features mentioned?
how can a gazetteer be a feature?
In my experience BOW Feature Extraction is used to produce word features out of sentences. So IMO BOW is not one feature, it is a method of generating features out of a sentence (or a block of text you are using). Uning NGrams can help with accounting for sequence, but BOW features amount to unordered bags of strings.
how can POS tags exactly be used as features ? Don't we have a POS tag for each word?
POS Tags are used as features because they can help with "word sense disambiguation" (at least on a theoretical level). For instance, the word "May" can be a name of a person or a month of a year or a poorly capitalized conjugated verb, but the POS tag can be the feature that differentiates that fact. And yes, you can get a POS tag for each word, but unless you explicitly use those tags in your "feature space" then the words themselves have no idea what they are in terms of their POS.
Isn't each object/instance a "text"?
If you mean what I think you mean, then this is true only if you have extracted object-instance "pairs" and stored them as features (an array of them derived from a string of tokens).
what is global document information?
I perceive this one to mean as such: Most NLP tasks function on a sentence. Global document information is data from all the surrounding text in the entire document. For instance, if you are trying to extract geographic placenames but disambiguate them, and you find the word Paris, which one is it? Well if France is mentioned 5 sentences above, that could increase the likelihood of it being Paris France rather than Paris Texas or worst case, the person Paris Hilton. It's also really important in what is called "coreference resolution", which is when you correlate a name to a pronoun reference (mapping a name mention to "he" or "she" etc).
what is the feature trigger words?
Trigger words are specific tokens or sequences that have high reliability as a stand alone thing to have a specific meaning. For instance, in sentiment analysis, curse words with exclamation marks often indicate negativity. There can be many permutations of this.
Anyway, my answers here are not perfect, and are prone to all manner of problems in human epistemology and inter-subjectivity, but those are the way I've been thinking about this things over the years I've been trying to solve problems with NLP.
Hopefully someone else will chime in, especially if I'm way off.
You should probably keep in mind that NER classify each word/token separately from features that are internal or external clues. Internal clues takes into account the word itself (morphology as uppercase letters, is the token present in a dedicated lexicon, POS) and external ones relies on contextual information (previous and next word, document features).
isn't bag of words supposed to be a method to generate features (one
for each word)? How can BOW itself be a feature? Or does this simply
mean we have a feature for each word as in BOW, besides all the other
features mentioned?
Yes, BOW generates one feature per word, with sometimes feature selection methods that reduces the number features taken into account (e.g. minimum frequency of words)
how can a gazetteer be a feature?
Gazetteer may also generate one feature per word, but in most cases it does enrich data, by labelling words or multi-word expressions (as full proper names). It is an ambiguous step: "Georges Washington" will lead to two features: entire "Georges Washington" as a celebrity and "Washington" as a city.
how can POS tags exactly be used as features ? Don't we have a POS tag
for each word? Isn't each object/instance a "text"?
For classifiers, each instance is a word. This is why sequence labelling (e.g. CRF) methods are used: they allow to leverage previous words and next words as additional contextual features to classify the current word. Labelling a text is done as a process relying on the most likely NE types for each word in the sequence.
what is global document information?
This could be metadata (e.g. date, author), topics (full text categorization), coreference, etc.
what is the feature trigger words?
Triggers are external clues, contextual patterns that help disambiguation. For instance "Mr" will be used as a feature that strongly suggest that the following tokens would be a person.
I recently implemented a NER system in python and I found the following features helpful:
character-level ngrams (using CountVectorizer)
previous word features and labels (i.e. context)
viterbi or beam-search on label sequence probability
part of speech (pos), word-length, word-count, is_capitalized, is_stopword

How to do part-of-speech tagging of texts, containing mathematical expressions?

The goal is a syntactic parsing of scientific texts. And first I need to make part-of-speech tagging of sentences of such texts. Texts are from arxiv.org. So they are originally in LaTeX. When extracting text from LaTeX documents, math expressions can be converted into MathML (or maybe some other format, but I prefer MathML cause this work is being done to create a specific web-app, and MathML is a convenient tool for this).
The only idea I have is to substitute mathematical expressions with some phrases of natural language and then use some implemented algorithm for pos-tagging. So the question is how to implement this substitutions or, in general, how to implement pos-tagging of texts with mathematics in them?
I have implemented a formula substitution algorithm on top of the Stanford tagger and it works quite nice. The way to go is, as abecadel has written, to replace every formula with a unique but new word, I used a combination of a word and a hash 'formula-duwkziah'.
Replacing all of the mathematical formulae with a single, unique word seem to be the way to go.

Resources