I need a model for the following tasks:
a sequence of words, with its POS tags. I want to judge whether this sequence of words is a Noun Phrase or not.
One model I can think of is HMM.
For those sequences which are noun phrase, we train a HMM (HMM+). For those are not noun phrase, we try an HMM(HMM-). And when we do prediction for a sequence, we can calculate P(sequence| HMM+) and P(sequence|HMM-). If the former is larger, we think this phrase is a noun phrase, otherwise it's not.
What do you think of it? and do you have any other models suited for this question?
From what i understand, you already have POS tags for the sequence of words. Once you have tags for the sequence of words, you don't need to use HMM to classify if the sequence is a NP. All you need to do is look for patterns of the following forms:
determiner followed by noun
adjective followed by noun
determiner followed by adjective followed by noun
etc
As somebody just mentioned,HMMs are used to obtain POS tags for new sequence of words. But for that you need a tagged corpus to train the HMM. There are some tagged corpus available in NLTK software.
If your sequences are already tagged then just use grammar rules as mentioned in the previous answer.
People do use HMMs to label noun phrases in POS-labeled sentences, but the typical model setup does not work in quite the way you're describing.
Instead, the setup (see Chunk tagger-statistical recognition of noun phrases (PDF) and Named entity recognition using an HMM-based chunk tagger (PDF) for examples) is to use an HMM with three states:
O (not in an NP),
B (beginning of an NP),
I (in an NP, but not the beginning).
Each word in a sentence will be assigned one of the states by the HMM. As an example, the sentence:
The/DT boy/NN hit/VT the/DT ball/NN with/PP the/DT red/ADJ bat/NN ./.
might be ideally labeled as follows:
The/DT B boy/NN I hit/VT O the/DT B ball/NN I with/PP O the/DT B red/ADJ I bat/NN I ./. O
The transitions among these three HMM states can be limited based on prior knowledge of how the sequences will behave; in particular, you can only transition to I from B, but the other transitions are all possible with nonzero probability. You can then use Baum-Welch on a corpus of unlabeled text to train up your HMM (to identify any type of chunk at all -- see Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models (PDF) for an example), or some sort of maximum-likelihood method with a corpus of labeled text (in case you're looking specifically for noun phrases).
My hunch is that an HMM is not the right model. It can be used to guess POS tags, by deriving the sequence of tags with the highest probabilities based on prior probabilities and conditional probabilities from one token to the next.
For a complete noun phrase I don't see how this model matches.
Any probability based approach will be very difficult to train, because noun phrases can contain many tokens. This makes for really many combinations. To get useful training probabilities, you need really huge training sets.
You might quickly and easily get a sufficiently good start by crafting a set of grammar rules, for example regular expressions, over POS tags by following the description in
http://en.wikipedia.org/wiki/Noun_phrase#Components_of_noun_phrases
or any other linguistic description of noun phrases.
Related
For example I have an original sentence. The word barking corresponds to the word that is missing.
Original Sentence : The dog is barking.
Incomplete Sentence : The dog is ___________.
For example, using the BERT model, it predicts the word crying instead of the word barking.
How will I measure the accuracy of the BERT Model in terms of how syntactically correct and semantically coherent the predicted word is?
(For an instance, there are a lot of incomplete sentences, and the task is to evaluate BERT accuracy based on these incomplete sentences.)Please help.
For syntax, you can use for instance English Resource Grammar to decide if a sentence is grammatical. It is the biggest manually curated description of English grammar, you can try an online demo. A grammar (given it has a sufficiently large coverage which they usually don't) refuses to parse ungrammatical sentences, unlike statistical/neural parser that happily parses everything (and usually better than grammars).
Estimating semantic plausibility is a very difficult task and given that BERT is probably one of the best current language models, you cannot use another language model as a reference. There are some academic papers that deal with modeling semantic plausibility, you can start, e.g., with this one from NAACL 2018.
In bag-of-words model, I know we should remove stopwords and punctuation before training. But in RNN model, if I want to do text classification, should I remove stopwords too ?
This depends on what your model classifies. If you're doing something in which the classification is aided by stop words -- some level of syntax understanding, for instance -- then you need to either leave in the stop words or alter your stop list, such that you don't lose that information. For instance, cutting out all verbs of being (is, are, should be, ...) can mess up a NN that depends somewhat on sentence structure.
However, if your classification is topic-based (as suggested by your bag-of-words reference), then treat the input the same way: remove those pesky stop words before they burn valuable training time.
Do not remove SW, as they add new information(context-awareness) to
the sentence (viz., text summarization, machine/language translation,
language modeling, question-answering)
Remove SW if we want only
general idea of the sentence (viz., sentiment analysis, language/text
classification, spam filtering, caption generation, auto-tag
generation, topic/document
Given a sentence like:
Complimentary gym access for two for the length of stay ($12 value per person per day)
What general approach can I take to identify the word gym or gym access?
Is this a POS tagger for nouns?
One of the most widely used techniques for extracting keywords from text is TF-IDF of the terms. A higher TF-IDF score indicates that a word is both important to the document, as well as relatively uncommon across the document corpus. This is often interpreted to mean that the word is significant to the document.
One other method is using lexical chains. I refer you to this paper for full description.
There are many other approaches out there that you can explore to use depending on your domain. A short survey can be found here.
Noun POS tags are not sufficient. For your example, "length of stay" is also a noun phrase, but might not be a key phrase.
There are many tools and paper available which perform this task using basic sentence seperators.
Such tools are
http://nlp.stanford.edu/software/tokenizer.shtml
OpenNLP
NLTK
and there might be other. They mainly focus on
(a) If it's a period, it ends a sentence.
(b) If the preceding token is on my hand-compiled list of abbreviations, then it doesn't end a sentence.
(c) If the next token is capitalized, then it ends a sentence.
There are few paper which suggest techniques for SBD in ASR text
http://pdf.aminer.org/000/041/703/experiments_on_sentence_boundary_detection.pdf
http://www.icsd.aegean.gr/lecturers/kavallieratou/publications_files/icpr_2000.pdf
http://www.icsd.aegean.gr/lecturers/kavallieratou/publications_files/icpr_2000.pdf
Is there any tools which can perform sentence detection on ambiguous sentences like
John is actor and his father Mr Smith was top city doctor in NW (2 sentences)
Where is statue of liberty, what is it's height and what is the history behind? (3 sentences)
What you are seeking to do is to identify the independent clauses in a compound sentence. A compound sentence is a sentence with at least two independent clauses joined by a coordinating conjunction. There is no readily available tool for this, but you can identify compound sentences with a high degree of precision by using constituency parse trees.
Be wary, though. Sligh grammatical mistakes can yield a very wrong parse tree! For example, if you use the Berkeley parser (demo page: http://tomato.banatao.berkeley.edu:8080/parser/parser.html) on your first example, the parse tree is not what you would expect, but correct it to "John is an actor and his father ... ", and you can see the parse tree neatly divided into the structure S CC S:
Now, you simply take each sentence-label S as an independent clause!
Questions are not handled well, I am afraid, as you can check with your second example.
I want to classify sentences with Weka. My features are sentence terms (words) and a Part of Speech tag of each terms. I don't know how figure attributes, because if each term is presented as one feature, number of feature for each instance (sentence) has become different. And, if all words in sentence is presented as one feature, how relate words and their POS tag.
Any ideas how I should proceed?
If I understand the question correctly, the answer is as follows: It is most common to treat words independently of their position in the sentence and represent a sentence in the feature space by the number of times each of the known words occurs in that sentence. I.e. there is usually a separate numerical feature for each word present in the training data. Or, if you're willing to use n-grams, a separate feature for every n-gram in the training data (possibly with some frequency threshold).
As for the POS tags, it might make sense to use them as separate features, but only if the classification you're interested in has to do with sentence structure (syntax). Otherwise you might want to just append the POS tag to the word, which would partly disambiguate those words that can represent different parts of speech.