sentence boundry detection in noisy or ASR data - parsing

There are many tools and paper available which perform this task using basic sentence seperators.
Such tools are
http://nlp.stanford.edu/software/tokenizer.shtml
OpenNLP
NLTK
and there might be other. They mainly focus on
(a) If it's a period, it ends a sentence.
(b) If the preceding token is on my hand-compiled list of abbreviations, then it doesn't end a sentence.
(c) If the next token is capitalized, then it ends a sentence.
There are few paper which suggest techniques for SBD in ASR text
http://pdf.aminer.org/000/041/703/experiments_on_sentence_boundary_detection.pdf
http://www.icsd.aegean.gr/lecturers/kavallieratou/publications_files/icpr_2000.pdf
http://www.icsd.aegean.gr/lecturers/kavallieratou/publications_files/icpr_2000.pdf
Is there any tools which can perform sentence detection on ambiguous sentences like
John is actor and his father Mr Smith was top city doctor in NW (2 sentences)
Where is statue of liberty, what is it's height and what is the history behind? (3 sentences)

What you are seeking to do is to identify the independent clauses in a compound sentence. A compound sentence is a sentence with at least two independent clauses joined by a coordinating conjunction. There is no readily available tool for this, but you can identify compound sentences with a high degree of precision by using constituency parse trees.
Be wary, though. Sligh grammatical mistakes can yield a very wrong parse tree! For example, if you use the Berkeley parser (demo page: http://tomato.banatao.berkeley.edu:8080/parser/parser.html) on your first example, the parse tree is not what you would expect, but correct it to "John is an actor and his father ... ", and you can see the parse tree neatly divided into the structure S CC S:
Now, you simply take each sentence-label S as an independent clause!
Questions are not handled well, I am afraid, as you can check with your second example.

Related

Retrieving the top 5 sentences- Algorithm if any present

I am new to Data Science. This could be a dumb question, but just want to know opinions and confirm if I could enhance it well.
I have a question getting the most common/frequent 5 sentences from the database. I know I could gather all the data (sentences) into a list and using the Counter library - I could fetch the most occurring 5 sentences, but I am interested to know if any algorithm (ML/DL/NLP) is present for such a requirement. All the sentences are given by the user. I need to know his top 5 (most occurring/frequent) sentences (not phrases please)!!
Examples of sentences -
"Welcome to the world of Geeks"
"This portal has been created to provide well written subject"
"If you like Geeks for Geeks and would like to contribute"
"to contribute at geeksforgeeks org See your article appearing on "
"to contribute at geeksforgeeks org See your article appearing on " (occurring for the second time)
"the Geeks for Geeks main page and help thousands of other Geeks."
Note: All my sentences in my database are distinct (contextual wise and no duplicates too). This is just an example for my requirement.
Thanks in Advance.
I'd suggest you to start with sentence embeddings. Briefly, it returns a vector for a given sentence and it roughly represents the meaning of the sentence.
Let's say you have n sentences in your database and you found the sentence embeddings for each sentence so now you have n vectors.
Once you have the vectors, you can use dimensionality reduction techniques such as t-sne to visualize your sentences in 2 or 3 dimensions. In this visualization, sentences that have similar meanings should ideally be close to each other. That may help you pinpoint the most-frequent sentences that are also close in meaning.
I think one problem is that it's still hard to draw boundaries to the meanings of sentences since meaning is intrinsically subjective. You may have to add some heuristics to the process I described above.
Adding to MGoksu's answer, Once you get sentence embeddings, you can apply LSH(Locality Sensitive Hashing) to group the embeddings into clusters.
Once you get the clusters of embeddings. It would be a trivial to get the clusters with highest number of vectors.

General approach to extract key text from sentence (nlp)

Given a sentence like:
Complimentary gym access for two for the length of stay ($12 value per person per day)
What general approach can I take to identify the word gym or gym access?
Is this a POS tagger for nouns?
One of the most widely used techniques for extracting keywords from text is TF-IDF of the terms. A higher TF-IDF score indicates that a word is both important to the document, as well as relatively uncommon across the document corpus. This is often interpreted to mean that the word is significant to the document.
One other method is using lexical chains. I refer you to this paper for full description.
There are many other approaches out there that you can explore to use depending on your domain. A short survey can be found here.
Noun POS tags are not sufficient. For your example, "length of stay" is also a noun phrase, but might not be a key phrase.

Models for classify Noun Phrase?

I need a model for the following tasks:
a sequence of words, with its POS tags. I want to judge whether this sequence of words is a Noun Phrase or not.
One model I can think of is HMM.
For those sequences which are noun phrase, we train a HMM (HMM+). For those are not noun phrase, we try an HMM(HMM-). And when we do prediction for a sequence, we can calculate P(sequence| HMM+) and P(sequence|HMM-). If the former is larger, we think this phrase is a noun phrase, otherwise it's not.
What do you think of it? and do you have any other models suited for this question?
From what i understand, you already have POS tags for the sequence of words. Once you have tags for the sequence of words, you don't need to use HMM to classify if the sequence is a NP. All you need to do is look for patterns of the following forms:
determiner followed by noun
adjective followed by noun
determiner followed by adjective followed by noun
etc
As somebody just mentioned,HMMs are used to obtain POS tags for new sequence of words. But for that you need a tagged corpus to train the HMM. There are some tagged corpus available in NLTK software.
If your sequences are already tagged then just use grammar rules as mentioned in the previous answer.
People do use HMMs to label noun phrases in POS-labeled sentences, but the typical model setup does not work in quite the way you're describing.
Instead, the setup (see Chunk tagger-statistical recognition of noun phrases (PDF) and Named entity recognition using an HMM-based chunk tagger (PDF) for examples) is to use an HMM with three states:
O (not in an NP),
B (beginning of an NP),
I (in an NP, but not the beginning).
Each word in a sentence will be assigned one of the states by the HMM. As an example, the sentence:
The/DT boy/NN hit/VT the/DT ball/NN with/PP the/DT red/ADJ bat/NN ./.
might be ideally labeled as follows:
The/DT B boy/NN I hit/VT O the/DT B ball/NN I with/PP O the/DT B red/ADJ I bat/NN I ./. O
The transitions among these three HMM states can be limited based on prior knowledge of how the sequences will behave; in particular, you can only transition to I from B, but the other transitions are all possible with nonzero probability. You can then use Baum-Welch on a corpus of unlabeled text to train up your HMM (to identify any type of chunk at all -- see Simple Unsupervised Grammar Induction from Raw Text with Cascaded Finite State Models (PDF) for an example), or some sort of maximum-likelihood method with a corpus of labeled text (in case you're looking specifically for noun phrases).
My hunch is that an HMM is not the right model. It can be used to guess POS tags, by deriving the sequence of tags with the highest probabilities based on prior probabilities and conditional probabilities from one token to the next.
For a complete noun phrase I don't see how this model matches.
Any probability based approach will be very difficult to train, because noun phrases can contain many tokens. This makes for really many combinations. To get useful training probabilities, you need really huge training sets.
You might quickly and easily get a sufficiently good start by crafting a set of grammar rules, for example regular expressions, over POS tags by following the description in
http://en.wikipedia.org/wiki/Noun_phrase#Components_of_noun_phrases
or any other linguistic description of noun phrases.

Feature extraction from a single word

Usually one wants to get a feature from a text by using the bag of words approach, counting the words and calculate different measures, for example tf-idf values, like this: How to include words as numerical feature in classification
But my problem is different, I want to extract a feature vector from a single word. I want to know for example that potatoes and french fries are close to each other in the vector space, since they are both made of potatoes. I want to know that milk and cream also are close, hot and warm, stone and hard and so on.
What is this problem called? Can I learn the similarities and features of words by just looking at a large number documents?
I will not make the implementation in English, so I can't use databases.
hmm,feature extraction (e.g. tf-idf) on text data are based on statistics. On the other hand, you are looking for sense (semantics). Therefore no such a method like tf-idef will work for you.
In NLP exists 3 basic levels:
morphological analyses
syntactic analyses
semantic analyses
(higher number represents bigger problems :)). Morphology is known for majority languages. Syntactic analyses is a bigger problem (it deals with things like what is verb, noun in some sentence,...). Semantic analyses has the most challenges, since it deals with meaning which is quite difficult to represent in machines, have many exceptions and are language-specific.
As far as I understand you want to know some relationships between words, this can be done via so-called dependency tree banks, (or just treebank): http://en.wikipedia.org/wiki/Treebank . It is a database/graph of sentences where a word can be considered as a node and relationship as arc. There is good treebank for czech language and for english there will be also some, but for many 'less-covered' languages it can be a problem to find one ...
user1506145,
Here is a simple idea that I have used in the past. Collect a large number of short documents like Wikipedia articles. Do a word count on each document. For the ith document and the jth word let
I = the number of documents,
J = the number of words,
x_ij = the number of times the jth word appears in the ith document, and
y_ij = ln( 1+ x_ij).
Let [U, D, V] = svd(Y) be the singular value decomposition of Y. So Y = U*D*transpose(V)), U is IxI, D is diagonal IxJ, and V is JxJ.
You can use (V_1j, V_2j, V_3j, V_4j) as a feature vector in R^4 for the jth word.
I am surprised the previous answers haven't mentioned word embedding. Word embedding algorithm can produce word vectors for each word a given dataset. These algorithms can nfer word vectors from the context. For instance, by looking at the context of the following sentences we can say that "clever" and "smart" is somehow related. Because the context is almost the same.
He is a clever guy
He is a smart guy
A co-occurrence matrix can be constructed to do this. However, it is too inefficient. A famous technique designed for this purpose is called Word2Vec. It can be studied from the following papers.
https://arxiv.org/pdf/1411.2738.pdf
https://arxiv.org/pdf/1402.3722.pdf
I have been using it for Swedish. It is quite effective in detecting similar words and completely unsupervised.
A package could be find in gensim and tensorflow.

rails - comparison of arrays of sentences

I have two arrays of sentences As you can see I'm trying to match applicant abilities with job requirements.
Array A
-Must be able to use MS Office
-Applicant should be prepared to work 40 to 50 hours a week
-Must know FDA Regulations, FCC Regulations
-Must be willing to work in groups
Array B
-Proficient in MS Office
-Experience with FDA Regulations
-Willing to work long hours
-Has experience with math applications.
Is there any way to compare the two arrays and determine how many similarities there are? Preferably on a sentence by sentence basis (not just picking out words that are similar) returning a percentage similar.
Any suggestions?
What you are asking for is pretty difficult and it is the buzz of natural language processing today.
NLTK is the toolkit of choice, but it's in python. There are lots of academic papers in this field. Most use copuses to train a a model where the hypothesis is that words that are similar tend to be in similar contexts (i.e. surrounded by similar words). This is very computationally expensive.
You can come up with a rudimentary solution by using the the nltk library with this plan in mind:
Remove filler words (a, the, and)
Use the part of speech tagger to identify label verbs, nouns etc (I'd
remove anything else than nouns and verbs)
For, say any twos noun (verbs), use the wordnet library to get
synonyms of that word. And if you have a match you count. There are
lots of other papers on this that use corpuses to build lexicons
which can use word frequencies to measure word similarities. The
latter method is preferred because you are likely to relate words
that are similar but do not have synonyms in common.
You can then give a relative measure of sentence similarity based on the word similarity
Other methods consider the syntactic structure of sentence, but you don't get that much benefits from this. Unfortunately, the above method is not very good, because of the nature of wordnet.

Resources