How to extract information from given sentences - machine-learning

I am building a system that will receive queries related to file management as
deleting, copying, moving, creating new item ...
So, what is the best approach to extract information from them like below :
can you delete file "file name" from "folder name"
then system should collect :
Action : deleting
upon : "file name"
destination : "folder name"

Natural language processing is rather complex, and there are many challenges that make parsing unstructured natural language queries like this more difficult than it might seem, depending on how broad the underlying set of commands is.
But in general, you would probably try to run the query through a part-of-speech tagger to extract verb phrases for the actions, verb-object pairs for upon/destination, etc. Then you would map these terms to a list of acceptable synonyms for each action. For instance you might have a list of synonyms for "delete" such as ['delete', 'remove', 'rm', 'toss', 'eliminate', ...], etc and then set action to delete if the verb phrase contains any of these words. Regarding how to use the NLTK POS tagger and other tools to parse queries, take a look at this tutorial, which covers many of the difficulties in analyzing the semantics of sentences: Analyzing the Meaning of Sentences
You might also want to check out these related threads:
How to process natural language queries?
Natural Language to SQL query
You had mentioned finding an academic paper on this, and if you are looking for more journal articles I suggest searching for the term "natural language query" (and variations thereof). A search for this on Semantic Scholar from 2010-present turned up more than 75,000 results.

Related

I have a dataset on which I want to do Phrase extraction using NLP but I am unable to do so?

How can I extract a phrase from a sentence using a dataset which has some set of the sentence and corresponding label in the form of
Sentence1:I want to play cricket
Label1: play cricket
Sentence2: Need to wash my clothes
Label2: wash clothes
I have tried using chunking with nltk but I am not able to use training data along with the chunks.
The "reminder paraphrases" you describe don't map exactly to other kinds of "phrases" with explicit software support.
For example, the gensim Phrases module uses a purely statistical approach to discover neighboring word-pairings that are so common, relative to the base rates of each word individually, that they might usefully be considered a combined unit. It might turn certain entities into phrases (eg: "New York" -> "New_York"), or repeated idioms (eg: "slacking off" -> "slacking_off"). But it'd only be neighboring-runs-of-words, and not the sort of contextual paraphrase you're seeking.
Similarly, libraries which are suitably grammar-aware to mark-up logical parts-of-speech (and inter-dependencies) also tend to simply group and label existing phrases in the text – not create simplified, imperative summaries like you desire.
Still, such libraries' output might help you work up your own rules-of-thumb. For example, it appears in your examples so far, your desired "reminder paraphrase" is always one verb and one noun (that verb's object). So after using part-of-speech tagging (as from NLTK or SpaCy), choosing the last verb (perhaps also preferring verbs in present/imperative tense), and the following noun-phrase (perhaps stripped of other modifiers/prepositions) may do most of what you need.
Of course, more complicated examples would need better heuristics. And if the full range of texts you need to work on is very varied, finding a general approach might require many more (hundreds/thousands) of positive training examples: what you think the best paraphrase is, given certain texts. Then, you could consider a number of machine-learning methods that might be able to pick the right ~2 words from larger texts.
Researching published work for "paraphrasing", rather than just "phrase extraction", might also guide you to ideas, but I unfortunately don't know any ready-to-use paraphrasing libraries.

What is parsing? (And differences from search and grep?

What exactly is parsing? I mean, generally. How different is parsing different from searching? On command line, if I use the grep tool/command; is that parsing?
For example, if I have just one string:
"Hello world! How are you doing today?"
and I tried to search (using grep or any other tool) whether the word "you" is within that string; is that parsing?
What if I do a web search; for example in Google? Is that parsing?
Or is parsing the name of the process that is a part of the process known as "Search"?
The verb "parse" is essentially related to the word "part", as in "part of speech". (See, for example, the on-line etymology dictionary.)
To "parse" a sentence has traditionally meant to break the sentence down into its component parts and identify their relationship with each other. For example, given "I asked a question.", we can parse it into a subject ("I"), a transitive verb in past tense ("asked"), and an object phrase consisting of an article ("a") and a noun ("question"). The parse indicates that the subject performed some action on the object; this is not the same statement as *"A question asked I", and not just because the latter is ungrammatical.
With the advent of computer languages and computational theory, the term "parsing" has been generalized to include analysis of strings which are not human languages. Some people would even use it to simply mean "to divide a string into its component parts", such as "parsing" a line in a CSV file into fields.
It's quite a stretch to apply that to merely searching for a string inside another string, although there may be contexts in which that is an acceptable use of the word. Personally, I would only use it for the action of completely deconstructing a structured string.

Algorithm for keyword/phrase trend search similar to Twitter trends

Wanted some ideas about building a tool which can scan text sentences (written in english language) and build a keyword rank, based on the most occurrences of words or phrases within the texts.
This would be very similar to the twitter trends wherin twitter detects and reports the top 10 words within the tweets.
I have identified the high level steps in the algorithm as follows
Scan the text and remove all the common , frequent words ( such as, "the" , "is" , "are", "what" , "at" etc..)
Add the remaining words to a hashmap. If the word is already in the map then increment its count.
To get the top 10 words , iterate through the hashmap and find out the top 10 counts.
Step 2 and 3 are straightforward but I do not know in step 1 how do I detect the important words within a text and segregate them from the common words (prepositions, conjunctions etc )
Also if I want to track phrases what could be the approach ?
For example if I have a text saying "This honey is very good"
I might want to track "honey" and "good" but I may also want to track the phrases "very good" or "honey is very good"
Any suggestions would be greatly appreciated.
Thanks in advance
For detecting phrases, I suggest to use chunker. You can use one provided by NLP tool like OpenNLP or Stanford CoreNLP.
NOTE
honey is very good is not a phrase. It is clause. very good is a phrase.
In Information Retrieval System, those common word are called Stop Words.
Actually, your step 1 would be quite similar to step 3 in the sense that you may want to constitute an absolute database of the most common words in the English language in the first place. Such a list is available easily on the internet (Wikipedia even has an article referencing the 100 most common words in the English language.) You can store those words in a hashmap and while scanning your text contents just ignore the common tokens.
If you don't trust Wikipedia and the already existing listing for common words, you can build your own database. For that purpose, just scan thousands of tweets (the more the better) and make your own frequency chart.
You're facing an n-gram-like problem.
Do not reinvent the wheel. What you seem to be wanting to do has been done thousands of times, just use existing libs or pieces of code (check the External Links section of the n-gram Wikipedia page.)
Check out the NLTK library. It has code that does number one two and three:
1 Removing common words can be done using stopwords or a stemmer
2,3 getting the most common words can be done with FreqDist
Second you can use tools from Stanford NLP for tracking your text

How to use Bayesian analysis to compute and combine weights for multiple rules to identify books

I am experimenting with machine learning in general, and Bayesian analysis in particular, by writing a tool to help me identify my collection of e-books. The input data consist of a set of e-book files, whose names and in some cases contents contain hints as to the book they correspond to.
Some are obvious to the human reader, like:
Artificial Intelligence - A Modern Approach 3rd.pdf
Microsoft Press - SharePoint Foundation 2010 Inside Out.pdf
The Complete Guide to PC Repair 5th Ed [2011].pdf
Hamlet.txt
Others are not so obvious:
Vsphere5.prc (Actually 'Mastering VSphere 5' by Scott Lowe)
as.ar.pdf (Actually 'Atlas Shrugged' by Ayn Rand)
Rather than try to code various parsers for different formats of file names, I thought I would build a few dozen simple rules, each with a score.
For example, one rule would look in the first few pages of the file for something resembling an ISBN number, and if found would propose a hypothesis that the file corresponds to the book identified by that ISBN number.
Another rule would look to see if the file name is in 'Author - Title' format and, if so, would propose a hypothesis that the author is 'Author' and the title is 'Title'. Similar rules for other formats.
I thought I could also get a list of book titles and authors from Amazon or an ISBN database, and search the file name and first few pages of the file for any of these; any matches found would result in a hypothesis being suggested by that rule.
In the end I would have a set of tuples like this:
[rulename,hypothesis]
I expect that some rules, such as the ISBN match, will have a high probability of being correct, when they are available. Other rules, like matches based on known book titles and authors, would be more common but not as accurate.
My questions are:
Is this a good approach for solving this problem?
If so, is Bayesian analysis a good candidate for combining all of these rules' hypotheses into compound score to help determine which hypothesis is the strongest, or most likely?
Is there a better way to solve this problem, or some research paper or book which you can suggest I turn to for more information?
It depends on the size of your collection and the time you want to spend training the classifier. It will be difficult to get good generalization that will save you time. For any type of classifier you will have to create a large training set, and also find a lot of rules before you get good accuracy. It will probably be more efficient (less false positives) to create the rules and use them only to suggest title alternatives for you to choose from, and not to implement the classifier. But, if the purpose is learning, then go ahead.

Hierarchy of meaning

I am looking for a method to build a hierarchy of words.
Background: I am a "amateur" natural language processing enthusiast and right now one of the problems that I am interested in is determining the hierarchy of word semantics from a group of words.
For example, if I have the set which contains a "super" representation of others, i.e.
[cat, dog, monkey, animal, bird, ... ]
I am interested to use any technique which would allow me to extract the word 'animal' which has the most meaningful and accurate representation of the other words inside this set.
Note: they are NOT the same in meaning. cat != dog != monkey != animal
BUT cat is a subset of animal and dog is a subset of animal.
I know by now a lot of you will be telling me to use wordnet. Well, I will try to but I am actually interested in doing a very domain specific area which WordNet doesn't apply because:
1) Most words are not found in Wordnet
2) All the words are in another language; translation is possible but is to limited effect.
another example would be:
[ noise reduction, focal length, flash, functionality, .. ]
so functionality includes everything in this set.
I have also tried crawling wikipedia pages and applying some techniques on td-idf etc but wikipedia pages doesn't really do much either.
Can someone possibly enlighten me as to what direction my research should go towards? (I could use anything)
It looks like you want to use something like the hypernym/hyponym relationships in WordNet, but without actually using WordNet due to language and domain specific coverage issues? That is, if you had the domain specific hypernym relationships, you could get the "super" representation by just looking for the nearest parent that subsumed all of the words in the list, or the nearest node that was equal to one of the list words and subsumed all of the others.
To start, I would first point out that WordNets are actually available for many of the worlds major languages see the list at Global WordNet.
To get domain specific hypernym relationships, you could use the technique presented in Snow et al.'s Learning syntactic patterns for automatic hypernym discovery. That is, you could start off with a small list of seed hypernyms, and then use them to train a classifier to detected the hypernyms in a corpus. You would then run this classifier over data from your domain in order to build a list of domain specific hypernym pairs.
The opinion mining and sentiment analysis folks might be doing related things, in terms of deciding what words represent features of products, without knowing anything about the products.
A quick sketch of an idea for how you might do this, which I've totally made up on the spot:
Parse a bunch of sentences in the relevant domain; find the noun phrases and adjectives. Figure out which noun phrases are associated with which adjectives. Cluster the noun phrases together based on the set of adjectives used to describe them. Animals will tend together because they're going to be described by adjectives like "furry" or "cute", etc. (In particular, hierarchical clustering would probably be most appropriate.)
If you try this, and it works, let me know. :)

Resources