Parsing a phrase - parsing

I am trying to make an algorithm to parse(i dont know if this is the correct word) a question and to get the correct answer to it.
Example
If someone ask "What is the Sun?", the correct answer would be "Is a Star"
This would be obtained from a list of phrases such as this:
"Is a Star"
"Is hot and brigth"
"I dont know"
etc
Now, I would like to know where can I get information about this,
I think the main problem here is how to make the program understand that "sun" is a star, and how to get the most accurate answer about it, becouse "Is hot and brigth" also is a valid answer.
Thanks

It is problem known as Machine Learning from Artificial Intelligence domain.
You can not just parse some phrases if you want to write good algorithm. It is not as simple as it seems to be.
You want to write your own application like http://www.cleverbot.com
I think you need to read and learn more about Machine Learning.

Related

How to Convert NLP Question to Knowledge Graph triple?

I have what I think is a simple question. I am trying to put together a question answering system and I am having trouble converting a natural question to a knowledge graph triple. Here is an example of what I mean:
Assume I have a prebuilt knowledge graph with the relationship:
((Todd) -[:picked_up_by]-> (Jane))
How can I make this conversion:
"Who picked up Todd today?" -> ((Todd) -[:picked_up_by]-> (?))
I am aware that there is a field dedicated to "Relationship Extraction", but I don't think that this fits that problem if I could name it, "question triple extraction" would be the name of what I am trying to do.
Generally speaking, it looks like a relation extraction problem, with your custom relations. Since the question is too generic, this is not an answer, just some links.
Check out reading comprehension: projects on github and lecture by Christopher Manning
Also, look up Semantic Role Labeling.

Summarization of simple Q&A

Is there a way to generate a one-sentence summarization of Q&A pairs?
For example, provided:
Q: What is the color of the car?
A: Red
I want to generate a summary as
The color of the car is red
Or, given
Q: Are you a man?
A: Yes
to
Yes, I am a man.
which accounts for both question and answer.
What would be some of the most reasonable ways to do this?
I had to once work on solving the opposite problem, i.e. generating questions out of sentences from Wikipedia articles.
I used the Stanford Parser to generate parse trees out of all possible sentences in my training dataset.
e.g.
Go to http://nlp.stanford.edu:8080/parser/index.jsp
Enter "The color of the car is red." and click "Parse".
Then look at the Parse section of the response. The first layer of that sentence is NP VP (noun phrase followed by a verb phrase).
The second layer is NP PP VBZ ADJP.
I basically collected these patterns across 1000s of sentences, sorted them how common each patter was, and then used figured out how to best modify this parse tree to convert into each sentence in a different Wh-question (What, Who, When, Where, Why, etc)
You could you easily do something very similar. Study the parse trees of all of your training data, and figure out what patterns you could extract to get your work done. In many cases, just replacing the Wh word from the question with the answer would give you a valid albeit somewhat awkwardly phrases sentence.
e.g. "Red is the color of the car."
In the case of questions like "Are you a man?" (i.e. primary verb is something like 'are', 'can', 'should', etc), swapping the first 2 words usually does the trick - "You are a man?"
I don't know any NLP task that explicitly handles your requirement.
Broadly, there are two kinds of questions. Questions that expect a passage as the answer such as definition or explain sort: What is Ebola Fever. The second type are fill in the blank which are referred to as Factoid Questions in the literature such as What is the height of Mt. Everest?. It is not clear what kind of question you would like to summarize. I am assuming you are interested in factoid questions as your examples refer to only them.
A very similar problem arises in the task of Question Answering. One of the first stages of this task is to generate query. In the paper: An Exploration of the Principles Underlying
Redundancy-Based Factoid Question
Answering; Jimmy Lin 2007, the author claims that better performance can be achieved by reformulating the query (see section 4.1) to the form more likely to appear in free text. Let me copy some of the examples discussed in the paper.
1. What year did Alaska became a state?
2. Alaska became a state ?x
1. Who was the first person to run the miles in less than four minutes?
2. The first person to run the miles in less than four minutes was ?x
In the above examples, the query in 1 is reformulated to 2. As you might have already observed, ?x is the blank that should be filled by the answer. This reformulation is carried out through a dozen hand-written rules and are built into the software tool discussed in the paper: ARANEA. All you have to do is to find the tool and use it, the paper is a good ten years old, I cannot promise you anything though :)
Hope this helps.

Find location from text

I am currently thinking of how to find a location from a text, such as a blogpost, without the user having to input any additional information. For example a post could look like this:
"Aberdeen, With a Foot on the Seafloor
Since the early 1970s, Aberdeen, Scotland, has evolved from a gritty fishing town into the world’s center of innovation in technology for the offshore energy industry."
By reading it I realize that the post is about Aberdeen Scotland but how can I geotag it? I have been using the geocoder (https://github.com/alexreisner/geocoder) by Alex Reisner but it seems weird to check every word against the google/nominatim(osm). My initial idea was to simply bruteforce it by checking every word with the geocoder and try to see if there are similarities between the words. But it seems like there could be a better way around this.
Has anyone done anything similar to this? Any algorithm that could be suggested (or gem :) ) would be immensely appreciated!
I'm sure there have been projects dedicated to this - for example, google's uncanny ability to geotag and pick data out of your personal emails effortlessly.
The most obvious answer I can see here, would be to create a few regular expressions for locations. The most simple one would be for City, Country:
Regexp.new("((?:[a-z][a-z]+))(.)(\\s+)((?:[a-z][a-z]+))",Regexp::IGNORECASE);
This would recognize Aberdeen, Scotland, but also course, I or even thanks, bye. It would be a start though, to query only those recognized spots instead of every word in the document.
There are also widely known regular expressions for addresses, cities, etc. You could use those as well if you find your algorithm missing matches.
Cheers!

How do you think the "Quick Add" feature in Google Calendar works?

Am thinking about a project which might use similar functionality to how "Quick Add" handles parsing natural language into something that can be understood with some level of semantics. I'm interested in understanding this better and wondered what your thoughts were on how this might be implemented.
If you're unfamiliar with what "Quick Add" is, check out Google's KB about it.
6/4/10 Update
Additional research on "Natural Language Parsing" (NLP) yields results which are MUCH broader than what I feel is actually implemented in something like "Quick Add". Given that this feature expects specific types of input rather than the true free-form text, I'm thinking this is a much more narrow implementation of NLP. If anyone could suggest more narrow topic matter that I could research rather than the entire breadth of NLP, it would be greatly appreciated.
That said, I've found a nice collection of resources about NLP including this great FAQ.
I would start by deciding on a standard way to represent all the information I'm interested in: event name, start/end time (and date), guest list, location. For example, I might use an XML notation like this:
<event>
<name>meet Sam</name>
<starttime>16:30 07/06/2010</starttime>
<endtime>17:30 07/06/2010</endtime>
</event>
I'd then aim to build up a corpus of diary entries about dates, annotated with their XML forms. How would I collect the data? Well, if I was Google, I'd probably have all sorts of ways. Since I'm me, I'd probably start by writing down all the ways I could think of to express this sort of stuff, then annotating it by hand. If I could add to this by going through friends' e-mails and whatnot, so much the better.
Now I've got a corpus, it can serve as a set of unit tests. I need to code a parser to fit the tests. The parser should translate a string of natural language into the logical form of my annotation. First, it should split the string into its constituent words. This is is called tokenising, and there is off-the-shelf software available to do it. (For example, see NLTK.) To interpret the words, I would look for patterns in the data: for example, text following 'at' or 'in' should be tagged as a location; 'for X minutes' means I need to add that number of minutes to the start time to get the end time. Statistical methods would probably be overkill here - it's best to create a series of hand-coded rules that express your own knowledge of how to interpret the words, phrases and constructions in this domain.
It would seem that there's really no narrow approach to this problem. I wanted to avoid having to pull along the entirety of NLP to figure out a solution, but I haven't found any alternative. I'll update this if I find a really great solution later.

How can I use NLP to parse recipe ingredients?

I need to parse recipe ingredients into amount, measurement, item, and description as applicable to the line, such as 1 cup flour, the peel of 2 lemons and 1 cup packed brown sugar etc. What would be the best way of doing this? I am interested in using python for the project so I am assuming using the nltk is the best bet but I am open to other languages.
I actually do this for my website, which is now part of an open source project for others to use.
I wrote a blog post on my techniques, enjoy!
http://blog.kitchenpc.com/2011/07/06/chef-watson/
The New York Times faced this problem when they were parsing their recipe archive. They used an NLP technique called linear-chain condition random field (CRF). This blog post provides a good overview:
"Extracting Structured Data From Recipes Using Conditional Random Fields"
They open-sourced their code, but quickly abandoned it. I maintain the most up-to-date version of it and I wrote a bit about how I modernized it.
If you're looking for a ready-made solution, several companies offer ingredient parsing as a service:
Zestful (full disclosure: I'm the author)
Spoonacular
Edamam
I guess this is a few years out, but I was thinking of doing something similar myself and came across this, so thought I might have a stab at it in case it is useful to anyone else in f
Even though you say you want to parse free test, most recipes have a pretty standard format for their recipe lists: each ingredient is on a separate line, exact sentence structure is rarely all that important. The range of vocab is relatively small as well.
One way might be to check each line for words which might be nouns and words/symbols which express quantities. I think WordNet may help with seeing if a word is likely to be a noun or not, but I've not used it before myself. Alternatively, you could use http://en.wikibooks.org/wiki/Cookbook:Ingredients as a word list, though again, I wouldn't know exactly how comprehensive it is.
The other part is to recognise quantities. These come in a few different forms, but few enough that you could probably create a list of keywords. In particular, make sure you have good error reporting. If the program can't fully parse a line, get it to report back to you what that line is, along with what it has/hasn't recognised so you can adjust your keyword lists accordingly.
Aaanyway, I'm not guaranteeing any of this will work (and it's almost certain not to be 100% reliable) but that's how I'd start to approach the problem
This is an incomplete answer, but you're looking at writing up a free-text parser, which as you know, is non-trivial :)
Some ways to cheat, using knowledge specific to cooking:
Construct lists of words for the "adjectives" and "verbs", and filter against them
measurement units form a closed set, using words and abbreviations like {L., c, cup, t, dash}
instructions -- cut, dice, cook, peel. Things that come after this are almost certain to be ingredients
Remember that you're mostly looking for nouns, and you can take a labeled list of non-nouns (from WordNet, for example) and filter against them.
If you're more ambitious, you can look in the NLTK Book at the chapter on parsers.
Good luck! This sounds like a mostly doable project!
Can you be more specific what your input is? If you just have input like this:
1 cup flour
2 lemon peels
1 cup packed brown sugar
It won't be too hard to parse it without using any NLP at all.

Resources