I need to parse recipe ingredients into amount, measurement, item, and description as applicable to the line, such as 1 cup flour, the peel of 2 lemons and 1 cup packed brown sugar etc. What would be the best way of doing this? I am interested in using python for the project so I am assuming using the nltk is the best bet but I am open to other languages.
I actually do this for my website, which is now part of an open source project for others to use.
I wrote a blog post on my techniques, enjoy!
http://blog.kitchenpc.com/2011/07/06/chef-watson/
The New York Times faced this problem when they were parsing their recipe archive. They used an NLP technique called linear-chain condition random field (CRF). This blog post provides a good overview:
"Extracting Structured Data From Recipes Using Conditional Random Fields"
They open-sourced their code, but quickly abandoned it. I maintain the most up-to-date version of it and I wrote a bit about how I modernized it.
If you're looking for a ready-made solution, several companies offer ingredient parsing as a service:
Zestful (full disclosure: I'm the author)
Spoonacular
Edamam
I guess this is a few years out, but I was thinking of doing something similar myself and came across this, so thought I might have a stab at it in case it is useful to anyone else in f
Even though you say you want to parse free test, most recipes have a pretty standard format for their recipe lists: each ingredient is on a separate line, exact sentence structure is rarely all that important. The range of vocab is relatively small as well.
One way might be to check each line for words which might be nouns and words/symbols which express quantities. I think WordNet may help with seeing if a word is likely to be a noun or not, but I've not used it before myself. Alternatively, you could use http://en.wikibooks.org/wiki/Cookbook:Ingredients as a word list, though again, I wouldn't know exactly how comprehensive it is.
The other part is to recognise quantities. These come in a few different forms, but few enough that you could probably create a list of keywords. In particular, make sure you have good error reporting. If the program can't fully parse a line, get it to report back to you what that line is, along with what it has/hasn't recognised so you can adjust your keyword lists accordingly.
Aaanyway, I'm not guaranteeing any of this will work (and it's almost certain not to be 100% reliable) but that's how I'd start to approach the problem
This is an incomplete answer, but you're looking at writing up a free-text parser, which as you know, is non-trivial :)
Some ways to cheat, using knowledge specific to cooking:
Construct lists of words for the "adjectives" and "verbs", and filter against them
measurement units form a closed set, using words and abbreviations like {L., c, cup, t, dash}
instructions -- cut, dice, cook, peel. Things that come after this are almost certain to be ingredients
Remember that you're mostly looking for nouns, and you can take a labeled list of non-nouns (from WordNet, for example) and filter against them.
If you're more ambitious, you can look in the NLTK Book at the chapter on parsers.
Good luck! This sounds like a mostly doable project!
Can you be more specific what your input is? If you just have input like this:
1 cup flour
2 lemon peels
1 cup packed brown sugar
It won't be too hard to parse it without using any NLP at all.
Related
I am looking to write a basic profanity filter in a Rails based application. This will use a simply search and replace mechanism whenever the appropriate attribute gets submitted by a user. My question is, for those who have written these before, is there a CSV file or some database out there where a list of profanity words can be imported into my database? We are submitting the words that we will replace the profanities with on our own. We more or less need a database of profanities, racial slurs and anything that's not exactly rated PG-13 to get triggered.
As the Tin Man suggested, this problem is difficult, but it isn't impossible. I've built a commercial profanity filter named CleanSpeak that handles everything mentioned above (leet speak, phonetics, language rules, whitelisting, etc). CleanSpeak is capable of filtering 20,000 messages per second on a low end server, so it is possible to build something that works well and performs well. I will mention that CleanSpeak is the result of about 3 years of on-going development though.
There are a few things I tell everyone that is looking to try and tackle a language filter.
Don't use regular expressions unless you have a small list and don't mind a lot of things getting through. Regular expressions are relatively slow overall and hard to manage.
Determine if you want to handle conjugations, inflections and other language rules. These often add a considerable amount of time to the project.
Decide what type of performance you need and whether or not you can make multiple passes on the String. The more passes you make the slow your filter will be.
Understand the scunthrope and clbuttic problems and determine how you will handle these. This usually requires some form of language intelligence and whitelisting.
Realize that whitespace has a different meaning now. You can't use it as a word delimiter any more (b e c a u s e of this)
Be careful with your handling of punctuation because it can be used to get around the filter (l.i.k.e th---is)
Understand how people use ascii art and unicode to replace characters (/ = v - those are slashes). There are a lot of unicode characters that look like English characters and you will want to handle those appropriately.
Understand that people make up new profanity all the time by smashing words together (likethis) and figure out if you want to handle that.
You can search around StackOverflow for my comments on other threads as I might have more information on those threads that I've forgotten here.
Here's one you could use: Offensive/Profane Word List from CMU site
Based on personal experience, you do understand that it's an exercise in futility?
If someone wants to inject profanity, there's a slew of words that are innocent in one context, and profane in another so you'll have to write a context parser to avoid black-listing clean words. A quick glance at CMU's list shows words I'd never consider rude/crude/socially unacceptable. You'll see there are many words that could be proper names or nouns, countries, terms of endearment, etc. And, there are myriads of ways to throw your algorithm off using L33T speak and such. Search Wikipedia and the internets and you can build tables of variations of letters.
Look at CMU's list and imagine how long the list would be if, in addition to the correct letter, every a could also be 4, o could be 0 or p, e could be 3, s could be 5. And, that's a very, very, short example.
I was asked to do a similar task and wrote code to generate L33T variations of the words, and generated a hit-list of words based on several profanity/offensive lists available on the internet. After running the generator, and being a little over 1/4 of the way through the file, I had over one million entries in my DB. I pulled the plug on the project at that point, because the time spent searching, even using Perl's Regex::Assemble, was going to be ridiculous, especially since it'd still be so easy to fool.
I recommend you have a long talk with whoever requested that, and ask if they understand the programming issues involved, and low-likelihood of accuracy and success, especially over the long-term, or the possible customer backlash when they realize you're censoring them.
I have one that I've added to (obfuscated a bit) but here it is: https://github.com/rdp/sensible-cinema/blob/master/lib/subtitle_profanity_finder.rb
I am searching for information on algorithms to process text sentences or to follow a structure when creating sentences that are valid in a normal human language such as English. I would like to know if there are projects working in this field that I can go learn from or start using.
For example, if I gave a program a noun, provided it with a thesaurus (for related words) and part-of-speech (so it understood where each word belonged in a sentence) - could it create a random, valid sentence?
I'm sure there are many sub-sections of this kind of research so any leads into this would be great.
The field you're looking for is called natural language generation, a subfield of natural language processing
http://en.wikipedia.org/wiki/Natural_language_processing
Sentence generation is either really easy or really hard depending on how good you want the sentences to be. Currently, there aren't programs that will be able to generate 100% sensible sentences about given nouns (even with a thesaurus) -- if that is what you mean.
If, on the other hand, you would be satisfied with nonsense that was sometimes ungrammatical, then you could try an n-gram based sentence generator. These just chain together of words that tend to appear in sequence, and 3-4-gram generators look quite okay sometimes (although you'll recognize them as what generates a lot of spam email).
Here's an intro to the basics of n-gram based generation, using NLTK:
http://www.nltk.org/book/ch02.html#generating-random-text-with-bigrams
This is called NLG (Natural Language Generation), although that is mainly the task of generating text that describes a set of data. There is also a lot of research on completely random sentence generation as well.
One starting point is to use Markov chains to generate sentences. How this is done is that you have a transition matrix that says how likely it is to transition between every every part-of-speech. You also have the most likely starting and ending part-of-speech of a sentence. Put this all together and you can generate likely sequences of parts-of-speech.
Now, you are far from done, this will first of all not offer a very good result as you are only considering the probability between adjacent words (also called bi-grams), so what you want to do is to extend this to look for instance at the transition matrix between three parts-of-speech (this makes a 3D matrix and gives you trigrams). You can extend it to 4-grams, 5-grams, etc. depending on the processing power and if your corpus can fill such matrix.
Lastly, you need to patch up things such as object agreement (subject-verb-agreement, adjective-verb-agreement (not in English though), etc.) and tense, so that everything is congruent.
Yes. There is some work dealing with solving problems in NLG with AI techniques. As far as I know, currently, there is no method that you can use for any practical use.
If you have the background, I suggest getting familiar with some work by Alexander Koller from Saarland University. He describes how to code NLG to PDDL. The main article you'll want to read is "Sentence generating as a planning problem".
If you do not have any background in NLP, just search for the online courses or course materials by Michael Collings or Dan Jurafsky.
Writing random sentences is not that hard. Any parser textbook's simple-english-grammar example can be run in reverse to generate grammatically correct nonsense sentences.
Another way is the word-tuple-random-walk, made popular by the old BYTE magazine TRAVESTY, or stuff like
http://www.perlmonks.org/index.pl?node_id=94856
I've studied some simple semantic network implementations and basic techniques for parsing natural language. However, I haven't seen many projects that try and bridge the gap between the two.
For example, consider the dialog:
"the man has a hat"
"he has a coat"
"what does he have?" => "a hat and coat"
A simple semantic network, based on the grammar tree parsing of the above sentences, might look like:
the_man = Entity('the man')
has = Entity('has')
a_hat = Entity('a hat')
a_coat = Entity('a coat')
Relation(the_man, has, a_hat)
Relation(the_man, has, a_coat)
print the_man.relations(has) => ['a hat', 'a coat']
However, this implementation assumes the prior knowledge that the text segments "the man" and "he" refer to the same network entity.
How would you design a system that "learns" these relationships between segments of a semantic network? I'm used to thinking about ML/NL problems based on creating a simple training set of attribute/value pairs, and feeding it to a classification or regression algorithm, but I'm having trouble formulating this problem that way.
Ultimately, it seems I would need to overlay probabilities on top of the semantic network, but that would drastically complicate an implementation. Is there any prior art along these lines? I've looked at a few libaries, like NLTK and OpenNLP, and while they have decent tools to handle symbolic logic and parse natural language, neither seems to have any kind of proabablilstic framework for converting one to the other.
There is quite a lot of history behind this kind of task. Your best start is probably by looking at Question Answering.
The general advice I always give is that if you have some highly restricted domain where you know about all the things that might be mentioned and all the ways they interact then you can probably be quite successful. If this is more of an 'open-world' problem then it will be extremely difficult to come up with something that works acceptably.
The task of extracting relationship from natural language is called 'relationship extraction' (funnily enough) and sometimes fact extraction. This is a pretty large field of research, this guy did a PhD thesis on it, as have many others. There are a large number of challenges here, as you've noticed, like entity detection, anaphora resolution, etc. This means that there will probably be a lot of 'noise' in the entities and relationships you extract.
As for representing facts that have been extracted in a knowledge base, most people tend not to use a probabilistic framework. At the simplest level, entities and relationships are stored as triples in a flat table. Another approach is to use an ontology to add structure and allow reasoning over the facts. This makes the knowledge base vastly more useful, but adds a lot of scalability issues. As for adding probabilities, I know of the Prowl project that is aimed at creating a probabilistic ontology, but it doesn't look very mature to me.
There is some research into probabilistic relational modelling, mostly into Markov Logic Networks at the University of Washington and Probabilstic Relational Models at Stanford and other places. I'm a little out of touch with the field, but this is is a difficult problem and it's all early-stage research as far as I know. There are a lot of issues, mostly around efficient and scalable inference.
All in all, it's a good idea and a very sensible thing to want to do. However, it's also very difficult to achieve. If you want to look at a slick example of the state of the art, (i.e. what is possible with a bunch of people and money) maybe check out PowerSet.
Interesting question, I've been doing some work on a strongly-typed NLP engine in C#: http://blog.abodit.com/2010/02/a-strongly-typed-natural-language-engine-c-nlp/ and have recently begun to connect it to an ontology store.
To me it looks like the issue here is really: How do you parse the natural language input to figure out that 'He' is the same thing as "the man"? By the time it's in the Semantic Network it's too late: you've lost the fact that statement 2 followed statement 1 and the ambiguity in statement 2 can be resolved using statement 1. Adding a third relation after the fact to say that "He" and "the man" are the same is another option but you still need to understand the sequence of those assertions.
Most NLP parsers seem to focus on parsing single sentences or large blocks of text but less frequently on handling conversations. In my own NLP engine there's a conversation history which allows one sentence to be understood in the context of all the sentences that came before it (and also the parsed, strongly-typed objects that they referred to). So the way I would handle this is to realize that "He" is ambiguous in the current sentence and then look back to try to figure out who the last male person was that was mentioned.
In the case of my home for example, it might tell you that you missed a call from a number that's not in its database. You can type "It was John Smith" and it can figure out that "It" means the call that was just mentioned to you. But if you typed "Tag it as Party Music" right after the call it would still resolve to the song that's currently playing because the house is looking back for something that is ITaggable.
I'm not exactly sure if this is what you want, but take a look at natural language generation wikipedia, the "reverse" of parsing, constructing derivations that conform to the given semantical constraints.
Am thinking about a project which might use similar functionality to how "Quick Add" handles parsing natural language into something that can be understood with some level of semantics. I'm interested in understanding this better and wondered what your thoughts were on how this might be implemented.
If you're unfamiliar with what "Quick Add" is, check out Google's KB about it.
6/4/10 Update
Additional research on "Natural Language Parsing" (NLP) yields results which are MUCH broader than what I feel is actually implemented in something like "Quick Add". Given that this feature expects specific types of input rather than the true free-form text, I'm thinking this is a much more narrow implementation of NLP. If anyone could suggest more narrow topic matter that I could research rather than the entire breadth of NLP, it would be greatly appreciated.
That said, I've found a nice collection of resources about NLP including this great FAQ.
I would start by deciding on a standard way to represent all the information I'm interested in: event name, start/end time (and date), guest list, location. For example, I might use an XML notation like this:
<event>
<name>meet Sam</name>
<starttime>16:30 07/06/2010</starttime>
<endtime>17:30 07/06/2010</endtime>
</event>
I'd then aim to build up a corpus of diary entries about dates, annotated with their XML forms. How would I collect the data? Well, if I was Google, I'd probably have all sorts of ways. Since I'm me, I'd probably start by writing down all the ways I could think of to express this sort of stuff, then annotating it by hand. If I could add to this by going through friends' e-mails and whatnot, so much the better.
Now I've got a corpus, it can serve as a set of unit tests. I need to code a parser to fit the tests. The parser should translate a string of natural language into the logical form of my annotation. First, it should split the string into its constituent words. This is is called tokenising, and there is off-the-shelf software available to do it. (For example, see NLTK.) To interpret the words, I would look for patterns in the data: for example, text following 'at' or 'in' should be tagged as a location; 'for X minutes' means I need to add that number of minutes to the start time to get the end time. Statistical methods would probably be overkill here - it's best to create a series of hand-coded rules that express your own knowledge of how to interpret the words, phrases and constructions in this domain.
It would seem that there's really no narrow approach to this problem. I wanted to avoid having to pull along the entirety of NLP to figure out a solution, but I haven't found any alternative. I'll update this if I find a really great solution later.
Here's the problem -- I have a few thousand small text snippets, anywhere from a few words to a few sentences - the largest snippet is about 2k on disk. I want to be able to compare each to each, and calculate a relatedness factor so that I can show users related information.
What are some good ways to do this? Are there known algorithms for doing this that are any good, are there any GPL'd solutions, etc?
I don't need this to run in realtime, as I can precalculate everything. I'm more concerned with getting good results than runtime.
I just thought I would ask the Stack Overflow community before going and writing my own thing. There HAVE to be people out there who have found good solutions to this before.
These articles on semantic relatedness and semantic similarity may be helpful. And this SO question about Latent Semantic Analysis.
You could also look into Soundex for words that "sound alike" phonetically.
I've never used it, but you might want to look into Levenshtein distance
Jeff talked about something like this on the pod cast to find the Related questions listed on the right side here. (in podcast 32)
One big tip was to remove all common words, like "the" "and" "this" etc. This will leave you with more meaningful words to compare.
And here is a similar question Is there an algorithm that tells the semantic similarity of two phrases
This is quite doable for reasonable large texts, however harder for smaller texts.
I did it once like this, and it worked pretty well:
Filter all "general" words (like a, an, the, in, etc...) (filters about 10-30% of the words)
Count the frequencies of the remaining words, store the top x of most frequent words, these are your topics.
As an extra step you can create groups of 2/3/4 subsequent words and compare them with the groups in other texts. I used it as a measure for plagerism.
See Manning and Raghavan course notes about MinHashing and searching for similar items, and a C#(?) version. I believe the techniques come from Ullman and Motwani's research.
This book may be relevant.
Edit: here is a related SO question
Phonetic algorithms
The article, Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, shows how to install and use the SimMetrics library into SQL Server. This library lets you find relative similarity between strings and includes numerous algorithms.
I ended up mostly using Jaro Winkler to match on names. Here's more information where I asked about matching names on SO: Matching records based on Person Name
A few algorithms based on Levenshtein Distance are also available in the SimMetric library and would probably be useful in your application.