I am currently taking a compilers class and I am having a hard time understanding LR(1) parsing algorithms using the action/goto table and also how to hand generate these tables. Right now we are using Engineering a Compiler by Cooper and Torczon as our class text book and I have also read the wikipedia pages on table generation but I still do not understand the concepts. If possible can anyone recommend any other book that explains parsing well or an online resource? I would think many universities would have good online resources/slides on the subject but I have no idea on where to start looking. Thanks!
The books are always hard to read because of the algorithm details. Greek symbols and abstract operations are hard to interpret unless you already know what they mean.
The way I learned how to do this, was to write a tiny grammar (simple expression,
assignment statement, if then statement, sequence of statements), and then hand simulate the algorithm. Get a really big piece of paper. Draw the starting configuration state with just the goal symbol and dot [ G = DOT RHS1 ... RHSM ]. Then process the unprocessed states, following the algorithm in detail; write down what each greek symbol represents at that moment. As you gain confidence, you'll get a better feeling and it will go faster.
Essentially what you are going to do is, for each item I
[LHS RHS1 DOT RHS2 RHS3 ... RHSN]
in a state, push the dot in item one place to right to produce a new item
[LHS RHS1 RHS2 DOT RHS3 ... RHSN ]
draw a new state on your paper new state with that item as the seed, fill out the item core with lookahead sets based on FIRST(RHS3), expand the state, and repeat.
This will take you several hours the first time you try it. Worth every second.
Use a pencil!
some decent lecture notes...
http://cs.oberlin.edu/~jdonalds/331/lecture14.html
Understanding and Writing Compilers has a section, What are the True Advantages of LR(1) Analysis?
http://www.amazon.com/Understanding-Writing-Compilers-Yourself-Macmillan/dp/0333217322
(also available freely online)
Here is a link to a decent summary, although explanation is lacking.
http://arantxa.ii.uam.es/~modonnel/Compilers/LR1Summary.pdf
more lecture notes...
http://www.cs.umd.edu/class/spring2011/cmsc430/lectures/lec07.pdf
and notes here...
http://cobweb.ecn.purdue.edu/~smidkiff/ece495S/files/handouts/w3w4bBW.pdf
(including goto and action tables)
Sorry I can't explain personally, I'm not too sure myself. Maybe you will find a kind, more knowledgeable soul around.
Related
I'm writing a program that takes in input a straight play in a custom format and then performs some analysis on it (like number of lines and words for each character). It's just for fun, and a pretext for learning cool stuff.
The first step in that process is writing a parser for that format. It goes :
####Play
###Act I
##Scene 1
CHARACTER 1. Line 1, he's saying some stuff.
#Comment, stage direction
CHARACTER 2, doing some stuff. Line 2, she's saying some stuff too.
It's quite a simple format. I read extensively about basic parser stuff like CFG, so I am now ready to get some work done.
I have written my grammar in EBNF and started playing with flex/bison but it raises some questions :
Is flex/bison too much for such a simple parser ? Should I just write it myself as described here : Is there an alternative for flex/bison that is usable on 8-bit embedded systems? ?
What is good practice regarding the respective tasks of the tokenizer and the parser itself ? There is never a single solution, and for such a simple language they often overlap. This is especially true for flex/bison, where flex can perform some intense stuff with regex matching. For example, should "#" be a token ? Should "####" be a token too ? Should I create types that carry semantic information so I can directly identify for example a character ? Or should I just process it with flex the simplest way then let the grammar defined in bison decide what is what ?
With flex/bison, does it makes sense to perform the analysis while parsing or is it more elegant to parse first, then operate on the file again with some other tool ?
This got me really confused. I am looking for an elegant, perhaps simple solution. Any guideline ?
By the way, about the programing language, I don't care much. For now I am using C because of flex/bison but feel free to advise me on anything more practical as long as it is a widely used language.
It's very difficult to answer those questions without knowing what your parsing expectations are. That is, an example of a few lines of text does not provide a clear understanding of what the intended parse is; what the lexical and syntactic units are; what relationships you would like to extract; and so on.
However, a rough guess might be that you intend to produce a nested parse, where ##{i} indicates the nesting level (inversely), with i≥1, since a single # is not structural. That violates one principle of language design ("don't make the user count things which the computer could count more accurately"), which might suggest a structure more like:
#play {
#act {
#scene {
#location: Elsinore. A platform before the castle.
#direction: FRANCISCO at his post. Enter to him BERNARDO
BERNARDO: Who's there?
FRANCISCO: Nay, answer me: stand, and unfold yourself.
BERNARDO: Long live the king!
FRANCISCO: Bernardo?
or even something XML-like. But that would be a different language :)
The problem with parsing either of these with a classic scanner/parser combination is that the lexical structure is inconsistent; the first token on a line is special, but most of the file consists of unparsed text. That will almost inevitably lead to spreading syntactic information between the scanner and the parser, because the scanner needs to know the syntactic context in order to decide whether or not it is scanning raw text.
You might be able to avoid that issue. For example, you might require that a continuation line start with whitespace, so that every line not otherwise marked with #'s starts with the name of a character. That would be more reliable than recognizing a dialogue line just because it starts with the name of a character and a period, since it is quite possible for a character's name to be used in dialogue, even at the end of a sentence (which consequently might be the first word in a continuation line.)
If you do intend for dialogue lines to be distinguished by the fact that they start with a character name and some punctuation then you will definitely have to give the scanner access to the character list (as a sort of symbol table), which is a well-known but not particularly respected hack.
Consider the above a reflection about your second question ("What are the roles of the scanner and the parser?"), which does not qualify as an answer but hopefully is at least food for thought. As to your other questions, and recognizing that all of this is opinionated:
Is flex/bison too much for such a simple parser ? Should I just write it myself...
The fact that flex and bison are (potentially) more powerful than necessary to parse a particular language is a red herring. C is more powerful than necessary to write a factorial function -- you could easily do it in assembler -- but writing a factorial function is a good exercise in learning C. Similarly, if you want to learn how to write parsers, it's a good idea to start with a simple language; obviously, that's not going to exercise every option in the parser/scanner generators, but it will get you started. The question really is whether the language you're designing is appropriate for this style of parsing, not whether it is too simple.
With flex/bison, does it makes sense to perform the analysis while parsing or is it more elegant to parse first, then operate on the file again with some other tool?
Either can be elegant, or disastrous; elegance has more to do with how you structure your thinking about the problem at hand. Having said that, it is often better to build a semantic structure (commonly referred to as an AST -- abstract syntax tree) during the parse phase and then analyse that structure using other functions.
Rescanning the input file is very unlikely to be either elegant or effective.
I'm making an application using a dependency tree parser. Actually, the parser is this one:
Parser Stanford, but it rarely change one or two letters of some words in a sentence that I want to parse. This is a big trouble for me, because I can't see any pattern in these changes and I need the dependency tree with the same words of my sentence.
All I can see is that just some words have these problems. I'm working with a tweets database. So, I have a lot of grammar mistakes in this data. For example the hashtag '#AllAmericanhumour ' becomes AllAmericanhumor. It misses one letter(u).
Is there anything I can do to solve this problem? In my first view I thought using an edit distance algorithm, but I think that might be an easier way to do it.
Thanks everybody in advance
You can give options to the tokenizer with the -tokenize.options flag/property. For this particular normalization, you can turn it off with
-tokenize.options americanize=false
There are also various other normalizations that you can turn off (see PTBTokenizer or http://nlp.stanford.edu/software/tokenizer.shtml. You can turn off a lot with
-tokenize.options ptb3Escaping=false
However, the parser is trained on data that looks like the output of ptb3Escaping=true and so will tend to degrade in performance if used with unnormalized tokens. So, you may want to consider alternative strategies.
If you're working at the Java level, you can look at the word tokens, which are actually Maps, and they have various keys. OriginalTextAnnotation will give you the unnormalized token, even when it has been normalized. CharacterOffsetBeginAnnotation and CharacterOffsetEndAnnotation will map to character offsets into the text.
p.s. And you should accept some answers :-).
I am searching for information on algorithms to process text sentences or to follow a structure when creating sentences that are valid in a normal human language such as English. I would like to know if there are projects working in this field that I can go learn from or start using.
For example, if I gave a program a noun, provided it with a thesaurus (for related words) and part-of-speech (so it understood where each word belonged in a sentence) - could it create a random, valid sentence?
I'm sure there are many sub-sections of this kind of research so any leads into this would be great.
The field you're looking for is called natural language generation, a subfield of natural language processing
http://en.wikipedia.org/wiki/Natural_language_processing
Sentence generation is either really easy or really hard depending on how good you want the sentences to be. Currently, there aren't programs that will be able to generate 100% sensible sentences about given nouns (even with a thesaurus) -- if that is what you mean.
If, on the other hand, you would be satisfied with nonsense that was sometimes ungrammatical, then you could try an n-gram based sentence generator. These just chain together of words that tend to appear in sequence, and 3-4-gram generators look quite okay sometimes (although you'll recognize them as what generates a lot of spam email).
Here's an intro to the basics of n-gram based generation, using NLTK:
http://www.nltk.org/book/ch02.html#generating-random-text-with-bigrams
This is called NLG (Natural Language Generation), although that is mainly the task of generating text that describes a set of data. There is also a lot of research on completely random sentence generation as well.
One starting point is to use Markov chains to generate sentences. How this is done is that you have a transition matrix that says how likely it is to transition between every every part-of-speech. You also have the most likely starting and ending part-of-speech of a sentence. Put this all together and you can generate likely sequences of parts-of-speech.
Now, you are far from done, this will first of all not offer a very good result as you are only considering the probability between adjacent words (also called bi-grams), so what you want to do is to extend this to look for instance at the transition matrix between three parts-of-speech (this makes a 3D matrix and gives you trigrams). You can extend it to 4-grams, 5-grams, etc. depending on the processing power and if your corpus can fill such matrix.
Lastly, you need to patch up things such as object agreement (subject-verb-agreement, adjective-verb-agreement (not in English though), etc.) and tense, so that everything is congruent.
Yes. There is some work dealing with solving problems in NLG with AI techniques. As far as I know, currently, there is no method that you can use for any practical use.
If you have the background, I suggest getting familiar with some work by Alexander Koller from Saarland University. He describes how to code NLG to PDDL. The main article you'll want to read is "Sentence generating as a planning problem".
If you do not have any background in NLP, just search for the online courses or course materials by Michael Collings or Dan Jurafsky.
Writing random sentences is not that hard. Any parser textbook's simple-english-grammar example can be run in reverse to generate grammatically correct nonsense sentences.
Another way is the word-tuple-random-walk, made popular by the old BYTE magazine TRAVESTY, or stuff like
http://www.perlmonks.org/index.pl?node_id=94856
I've studied some simple semantic network implementations and basic techniques for parsing natural language. However, I haven't seen many projects that try and bridge the gap between the two.
For example, consider the dialog:
"the man has a hat"
"he has a coat"
"what does he have?" => "a hat and coat"
A simple semantic network, based on the grammar tree parsing of the above sentences, might look like:
the_man = Entity('the man')
has = Entity('has')
a_hat = Entity('a hat')
a_coat = Entity('a coat')
Relation(the_man, has, a_hat)
Relation(the_man, has, a_coat)
print the_man.relations(has) => ['a hat', 'a coat']
However, this implementation assumes the prior knowledge that the text segments "the man" and "he" refer to the same network entity.
How would you design a system that "learns" these relationships between segments of a semantic network? I'm used to thinking about ML/NL problems based on creating a simple training set of attribute/value pairs, and feeding it to a classification or regression algorithm, but I'm having trouble formulating this problem that way.
Ultimately, it seems I would need to overlay probabilities on top of the semantic network, but that would drastically complicate an implementation. Is there any prior art along these lines? I've looked at a few libaries, like NLTK and OpenNLP, and while they have decent tools to handle symbolic logic and parse natural language, neither seems to have any kind of proabablilstic framework for converting one to the other.
There is quite a lot of history behind this kind of task. Your best start is probably by looking at Question Answering.
The general advice I always give is that if you have some highly restricted domain where you know about all the things that might be mentioned and all the ways they interact then you can probably be quite successful. If this is more of an 'open-world' problem then it will be extremely difficult to come up with something that works acceptably.
The task of extracting relationship from natural language is called 'relationship extraction' (funnily enough) and sometimes fact extraction. This is a pretty large field of research, this guy did a PhD thesis on it, as have many others. There are a large number of challenges here, as you've noticed, like entity detection, anaphora resolution, etc. This means that there will probably be a lot of 'noise' in the entities and relationships you extract.
As for representing facts that have been extracted in a knowledge base, most people tend not to use a probabilistic framework. At the simplest level, entities and relationships are stored as triples in a flat table. Another approach is to use an ontology to add structure and allow reasoning over the facts. This makes the knowledge base vastly more useful, but adds a lot of scalability issues. As for adding probabilities, I know of the Prowl project that is aimed at creating a probabilistic ontology, but it doesn't look very mature to me.
There is some research into probabilistic relational modelling, mostly into Markov Logic Networks at the University of Washington and Probabilstic Relational Models at Stanford and other places. I'm a little out of touch with the field, but this is is a difficult problem and it's all early-stage research as far as I know. There are a lot of issues, mostly around efficient and scalable inference.
All in all, it's a good idea and a very sensible thing to want to do. However, it's also very difficult to achieve. If you want to look at a slick example of the state of the art, (i.e. what is possible with a bunch of people and money) maybe check out PowerSet.
Interesting question, I've been doing some work on a strongly-typed NLP engine in C#: http://blog.abodit.com/2010/02/a-strongly-typed-natural-language-engine-c-nlp/ and have recently begun to connect it to an ontology store.
To me it looks like the issue here is really: How do you parse the natural language input to figure out that 'He' is the same thing as "the man"? By the time it's in the Semantic Network it's too late: you've lost the fact that statement 2 followed statement 1 and the ambiguity in statement 2 can be resolved using statement 1. Adding a third relation after the fact to say that "He" and "the man" are the same is another option but you still need to understand the sequence of those assertions.
Most NLP parsers seem to focus on parsing single sentences or large blocks of text but less frequently on handling conversations. In my own NLP engine there's a conversation history which allows one sentence to be understood in the context of all the sentences that came before it (and also the parsed, strongly-typed objects that they referred to). So the way I would handle this is to realize that "He" is ambiguous in the current sentence and then look back to try to figure out who the last male person was that was mentioned.
In the case of my home for example, it might tell you that you missed a call from a number that's not in its database. You can type "It was John Smith" and it can figure out that "It" means the call that was just mentioned to you. But if you typed "Tag it as Party Music" right after the call it would still resolve to the song that's currently playing because the house is looking back for something that is ITaggable.
I'm not exactly sure if this is what you want, but take a look at natural language generation wikipedia, the "reverse" of parsing, constructing derivations that conform to the given semantical constraints.
I need to parse recipe ingredients into amount, measurement, item, and description as applicable to the line, such as 1 cup flour, the peel of 2 lemons and 1 cup packed brown sugar etc. What would be the best way of doing this? I am interested in using python for the project so I am assuming using the nltk is the best bet but I am open to other languages.
I actually do this for my website, which is now part of an open source project for others to use.
I wrote a blog post on my techniques, enjoy!
http://blog.kitchenpc.com/2011/07/06/chef-watson/
The New York Times faced this problem when they were parsing their recipe archive. They used an NLP technique called linear-chain condition random field (CRF). This blog post provides a good overview:
"Extracting Structured Data From Recipes Using Conditional Random Fields"
They open-sourced their code, but quickly abandoned it. I maintain the most up-to-date version of it and I wrote a bit about how I modernized it.
If you're looking for a ready-made solution, several companies offer ingredient parsing as a service:
Zestful (full disclosure: I'm the author)
Spoonacular
Edamam
I guess this is a few years out, but I was thinking of doing something similar myself and came across this, so thought I might have a stab at it in case it is useful to anyone else in f
Even though you say you want to parse free test, most recipes have a pretty standard format for their recipe lists: each ingredient is on a separate line, exact sentence structure is rarely all that important. The range of vocab is relatively small as well.
One way might be to check each line for words which might be nouns and words/symbols which express quantities. I think WordNet may help with seeing if a word is likely to be a noun or not, but I've not used it before myself. Alternatively, you could use http://en.wikibooks.org/wiki/Cookbook:Ingredients as a word list, though again, I wouldn't know exactly how comprehensive it is.
The other part is to recognise quantities. These come in a few different forms, but few enough that you could probably create a list of keywords. In particular, make sure you have good error reporting. If the program can't fully parse a line, get it to report back to you what that line is, along with what it has/hasn't recognised so you can adjust your keyword lists accordingly.
Aaanyway, I'm not guaranteeing any of this will work (and it's almost certain not to be 100% reliable) but that's how I'd start to approach the problem
This is an incomplete answer, but you're looking at writing up a free-text parser, which as you know, is non-trivial :)
Some ways to cheat, using knowledge specific to cooking:
Construct lists of words for the "adjectives" and "verbs", and filter against them
measurement units form a closed set, using words and abbreviations like {L., c, cup, t, dash}
instructions -- cut, dice, cook, peel. Things that come after this are almost certain to be ingredients
Remember that you're mostly looking for nouns, and you can take a labeled list of non-nouns (from WordNet, for example) and filter against them.
If you're more ambitious, you can look in the NLTK Book at the chapter on parsers.
Good luck! This sounds like a mostly doable project!
Can you be more specific what your input is? If you just have input like this:
1 cup flour
2 lemon peels
1 cup packed brown sugar
It won't be too hard to parse it without using any NLP at all.