While many questions are asking to help solve reduce-reduce conflicts, I don't have any of those and I am actually asking your help to find some.
I am writing documentation and exercices about LR(1) parser conflicts. While I could find some interesting shift-reduce conflicts such as order of operation or the dangling else conflict, I cannot find examples of reduce-reduce conflicts that are both subtle and arise with as few rules as possible.
For a change, could you help me find problems instead of solutions?
Related
I am trying to write a Parser that can analyze mixed Languages and generate an AST of it. I first tried to build it from scratch on my own in Java and failed, because this is quite a hard topic for a Parser-beginner. Then i googled and found http://www2.cs.tum.edu/projects/cup/examples.php and JFlex.
The question now is: What is the best way to do it?
For example i have a Codefile, that contains several Tags, JS Code, and some $CMS_SET(x,y)$ Code. Is the best way to solve this to define a grammar for all those things in CUP and let CUP generate a Parser based on my grammer that can analyze those mixed Language files and generate and AST Tree of it?
Thanks for all helpful answers. :)
EDIT: I need to do it in Java...
This topic is quite hard even for an expert in this area, which I consider myself to be; check my bio.
The first issue is to build individual parsers for each sublanguage. The first thing you will discover is that defining parsers for specific languages is actually hard; you can read the endless list of SO requests for "can I get a parser for X" or "how do I fix my parser for X". Mostly I think these requests end up not going anywhere; the parsing engines aren't very good, you have to twist the grammars and the parsers to make them work on real languages, there isn't any such thing as "pure HTML", the standards documents disagree and your customer always has some twist in his code that you're not prepared for. Finally there's the glitches related to character set encodings, variations in newline endings, and preprocessors to complicate the parsing parsing problem. The C++ preprocessor is a lot more complex than you might think, and you have to have this right. The easiest way to defeat this problem is to find some parser generator with the languages already predefined. ANTLR has a bunch for the now deprecated ANTLR3; there's no assurance these parsers are robust let alone compatible for your purposes.
CUP isn't a particularly helpful parser generator; none of the LL(x) or LALR(x) parser generators are really helpful, because no real langauge matches the categories of things they can parse. The consequence: an endless stream of requests (on SO!) for help "resolving my shift-reduce conflict", or "eliminating right recursion". The only parser generator IMHO that has stood the test of time is a GLR parser generator (I hear good things about GLL but that's pretty recent). We've done 40+ languages with one GLR parser generator, including production IBM COBOL, full C++14 and Java8.
You second problem will be building ASTs. You can hand code the AST building process, but that gets old fast when you have to change the grammars often and/or you have many grammars as you are effectively contemplating. This you can beat your way throught with sweat. (We chose to push the problem of building ASTs into the parser so we didn't have to put any energy this in building a grammar; to do this, your parser engine has to offer you this help and none of the mainstream ones do.).
Now you need to compose parsers. You need to have one invoke the other as the need comes up; of course your chosen parser isn't designed to do this so you'll have to twist it. The first hard part is provide a parser with clues that a sublanguage is coming up in the input stream, and for it to hand off that parsing to the sublangauge parser, and get it to pass a tree back to be incorporated in the parent parser's tree, presumably with some kind of marker so you can tell where the transitions between different sublangauges are in the tree. You can often do this by hacking one language's lexer, when it sees the clue, to invoke the other; but then what you do with the tree it returns? There's no way to give that tree to the current parser and say "integrate this". You get around this by modifying the parsing machinery in arcane ways.
But all of the above isn't where the problem is.
Parsing is inconvenient, but only a small part of what you need to analyze your programs in any interesting way; you need symbol tables, control and data flow analysis, maybe points-to analyses, and the engineering on these will swamp the work listed above. See my essay on "Life After Parsing" (google or via my bio) for a long discussion of what else you need.
In short, I think you are biting off an enormous task in just "parsing", and you haven't even told us what you intend to do with the result. You are welcome to start down this path, but very few people have succeeded; my team spent over a 50 man years of PhD level engineering to get where we are, and we are hardly done.
Java won't make the solution any easier or harder; the langauge in which you solve all of the above is irrelevant.
Can anybody point me at references on techniques for parsing code that contains syntax errors, or is missing necessary punctuation, for example?
The application that I'm working on is an IDE, where we'd like to provide features like "jump to definition", auto-complete, and refactoring features, without requiring the source to be syntactically correct at the moment the functions are invoked.
Most parser code I've seen appears to work on the principle of "fail early", rather than focusing on error recovery or parsing partially-complete code.
Have you tried ANTLR?
In "The Definitive ANTLR Reference", section 10.7 Automatic Error Recovery Strategy for 5 pages Terrence talks about this. He references Algorithms + Data Structures = Programs, A Note on Error Recovery in Recursive Descent Parsers, Efficient and Comfortable Error Recovery in Recursive Descent Parsers.
Also see the pages from the web site:
Error reporting and recovery
ANTLR 3.0 Error Reporting and Recovery
Custom Syntax Error Recovery
Also check the ANTLR tag for accessing the ANTLR forum where Terrence Parr answers questions. He does answer some questions here as The ANTLR Guy.
Also the new version of ANTLR 4 is due out as well as the book.
Sorry to sound like a sales pitch, but I have been using ANTLR for years because it used by lots of people, is used in production systems, has a few solid versions: Java, C, C#, has a very active community, has a web site, has books, is evolving, maintained, open source, BSD license, easy to use and has some GUI tools.
One of the people working on a GUI for ANTLR 4 that has syntax highlight and auto-completion among other useful IDE editing is Sam Harwell. If you can reach him through the ANTLR forum, he might be able to help you out.
I don’t know of any papers or tutorials, but uu-parsinglib is a Haskell parsing library that can recover from syntax errors in a general fashion. If, for example, ; was expected but int was received, the parser can continue as though ; were inserted at that source position.
It’s up to you where the parser will fail and where it will proceed with corrections, and the results will be delivered alongside a set of the errors corrected during parsing. Even if you don’t intend to implement your parsing code in Haskell, an examination of the library may offer you some insight. Or you can write a parser in Haskell and call it from C.
Research on "Island grammars" may interest you. It's been a while since I looked at them, but I believe that they are supposed to reasonably handle cases where there are many chunks of nonsense in the file. I didn't have much luck with CiteSeer (oddly; usually it's pretty good), but Google Scholar found a number of relevant papers. Generating robust parsers using island grammars looks like a good place to start.
On the web, there is a lot of examples showing how to construct parsing tables for a context-free grammar from first/follow sets for LL(1) parser.
But I haven't found anything useful related to k>1 cases. Even wikipedia gives no info about this.
I expect that it must be in some way similar, but pointers to existing research in this area would be very helpful.
I struggle pretty much with the same issues, building LR parser, not LL though. I found a little better page than LL(k) mentioned by #cakeplus -- http://www.seanerikoconnor.freeservers.com/ComputerScience/Compiler/ParserGeneratorAndParser/QuickReviewOfLRandLALRParsingTheory.html There is also related paper available for free -- http://ci.nii.ac.jp/naid/110002673618/
However even those didn't help me much. So I started myself from the basics. If anyone is interested: https://aboutskila.wordpress.com/2013/06/14/lalrk-first-sets/ and the battle will continue :-)
I am currently working on a parser and it seems that I have made a few mistakes druing the
follow set calculation. So I was wondering if someone know a good tool to calculate follow and first sets so I could skip/reevaluate this error prone part of the parser construction.
Take a look at http://hackingoff.com/compilers/predict-first-follow-set
It's an awesome tool to compute first and follow sets in a grammar. also, you can check your answer with this visualization tools:
http://smlweb.cpsc.ucalgary.ca/start.html
I found my mistake by comparing my first/follow-sets with the one generated by this web-app
Most parser generators that I've encountered don't have obvious means to dump this information, let alone dump it in a readable way. (I built one that does, for the reason you are suggesting, but it isn't available by itself and I doubt you want the rest of the baggage).
If your parser definition doesn't work, you mostly don't need to know these things to debug it. Staring at the rules amazingly enough helps; it also helps to build the two smallest grammar instances you can think of, one being something you expect to be accepted, and the other being a slight variant that should be rejected.
In spite of having a parser generator that will dump this information, I rarely resort to using it to debug grammars, and I've built 20-30 pretty big grammars with it.
Namely, is there a tool out there that will automatically show the full language for a given grammar, including highlighting ambiguities (if any)?
There might be some peculiarity about BNF-style grammars, but in general, deciding whether a given context-free grammar (such as BNF) is ambiguous is not possible.
In short, there does not exist a tool because in general, that tool is mathematically impossible. There might be some special cases that could work for you, though.
In general, no.
But as a practical approach, what you can do, is given a grammar, is for each rule, to enumerate possible strings of valid terminals/nonterminals, to see if any rule has two or more equivalent derivations (which would be an ambiguity).
Our DMS Software Reengineering Toolkit is a program transformation system for arbitrary computer langauges, driven by explicit grammar descriptions. DMS uses a parser generator to drive its GLR parsing engine.
DMS's parser generator will optionally the ambiguity check sketched above, by running an iterative deepening search across all grammar rules. This is practical because it has the parse tables to efficiently guide the enumeration of choices. You can tell it to run this check up to some chosen depth. It can take a long time if you choose a depth of any interesting size, but in fact a depth of 3 or 4 is sufficient to find many stupid ambiguities introduced in a large grammar. We generally do this during our initial grammar debugging, and at the point where we think we have it pretty much right.