Tool/Application to calculate first and follow sets - parsing

I am currently working on a parser and it seems that I have made a few mistakes druing the
follow set calculation. So I was wondering if someone know a good tool to calculate follow and first sets so I could skip/reevaluate this error prone part of the parser construction.

Take a look at http://hackingoff.com/compilers/predict-first-follow-set
It's an awesome tool to compute first and follow sets in a grammar. also, you can check your answer with this visualization tools:
http://smlweb.cpsc.ucalgary.ca/start.html

I found my mistake by comparing my first/follow-sets with the one generated by this web-app

Most parser generators that I've encountered don't have obvious means to dump this information, let alone dump it in a readable way. (I built one that does, for the reason you are suggesting, but it isn't available by itself and I doubt you want the rest of the baggage).
If your parser definition doesn't work, you mostly don't need to know these things to debug it. Staring at the rules amazingly enough helps; it also helps to build the two smallest grammar instances you can think of, one being something you expect to be accepted, and the other being a slight variant that should be rejected.
In spite of having a parser generator that will dump this information, I rarely resort to using it to debug grammars, and I've built 20-30 pretty big grammars with it.

Related

Writing a Parser for mixed Languages

I am trying to write a Parser that can analyze mixed Languages and generate an AST of it. I first tried to build it from scratch on my own in Java and failed, because this is quite a hard topic for a Parser-beginner. Then i googled and found http://www2.cs.tum.edu/projects/cup/examples.php and JFlex.
The question now is: What is the best way to do it?
For example i have a Codefile, that contains several Tags, JS Code, and some $CMS_SET(x,y)$ Code. Is the best way to solve this to define a grammar for all those things in CUP and let CUP generate a Parser based on my grammer that can analyze those mixed Language files and generate and AST Tree of it?
Thanks for all helpful answers. :)
EDIT: I need to do it in Java...
This topic is quite hard even for an expert in this area, which I consider myself to be; check my bio.
The first issue is to build individual parsers for each sublanguage. The first thing you will discover is that defining parsers for specific languages is actually hard; you can read the endless list of SO requests for "can I get a parser for X" or "how do I fix my parser for X". Mostly I think these requests end up not going anywhere; the parsing engines aren't very good, you have to twist the grammars and the parsers to make them work on real languages, there isn't any such thing as "pure HTML", the standards documents disagree and your customer always has some twist in his code that you're not prepared for. Finally there's the glitches related to character set encodings, variations in newline endings, and preprocessors to complicate the parsing parsing problem. The C++ preprocessor is a lot more complex than you might think, and you have to have this right. The easiest way to defeat this problem is to find some parser generator with the languages already predefined. ANTLR has a bunch for the now deprecated ANTLR3; there's no assurance these parsers are robust let alone compatible for your purposes.
CUP isn't a particularly helpful parser generator; none of the LL(x) or LALR(x) parser generators are really helpful, because no real langauge matches the categories of things they can parse. The consequence: an endless stream of requests (on SO!) for help "resolving my shift-reduce conflict", or "eliminating right recursion". The only parser generator IMHO that has stood the test of time is a GLR parser generator (I hear good things about GLL but that's pretty recent). We've done 40+ languages with one GLR parser generator, including production IBM COBOL, full C++14 and Java8.
You second problem will be building ASTs. You can hand code the AST building process, but that gets old fast when you have to change the grammars often and/or you have many grammars as you are effectively contemplating. This you can beat your way throught with sweat. (We chose to push the problem of building ASTs into the parser so we didn't have to put any energy this in building a grammar; to do this, your parser engine has to offer you this help and none of the mainstream ones do.).
Now you need to compose parsers. You need to have one invoke the other as the need comes up; of course your chosen parser isn't designed to do this so you'll have to twist it. The first hard part is provide a parser with clues that a sublanguage is coming up in the input stream, and for it to hand off that parsing to the sublangauge parser, and get it to pass a tree back to be incorporated in the parent parser's tree, presumably with some kind of marker so you can tell where the transitions between different sublangauges are in the tree. You can often do this by hacking one language's lexer, when it sees the clue, to invoke the other; but then what you do with the tree it returns? There's no way to give that tree to the current parser and say "integrate this". You get around this by modifying the parsing machinery in arcane ways.
But all of the above isn't where the problem is.
Parsing is inconvenient, but only a small part of what you need to analyze your programs in any interesting way; you need symbol tables, control and data flow analysis, maybe points-to analyses, and the engineering on these will swamp the work listed above. See my essay on "Life After Parsing" (google or via my bio) for a long discussion of what else you need.
In short, I think you are biting off an enormous task in just "parsing", and you haven't even told us what you intend to do with the result. You are welcome to start down this path, but very few people have succeeded; my team spent over a 50 man years of PhD level engineering to get where we are, and we are hardly done.
Java won't make the solution any easier or harder; the langauge in which you solve all of the above is irrelevant.

F# Debugging and code maintenance: Is there an un-factoring, factoring tool for F#?

When working down a long chain of factored F# code there are times I have to un-factor the code for various reasons so I can modify the code to fix a bug or add a change, followed by factoring the code again.
Since un-factoring and factoring are for the most part symbolic transformations which should be able to be automated, (I know it's not easy in reality); has anyone made such a tool?
i don't know of any tools like that, i am assuming that you are asking about breaking up the pipes so that you can put a break point somewhere and inspect the result. I agree that it's the hardest part of working with code in F#, that once you've composed it, it's virtually impossible to step through it with all the lazy evaluation and compositions.
In situations like this, it can be useful to override the forward pipe operator, which allows you to put a breakpoint on it. Doesn't necessarily solve the lazy/composition issues but a useful trick anyway.
Details here:
http://www.kiteason.com/blogengine/post/2012/09/13/Tapping-into-the-pipe.aspx

Writing a Parser (for a markup language): Theory & Practice

I'd like to write an idiomatic parser for a markup language like Markdown. My version will be slightly different, but I perceive at least a minor need for something like this in Clojure, and I'd like to get on it.
I don't want to use a mess of RegExes (though I realize some will probably be needed), and I'd like to make something both powerful and in idiomatic Clojure.
I've begun a few different attempts (mostly on paper), but I'm terribly happy with them, as I feel as though I'm just improvising. That would be fine, but I've done plenty of exploring in the language of Clojure in the past month or two, and would like to, at least in part, follow in the paths of giants.
I'd like some pointers, or suggestions, or resources (books from O'Reilly would be awesome–love me some eBooks–but Amazon or wherever would be great, too). Whatever you can offer.
EDIT Brian Carper has an interesting post on using ANTLR from Clojure.
There's also clojure-pg and fnparse, which are Clojure parser-generators. fnparse even looks like it's got some decent documentation.
Still looking for resources etc! Just thought I'd update these with some findings of my own.
Best I can think of is that Terrence Parr - the guy that leads the ANTLR parser generator - has written a markup language documented here. Anyway, there's source code there to look at.
There is also clj-peg project, that allows to specify PEG grammar for parsing data
Another not yet mentioned here is clarsec, a port of Haskell's parsec library.
I've recently been on a very similar quest to build a parser in Clojure. I went pretty far down the fnparse path, in particular using the (yet unreleased) fnparse 3 which you can find in the develop branch on github. It is broken into two forms: hound (specifically for LL(1) single lookahead parsers) and cat, which is a packrat parser. Both are functional parsers built on monads (like clarsec). fnparse has some impressive work - the ability to document your parser, build error messages, etc is neat. The documentation on the develop branch is non-existent though other than the function docstrings, which are actually quite good. In the end, I hit some road-blocks with trying to make LL(k) work. I think it's possible to make it work, it's just hard without a decent set of examples on how to make backtracking work well. I'm also so familiar with parsers that split lexing and parsing that it was hard for me to think that way. I'm still very interested in this as a good solution in the future.
In the meantime, I've fallen back to Antlr, which is very robust, well-traveled, well-documented (in 2 books), etc. It doesn't have a Clojure back-end but I hope it will in the future, which would make it really nice for parser work. I'm using it for lexing, parsing, tree transformation, and templating via StringTemplate. It hasn't been entirely bump-free, but I've been able to find workable solutions to all problems so far. Antlr's unique LL(*) parsing algorithm lets you write really readable grammars but still make them fairly efficient (and tweak things gradually if they're not).
Two functional markup translators are;
Pandoc, a markdown implemented in Haskell with source on github
Simple_markdown implemented in OCaml.

Will rewriting a multipurpose log file parser to use formal grammars improve maintainability?

TLDR: if I built a multipurpose parser by hand with different code for each format, will it work better in the long run using one chunk of parser code and an ANTLR, PyParsing or similar grammar to specify each format?
Context:
My job involves lots of benchmark log files from ~50 different benchmarks. There are a few in XML, a few HTML, a few CSV and lots of proprietary stuff with no documented spec. To save me and my coworkers the time of entering this data by hand, I wrote a parsing tool that handles all of the formats we deal with regularly with a uniform interface. The design, though, is not so clean.
I wrote this thing in Python and created a Parser class. Each file format is handled as an implementation that provides its own code for the Parser's read() method. I like the idea of having only one definition of Parser that uses grammars to understand each format, but I've never done it before.
Is it worth my time, and will it be easier for other newbies to work with in the future once I finish refactoring?
I can't answer your question with 100% certainty, but I can give you an opinion.
I find the choice to use a proper grammar vs hand rolled regex "parsers" often comes down to how uniform the input is.
If the input is very uniform and you already know a language that deals with strings well, like Python or Perl, then I'd keep your existing code.
On the other hand I find parser generators, like Antlr, really shine when the input can have errors and inconsistencies in it. The reason is that the formal grammar allows you to focus on what should be matched in a certain context without having to worry about walking the input stream manually.
Furthermore if the input stream has an error then I find it's often easier to deal with them using Antlr vs regexs. The reason being is that if a couple of options are available Antlr has built in functionality for hosing the correct path, including rollback via predicates.
Having said all that, there is alot to be said for working code. I find if I want to rewrite something then I try to make a good use case for how the rewrite will benefit the user of the product.

Is Yacc still used in the industry?

The software base I am developing for uses a signficant amount of yacc which I don't need to deal with. Some times I think it would be helpful in understanding some problems I find but most of the time I can get away with my complete ignorance of yacc.
My question are there enough new projects out there that still use yacc to warrant the time I'll need to learn it?
Edit: Given the response is mostly in favour of learning Yacc, is there a similar language that you would recommend over yacc?
Yes, these tools are worth learning if you ever need to create or modify code that parses a grammar.
For many years the de facto tool for generating code to parse a grammar was yacc, or its GNU cousin, bison.
Lately I've heard there are a couple of new kids on the block, but the principle is the same: you write a declarative grammar in a format that is more or less in Backus-Naur Form (BNF) and yacc/bison/whatever generates some code for you that would be extremely tedious to write by hand.
Also, the principles behind grammars can be very useful to learn even if you don't need to work on such code directly. I haven't worked with parsers much since taking a course on Compiler Design in college, but understanding runtime stacks, lookahead parsers, expression evaluation, and a lot of other related things has helped me immensely to write and debug my code effectively.
edit: Given your followup question about other tools, Yacc/Bison of course are best for C/C++ projects, since they generate C code. There are similar tools for other languages. Not all grammars are equivalent, and some parser generators can only grok grammars of a certain complexity. So you might need to find a tool that can parse your grammar. See http://en.wikipedia.org/wiki/Comparison_of_parser_generators
I don't know about new projects using it but I'm involved in seven different maintenance jobs that use lex and yacc for processing configuration files.
No XML for me, no-sir-ee :-).
Solutions using lex/yacc are a step up from the old configuration files of key=val lines since they allow better hierarchical structures like:
server = "mercury" {
ip = "172.3.5.13"
gateway = "172.3.5.1"
}
server = "venus" {
ip = "172.3.5.21"
gateway = "172.3.5.1"
}
And, yes, I know you can do that with XML, but these are primarily legacy applications written in C and, to be honest, I'd probably use lex/yacc for new (non-Java) jobs as well.
That's because I prefer delivering software on time and budget rather than delivering the greatest new whizz-bang technology - my clients won't pay for my education, they want results first and foremost and I'm already expert at lex/yacc and have all the template code for doing it quickly.
A general rule of thumb: code lasts a long time, so the technologies used in that code last a long time, too. It would take an enormous amount of time to replace the codebase you mention (it took 15 years to build it...), which in turn implies that it will still be around in 5, 10, or more years. (There's even a chance that someone who reads this answer will end up working on it!)
Another rule of thumb: if a general-purpose technology is commonplace enough that you have encountered it already, it's probably commonplace enough that you should familiarize yourself with it, because you'll see it again one day. Who knows: by familiarizing yourself with it, maybe you added a useful tool to your toolbox...
Yacc is one of these technologies: you're probably going to run into it again, it's not that difficult, and the principles you'll learn apply to the whole family of parser constructors.
PEGs are the new hotness, but there are still a ton of projects that use yacc or tools more modern than yacc. I would frown on a new project that chose to use yacc, but for existing projects porting to a more modern tool may not make sense. This makes having rough familiarity with yacc a useful skill.
If you're totally unfamiliar with the topic of parser generators I'd encourage you to learn about one, any one. Many of the concepts are portable between them. Also, it's a useful tool to have in the belt: once you know one you'll understand how they can often be superior compared to regex heavy hand written parsers. If you're already comfortable with the topic of parsers, I wouldn't worry about it. You'll learn yacc if and when you need to in order to get something done.
I work on projects that use Yacc. Not new code - but were they new, they'd still use Yacc or a close relative (Bison, Byacc, ...).
Yes, I regard it as worth learning if you work in C.
Also consider learning ANTLR, or other more modern parser generators. But knowledge of Yacc will stand you in good stead - it will help you learn any other similar tools too, since a lot of the basic theory is similar.
I don't know about yacc/bison specifically, but I have used antlr, cup, jlex and javacc. I thought they would only be of accademic importance, but as it turns out we needed a domain-specific language, and this gave us a much nicer solution than some "simpler" (regex based) parsers out there. Maintenance might be an issue in many environments, though - since most coders these days won't have any experience with parsing tools.
I haven't had the chance to compare it with other parsing systems but I can definitely recommend ANTLR based on my own experience and also with its large and active user base.
Another plus point for ANTLR is ANTLRWorks: The ANTLR GUI Development Environment which is a great help while developing and debugging your grammars. I've yet to see another parsing system which is supported by such an IDE.
We are writing new yacc code at my company for shipping products. Yes, this stuff is still used.

Resources