I'm writing a program that takes in input a straight play in a custom format and then performs some analysis on it (like number of lines and words for each character). It's just for fun, and a pretext for learning cool stuff.
The first step in that process is writing a parser for that format. It goes :
####Play
###Act I
##Scene 1
CHARACTER 1. Line 1, he's saying some stuff.
#Comment, stage direction
CHARACTER 2, doing some stuff. Line 2, she's saying some stuff too.
It's quite a simple format. I read extensively about basic parser stuff like CFG, so I am now ready to get some work done.
I have written my grammar in EBNF and started playing with flex/bison but it raises some questions :
Is flex/bison too much for such a simple parser ? Should I just write it myself as described here : Is there an alternative for flex/bison that is usable on 8-bit embedded systems? ?
What is good practice regarding the respective tasks of the tokenizer and the parser itself ? There is never a single solution, and for such a simple language they often overlap. This is especially true for flex/bison, where flex can perform some intense stuff with regex matching. For example, should "#" be a token ? Should "####" be a token too ? Should I create types that carry semantic information so I can directly identify for example a character ? Or should I just process it with flex the simplest way then let the grammar defined in bison decide what is what ?
With flex/bison, does it makes sense to perform the analysis while parsing or is it more elegant to parse first, then operate on the file again with some other tool ?
This got me really confused. I am looking for an elegant, perhaps simple solution. Any guideline ?
By the way, about the programing language, I don't care much. For now I am using C because of flex/bison but feel free to advise me on anything more practical as long as it is a widely used language.
It's very difficult to answer those questions without knowing what your parsing expectations are. That is, an example of a few lines of text does not provide a clear understanding of what the intended parse is; what the lexical and syntactic units are; what relationships you would like to extract; and so on.
However, a rough guess might be that you intend to produce a nested parse, where ##{i} indicates the nesting level (inversely), with i≥1, since a single # is not structural. That violates one principle of language design ("don't make the user count things which the computer could count more accurately"), which might suggest a structure more like:
#play {
#act {
#scene {
#location: Elsinore. A platform before the castle.
#direction: FRANCISCO at his post. Enter to him BERNARDO
BERNARDO: Who's there?
FRANCISCO: Nay, answer me: stand, and unfold yourself.
BERNARDO: Long live the king!
FRANCISCO: Bernardo?
or even something XML-like. But that would be a different language :)
The problem with parsing either of these with a classic scanner/parser combination is that the lexical structure is inconsistent; the first token on a line is special, but most of the file consists of unparsed text. That will almost inevitably lead to spreading syntactic information between the scanner and the parser, because the scanner needs to know the syntactic context in order to decide whether or not it is scanning raw text.
You might be able to avoid that issue. For example, you might require that a continuation line start with whitespace, so that every line not otherwise marked with #'s starts with the name of a character. That would be more reliable than recognizing a dialogue line just because it starts with the name of a character and a period, since it is quite possible for a character's name to be used in dialogue, even at the end of a sentence (which consequently might be the first word in a continuation line.)
If you do intend for dialogue lines to be distinguished by the fact that they start with a character name and some punctuation then you will definitely have to give the scanner access to the character list (as a sort of symbol table), which is a well-known but not particularly respected hack.
Consider the above a reflection about your second question ("What are the roles of the scanner and the parser?"), which does not qualify as an answer but hopefully is at least food for thought. As to your other questions, and recognizing that all of this is opinionated:
Is flex/bison too much for such a simple parser ? Should I just write it myself...
The fact that flex and bison are (potentially) more powerful than necessary to parse a particular language is a red herring. C is more powerful than necessary to write a factorial function -- you could easily do it in assembler -- but writing a factorial function is a good exercise in learning C. Similarly, if you want to learn how to write parsers, it's a good idea to start with a simple language; obviously, that's not going to exercise every option in the parser/scanner generators, but it will get you started. The question really is whether the language you're designing is appropriate for this style of parsing, not whether it is too simple.
With flex/bison, does it makes sense to perform the analysis while parsing or is it more elegant to parse first, then operate on the file again with some other tool?
Either can be elegant, or disastrous; elegance has more to do with how you structure your thinking about the problem at hand. Having said that, it is often better to build a semantic structure (commonly referred to as an AST -- abstract syntax tree) during the parse phase and then analyse that structure using other functions.
Rescanning the input file is very unlikely to be either elegant or effective.
I am trying to write a simple YAML parser, I read the spec from yaml.org,
before I start, I was wondering if it is better to write a hand-rolled parser, or
use lex (flex/bison). I looked at the libyaml (C library) -
doesn't seem to use lex/yacc.
YAML (excluding the flow styles), seems to be more line-oriented, so, is it
easier to write a hand-rolled parser, or use flex/bison
Thanks.
This answer is basically an answer to the question: "Should I roll my own parser or use parser generator?" and has not much to do with YAML. But nevertheless it will "answer" your question.
The question you need to ask is not "does this work with this given language/grammar", but "do I feel confident to implement this". The truth of the matter is that most formats you want to parse will just work with a generated parser. The other truth is that it is feasible to parse even complex languages with a simple hand written recursive descent parser.
I have written among others, a recursive descent parser for EDDL (C and structured elements) and a bison/flex parser for INI. I picked these examples, because they go against intuition and exterior requirements dictated the decision.
Since I established on a technical level it is possible, why would you pick one over the other? This is really hard question to answer, here are some thoughts on the subject:
Writing a good lexer is really hard. In most cases it makes sense to use flex to generate the lexer. There is little use of hand-rolling your own lexer, unless you have really exotic input formats.
Using bison or similar generators make the grammar used for parsing explicitly visible. The primary gain here is that the developer maintaining your parser in five years will immediately see the grammar used and can compare it with any specs.
Using a recursive descent parser makes is quite clear what happens in the parser. This provides the easy means to gracefully handle harry conflicts. You can write a simple if, instead of rearranging the entire grammar to be LALR1.
While developing the parser you can "gloss over details" with a hand written parser, using bison this is almost impossible. In bison the grammar must work or the generator will not do anything.
Bison is awesome at pointing out formal flaws in the grammar. Unfortunately you are left alone to fix them. When hand-rolling a parser you will only find the flaws when the parser reads nonsense.
This is not a definite answer for one or the other, but it points you in the right direction. Since it appears that you are writing the parser for fun, I think you should have written both types of parser.
PS.Where to read about parsing theory?
Summary: the shortest is probably Antlr.
Its tempting to go to the Dragon Book to learn about parsing theory. But I don't think the Dragon Book and you have the same idea of what "theory" means. The Dragon Book describes how to built hand-written parsers, parser generators, etc, but you almost certainly want to use a parser-generation tool instead.
A few people have suggested Bison and Flex (or their older versions Yacc and Lex).
Those are the old stalwarts, but they are not very usable tools.
Their documentation is not poor per se, its just that it doesn't quite help in getting dealing with the accidental complexity of using them.
Their internal data is not well encapsulated, and it is very hard to do anything advanced with them. As an example, in phc we still do not have correct line numbers because it is very difficult. They got better when we modified out grammar to include No-op statements, but that is an incredible hack which should not be necessary.
Ostensibly, Bison and Flex work together, but the interface is awkward. Worse, there are many versions of each, which only play nicely with some specific versions of the other. And, last I checked at least, the documentation of which versions went with which was pretty poor.
Writing a recursive descent parser is straightforward, but can be tedious. Antlr can do that for you, and it seems to be a pretty good toolset, with the benefit that what you learn on this project can be applied to lots of other languages and platforms (Antlr is very portable). There are also lots of existing grammars to learn from.
Its not clear what language you're working in, but some languages have excellent parsing frameworks. In particular, the Haskell Parsec Library seems very elegant. If you use C++ you might be tempted to use Spirit. I found it very easy to get started with, and difficult--but still possible--to do advanced things with it. This matches my experience of C++ in general. I say I found it easy to start, but then I had already written a couple of parsers, and studied parsing in compiler class.
Long story short: Antlr, unless you've a very good reason.
It's always a good idea to read the Dragon Book. But be aware that if your language is not trivial, there's not really a "short" way to do it.
It rather depends on your language. Some very simple languages take very little parsing so can be hand-coded; other languages use PEG generators such as Rats! ( PEG is parser expression grammar, which sits between a Regex and a LR parser ) or conventional parser generators such as Antlr and Yacc. Less formal languages require probabilistic techniques such as link grammars.
Write a Recursive Descent Parser. This is sometimes easier than YACC/BISON, and usually more intuitive.
Douglas Crockford has an approachable example of a parser written in JavaScript.
YACC, there are various implementation for different languages.
Good luck with your language ;-)
I used the GOLD Parsing System, because it seemed easier to use than ANTLR for a novice like me, while still being sufficiently-fully-featured for my needs. The web site includes documentation (including an instructions on Writing Grammars, which is half the work) as well as software.
Try Bison for parsing and Flex for lexing
The bison definition of your language is in the form of a context-free grammar. The wikipedia artcile on this topic is quite good, and is probably a good place to start.
Using a parser generator for your host language is the fastest way, combined with parsing theory from a book such as the Dragon Book or the Modern Compiler Construction in {C,ML} series.
If you use C, yacc and the GNU version bison are the standard generators. Antlr is widely used in many languages, supporting Java, C#, and C++ as far as I know. There are also many others in almost any language.
My personal favorite at present is Menhir, an excellent parser generator for OCaml. ML-style languages (Ocaml, Standard ML, etc.) dialects in general are very good for building compilers and interpreters.
ANTLR is the easiest for someone without compiler theory background because of:
ANTLRWORKS (visual parsing and AST debugging)
The ANTLR book (no compiler theory background required)
Just 1 syntax for lexer and parser.
If you are happy with parsing expression grammars, writing your own parsers can be incredibly short. Here is a simple Packrat parser that takes a reasonable subset of PEG:
import functools
class peg_parse:
def __init__(self, grammar):
self.grammar = {k:[tuple(l) for l in rules] for k,rules in grammar.items()}
#functools.lru_cache(maxsize=None)
def unify_key(self, key, text, at=0):
if key not in self.grammar:
return (at + len(key), (key, [])) if text[at:].startswith(key) \
else (at, None)
rules = self.grammar[key]
for rule in rules:
l, res = self.unify_rule(rule, text, at)
if res is not None: return l, (key, res)
return (0, None)
def unify_line(self, parts, text, tfrom):
results = []
for part in parts:
tfrom, res = self.unify_key(part, text, tfrom)
if res is None: return tfrom, None
results.append(res)
return tfrom, results
It accepts grammars of the form of a python dictionary, with nonterminals as keys and alternatives as elements of the array, and each alternative is a sequence of expressions. Below is an example grammar.
term_grammar = {
'expr': [
['term', 'add_op', 'expr'],
['term']],
'term': [
['fact', 'mul_op', 'term'],
['fact']],
'fact': [
['digits'],
['(','expr',')']],
'digits': [
['digit','digits'],
['digit']],
'digit': [[str(i)] for i in list(range(10))],
'add_op': [['+'], ['-']],
'mul_op': [['*'], ['/']]
}
Here is the driver:
import sys
def main(to_parse):
result = peg_parse(term_grammar).unify_key('expr', to_parse)
assert (len(to_parse) - result[0]) == 0
print(result[1])
if __name__ == '__main__': main(sys.argv[1])
Which can be invoked thus:
python3 parser.py '1+2'
('expr',
[('term',
[('fact',
[('digits', [('digit', [('1', [])])])])]),
('add_op', [('+', [])]),
('expr',
[('term', [('fact', [('digits', [('digit', [('2', [])])])])])])])
Parsing Expression Grammars take some care to write: The ordering of alternatives is important (Unlike a Context Free Grammar, the alternatives are an ordered choice, with the first choice being tried first, and second being tried only if the first did not match). However, they can represent all known context free grammars.
If on the other hand, you decide to go with a Context Free Grammar, Earley Parser is one of the simplest.
I'm making an application that will parse commands in Scala. An example of a command would be:
todo get milk for friday
So the plan is to have a pretty smart parser break the line apart and recognize the command part and the fact that there is a reference to time in the string.
In general I need to make a tokenizer in Scala. So I'm wondering what my options are for this. I'm familiar with regular expressions but I plan on making an SQL like search feature also:
search todo for today with tags shopping
And I feel that regular expressions will be inflexible implementing commands with a lot of variation. This leads me to think of implementing some sort of grammar.
What are my options in this regard in Scala?
You want to search for "parser combinators". I have a blog post using this approach (http://cleverlytitled.blogspot.com/2009/04/shunting-yard-algorithm.html), but I think the best reference is this series of posts by Stefan Zieger (http://szeiger.de/blog/2008/07/27/formal-language-processing-in-scala-part-1/)
Here are slides from a presentation I did in Sept. 2009 on Scala parser combinators. (http://sites.google.com/site/compulsiontocode/files/lambdalounge/ImplementingExternalDSLsUsingScalaParserCombinators.ppt) An implementation of a simple Logo-like language is demonstrated. It might provide some insights.
Scala has a parser library (scala.util.parsing.combinator) which enables one to write a parser directly from its EBNF specification. If you have an EBNF for your language, it should be easy to write the Scala parser. If not, you'd better first try to define your language formally.
I have stumbled upon the following F77 yacc grammar: http://yaxx.cvs.sourceforge.net/viewvc/yaxx/yaxx/fortran/fortran.y?revision=1.3&view=markup.
How can I make a Fortran 77 parser out of this file using Happy?
Why is there some C?/C++? code in that .y file?
UPDATE: Thank you for your replies!
I've been playing with two fresh approaches for a while now:
extracting and modifiying the parser from the source code package bundled with a paper titled Parametric Fortran,
writing a grammar from scratch with the help of BNFC.
I've got both to parse simple code excerpts already. I'll keep people in the know should something usable come into existence within this century ^__^" hehe.
P/S: Want to see whether I could gather enough momentum on my own to initiate a project for an automatic differentiation engine to replace a binary-only one we depend on for the time being. For entertainment at the initial stages: I'm watching Love Shuffle! It's a very enjoyable J-Drama! Highly recommendable ...
The C is the semantic action for reducing the stack when the syntax is read in. These actions are in C because the definition is intended for Bison/Yacc which produces a C source file.
If you want to use Happy, port the BNF to the Happy definition syntax and write your semantics in Haskell.
Just the tip of the iceberg for getting anything useful however.
If you don't have a copy already, invest in the Dragon Book (Compilers: Principles, Techniques & tools by Aho, Lam, Sethi, Ullman - Pearson)
Why the other answers are true in the general sense, in that you'll need to write your own actions to do anything meaningful the Yacc definition that you linked to actually doesn't have any actions associated with the grammar rules. What it does is that it defines the yyerror function and some code for extracting values from yylval based on the token type.
If you have no clue what yyerror/yylval are about you should read a bison/flex tutorial. The Dragon book is also a good resource if you're more serious about this. There are also some excellent handouts from a Stanford course on compilers floating around the Net, which are based on the book.
You'll need an AST to build that can be constructed in an equivalent way to the C fragments in the Yacc file.
Use BNFC and write your own grammar from scratch! BNFC works wonders and you could do your parsing exactly as you desire.