How to turn a token stream into a parse tree [closed] - parsing

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I have a lexer built that streams out tokens from in input but I'm not sure how to build the next step in the process - the parse tree. Does anybody have any good resources or examples on how to accomplish this?

I would really recommend http://www.antlr.org/ and of course the classic Dragon Compilers book.
For an easy language like JavaScript it's not hard to hand roll a recursive descent parser, but it's almost always easier to use a tool like yacc or antlr.
I think to step back to the basics of your question, you really want to study up on BNF-esque grammar syntax and pick a syntax for your target. If you have that, the parse tree should sort of fall out, being the 'instance' manifestation of that grammar.
Also, don't try to turn the creation of your parse tree into your final solution (like generating code, or what-not). It might seem do-able and more effecient; but invariably there will come a time when you'll really wish you had that parse tree 'as is' laying around.

You should investigate parser generator tools for your platform. A parser generator allows you to specify a context-free grammar for your language. The language consists of a number of rules which "reduce" a series of symbols into a new symbol. You can usually also specify precedence and associativity for different rules to eliminate ambiguity in the language. For instance, a very simple calculator language might look something like this:
%left PLUS, MINUS # low precedence, evaluated left-to-right
%left TIMES, DIV # high precedence, left-to-right
expr ::= INT
| expr PLUS expr
| expr MINUS expr
| expr TIMES expr
| expr DIV expr
| LEFT_PAREN expr RIGHT_PAREN
Usually, you can associate a bit of code with each rule to construct a new value (in this case an expression) from the other symbols in that rule. The parser generator will take in the grammar and produce code in your language that translates a token stream to a parse tree.
Most parser generators are language specific. ANTLR is well-known and supports C, C++, Objective C, Java, and Python. I've heard it's hard to use though. I've used bison for C/C++, CUP for Java, and ocamlyacc for OCaml, and they're all pretty good. If you are already using a lexer generator, you should look for a parser generator that is specifically compatible with it.

I believe a common a approach is to use a Finite State Machine. For example if you read an operand you go into a state where you next expect an operator, and you usually use the operator as the root node for the operands and so on.

As described above by Marcos Marin, a state machine that uses your language rules in BNF to parse your token list will do the trick if you want to do it yourself. Only, as said in above comment by Paul Hollingsworth, the easier way is to use a Pushdown-Automaton that has a simple FiFo memory stack.
Every class of token has a next expected token in your grammar, which also is represented in your state-machine. The stack is used to "remember" what the previous token class was, to reduce the required states (could be done without stack, but you would need a new state for every class and subclass split in the grammar tree).
The accepting state(s) would be (in natural languages and most programming languages too) the starting state, and maybe some other state in particular cases.
Antlr would be my suggestion if you want to use a tool (waaay faster and less extensive). Good luck!

Related

Can we use BNF for parsing AND lexing instead of regex?

With a Backus-Naur form grammar (BNF), we can specify the syntax of the programming language in order to parse it and produce an abstract syntax tree (AST).
<if> ::= "if" <expression> "then" <action> "end"
But we can also specify the tokens with a BNF grammar, as the first usage of BNF did for ALGOL-60:
<digit> ::= "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9"
<digit_with_zero> ::= <digit> | "0"
<integer> ::= <digit> | <digit_with_zero> <integer>
However, this usage of the BNF in order to lex (= produce a list of minimal meaningful units aka tokens) has been deprecated in favor of regular expressions (like [1-9][0-9]*).
It seems clear that the regex are much more concise.
It seems also that keeping the structure of an if statement is interesting for the interpreter or the compiler which will handle the AST produced by the parser, but keeping the structure of an integer (or a float) is not.
But do you agree that BNF could be used for both lexing and parsing?
And do you agree with the reasons which make regex much more suited for lexing?
Or are there others?
Regular expressions (in the mathematical sense) are equivalent in power to regular grammars and regular grammars can be written in BNF. So in that sense, it is clearly possible to write a full grammar for any context-free language in pure BNF.
Indeed, it is not even necessary to maintain the lexer/parser dichotomy. Some programmers find it convenient to use scannerless parsing (the article is not great but it has some interesting references), although many of these are based on the PEG formalism (which is not context-free) rather than BNF. (These are not the same despite the superficial resemblance.)
That said, it might not be convenient. In general, like most questions related to the structure of parsers, the answer is going to be based less on theory and more on a combination of practicality (with reference to a specific use case) and programmer prejudice.
As is well known, purity is rarely the most practical. Most real-life parser and scanner generators deviate from the pure theoretical models in order to provide mechanisms which are easier to use, easier to implement efficiently, or more powerful. For example, the character class syntax ([a-zA-Z]), which is almost universal in scanner generators, is a clear extension to regular expression syntax which deliberately avoids the need to explicitly list the entire contents of the set. One could say that the listing is implicit and unambiguous in the example I just presented, but most scanner generators also allow the use of classes like [[:alnum:]] ("alphanumeric symbols"), where the precise list of matched symbols is either locale-dependent or, in the Unicode world, extensible in the future. Regardless, this is obviously a useful extension.
While it is true that some aspects of regular expressions are more compact than their equivalent regular grammars -- especially the Kleene star operator, which in BNF requires an additional non-terminal and thus an additional name -- there are also cases where the ability to name subexpressions makes regular grammars more compact. Many scanner generators, starting with Lex, allowed named subpatterns as another regular expression extension. Furthermore, it is possible (with some caveats) to add the Kleene star and other operators to BNF as macros, and many parser generators do so. So there is a certain convergence of notation.
As you say, one difference between scanners and parsers is that the scanner generally makes no attempt to parse the substructure of a lexeme. But it is not true that no lexeme has substructure, and these substructures often do need to be analysed. The most notorious example is probably floating point numbers, which have to be analysed into a multiplier and an exponent, and the multiplier also analysed into an integer part and a fractional part. This analysis is commonly done using primitive functions available in the scanner implementation language (such as strtod for C scanners), but that does mean a second lexical scan. (Using the built-in avoids the considerable inconvenience of writing a mathematically correct string-to-internal converter, which is a much more difficult problem than it first appears. Rolling your own number converter is not recommended.)
Other lexemes with internal structure include string literals (which may contain escape sequences) and a large variety of more complex lexemes available in certain languages (dates and times, IP addresses, HTML tags, etc., etc.). All of these things tend to blur the boundary between scanning and parsing. Which is fine, because, as I said, the boundary is situational and not restrained by any absolute law of nature.
Still, it is certainly the case that many lexemes do not have any interesting internal structure, and furthermore that while it is easy to rewrite a regular expression as a regular grammar, it is considerably harder to rewrite it as an unambiguous, deterministic regular grammar, much less an LALR(1) regular grammar. (This is one of the reasons scannerless parsing is often associated with PEG, but it can also be solved with GLL or GLR parsers, at a slight loss of efficiency.)

Testing grammar for ambiguities

I'm writing a grammar for a formal language. Ideally I'd want that grammar to be unambiguous, but that might not be possible. In either case, I want to know about all possible ambiguities while developing the grammar. How can I do that?
So far, most of the time when I developed a language, I'd turn to Bison, write a LR(1) grammar for it, run Bison in verbose mode and have a look at all the shift-reduce and reduce-reduce conflicts it tells me about. Make sure that I agree with its choice in each case.
But now I'm in a project where Bison doesn't have a code generator for one of the required target languages, and where ANTLR is already being used. Plus the language isn't LR(1), and rewriting it as LR(1) would entail additional syntax checks after the parser is done, thus reducing the expressiveness of the grammar as a tool to describe the language.
So I'm now working with ANTLR, fed it my grammar, and all seems to be working well. But ANTLR doesn't seem to check for ambiguities at compile time. For example, the following grammar is ambiguous:
grammar test;
lst: '(' ')' {System.out.println("a");}
| '(' elts ')' {System.out.println("b");} ;
elts: elt (',' elt)* ;
elt: 'x' | /* empty */ ;
The input () could be interpreted as the empty list, or it could be interpreted as a list consisting of a single empty element. The generated parser chooses the former interpretation, but I'd like to be able to manually verify that choice.
Is there command line switch I can use to get ANTLR to tell me about ambiguities?
Or perhaps an option I can set in the grammar file?
Or should I use some other tool to check the grammar for ambiguities?
If so, is there one which can already read ANTLR syntax, or do I have to strip out all the actions and boil this down to BNF myself?
The ANTLRErrorListener.reportAmbiguity method suggests that ANTLR might be able to perform some ambiguity testing at runtime. But I guess that's only going to tell you whether the parsing of a given input is ambiguous. Is there some strategy how I could leverage this to detect all ambiguities, using a carefully selected set of inputs?
Well, as far as I know, ANTLR has no real option to check for ambiguity, other than the errors it produced IF you write an ambiguous grammar and feed an input that triggers the ambiguity. I do, however know a few tools which can check for ambiguity. They all have different syntax, and I don't know any tool which uses ANTLR grammar.
A software called AtoCC has a tool called KfG which can check for ambiguity.
ACLA (Ambiguity Checking with Language Approximations).
Context Free Grammar Tool.
Personally, I find tool 3 easiest to use, but is the most limited as well. It is important, however to note that none of the tools can be 100% sure; if the tools says you're grammar is ambiguous, it is ambiguous, however if they say you're grammar is unambiguous, they might still be ambiguous, as they have no way of testing an infinite number of ways, that your language can be written.
Hope this helps.

Are there different types of parsers? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
Consider this simple grammar:
S -> a | b
The set of strings that may be generated by the grammar is:
{a, b}
Thus, a grammar generates a set of strings.
A parser for a grammar takes an input string and determines if the string could be generated by the grammar.
Thus, a parser is a recognizer for a grammar.
At least, that is one use of a parser.
But oftentimes a parser is used for other things. For example, a parser for a grammar may take an input string and create a tree structure which contains the input data and conforms to the grammar.
In that case the parser is not a recognizer, it is a data structure builder.
I conclude that there are different types of parsers.
Am I thinking logically? Are there indeed different types of parsers?
Has someone created a list of the different types of things that parsers have been created for?
Please let me know of any looseness or ambiguity in the above statements. I am trying to learn to be precise in statements about these concepts. For example, do you agree that "a grammar generates a set of strings"? Is that precise? correct?
No, I do not agree. A grammar does not generate anything. It is a set of rules which define the structure of something. A parser takes a grammar and some form of input and produces some form of output, whether that be an abstract syntax tree, an indication of whether the input is well-formed according to the grammar, or whatever else it might be. There are different types of parsers, but not because of what they produce. Rather, parsers are classified based on what kind of grammars they can accept and how that grammar is interpreted. For example, there are LL parsers and LR parsers, with various subtypes having additional restrictions on, for example, how many tokens of lookahead are needed.
Regarding a grammar "generating" something, what would this generate?
S -> ("a" | "b") S?
As soon as the grammar becomes non-trivial, finding all valid input starts to no longer make sense.

Writing correct LL(1) grammars?

I'm currently trying to write a (very) small interpreter/compiler for a programming language. I have set the syntax for the language, and I now need to write down the grammar for the language. I intend to use an LL(1) parser because, after a bit of research, it seems that it is the easiest to use.
I am new to this domain, but from what I gathered, formalising the syntax using BNF or EBNF is highly recommended. However, it seems that not all grammars are suitable for implementation using an LL(1) parser. Therefore, I was wondering what was the correct (or recommended) approach to writing grammars in LL(1) form.
Thank you for your help,
Charlie.
PS: I intend to write the parser using Haskell's Parsec library.
EDIT: Also, according to SK-logic, Parsec can handle an infinite lookahead (LL(k) ?) - but I guess the question still stands for that type of grammar.
I'm not an expert on this as I have only made a similar small project with an LR(0) parser. The general approach I would recommend:
Get the arithmetics working. By this, make rules and derivations for +, -, /, * etc and be sure that the parser produces a working abstract syntax tree. Test and evaluate the tree on different input to ensure that it does the arithmetic correctly.
Make things step by step. If you encounter any conflict, resolve it first before moving on.
Get simper constructs working like if-then-else or case expressions working.
Going further depends more on the language you're writing the grammar for.
Definetly check out other programming language grammars as an reference (unfortunately I did not find in 1 min any full LL grammar for any language online, but LR grammars should be useful as an reference too). For example:
ANSI C grammar
Python grammar
and of course some small examples in Wikipedia about LL grammars Wikipedia LL Parser that you probably have already checked out.
I hope you find some of this stuff useful
There are algorithms both for determining if a grammar is LL(k). Parser generators implement them. There are also heuristics for converting a grammar to LL(k), if possible.
But you don't need to restrict your simple language to LL(1), because most modern parser generators (JavaCC, ANTLR, Pyparsing, and others) can handle any k in LL(k).
More importantly, it is very likely that the syntax you consider best for your language requires a k between 2 and 4, because several common programming constructs do.
So first off, you don't necessarily want your grammar to be LL(1). It makes writing a parser simpler and potentially offers better performance, but it does mean that you're language will likely end up more verbose than commonly used languages (which generally aren't LL(1)).
If that's ok, your next step is to mentally step through the grammar, imagine all possibilities that can appear at that point, and check if they can be distinguished by their first token.
There's two main rules of thumb to making a grammar LL(1)
If of multiple choices can appear at a given point and they can
start with the same token, add a keyword in front telling you which
choice was taken.
If you have an optional or repeated part, make
sure it is followed by an ending token which can't appear as the first token of the optional/repeated part.
Avoid optional parts at the beginning of a production wherever possible. It makes the first two steps a lot easier.

Practical consequences of formal grammar power?

Every undergraduate Intro to Compilers course reviews the commonly-implemented subsets of context-free grammars: LL(k), SLR(k), LALR(k), LR(k). We are also taught that for any given k, each of those grammars is a subset of the next.
What I've never seen is an explanation of what sorts of programming language syntactic features might require moving to a different language class. There's an obvious practical motivation for GLR parsers, namely, avoiding an unholy commingling of parser and symbol table when parsing C++. But what about the differences between the two "standard" classes, LL and LR?
Two questions:
What (general) syntactic constructions can be parsed with LR(k) but not LL(k')?
In what ways, if any, do those constructions manifest as desirable language constructs?
There's a plausible argument for reducing language power by making k as small as possible, because a language requiring many, many tokens of lookahead will be harder for humans to parse, as well as "harder" for machines to parse. Question (2) implicitly asks if the same reasoning ends up holding between classes, as well as within a class.
edit: Here's one example to illustrate the sorts of answers I'm looking for, but for regular languages instead of context-free:
When describing a regular language, one usually gets three operators: +, *, and ?. Now, you can remove + without reducing the power of the language; instead of writing x+, you write xx*, and the effect is the same. But if x is some big and hairy expression, the two xs are likely to diverge over time due to human forgetfulness, yielding a syntactically correct regular expression that doesn't match the original author's intent. Thus, even though adding + doesn't strictly add power, it does make the notation less error-prone.
Are there constructs with similar practical (human?) effects that must be "removed" when switching from LR to LL?
Parsing (I claim) is a bit like sorting: a problem that was the focus of a lot of thought in the early days of CS, leading to a set of well-understood solutions with some nice theoretical results.
My claim is that the picture that we get (or give, for those of us who teach) in a compilers class is, to some degree, a beautiful answer to the wrong question.
To answer your question more directly, an LL(1) grammar can't parse all kinds of things that you might want to parse; the "natural" formulation of an 'if' with an optional 'else', for instance.
But wait! Can't I reformulate my grammar as an LL(1) grammar and then patch up the source tree by walking over it afterward? Sure you can! To some degree, this is what makes the question of what kind of grammar your parser uses largely moot.
Also, back when I was an undergraduate (1990-94), whitespace-sensitive grammars were clearly the work of the Devil; now, Python and Haskell's designs are bringing whitespace-sensitivity back into the light. Also, Packrat parsing says "to heck with your theoretical purity: I'm just going to define a parser as a set of rules, and I don't care what class my grammar belongs to." (paraphrased)
In summary, I would agree with what I believe to be your implied suggestion: in 2009, a clear understanding of the difference between the classes LL(k) and LR(k) is less important in itself than the ability to formulate and debug a grammar that makes your parser generator happy.
The difference between LL and LR is primarily in the lookahead mechanism. People generally say that LR parsers carry more "context". To see this practically, consider a recursive grammar definition with S as the starting symbol:
A -> Ax | x
B -> Ay
C -> Az
S -> B | C
When k is a small fixed value, parsing a string like xxxxxxy is a task better suited to an LR parser. However, these days the popular LL parsers such as ANTLR do not restrict k to such small values and most people no longer care.
I hope this is more or less in line with your question. Of course Knuth showed that any unambiguous context-free language can be recognized by some LR(1) grammar. However, in practice we are also concerned with translation.
As a side note: You might also enjoy reading http://www.antlr.org/article/needlook.html.
This is by no means proven, but I have always questioned whether LR-like parsing is really similar to how the brain works when reading certain notations. For example, when reading an English sentence it is pretty obvious that we read from left-to-right. But, consider the pattern bellow:
. . . . . | . . . . .
I rather expect that with short patterns such as this one people do not literally read "dot dot dot dot dot bar dot dot dot dot dot" from left to right, but rather processes the pattern in parallel or at least in some kind of fuzzy iterative manner. In other words, I do not believe we necessarily read all patterns in a left-to-right manner with the kind of linear lookahead that a LL/LR parser employs.
Furthermore, if we can describe any context-free language using an LR(1) grammar then it is clear that simply recognizing a string is not the same as "understanding" it.
well, for one, Left recursive definitions are impossible in LL(k) grammars (as far as i know), don't know about others. This doesn't make itimpossible to define other things just a massive pain to do otherwise. For instance, putting together expressions can be easy in a left-recursive language (in pseudocode):
lexer rule expression = other rules
| expression
| '(' expression ')';
As far as syntactically useful things that can be made with left-recursion, um does simpler grammars count as syntactically useful?
The capabilities of a language are not limited by its syntax and grammar.
It's possible to define any language feature with an LL(k) grammar, it just might not be very readable to humans.

Resources