I have a grammar which has the following productions:
S-> if e then S else | while e do S| begin L end
|s
L-> S; L|S
I am supposed to construct the operator precedence parsing table for the above. But I'm little confused about how to decide the precedence of various terminals here. Till now, we used to work on normal operators (like, +,I,(,id etc). But how to decide in this? I googled to find how to parse if-else grammar using operator precedence parser, but couldn't find any link explaining the same. I actually need to design the error correcting routines for parsing this grammar using operator precedence and SLR parser. Any help will be appreciated (a question from the book Compiler Design, Aho Ullman)!
Thanks in advance!!
Answering my own question for people who want to learn, read this pdf. It presents a method to do the parsing as per operator precedence parsing for all general operators.
Related
I'm following along with Bob Nystrom's great book "Crafting Interpreters".
Please let me know if this question is too specific for this site - I've been trying for hours but couldn't figure this out on my own :)
In chapter Compiling Expressions, in function unary(), the function parsePrecedence(Precedence) is called with PREC_UNARY instead of PREC_UNARY + 1.
The book explains this is in order to enable "nesting" of unary operators. E.g.: --1.
However, in parsePrecedence(Precedence) no precedence level is checked before parsing prefix operators - it is checked only before infix ones. And unary is a prefix parser.
So passing PREC_UNARY or PREC_UNARY + 1 to parsePrecedence(Precedence) doesn't seem to make a difference. What am I missing?
The simple answer is that you are right: with this particular grammar, there is no difference because no binary (or postfix) operator has precedence PREC_UNARY, and the test that will be used is ≤.
All the same, the conventional answer is to use PREC_UNARY because unary prefix operators are (necessarily) right associative. This convention comes from the case of binary operators, where you need to use the operator's precedence plus one for left associative operators (the normal case) and the operator's precedence itself for right-associative operators (exponentiation and assignment, for example). (Assignment is actually somewhat more complicated, but I personally think the solution proposed by Bob Nystrom is more complicated than would have been necessary.)
Another conventional answer derives from the possibility of using a bottom-up operator precedence parser (Dijkstra's "shunting yard") instead of the top-down Pratt parser. Fully exploring bottom-up parsing goes well beyond the scope of this question; suffice it to say that the same principle applies with respect to associativity.
The grammar S -> a S a | a a generates all even-length strings of a's. We can devise a recursive-descent parser with backtrack for this
grammar.
If we choose to expand by production S -> aa first, then we shall
only recognize the string aa.
Thus, any reasonable recursive-descent parser will
try S -> aSa first.
Show that this recursive-descent parser recognizes inputs aa, aaaa, and
aaaaaaaa, but not aaaaaa.
The parser will try to invoke match(a);S();match(a); at first rather than match(a);match(a); as described in the problem. And note that as you try to recursively invokeS() inside the block match(a);S();match(a);, you have only invoked match(a) once, the 'a' symbol in the end is not consumed.
I'll paraphrase myself from this other answer.
It's actually a property of the "singleton match strategy" usual in naive implementations of recursive descent parsers with backtracking.
Quoting this answer by Dr Adrian Johnstone:
The trick here is to realise that many backtracking parsers use what we call a singleton match strategy, in which as soon as the parse function for a rule finds a match, it returns. In general, parse functions need to return a set of putative matches. Just try working it through, and you'll see that a singelton match parser misses some of the possible derivations.
Also, the images available in this answer will help you visualize what's happening in the case you exemplified.
According to me, the first condition for Recursive Decent Parser is that given grammar should not be Left Recursive and Non-deterministic. but here in question given grammar is non-deterministic so we need to convert that using left factoring and we'll get S->aS', S'->Sa|a and this will work I guess.
OK, so here's a question: Given that Haskell allows you to define new operators with arbitrary operator precedence... how is it possible to actually parse Haskell source code?
You cannot know what operator precedences are set until you parse the source. But you cannot parse the source until you know the correct operator precedences. So... um, how?
Consider, for example, the expression
x *** y +++ z
Until we finish parsing the module, we don't know what other modules are imported, and hence what operators (and other identifiers) might be in scope. We certainly don't know their precedences yet. But the parser has to return something... But should it return
(x *** y) +++ z
Or should it return
x *** (y +++ z)
The poor parser has no way to know. This can only be determined once you hunt down the import that brings (+++) and (***) into scope, load that file off disk, and discover what the operator precedences are. Clearly the parser itself isn't going to do all that I/O; a parser just turns a stream of characters into an AST.
Clearly somebody somewhere has figured out how to do this. But I can't work it out... Any hints?
Quoting the page on GHC trac for the parser:
Infix operators are parsed as if they were all left-associative. The
renamer uses the fixity declarations to re-associate the syntax tree.
András Kovács's answer tells what's really done in GHC, but there's some history to this.
There was actually a somewhat hypothetical change from the Haskell 98 to the Haskell 2010 standard. In the former's BNF grammar, operator fixity and parsing were intertwined in such a way that you could in theory have some very strange interactions between the rules for fixity and the rules for when expressions and indentation blocks end. (For the latter two, the rules are essentially, "keep on going until you have to stop".)
In particular you could redefine a local operator and its fixity such that a use of it belonged in the redefining inner where block exactly ... when it didn't. So you got a parser paradox. I cannot find any of the old examples but this may be one:
let (+) = (Prelude.+)
infix 9 + -- make the inner + high precedence and non-associative
in 2 + 3 + 4
-- ^ this + cannot parse here as the inner operator, which means
-- the let ... in ... expression should end automatically first,
-- but then it's the standard +, and its fixity says it should parse
-- as part of the inner expression...
In Haskell 2010 they officially changed that so that operator fixities are determined in a separate stage after the parsing proper.
So why was this a hypothetical change? Because all the compiler writers already did it the Haskell 2010 way, and always had, for their own sanity.
Summarising the comments so far, it seems the possibilities are thus:
Return a parse tree where any infix operators are left as some kind of "list" structure, and then rearrange once precedences become known.
Pretend you know the operator precedences, and then rearrange the parse tree after the fact.
Do a first parse that only reads imports and fixity declarations, load the imports, and then do a full parse with known precedences.
I wrote grammar for LALR parser and I am stuck at optional non-terminal. Consider for example C++ dereference, when you can write:
******expression;
Of course you can write:
expression;
And here is my problem, dereference non terminal is optional really and this has such impact on grammar, that now parser sees it fits everywhere (almost), because, well, it might be empty.
Is there a common pattern how should I rewrite the grammar to fix it?
I would also be grateful for pointing out some book or other resources which deals with "common problems & patterns when writing grammars".
First of all, the problem you are having is not the one you are claiming to have. Having a nullable (possibly empty) nonterminal does not mean that the parser will try to stick it everywhere. (I use the term “nullable” here to avoid confusion, because “optional” might refer to an optional occurrence of a nonterminal, as in x? where x is the nonterminal name). It just means that whenever you use that nonterminal in your grammar, the parser might skip over it or match with an empty word (details are according to the rules of the particular parsing algorithm, in your case LALR).
Secondly, the problem most probably is that the resulting grammar is ambiguous. My guess is that you used some kind of combination of right recursion for defining the nonterminal with the stars, and having an asterisk as a binary multiplication operator. (Feel free to update the question with a grammar fragment, then I might be able to offer more detailed help).
Thirdly, and mainly concerning your quest for general problems and patterns in grammars: usually people would not put the stars in one nonterminal and the expression in another, because ultimately you would want to transform your parse tree into an abstract syntax tree on which you probably intend to perform some calculations, in that case you would prefer to have a construction that says “dereference of a dereference of a dereference of an expression” rather than “three stars followed by an expression”. Again, the answer would have been less vague if you provided more details.
I'm making my own javascript-based programming language (yeah, it is crazy, but it's for learn only... maybe?). Well, I'm reading about parsers and the first pass is to convert the code source to tokens, like:
if(x > 5)
return true;
Tokenizer to:
T_IF "if"
T_LPAREN "("
T_IDENTIFIER "x"
T_GT ">"
T_NUMBER "5"
T_RPAREN ")"
T_IDENTIFIER "return"
T_TRUE "true"
T_TERMINATOR ";"
I don't know if my logic is correct for that for while. On my parser it is even better (or not?) and translate to it (yeah, multidimensional array):
T_IF "if"
T_EXPRESSION ...
T_IDENTIFIER "x"
T_GT ">"
T_NUMBER "5"
T_CLOSURE ...
T_IDENTIFIER "return"
T_TRUE "true"
I have some doubts:
Is my way better or worse that the original way? Note that my code will be read and compiled (translated to another language, like PHP), instead of interpreted all the time.
After I tokenizer, what I need do exactly? I'm really lost on this pass!
There are some good tutorial to learn how I can do it?
Well, is that. Bye!
Generally, you want to separate the functions of the tokeniser (also called a lexer) from other stages of your compiler or interpreter. The reason for this is basic modularity: each pass consumes one kind of thing (e.g., characters) and produces another one (e.g., tokens).
So you’ve converted your characters to tokens. Now you want to convert your flat list of tokens to meaningful nested expressions, and this is what is conventionally called parsing. For a JavaScript-like language, you should look into recursive descent parsing. For parsing expressions with infix operators of different precedence levels, Pratt parsing is very useful, and you can fall back on ordinary recursive descent parsing for special cases.
Just to give you a more concrete example based on your case, I’ll assume you can write two functions: accept(token) and expect(token), which test the next token in the stream you’ve created. You’ll make a function for each type of statement or expression in the grammar of your language. Here’s Pythonish pseudocode for a statement() function, for instance:
def statement():
if accept("if"):
x = expression()
y = statement()
return IfStatement(x, y)
elif accept("return"):
x = expression()
return ReturnStatement(x)
elif accept("{")
xs = []
while True:
xs.append(statement())
if not accept(";"):
break
expect("}")
return Block(xs)
else:
error("Invalid statement!")
This gives you what’s called an abstract syntax tree (AST) of your program, which you can then manipulate (optimisation and analysis), output (compilation), or run (interpretation).
Most toolkits split the complete process into two separate parts
lexer (aka. tokenizer)
parser (aka. grammar)
The tokenizer will split the input data into tokens. The parser will only operate on the token "stream" and build the structure.
Your question seems to be focused on the tokenizer. But your second solution mixes the grammar parser and the tokenizer into one step. Theoretically this is also possible but for a beginner it is much easier to do it the same way as most other tools/framework: keep the steps separate.
To your first solution: I would tokenize your example like this:
T_KEYWORD_IF "if"
T_LPAREN "("
T_IDENTIFIER "x"
T_GT ">"
T_LITARAL "5"
T_RPAREN ")"
T_KEYWORD_RET "return"
T_KEYWORD_TRUE "true"
T_TERMINATOR ";"
In most languages keywords cannot be used as method names, variable names and so on. This is reflected already on the tokenizer level (T_KEYWORD_IF, T_KEYWORD_RET, T_KEYWORD_TRUE).
The next level would take this stream and - by applying a formal grammar - would build some datastructure (often called AST - Abstract Syntax Tree) which might look like this:
IfStatement:
Expression:
BinaryOperator:
Operator: T_GT
LeftOperand:
IdentifierExpression:
"x"
RightOperand:
LiteralExpression
5
IfBlock
ReturnStatement
ReturnExpression
LiteralExpression
"true"
ElseBlock (empty)
Implementing the parser by hand is usually done by some frameworks. Implementing something like that by hand and efficiently is usually done at a university in the better part of a semester. So you really should use some kind of framework.
The input for a grammar parser framework is usually a formal grammar in some kind of BNF. Your "if" part migh look like this:
IfStatement: T_KEYWORD_IF T_LPAREN Expression T_RPAREN Statement ;
Expression: LiteralExpression | BinaryExpression | IdentifierExpression | ... ;
BinaryExpression: LeftOperand BinaryOperator RightOperand;
....
That's only to get the idea. Parsing a realworld-language like Javascript correctly is not an easy task. But funny.
Is my way better or worse that the original way? Note that my code will be read and compiled (translated to another language, like PHP), instead of interpreted all the time.
What's the original way ? There are many different ways to implement languages. I think yours is fine actually, I once tried to build a language myself that translated to C#, the hack programming language. Many language compilers translate to an intermediate language, it's quite common.
After I tokenizer, what I need do exactly? I'm really lost on this pass!
After tokenizing, you need to parse it. Use some good lexer / parser framework, such as the Boost.Spirit, or Coco, or whatever. There are hundreds of them. Or you can implement your own lexer, but that takes time and resources. There are many ways to parse code, I generally rely on recursive descent parsing.
Next you need to do Code Generation. That's the most difficult part in my opinion. There are tools for that too, but you can do it manually if you want to, I tried to do it in my project, but it was pretty basic and buggy, there's some helpful code here and here.
There are some good tutorial to learn how I can do it?
As I suggested earlier, use tools to do it. There are a lot of pretty good well-documented parser frameworks. For further information, you can try asking some people who know about this stuff. #DeadMG , over at the Lounge C++ is building a programming language called "Wide". You may try consulting him.
Let's say I have this statement in a programming language:
if (0 < 1) then
print("Hello")
The lexer will translate it into:
keyword: if
num: 0
op: <
num: 1
keyword: then
keyword: print
string: "Hello"
The parser will then take the information (aka "Token Stream") and make this:
if:
expression:
<:
0, 1
then:
print:
"Hello"
I don't know if this will help or not, but I hope it does.