Building AST in OCaml - parsing

I'm using OCaml to build a recursive descent parser for a subset of Scheme. Here's the grammar:
S -> a|b|c|(T)
T -> ST | Epsilon
So say I have:
type expr =
Num of int | String of string | Tuple of expr * expr
Pseudocode
These functions have to return expr type to build the AST
parseS lr =
if head matches '(' then
parseL lr
else
match tokens a, b, or c
Using First Set of S which are tokens and '(':
parseL lr =
if head matches '(' or the tokens then
Tuple (parseS lr, parseL lr)
else
match Epsilon
My question is "How do I return for the Epsilon part since I just can't return ()?". An OCaml function requires same return type and even if I leave blank for Epsilon part, OCaml still assumes unit type.

As far as I can see, your AST doesn't match your grammar.
You can solve the problem by having a specifically empty node in your AST type to represent the Epsilon in your grammar.
Or, you can change your grammar to factor out the Epsilon.
Here's an equivalent grammar with no Epsilon:
S -> a|b|c|()|(T)
T -> S | S T

Maybe instead of creating parser-functions manually it is better to use existent approaches: for example, LALR(1) ocamlyacc or camlp4 based LL(k) parsers ?

Related

Invalid grammar for a top-down recursive parser

I am trying to create grammar for a naive top-down recursive parser. As I understand the basic idea is to write a list of functions (top-down) that correspond to the productions in the grammar. Each function can call other functions (recursive).
The rules for a list include any number of numbers, but they must be separated by commas.
Here's an example of grammar I came up with:
LIST ::= NUM | LIST "," NUM
NUM ::= [0-9]+
Apparently this is incorrect, so my question is: why is this grammar not able to be parsed by a naive top-down recursive descent parser? What would be an example of a valid solution?
The issue is that for a LL(1) recursive decent parser such as this:
For any i and j (where j ≠ i) there is no symbol that can start both an instance of Wi and an instance of Wj.
This is because otherwise the parser will have errors knowing what path to take.
The correct solution can be obtained by left-factoring, it would be:
LIST ::= NUM REST
REST ::= "" | "," NUM
NUM ::= [0-9]+

How to deal with the implicit 'cat' operator while building a syntax tree for RE(use stack evaluation)

I am trying to build a syntax tree for regular expression. I use the strategy similar to arithmetic expression evaluation (i know that there are ways like recursive descent), that is, use two stack, the OPND stack and the OPTR stack, then to process.
I use different kind of node to represent different kind of RE. For example, the SymbolExpression, the CatExpression, the OrExpression and the StarExpression, all of them are derived from RegularExpression.
So the OPND stack stores the RegularExpression*.
while(c || optr.top()):
if(!isOp(c):
opnd.push(c)
c = getchar();
else:
switch(precede(optr.top(), c){
case Less:
optr.push(c)
c = getchar();
case Equal:
optr.pop()
c = getchar();
case Greater:
pop from opnd and optr then do operation,
then push the result back to opnd
}
But my primary question is, in typical RE, the cat operator is implicit.
a|bc represents a|b.c, (a|b)*abb represents (a|b)*.a.b.b. So when meeting an non-operator, how should i do to determine whether there's a cat operator or not? And how should i deal with the cat operator, to correctly implement the conversion?
Update
Now i've learn that there is a kind of grammar called "operator precedence grammar", its evaluation is similar to arithmetic expression's. It require that the pattern of the grammar cannot have the form of S -> ...AB...(A and B are non-terminal). So i guess that i just cannot directly use this method to parse the regular expression.
Update II
I try to design a LL(1) grammar to parse the basic regular expression.
Here's the origin grammar.(\| is the escape character, since | is a special character in grammar's pattern)
E -> E \| T | T
T -> TF | F
F -> P* | P
P -> (E) | i
To remove the left recursive, import new Variable
E -> TE'
E' -> \| TE' | ε
T -> FT'
T' -> FT' | ε
F -> P* | P
P -> (E) | i
now, for pattern F -> P* | P, import P'
P' -> * | ε
F -> PP'
However, the pattern T' -> FT' | ε has problem. Consider case (a|b):
E => TE'
=> FT' E'
=> PT' E'
=> (E)T' E'
=> (TE')T'E'
=> (FT'E')T'E'
=> (PT'E')T'E'
=> (iT'E')T'E'
=> (iFT'E')T'E'
Here, our human know that we should substitute the Variable T' with T' -> ε, but program will just call T' -> FT', which is wrong.
So, what's wrong with this grammar? And how should i rewrite it to make it suitable for the recursive descendent method.
1. LL(1) grammar
I don't see any problem with your LL(1) grammar. You are parsing the string
(a|b)
and you have gotten to this point:
(a T'E')T'E' |b)
The lookahead symbol is | and you have two possible productions:
T' ⇒ FT'
T' ⇒ ε
FIRST(F) is {(, i}, so the first production is clearly incorrect, both for the human and the LL(1) parser. (A parser without lookahead couldn't make the decision, but parsers without lookahead are almost useless for practical parsing.)
2. Operator precedence parsing
You are technically correct. Your original grammar is not an operator grammar. However, it is normal to augment operator precedence parsers with a small state machine (otherwise algebraic expressions including unary minus, for example, cannot be correctly parsed), and once you have done that it is clear where the implicit concatenation operator must go.
The state machine is logically equivalent to preprocessing the input to insert an explicit concatenation operator where necessary -- that is, between a and b whenever a is in {), *, i} and b is in {), i}.
You should take note that your original grammar does not really handle regular expressions unless you augment it with an explicit ε primitive to represent the empty string. Otherwise, you have no way to express optional choices, usually represented in regular expressions as an implicit operand (such as (a|), also often written as a?). However, the state machine is easily capable of detecting implicit operands as well because there is no conflict in practice between implicit concatenation and implicit epsilon.
I think just keeping track of the previous character should be enough. So if we have
(a|b)*abb
^--- we are here
c = a
pc = *
We know * is unary, so 'a' cannot be its operand. So we must have concatentation. Similarly at the next step
(a|b)*abb
^--- we are here
c = b
pc = a
a isn't an operator, b isn't an operator, so our hidden operator is between them. One more:
(a|b)*abb
^--- we are here
c = b
pc = |
| is a binary operator expecting a right-hand operand, so we do not concatenate.
The full solution probably involves building a table for each possible pc, which sounds painful, but it should give you enough context to get through.
If you don't want to mess up your loop, you could do a preprocessing pass where you insert your own concatenation character using similar logic. Can't tell you if that's better or worse, but it's an idea.

ANTLR does not automatically do lookahead matching?

I'm currently writing a simple grammar that requires operator precedence and mixed associativities in one expression. An example expression would be a -> b ?> C ?> D -> e, which should be parsed as (a -> (((b ?> C) ?> D) -> e). That is, the ?> operator is a high-precedence left-associative operator wheras the -> operator is a lower-precedence right-associative operator.
I'm prototyping the grammar in ANTLR 3.5.1 (via ANTLRWorks 1.5.2) and find that it can't handle the following grammar:
prog : expr EOF;
expr : term '->' expr
| term;
term : ID rest;
rest : '?>' ID rest
| ;
It produces rule expr has non-LL(*) decision due to recursive rule invocations reachable from alts 1,2 error.
The term and rest productions work fine in isolation when I tested it , so I assumed this happened because the parser is getting confused by expr. To get around that, I did the following refactor:
prog : expr EOF;
expr : term exprRest;
exprRest
: '->' expr
| ;
term : ID rest;
rest : DU ID rest
| ;
This works fine. However, because of this refactor I now need to check for empty exprRest nodes in the output parse tree, which is non-ideal. Is there a way to make ANTLR work around the ambiguity in the initial declaration of expr? I would of assumed that the generated parser would fully match term and then do a lookahead search for "->" and either continue parsing or return the lone term. What am I missing?
As stated, the problem is in this rule:
expr : term '->' expr
| term;
The problematic part is the term which is common to both alternatives.
LL(1) grammar doesn't allow this at all (unless term only matches zero tokens - but such rules would be pointless), because it cannot decide which alternative to use with only being able to see one token ahead (that's the 1 in LL(1)).
LL(k) grammar would only allow this if the term rule could match at most k - 1 tokens.
LL(*) grammar which ANTLR 3.5 uses does some tricks that allows it to handle rules that match any number of tokens (ANTLR author calls this "variable look-ahead").
However, one thing that these tricks cannot handle is if the rule is recursive, i.e. if it or any rules it invokes reference itself in any way (direct or indirect) - and that is exactly what your term rule does:
term : ID rest;
rest : '?>' ID rest
| ;
- the rule rest, referenced from term, recursively references itself. Thus, the error message
rule expr has non-LL(*) decision due to recursive rule invocations ...
The way to solve this limitation of LL grammars is called left-factoring:
expr : term
( '->' expr )?
;
What I did here is said "match term first" (since you want to match it in both alternatives, there's no point in deciding which one to match it in), then decide whether to match '->' expr (this can be decided just by looking at the very next token - if it's ->, use it - so this is even LL(1) decision).
This is very similar to what you came to as well, but the parse tree should look very much like you intended with the original grammar.

Find an equivalent LR grammar

I am trying to find an LR(1) or LR(0) grammar for pascal. Here is a part of my grammar which is not LR(0) as it has shift/reduce conflict.
EXPR --> AEXPR | AEXPR realop AEXPR
AEXPR --> TERM | SIGN TERM | AEXPR addop TERM
TERM --> TERM mulop FACTOR | FACTOR
FACTOR --> id | num | ( EXPR )
SIGN --> + | -
(Uppercase words are variables and lowercase words, + , - are terminals)
As you see , EXPR --> AEXPR | AEXPR realop AEXPR cause a shift/reduce conflict on LR(0) parsing. I tried adding a new variable , and some other ways to find an equivalent LR (0) grammar for this, but I was not successful.
I have two problems.
First: Is this grammar a LR(1) grammar?
Second: Is it possible to find a LR(0) equivalent for this grammar? what about LR(1) equivalent?
Yes, your grammar is an LR(1) grammar. [see note below]
It is not just the first production which causes an LR(0) conflict. In an LR(0) grammar, you must be able to predict whether to shift or reduce (and which production to reduce) without consulting the lookahead symbol. That's a highly restrictive requirement.
Nonetheless, there is a grammar which will recognize the same language. It's not an equivalent grammar in the sense that it does not produce the same parse tree (or any useful parse tree), so it depends on what you consider equivalent.
EXPR → TERM | EXPR OP TERM
TERM → num | id | '(' EXPR ')' | addop TERM
OP → addop | mulop | realop
The above works by ignoring operator precedence; it regards an expression as simply the regular language TERM (op TERM)*. (I changed + | - to addop because I couldn't see how your scanner could work otherwise, but that's not significant.)
There is a transformation normally used to make LR(1) expression grammars suitable for LL(1) parsing, but since LL(1) is allowed to examine the lookahead character, it is able to handle operator precedence in a normal way. The LL(1) "equivalent" grammar does not produce a parse tree with the correct operator associativity -- all operators become right-associative -- but it is possible to recover the correct parse tree by a simple tree rotation.
In the case of the LR(0) grammar, where operator precedence has been lost, the tree transformation would be almost equivalent to reparsing the input, using something like the shunting yard algorithm to create the true parse tree.
Note
I don't believe the grammar presented is the correct grammar, because it makes unary plus and minus bind less tightly than multiplication, with the result that -3*4 is parsed as -(3*4). As it happens, there is no semantic difference most of the time, but it still feels wrong to me. I would have written the grammar as:
EXPR → AEXPR | AEXPR realop AEXPR
AEXPR → TERM | AEXPR addop TERM
TERM → FACTOR | TERM mulop FACTOR
FACTOR → num | id | '(' EXPR ')' | addop FACTOR
which makes unary operators bind more tightly. (As above, I assume that addop is precisely + or -.)

Left recursion, associativity and AST evaluation

So I have been reading a bit on lexers, parser, interpreters and even compiling.
For a language I'm trying to implement I settled on a Recrusive Descent Parser. Since the original grammar of the language had left-recursion, I had to slightly rewrite it.
Here's a simplified version of the grammar I had (note that it's not any standard format grammar, but somewhat pseudo, I guess, it's how I found it in the documentation):
expr:
-----
expr + expr
expr - expr
expr * expr
expr / expr
( expr )
integer
identifier
To get rid of the left-recursion, I turned it into this (note the addition of the NOT operator):
expr:
-----
expr_term {+ expr}
expr_term {- expr}
expr_term {* expr}
expr_term {/ expr}
expr_term:
----------
! expr_term
( expr )
integer
identifier
And then go through my tokens using the following sub-routines (simplified pseudo-code-ish):
public string Expression()
{
string term = ExpressionTerm();
if (term != null)
{
while (PeekToken() == OperatorToken)
{
term += ReadToken() + Expression();
}
}
return term;
}
public string ExpressionTerm()
{
//PeekToken and ReadToken accordingly, otherwise return null
}
This works! The result after calling Expression is always equal to the input it was given.
This makes me wonder: If I would create AST nodes rather than a string in these subroutines, and evaluate the AST using an infix evaluator (which also keeps in mind associativity and precedence of operators, etcetera), won't I get the same result?
And if I do, then why are there so many topics covering "fixing left recursion, keeping in mind associativity and what not" when it's actually "dead simple" to solve or even a non-problem as it seems? Or is it really the structure of the resulting AST people are concerned about (rather than what it evaluates to)? Could anyone shed a light, I might be getting it all wrong as well, haha!
The shape of the AST is important, since a+(b*3) is not usually the same as (a+b)*3 and one might reasonably expect the parser to indicate which of those a+b*3 means.
Normally, the AST will actually delete parentheses. (A parse tree wouldn't, but an AST is expected to abstract away syntactic noise.) So the AST for a+(b*3) should look something like:
Sum
|
+---+---+
| |
Var Prod
| |
a +---+---+
| |
Var Const
| |
b 3
If you language obeys usual mathematical notation conventions, so will the AST for a+b*3.
An "infix evaluator" -- or what I imagine you're referring to -- is just another parser. So, yes, if you are happy to parse later, you don't have to parse now.
By the way, showing that you can put tokens back together in the order that you read them doesn't actually demonstrate much about the parser functioning. You could do that much more simply by just echoing the tokenizer's output.
The standard and easiest way to deal with expressions, mathematical or other, is with a rule hierarchy that reflects the intended associations and operator precedence:
expre = sum
sum = addend '+' sum | addend
addend = term '*' addend | term
term = '(' expre ')' | '-' integer | '+' integer | integer
Such grammars let the parse or abstract trees be directly evaluatable. You can expand the rule hierarchy to include power and bitwise operators, or make it part of the hierarchy for logical expressions with and or and comparisons.

Resources