Parse grammar alternating and repeating - parsing

I was able to add support to my parser's grammar for alternating characters (e.g. ababa or baba) by following along with this question.
I'm now looking to extend that by allowing repeats of characters.
For example, I'd like to be able to support abaaabab and aababaaa as well. In my particular case, only the a is allowed to repeat but a solution that allows for repeating b's would also be useful.
Given the rules from the other question:
expr ::= A | B
A ::= "a" B | "a"
B ::= "b" A | "b"
... I tried extending it to support repeats, like so:
expr ::= A | B
# support 1 or more "a"
A_one_or_more = A_one_or_more "a" | "a"
A ::= A_one_or_more B | A_one_or_more
B ::= "b" A | "b"
... but that grammar is ambiguous. Is it possible for this to be made unambiguous, and if so could anyone help me disambiguate it?
I'm using the lemon parser which is an LALR(1) parser.

The point of parsing, in general, is to parse; that is, determine the syntactic structure of an input. That's significantly different from simply verifying that an input belongs to a language.
For example, the language which consists of arbitrary repetitions of a and b can be described with the regular expression (a|b)*, which can be written in BNF as
S ::= /* empty */ | S a | S b
But that probably does not capture the syntactic structure you are trying to defind. On the other hand, since you don't specify that structure, it is hard to know.
Here are a couple more possibilities, which build different parse trees:
S ::= E | S E
E ::= A b | E b
A ::= a | A a
S ::= E | S E
E ::= A B
A ::= a | A a
B ::= b | B b
When writing a grammar to parse a language, it is useful to start by drawing your proposed parse trees. Usually, you can write the grammar directly from the form of the trees, which shows that a formal grammar is primarily a documentation tool, since it clearly describes the language in a way that informal descriptions cannot. Using a parser generator to turn that grammar into a parser ensures that the parser implements the described language. Or, at least, that is the goal.

Here is a nice tool for checking your grammar online http://smlweb.cpsc.ucalgary.ca/start.html. It actually accepts the grammar you provided as a valid LALR(1) grammar.
A different LALR(1) grammar, that allows reapeating a's, would be:
S ::= "a" S | "a" | "b" A | "b"
A ::= "a" S .

Related

LR(1) Parser: Why adding an epsilon production makes r/r or s/r conflicts

I wanted to make a reader which reads configuration files similar to INI files for mswin.
It is for exercise to teach myself using a lexer/parser generator which I made.
The grammar is:
%lexer
HEADER ::= "\\[[0-9a-zA-Z]+\\]"
TRUE ::= "yes|true"
FALSE ::= "no|false"
ASSIGN ::= "="
OPTION_NAME ::= "[a-zA-Z][0-9a-zA-Z]*"
INT ::= "[0-9]+"
STRING ::= "\"(\\\"|[^\"])*\""
CODE ::= "<{(.*)}>"
BLANK ::= "[ \t\f]+" :ignore
COMMENT ::= "#[^\n\r]*(\r|\n)?" :ignore
NEWLINE ::= "\r|\n"
%parser
Options ::= OptionGroup Options | OptionGroup | #epsilon#
OptionGroup ::= HEADER NEWLINE OptionsList
OptionsList ::= Option NEWLINE OptionsList | Option
Option ::= OPTION_NAME ASSIGN OptionValue
OptionValue ::= TRUE | FALSE | INT | STRING | CODE
The problem lies in the #epsilon# production. I added it because I want my reader to accept also empty files. But I'm getting conflicts when 'OptionsList' or 'OptionGroup' contains an epsilon production. I tried rearrange elements in productions, but I'm only getting conflicts (r/r or s/r, depending of what I did), unless I remove the epsilon completely from my grammar. It removes the problem, but...in my logic one of 'OptionsList' or 'OptionGroup' should contain an epsilon, otherwise my goal to accepting empty files is not met.
My parser generator uses LR(1) method, so I thought I can use epsilon productions in my grammar. It seems I'm good at writing generators, but not in constructing error-less grammars :(.
Should I forget about epsilons? Or is my grammar accepting empty inputs even when there is no epsilon production?
Your Options production allows an Options to be a sequence of OptionGroups, starting with either an empty list or a list consisting of a single element. That's obviously ambiguous, because a list of exactly one OptionGroup could be:
The base case OptionGroup
The base case #epsilon# with the addition of an OptionGroup.
In short, instead of
Options ::= OptionGroup Options | OptionGroup | #epsilon#
you need
Options ::= OptionGroup Options | #epsilon#
which matches exactly the same set of sentences, but unambiguously.
In general terms, you are usually better off writing left-recursive rules for bottom-up parsers. So I would have written
Options ::= Options OptionGroup | #epsilon#

Find an equivalent LR grammar

I am trying to find an LR(1) or LR(0) grammar for pascal. Here is a part of my grammar which is not LR(0) as it has shift/reduce conflict.
EXPR --> AEXPR | AEXPR realop AEXPR
AEXPR --> TERM | SIGN TERM | AEXPR addop TERM
TERM --> TERM mulop FACTOR | FACTOR
FACTOR --> id | num | ( EXPR )
SIGN --> + | -
(Uppercase words are variables and lowercase words, + , - are terminals)
As you see , EXPR --> AEXPR | AEXPR realop AEXPR cause a shift/reduce conflict on LR(0) parsing. I tried adding a new variable , and some other ways to find an equivalent LR (0) grammar for this, but I was not successful.
I have two problems.
First: Is this grammar a LR(1) grammar?
Second: Is it possible to find a LR(0) equivalent for this grammar? what about LR(1) equivalent?
Yes, your grammar is an LR(1) grammar. [see note below]
It is not just the first production which causes an LR(0) conflict. In an LR(0) grammar, you must be able to predict whether to shift or reduce (and which production to reduce) without consulting the lookahead symbol. That's a highly restrictive requirement.
Nonetheless, there is a grammar which will recognize the same language. It's not an equivalent grammar in the sense that it does not produce the same parse tree (or any useful parse tree), so it depends on what you consider equivalent.
EXPR → TERM | EXPR OP TERM
TERM → num | id | '(' EXPR ')' | addop TERM
OP → addop | mulop | realop
The above works by ignoring operator precedence; it regards an expression as simply the regular language TERM (op TERM)*. (I changed + | - to addop because I couldn't see how your scanner could work otherwise, but that's not significant.)
There is a transformation normally used to make LR(1) expression grammars suitable for LL(1) parsing, but since LL(1) is allowed to examine the lookahead character, it is able to handle operator precedence in a normal way. The LL(1) "equivalent" grammar does not produce a parse tree with the correct operator associativity -- all operators become right-associative -- but it is possible to recover the correct parse tree by a simple tree rotation.
In the case of the LR(0) grammar, where operator precedence has been lost, the tree transformation would be almost equivalent to reparsing the input, using something like the shunting yard algorithm to create the true parse tree.
Note
I don't believe the grammar presented is the correct grammar, because it makes unary plus and minus bind less tightly than multiplication, with the result that -3*4 is parsed as -(3*4). As it happens, there is no semantic difference most of the time, but it still feels wrong to me. I would have written the grammar as:
EXPR → AEXPR | AEXPR realop AEXPR
AEXPR → TERM | AEXPR addop TERM
TERM → FACTOR | TERM mulop FACTOR
FACTOR → num | id | '(' EXPR ')' | addop FACTOR
which makes unary operators bind more tightly. (As above, I assume that addop is precisely + or -.)

Left recursion, associativity and AST evaluation

So I have been reading a bit on lexers, parser, interpreters and even compiling.
For a language I'm trying to implement I settled on a Recrusive Descent Parser. Since the original grammar of the language had left-recursion, I had to slightly rewrite it.
Here's a simplified version of the grammar I had (note that it's not any standard format grammar, but somewhat pseudo, I guess, it's how I found it in the documentation):
expr:
-----
expr + expr
expr - expr
expr * expr
expr / expr
( expr )
integer
identifier
To get rid of the left-recursion, I turned it into this (note the addition of the NOT operator):
expr:
-----
expr_term {+ expr}
expr_term {- expr}
expr_term {* expr}
expr_term {/ expr}
expr_term:
----------
! expr_term
( expr )
integer
identifier
And then go through my tokens using the following sub-routines (simplified pseudo-code-ish):
public string Expression()
{
string term = ExpressionTerm();
if (term != null)
{
while (PeekToken() == OperatorToken)
{
term += ReadToken() + Expression();
}
}
return term;
}
public string ExpressionTerm()
{
//PeekToken and ReadToken accordingly, otherwise return null
}
This works! The result after calling Expression is always equal to the input it was given.
This makes me wonder: If I would create AST nodes rather than a string in these subroutines, and evaluate the AST using an infix evaluator (which also keeps in mind associativity and precedence of operators, etcetera), won't I get the same result?
And if I do, then why are there so many topics covering "fixing left recursion, keeping in mind associativity and what not" when it's actually "dead simple" to solve or even a non-problem as it seems? Or is it really the structure of the resulting AST people are concerned about (rather than what it evaluates to)? Could anyone shed a light, I might be getting it all wrong as well, haha!
The shape of the AST is important, since a+(b*3) is not usually the same as (a+b)*3 and one might reasonably expect the parser to indicate which of those a+b*3 means.
Normally, the AST will actually delete parentheses. (A parse tree wouldn't, but an AST is expected to abstract away syntactic noise.) So the AST for a+(b*3) should look something like:
Sum
|
+---+---+
| |
Var Prod
| |
a +---+---+
| |
Var Const
| |
b 3
If you language obeys usual mathematical notation conventions, so will the AST for a+b*3.
An "infix evaluator" -- or what I imagine you're referring to -- is just another parser. So, yes, if you are happy to parse later, you don't have to parse now.
By the way, showing that you can put tokens back together in the order that you read them doesn't actually demonstrate much about the parser functioning. You could do that much more simply by just echoing the tokenizer's output.
The standard and easiest way to deal with expressions, mathematical or other, is with a rule hierarchy that reflects the intended associations and operator precedence:
expre = sum
sum = addend '+' sum | addend
addend = term '*' addend | term
term = '(' expre ')' | '-' integer | '+' integer | integer
Such grammars let the parse or abstract trees be directly evaluatable. You can expand the rule hierarchy to include power and bitwise operators, or make it part of the hierarchy for logical expressions with and or and comparisons.

Entry rule position convention in BNF?

Is it mandatory for the first (topmost) rule of an BNF (or EBNF) grammar to represent the entry point? For example, from the wikipedia BNF page, the US Postal address grammar below has <postal-address> as the first derivation rule, and also the entry point:
<postal-address> ::= <name-part> <street-address> <zip-part>
<name-part> ::= <personal-part> <last-name> <opt-suffix-part> <EOL>
| <personal-part> <name-part>
<personal-part> ::= <first-name> | <initial> "."
<street-address> ::= <house-num> <street-name> <opt-apt-num> <EOL>
<zip-part> ::= <town-name> "," <state-code> <ZIP-code> <EOL>
<opt-suffix-part> ::= "Sr." | "Jr." | <roman-numeral> | ""
<opt-apt-num> ::= <apt-num> | ""
Am I at liberty to put the <postal-address> rule in, say, the second position, and so provide the grammar in the following alternate form:
<name-part> ::= <personal-part> <last-name> <opt-suffix-part> <EOL>
| <personal-part> <name-part>
<postal-address> ::= <name-part> <street-address> <zip-part>
<personal-part> ::= <first-name> | <initial> "."
<street-address> ::= <house-num> <street-name> <opt-apt-num> <EOL>
<zip-part> ::= <town-name> "," <state-code> <ZIP-code> <EOL>
<opt-suffix-part> ::= "Sr." | "Jr." | <roman-numeral> | ""
<opt-apt-num> ::= <apt-num> | ""
No, this isn't a requirement. It is just a convention used by some.
In practice, one must designate "the" goal rule. We have set of tools in which one identifies the nonterminal which is the goal nonterminal, and you can provide the rules (including goal rules) in any order. How you designate that may be outside the grammar formalism, or may be a special rule included in the grammar.
As a practical matter, this is not a big deal (OK, so some tool insists you put all the goal rules first, not actually that hard) and not that hard to do nicely (ok, the tool checks the left hand side of a grammar rule to see if it matches the goal nonterminal).
Of course, you need to know which way your tool works, but that takes about 2 minutes to figure out.
Some tools only allow one goal rule. As a practical matter, real (re-engineering, see my bio) parsers often find it useful to allow multiple rules (consider parsing COBOL as "whole programs" and as "COPYLIBS"), so you end up writing (clumsily IMHO):
G = G1 | G2 | G3 ... ;
G1 = ...
in this case. Still not a big deal. None of these constraints hurt expressiveness or in fact cost you much engineering time.

Producing Expressions from This Grammar with Recursive Descent

I've got a simple grammar. Actually, the grammar I'm using is more complex, but this is the smallest subset that illustrates my question.
Expr ::= Value Suffix
| "(" Expr ")" Suffix
Suffix ::= "->" Expr
| "<-" Expr
| Expr
| epsilon
Value matches identifiers, strings, numbers, et cetera. The Suffix rule is there to eliminate left-recursion. This matches expressions such as:
a -> b (c -> (d) (e))
That is, a graph where a goes to both b and the result of (c -> (d) (e)), and c goes to d and e. I'm trying to produce an abstract syntax tree for these expressions, but I'm running into difficulty because all of the operators can accept any number of operands on each side. I'd rather keep the logic for producing the AST within the recursive descent parsing methods, since it avoids having to duplicate the logic of extracting an expression. My current strategy is as follows:
If a Value appears, push it to the output.
If a From or To appears:
Output a separator.
Get the next Expr.
Create a Link node.
Pop the first set of operands from output into the Link until a separator appears.
Erase the separator discovered.
Pop the second set of operands into the Link until a separator.
Push the Link to the output.
If I run this through without obeying steps 2.3–2.7, I get a list of values and separators. For the expression quoted above, a -> b (c -> (d) (e)), the output should be:
A sep_1 B sep_2 C sep_3 D E
Applying the To rule would then yield:
A sep_1 B sep_2 (link from C to {D, E})
And subsequently:
(link from A to {B, (link from C to {D, E})})
The important thing to note is that sep_2, crucial to delimit the left-hand operands of the second ->, does not appear, so the parser believes that the expression was actually written:
a -> (b c -> (d) (e))
In order to solve this with my current strategy, I would need a way to produce a separator between adjacent expressions, but only if the current expression is a From or To expression enclosed in parentheses. If that's possible, then I'm just not seeing it and the answer ought to be simple. If there's a better way to go about this, however, then please let me know!
I haven't tried to analyze it in detail, but: "From or To expression enclosed in parentheses" starts to sound a lot like "context dependent", which recursive descent can't handle directly. To avoid context dependence you'll probably need a separate production for a From or To in parentheses vs. a From or To without the parens.
Edit: Though it may be too late to do any good, if my understanding of what you want to match is correct, I think I'd write it more like this:
Graph :=
| List Sep Graph
;
Sep := "->"
| "<-"
;
List :=
| Value List
;
Value := Number
| Identifier
| String
| '(' Graph ')'
;
It's hard to be certain, but I think this should at least be close to matching (only) the inputs you want, and should make it reasonably easy to generate an AST that reflects the input correctly.

Resources