How does Bison deal with some optional part in BNF grammars? - parsing

I'm a freshman in Bison and Compiler principles, now I'm trying to write an easy Verilog parser in Flex&Bison according to the IEEE Standard. Here is the question: when a grammar's body has optional parts, which are enclosed by square brackets, what should I do in Bison?
The BNF grammar maybe like this:
input_declaration ::= input [ net_type ] [ signed ] [ range ] list_of_port_identifiers
1, Should I enumerate them one by one like the following?
input_declaration : INPUT list_of_port_identifiers
| INPUT net_type list_of_port_identifiers
| INPUT signed list_of_port_identifiers
| INPUT net_type signed list_of_port_identifiers
....
This way can do solve the problem, but I feel it's so stupid.
2, Should I write a %empty directive in the optional parts' grammar like the following?
net_type:
%empty
| SUPPLY0
| SUPPLY1
| TRI
| TRIAND
| TRIOR
| WIRE
| WAND
| WOR
;
But I think this will cause some conflicts. So what is the best idea for this?

The best solution is (2):
net_type: %empty | ...;
signed: %empty | ...;
range: %empty | ...;
You should not see any conflicts between them unless they share tokens.
If you attempted to write all combinations as (1) and it worked, I still believe that you probably left out one combination by mistake, as Bison would in any case detect the conflict as in (2) and warn you.

Related

Obscure Antlr Error when Parsing Data Type

I am trying to parse a variable type for a toy language meant to teach Antlr fundamentals. I wish to parse at the rule var using the below code.
// Parser
var : TYPE ID;
// Lexer
TYPE: SIGNED PTR? DIMENSIONS?
| UNSIGNED PTR? DIMENSIONS?
| UNSIGNABLE PTR? DIMENSIONS?;
fragment DIMENSIONS : '[' ((NAT | ':') ',')* (NAT | ':')? ']';
fragment SIGNED : 'I16' | 'I32' | 'I64' | 'F32' | 'CHAR';
fragment UNSIGNED : 'U_I16' | 'U_I32' | 'U_I64' | 'U_F32' | 'U_CHAR';
fragment UNSIGNABLE : 'VOID' | 'STR' | 'BOOL' | 'CPLX';
PTR : 'PTR';
NAT : [0-9]+;
ID : [A-Z][A-Z0-9_]*;
However, when I test my program with the example declaration I32 HELLO_9, I receive the following error.
line 1:0 missing TYPE at 'I32'
PTR and DIMENSIONS should be marked as optional, so I am unsure as to why my lexer will not identify the I32 token for the SIGNED fragment. As a secondary question, I wonder how it is ever possible for professional programmers to create sophisticated projects with Antlr. I have experimented with Haskell parsing libraries in the past and it appears (from my subjective view) that Antlr is more prone to producing obscure errors. My perception is probably just a consequence of my inexperience, and I would be thankful to hear the opinions of a more suave programmer.
Given your grammar, I can't reproduce this. If I add SPACE : [ \t\r\n] -> skip; to it, the following code:
TLexer lexer = new TLexer(CharStreams.fromString("I32 HELLO_9"));
TParser parser = new TParser(new CommonTokenStream(lexer));
ParseTree root = parser.var();
System.out.println(root.toStringTree(parser));
produces no warnings/errors and prints:
(var I32 HELLO_9)
representing the parse tree:
The real problem is something #rici mentioned, or it is hidden by the fact that you've minimized your real grammar and this minimized form does not produce the error your real grammar does.

Interpretation variants of binary operators

I'm writing a grammar for a language that contains some binary operators that can also be used as unary operator (argument to the right side of the operator) and for a better error recovery I'd like them to be usable as nular operators as well).
My simplified grammar looks like this:
start:
code EOF
;
code:
(binaryExpression SEMICOLON?)*
;
binaryExpression:
binaryExpression BINARY_OPERATOR binaryExpression //TODO: check before primaryExpression
| primaryExpression
;
primaryExpression:
unaryExpression
| nularExpression
;
unaryExpression:
operator primaryExpression
| BINARY_OPERATOR primaryExpression
;
nularExpression:
operator
| BINARY_OPERATOR
| NUMBER
| STRING
;
operator:
ID
;
BINARY_OPERATOR is just a set of defined keywords that are fed into the parser.
My problem is that Antlr prefers to use BINARY_OPERATORs as unary expressions (or nualr ones if there is no other choice) instead of trying to use them in a binary expression as I need it to be done.
For example consider the following intput: for varDec from one to twelve do something where from, to and do are binary operators the output of the parser is the following:
As you can see it interprets all binary operators as unary ones.
What I'm trying to achieve is the following: Try to match each BINARY_OPERATOR in a binary expression and only if that is not possible try to match them as a unary expression and if that isn't possible as well then it might be considered a nular expression (which can only be the case if the BINARY_OPERATORis the only content of an expression).
Has anyone an idea about how to achieve the desired behaviour?
Fairly standard approach is to use a single recursive rule to establish the acceptable expression syntax. ANTLR is default left associative, so op expr meets the stated unary op requirement of "argument to the right side of the operator". See, pg 70 of TDAR for a further discussion of associativity.
Ex1: -y+x -> binaryOp{unaryOp{-, literal}, +, literal}
Ex2: -y+-x -> binaryOp{unaryOp{-, literal}, +, unaryOp{-, literal}}
expr
: LPAREN expr RPAREN
| expr op expr #binaryOp
//| op expr #unaryOp // standard formulation
| op literal #unaryOp // limited formulation
| op #errorOp
| literal
;
op : .... ;
literal
: KEYWORD
| ID
| NUMBER
| STRING
;
You allow operators to act like operands ("nularExpression") and operands to act like operators ("operator: ID"). Between those two curious decisions, your grammar is 100% ambiguous, and there is never any need for a binary operator to be parsed. I don't know much about Antlr, but it surprises me that it doesn't warn you that your grammar is completely ambiguous.
Antlr has mechanisms to handle and recover from errors. You would be much better off using them than writing a deliberately ambiguous grammar which makes erroneous constructs part of the accepted grammar. (As I said, I'm not an Antlr expert, but there are Antlr experts who pass by here pretty regularly; if you ask a specific question about error recovery, I'm sure you'll get a good answer. You might also want to search this site for questions and answers about Antlr error recovery.)
I think what I'm going to write down now is what #GRosenberg meant with his answer. However as it took me a while to fully understand it I will provide a concrete solution for my problem in case someone else is stumbling across this question and is searching or an answer:
The trick was to remove the option to use a BINARY_OPERATOR inside the unaryExpression rule because this always got preferred. Instead what I really wanted was to specify that if there was no left side argument it should be okay to use a BINARY_OPERATOR in a unary way. And that's the way I had to specify it:
binaryExpression:
binaryExpression BINARY_OPERATOR binaryExpression
| BINARY_OPERATOR primaryExpression
| primaryExpression
;
That way this syntax only becomes possible if there is nothing to the left side of a BINARY_OPERATOR and in every other case the binary syntax has to be used.

Entry rule position convention in BNF?

Is it mandatory for the first (topmost) rule of an BNF (or EBNF) grammar to represent the entry point? For example, from the wikipedia BNF page, the US Postal address grammar below has <postal-address> as the first derivation rule, and also the entry point:
<postal-address> ::= <name-part> <street-address> <zip-part>
<name-part> ::= <personal-part> <last-name> <opt-suffix-part> <EOL>
| <personal-part> <name-part>
<personal-part> ::= <first-name> | <initial> "."
<street-address> ::= <house-num> <street-name> <opt-apt-num> <EOL>
<zip-part> ::= <town-name> "," <state-code> <ZIP-code> <EOL>
<opt-suffix-part> ::= "Sr." | "Jr." | <roman-numeral> | ""
<opt-apt-num> ::= <apt-num> | ""
Am I at liberty to put the <postal-address> rule in, say, the second position, and so provide the grammar in the following alternate form:
<name-part> ::= <personal-part> <last-name> <opt-suffix-part> <EOL>
| <personal-part> <name-part>
<postal-address> ::= <name-part> <street-address> <zip-part>
<personal-part> ::= <first-name> | <initial> "."
<street-address> ::= <house-num> <street-name> <opt-apt-num> <EOL>
<zip-part> ::= <town-name> "," <state-code> <ZIP-code> <EOL>
<opt-suffix-part> ::= "Sr." | "Jr." | <roman-numeral> | ""
<opt-apt-num> ::= <apt-num> | ""
No, this isn't a requirement. It is just a convention used by some.
In practice, one must designate "the" goal rule. We have set of tools in which one identifies the nonterminal which is the goal nonterminal, and you can provide the rules (including goal rules) in any order. How you designate that may be outside the grammar formalism, or may be a special rule included in the grammar.
As a practical matter, this is not a big deal (OK, so some tool insists you put all the goal rules first, not actually that hard) and not that hard to do nicely (ok, the tool checks the left hand side of a grammar rule to see if it matches the goal nonterminal).
Of course, you need to know which way your tool works, but that takes about 2 minutes to figure out.
Some tools only allow one goal rule. As a practical matter, real (re-engineering, see my bio) parsers often find it useful to allow multiple rules (consider parsing COBOL as "whole programs" and as "COPYLIBS"), so you end up writing (clumsily IMHO):
G = G1 | G2 | G3 ... ;
G1 = ...
in this case. Still not a big deal. None of these constraints hurt expressiveness or in fact cost you much engineering time.

ANTLR doesn't find the defined start rule

I'm facing a strange ANTLR issue with a that should just output an AST.
grammar ltxt.g;
options
{
language=CSharp3;
}
prog : start
;
start : '{Start 'loopname'}'statement'{Ende 'loopname'}'
| statement
;
loopname : (('a'..'z')|('A'..'Z')|('1'..'9'))*;
statement : '<%' table_ref '>'
| start;
table_ref : '{'format'}'ID;
format : FSTRING
| FSTRING OFSTRING{0,5}
;
FSTRING : '#F'
| '#D'
| '#U'
| '#K'
;
OFSTRING: 'F'
| 'D'
| 'U'
| 'K'
//| 1..65536
;
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
When I try to code-gen this I get
error(100):LTXT.g:1:13:syntax error: antlr: MismatchedTokenException(74!=52). I didn't declare any 74 or 52.
also I do not get a Synatx diagram, since "rule "start"" cannot be found as a start state...
I know that this isn't pretty, but I thought it would work at least :)
Best,
wishi
There are four errors that I see.
A grammar name can't contain a period. That's the syntax error you're getting. The 74!=52 error message is a hint telling you that ANTLR found token id 74 when it was expecting token id 52, which in this case just translates to "it found one thing when it expected something else."
The grammar name ("ltxt") and the file name before the extension ("LTXT") need to match exactly.
The grammar won't produce an AST unless you specify output=AST; in the options section.
format's second alternative (FSTRING OFSTRING{0,5}) won't do what I think you think it's going to do. ANTLR doesn't support an arbitrary number of matches such as "match zero to five OFSTRINGs". You'll need to redefine the rule using semantic predicates that count occurrences for you. They aren't hard to use, but they're one of the trickier parts of ANTLR.
I hope that helps get you started.

Strange problem with context free grammar

I begin with an otherwise well formed (and well working) grammar for a language. Variables,
binary operators, function calls, lists, loops, conditionals, etc. To this grammar I'd like to add what I'm calling the object construct:
object
: object_name ARROW more_objects
;
more_objects
: object_name
| object_name ARROW more_objects
;
object_name
: IDENTIFIER
;
The point is to be able to access scalars nested in objects. For example:
car->color
monster->weapon->damage
pc->tower->motherboard->socket_type
I'm adding object as a primary_expression:
primary_expression
: id_lookup
| constant_value
| '(' expression ')'
| list_initialization
| function_call
| object
;
Now here's a sample script:
const list = [ 1, 2, 3, 4 ];
for var x in list {
send "foo " + x + "!";
}
send "Done!";
Prior to adding the nonterminal object as a primary_expression everything is sunshine and puppies. Even after I add it, Bison doesn't complain. No shift and/or reduce conflicts reported. And the generated code compiles without a sound. But when I try to run the sample script above, I get told error on line 2: Attempting to use undefined symbol '{' on line 2.
If I change the script to:
var list = 0;
for var x in [ 1, 2, 3, 4 ] {
send "foo " + x + "!";
}
send "Done!";
Then I get error on line 3: Attempting to use undefined symbol '+' on line 3.
Clearly the presence of object in the grammar is messing up how the parser behaves [SOMEhow], and I feel like I'm ignoring a rather simple principle of language theory that would fix this in a jiff, but the fact that there aren't any shift/reduce conflicts has left me bewildered.
Is there a better way (grammatically) to write these rules? What am I missing? Why aren't there any conflicts?
(And here's the full grammar file in case it helps)
UPDATE: To clarify, this language, which compiles into code being run by a virtual machine, is embedded into another system - a game, specifically. It has scalars and lists, and there are no complex data types. When I say I want to add objects to the language, that's actually a misnomer. I am not adding support for user-defined types to my language.
The objects being accessed with the object construct are actually objects from the game which I'm allowing the language processor to access through an intermediate layer which connects the VM to the game engine. This layer is designed to decouple as much as possible the language definition and the virtual machine mechanics from the implementation and details of the game engine.
So when, in my language I write:
player->name
That only gets codified by the compiler. "player" and "name" are not traditional identifiers because they are not added to the symbol table, and nothing is done with them at compile time except to translate the request for the name of the player into 3-address code.
It seems you are doing a classical error when using direct strings in the yacc source file. Since you are using a lexer, you can only use token names in yacc source files. More on this here
So I spent a reasonable amount of time picking over the grammar (and the bison output) and can't see what is obviously wrong here. Without having the means to execute it, I can't easily figure out what is going on by experimentation. Therefore, here are some concrete steps I usually go through when debugging grammars. Hopefully you can do any of these you haven't already done and then perhaps post follow-ups (or edit your question) with any results that might be revealing:
Contrive the smallest (in terms of number of tokens) possible working input, and the smallest possible non-working inputs based on the rules you expect to be applied.
Create a copy of the grammar file including only the troublesome rules and as few other supporting rules as you can get away with (i.e. you want a language that only allows construction of sequences consisting of the object and more_object rules, joined by ARROW. Does this work as you expect?
Does the rule in which it is nested work as you expect? Try replacing object with some other very simple rule (using some tokens not occuring elsewhere) and seeing if you can include those tokens without it breaking everything else.
Run bison with --report=all. Inspect the output to try to trace the rules you've added and the states that they affect. Try removing those rules and repeat the process - what has changed? This is extremely time consuming often, and is a giant pain, but it's a good last resort. I recommend a pencil and some paper.
Looking at the structure of your error output - '+' is being recognised as an identifier token, and is therefore being looked up as a symbol. It might be worth checker your lexer to see how it is processing identifier tokens. You might just accidentally be grabbing too much. As a further debugging technique, you might consider turning some of those token literals (e.g. '+', '{', etc) into real tokens so that bison's error reporting can help you out a little more.
EDIT: OK, the more I've dug into it, the more I'm convinced that the lexer is not necessarily working as it should be. I would double-check that the stream of tokens you are getting from yylex() matches your expectations before proceeding any further. In particular, it looks like a bunch of symbols that you consider special (e.g. '+' and '{') are being captured by some of your regular expressions, or at least are being allowed to pass for identifiers.
You don't get shift/reduce conflicts because your rules using object_name and more_objects are right-recursive - rather than the left-recursive rules that Yacc (Bison) handles most naturally.
On classic Yacc, you would find that you can run out of stack space with deep enough nesting of the 'object->name->what->not' notation. Bison extends its stack at runtime, so you have to run out of memory, which is a lot harder these days than it was when machines had a few megabytes of memory (or less).
One result of the right-recursion is that no reductions occur until you read the last of the object names in the chain (or, more accurately, one symbol beyond that). I see that you've used right-recursion with your statement_list rule too - and in a number of other places too.
I think your principal problem is that you failed to define a subtree constructor
in your object subgrammar. (EDIT: OP says he left the semantic actions for
object out of his example text. That doesn't change the following answer).
You probably have to lookup up the objects in the order encountered, too.
Maybe you intended:
primary_expression
: constant_value { $$ = $1; }
| '(' expression ')' { $$ = $2; }
| list_initialization { $$ = $1; }
| function_call { $$ = $1; }
| object { $$ = $1; }
;
object
: IDENTIFIER { $$ = LookupVariableOrObject( yytext ); }
| object ARROW IDENTIFIER { $$ = LookupSubobject( $1, yytext ); }
;
I assume that if one encounters an identifier X by itself, your default interpretation
is that it is a variable name. But, if you encounter X -> Y, then even if X
is a variable name, you want the object X with subobject Y.
What LookupVarOrObject does is to lookup the leftmost identifier encountered to see if it is variable
(and return essentially the same value as idlookup which must produce an AST node of type AST_VAR),
or see if it is valid object name, and return an AST node marked as an AST_OBJ,
or complain if the identifier isn't one of these.
What LookupSuboject does, is to check its left operand to ensure it is an AST_OBJ
(or an AST_VAR whose name happens to be the same as that of an object).
and complain if it is not. If it is, then its looks up the yytext-child object of
the named AST_OBJ.
EDIT: Based on discussion comments in another answer, right-recursion in the OP's original
grammar might be problematic if the OP's semantic checks inspect global lexer state (yytext).
This solution is left-recursive and won't run afoul of that particular trap.
id_lookup
: IDENTIFIER
is formally identical to
object_name
: IDENTIFIER
and object_name would accept everything that id_lookup wouldn't, so assertLookup( yytext ); probably runs on everything that may look like IDENTIFIER and is not accepted by enother rule just to decide between the 2 and then object_name can't accept because single lookahead forbids that.
For the twilight zone, the two chars that you got errors for are not declared as tokens with opends the zone of undefinded behavior and could trip parser into trying to treat them as potential identifiers when the grammar gets loose.
I just tried running muscl in Ubuntu 10.04 using bison 2.4.1 and I was able to run both of your examples with no syntax errors. My guess is that you have a bug in your version of bison. Let me know if I'm somehow running your parser wrong. Below is the output from the first example you gave.
./muscle < ./test1.m (this was your first test)
\-statement list
|-declaration (constant)
| |-symbol reference
| | \-list (constant)
| \-list
| |-value
| | \-1
| |-value
| | \-2
| |-value
| | \-3
| \-value
| \-4
|-loop (for-in)
| |-symbol reference
| | \-x (variable)
| |-symbol reference
| | \-list (constant)
| \-statement list
| \-send statement
| \-binary op (addition)
| |-binary op (addition)
| | |-value
| | | \-foo
| | \-symbol reference
| | \-x (variable)
| \-value
| \-!
\-send statement
\-value
\-Done!
+-----+----------+-----------------------+-----------------------+
| 1 | VALUE | 1 | |
| 2 | ELMT | #1 | |
| 3 | VALUE | 2 | |
| 4 | ELMT | #3 | |
| 5 | VALUE | 3 | |
| 6 | ELMT | #5 | |
| 7 | VALUE | 4 | |
| 8 | ELMT | #7 | |
| 9 | LIST | | |
| 10 | CONST | #10 | #9 |
| 11 | ITER_NEW | #11 | #10 |
| 12 | BRA | #14 | |
| 13 | ITER_INC | #11 | |
| 14 | ITER_END | #11 | |
| 15 | BRT | #22 | |
| 16 | VALUE | foo | |
| 17 | ADD | #16 | #11 |
| 18 | VALUE | ! | |
| 19 | ADD | #17 | #18 |
| 20 | SEND | #19 | |
| 21 | BRA | #13 | |
| 22 | VALUE | Done! | |
| 23 | SEND | #22 | |
| 24 | HALT | | |
+-----+----------+-----------------------+-----------------------+
foo 1!
foo 2!
foo 3!
foo 4!
Done!

Resources