Antlr parsing rule to parse string with matching braces - parsing

I have a parser rule which looks like below
nested_query: ~(LPARAN | RPARAN)+? LPARAN nested_query RPARAN ~(LPARAN | RPARAN)+?
| nested_query nested_query_op LPARAN nested_query RPARAN
| ~(LPARAN | RPARAN)+?
;
nested_query_op: binary_in | binary_not_in ;
binary_in: 'in';
binary_not_in: 'not' 'in';
LPARAN: '(';
RPARAN: ')';
This correctly matches the string list(srcVm) of flows where typeTag ="TAG_SRC_IP_VM" until timestamp
But when I try to parse a string having more than one matching brackets it does not get properly parsed for example list(srcVm) of flows where (typeTag ="TAG_SRC_IP_VM") until timestamp
Can someone let me know how can I modify the above rule to match a string with more than one matching braces under nested_query rule like below
nested_query:1
|
---------------------------------------------------------
list ( nested_query:3 ) of flows where ( nested_query:4) until timestamp
| |
srcVM (typeTag ="TAG_SRC_IP_VM")

This ought to do the trick:
nested_query
: ( LPARAN nested_query RPARAN | ~( LPARAN | RPARAN ) )+
;
list(srcVm) of flows where typeTag ="TAG_SRC_IP_VM" until timestamp
list(srcVm) of flows where (typeTag ="TAG_SRC_IP_VM") until timestamp

Well there simply isn't any rule that would allow input with two sets of parentheses following each other without in|not in just in front of the second opening brace`:
The only first alternative in nested_query only allows one occurrence of parentheses (although they might be nested) - outside of the top-level parentheses the must be no parentheses.
The second alternative allows top-level parentheses (from first nested_query) followed by non-parentheses, then in or not in and then second top/level parentheses.
The third alternative doesn't allow parentheses at all.
To match multiple parentheses on one level, the first alternative of nested_query should be something like
~(LPARAN | RPARAN)* (LPARAN nested_query RPARAN ~(LPARAN | RPARAN)*)+ ~(LPARAN | RPARAN)*
But then the second alternative will be in conflict with this, because everything that can match it can also match this modified first alternative.

Related

Calculate first, follow and predict from EBNF notation

Grammatical rules are defined as:
an integer literal is a sequence of digits;
a boolean literal is one of true or false;
a keyword is one of if, while, or the boolean literals;
a variable is a string that starts with a letter and is followed by letters or digits, and
is not a keyword;
an operator is one of <= >= == != && || = + - * < >
punctuation is one of the ( ) { } , ; characters.
Based on the description I wrote out grammar in EBNF notation as fallows:
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;
int literal = digit {digit} ;
bool = "true" | "false" ;
keyword = "if" | "while" | bool ;
letter = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" | "J" | "K" | "L" | "M" | "N" | "O" | "P" |
"Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z" | "a" | "b" | "c" | "d" | "e" | "f" | "g" |
"h" | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x" |
"y" | "z" ;
variable = (letter {digit | letter}) -keyword ;
operator = "<=" |">=" | "==" | "!=" | "&&" | "||" | "=" | "+" | "-" | "*" | "<" | ">" | "!" ;
punctuation = "(" | ")" | " {" | " }" | " , " | " ; " ;
Now i want to calculate FIRST, FOLLOW and PREDICT sets but I'm not sure how to do it out of EBNF notation. Should I first change it to Chomsky normal form? Is so then how? Would that be right?
DIGIT -> 0 1 2 3 ...
INT -> DIGIT | DIGIT DIGIT
BOOL -> true false
KEYWORD -> if while BOOL
LETTER -> A B C D ...
VARIABLE -> LETTER | LETTER DIGIT | LETTER LETTER
First and follow are pretty straight-forward, even with EBNF. In this case, they are even easier, since you have no nullable non-terminals. (You need to watch out for repetition groups, since the repetition count can be 0. If you have:
... A { X ... } Y ...
then FOLLOW(A) must include both FIRST(X) and FIRST(Y). And if you have
C -> A { X }
then FOLLOW(A) must include FOLLOW(C).
None of this should be complicated if you're doing the computation by hand. For an automated solution, I would probably unroll the repetition operators into unextended BNF by creating new non-terminals, but you could do the computation directly on the EBNF as well.
The one wrinkle is your use of the set difference operator -, in
variable = (letter {digit | letter}) - keyword ;
In this particular case, it does not create any difficulties, but the general solution is tricky. In fact, since there is no guarantee that the difference between two context-free languages is context-free, it will not really be possible to find a truly general solution.
Predict sets are another story. Indeed, I'm not even 100% sure what a predict set would be for EBNF, since you need to be able to predict repetition of a subpattern, not just derivations. Again, expanding to BNF might help, but it can happen that the expansion creates a predict conflict which didn't exist in the original grammar.
The grammar you present is incomplete, so I don't know how useful computing LL(1) sets will be. I suppose that it is intended to be just the lexical part of the grammar, but really there is a reason why lexical analysis is usually done with regular expressions rather than context-free parsing.
Several reasons, really: aside from the fact that lexical analysis usually involves reasonably readable regular expressions, there is also the important fact that lexical analysis does not usually involve parsing the internal structure of a token. That lets you choose to simply recognize a repeated element rather than worrying about whether the parse tree for the repetition should be left- or right-leaning.
The key insight about computing FIRST and FOLLOW sets is that they mean just what their names indicate. The FIRST set of a non-terminal is precisely the set of tokens which can begin a complete derivation from the non-terminal; similarly, the FOLLOW set is precisely the set of tokens which might immediately follow the non-terminal during a derivation from the start symbol. In many simple grammars, these sets can be computed by inspection; that certainly should be the case for your grammar, at least for the FIRST sets.
The fact that you have no start symbol here is another indication that you are probably not solving the right problem; without a start symbol, there is no meaningful definition of FOLLOW.
If you are trying to do lexical analysis, you might be able to get away with:
start -> { token }
token -> int literal | keyword | identifier | ...
Although to be formally correct, you'd also need to handle "ignored tokens" such as comments and whitespace.

ANTLR Parsing Literals and Quoted IDs

I'm working on an SQL grammar in ANTLR which allows quoted identifiers (table names, field names, etc), as well as quoted literal strings.
The problem is that this grammar seems to always match quoted inputs as "QUOTED_LITERAL", and never as IDs wrapped in quotes.
Here are my results:
input: 'blahblah' result: string_literal as expected.
input: field1 restul: column_name as expected
input: table.field1 result: column_spec as expected
input: 'table'.'field1' result: string_literal, MissingTokenException
Below is my simplified grammar for the expression portion of the SQL grammar, if anybody can help identify what is needed to match quoted rules other than the quoted literal, thanks.
grammar test;
expression
:
simpleExpression EOF!
;
simpleExpression
:
column_spec
| literal_value
;
column_spec
:
(table_name '.')? column_name
| ('\''table_name '\'''.')? '\'' column_name '\''
| ('\"'table_name '\"' '.')? '\"' column_name '\"'
;
string_literal: QUOTED_LITERAL ;
boolean_literal: 'TRUE' | 'FALSE' ;
literal_value :
(
string_literal
| boolean_literal
)
;
table_name :ID;
column_name :ID;
QUOTED_LITERAL:
( '\''
( ('\\' '\\') | ('\'' '\'') | ('\\' '\'') | ~('\'') )*
'\'' )
|
( '\"'
( ('\\' '\\') | ('\"' '\"') | ('\\' '\"') | ~('\"') )*
'\"' )
;
ID
:
( 'A'..'Z' | 'a'..'z' ) ( 'A'..'Z' | 'a'..'z' | '_' | '0'..'9'| '::' )*
;
WHITE_SPACE : ( ' '|'\r'|'\t'|'\n' ) {$channel=HIDDEN;} ;
In case anybody is interested, I removed a little bit of the flexibility from the quoted literal strings. Literal strings can only be quoted by single quotes, and identifiers can be optionally quoted by double quotes. As long as the literal quote and the identifier quote is well defined and they don't overlap, the grammar is trivial.
This policy makes the grammar much cleaner, and doesn't remove the ability to quote identifiers. I make use of the JDBC method getIdentifierQuote to report which quote can be used to wrap identifiers.
This is your classical shift/reduce conflict. (Except that ANTLR does not shift or reduce; since it is not a stack automaton.)
You have the following problem:
When you are in the simpleExpression state you need to decide what branch to take with one token lookahead. In the case of ANTLR, since no difference is done between lexer and parser the one token is a single character. (You should see a warning from ANTLR about the conflict.)
It gets even better, what is the difference between "Bob Dillan" and "table1"? From the parsers point of view, none. So how do you expect to make a difference between:
('\"'table_name '\"' '.')? '\"' column_name '\"'
and
( '\"'
( ('\\' '\\') | ('\"' '\"') | ('\\' '\"') | ~('\"') )*
'\"' )
I strongly suggest to rewrite the simpleExpression rule to:
simpleExpression:
IDENTIFIER |
IDENTIFIER . IDENTIFIER |
QUOTED_LITERAL |
QUOTED_LITERAL . QUOTED_LITERAL |
boolean_literal;
And then decide in the action code of simpleExpression what to do. Especially since I am quite sure that you can reference a table with a quoted name; never the less "users" and "Bod Dillan" are syntactically equal.
It also depends on the grater grammar, you may also be able to resolve the amiability on a higher level.
The antlr lexer is greedy, in that when there are two possible token matches, it will match the longest possible one.
When the lexer sees 'some_id', it can match the first quote as just a quote, or a quoted literal. The literal is longer, so that matches.
As a side note, you generally do not want lexer rules that can match nothing (like ID) or to uses string constants in the parser rules, but only reference token names.
What you want to do is something like this.
QUOTE: '\'';
ID: ('a'..'z' | 'A'..'Z')+; // Must have at least one character
QUOTED_LITERAL: QUOTE ( (ID QUOTE) => { $type=QUOTE; } ) | .* QUOTE;
id: ID | QUOTE ID QUOTE;
quoted_literal: QUOTED_LITERAL | QUOTE ID QUOTE;
If the lexer sees something that looks like a quoted id, it cannot tell which to use, so it breaks it up into smaller tokens. In your parser, you use id where you expect a possibly quoted ID, and quoted_literal where you expect a QUOTED_LITERAL.
The syntactical predicate in QUOTED_LITERAL prevents it from matching the full quote when the input is ambiguous.
Looking that this, it will fail to correctly parse lines like
'tag' text 'second'
as ' text ' will be parsed as a QUOTED_LITERAL. If that is a valid input, then you would need something like
fragment QUOTED_ID;
QUOTED_LITERAL: QUOTE ( ID {$type=QUOTED_ID} | .* ) QUOTE;
id: ID | QUOTED_ID;
quoted_literal: QUOTED_LITERAL | QUOTED_ID;
(My example does not cover all the cases in your input, but extending it should be obvious. You also probably need some actions to either generate the correct tokens in your AST or add/remove quotes from the text, depending one what you do after you parse.)

ANTLR doesn't find the defined start rule

I'm facing a strange ANTLR issue with a that should just output an AST.
grammar ltxt.g;
options
{
language=CSharp3;
}
prog : start
;
start : '{Start 'loopname'}'statement'{Ende 'loopname'}'
| statement
;
loopname : (('a'..'z')|('A'..'Z')|('1'..'9'))*;
statement : '<%' table_ref '>'
| start;
table_ref : '{'format'}'ID;
format : FSTRING
| FSTRING OFSTRING{0,5}
;
FSTRING : '#F'
| '#D'
| '#U'
| '#K'
;
OFSTRING: 'F'
| 'D'
| 'U'
| 'K'
//| 1..65536
;
ID : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
When I try to code-gen this I get
error(100):LTXT.g:1:13:syntax error: antlr: MismatchedTokenException(74!=52). I didn't declare any 74 or 52.
also I do not get a Synatx diagram, since "rule "start"" cannot be found as a start state...
I know that this isn't pretty, but I thought it would work at least :)
Best,
wishi
There are four errors that I see.
A grammar name can't contain a period. That's the syntax error you're getting. The 74!=52 error message is a hint telling you that ANTLR found token id 74 when it was expecting token id 52, which in this case just translates to "it found one thing when it expected something else."
The grammar name ("ltxt") and the file name before the extension ("LTXT") need to match exactly.
The grammar won't produce an AST unless you specify output=AST; in the options section.
format's second alternative (FSTRING OFSTRING{0,5}) won't do what I think you think it's going to do. ANTLR doesn't support an arbitrary number of matches such as "match zero to five OFSTRINGs". You'll need to redefine the rule using semantic predicates that count occurrences for you. They aren't hard to use, but they're one of the trickier parts of ANTLR.
I hope that helps get you started.

How to remove Left recursive in this ANTLR grammar?

I am trying to parse CSP(Communicating Sequential Processes) CSP Reference Manual. I have defined following grammar rules.
assignment
: IDENT '=' processExpression
;
processExpression
: ( STOP
| SKIP
| chaos
| prefix
| prefixWithValue
| seqComposition
| interleaving
| externalChoice
....
seqComposition
: processExpression ';' processExpression
;
interleaving
: processExpression '|||' processExpression
;
externalChoice
: processExpression '[]' processExpression
;
Now ANTLR reports that
seqComposition
interleaving
externalChoice
are left recursive . Is there any way to remove this or I should better used Bison Flex for this type of grammar. (There are many such rules)
Define a processTerm. Then write rules looking like
assignment
: IDENT '=' processExpression
;
processTerm
: ( STOP
| SKIP
| chaos
| prefix
...
processExpression
: ( processTerm
| processTerm ';' processExpression
| processTerm '|||' processExpression
| processTerm '[]' processExpression
....
If you want to have things like seqComposition still defined, I think that would be OK as well. But you need to make sure that the parsing of processExpansion is going to always consume more text as you proceed through your rules.
Read the guide to removing left recursion in on the ANTLR wiki. It helped me a lot.

Strange problem with context free grammar

I begin with an otherwise well formed (and well working) grammar for a language. Variables,
binary operators, function calls, lists, loops, conditionals, etc. To this grammar I'd like to add what I'm calling the object construct:
object
: object_name ARROW more_objects
;
more_objects
: object_name
| object_name ARROW more_objects
;
object_name
: IDENTIFIER
;
The point is to be able to access scalars nested in objects. For example:
car->color
monster->weapon->damage
pc->tower->motherboard->socket_type
I'm adding object as a primary_expression:
primary_expression
: id_lookup
| constant_value
| '(' expression ')'
| list_initialization
| function_call
| object
;
Now here's a sample script:
const list = [ 1, 2, 3, 4 ];
for var x in list {
send "foo " + x + "!";
}
send "Done!";
Prior to adding the nonterminal object as a primary_expression everything is sunshine and puppies. Even after I add it, Bison doesn't complain. No shift and/or reduce conflicts reported. And the generated code compiles without a sound. But when I try to run the sample script above, I get told error on line 2: Attempting to use undefined symbol '{' on line 2.
If I change the script to:
var list = 0;
for var x in [ 1, 2, 3, 4 ] {
send "foo " + x + "!";
}
send "Done!";
Then I get error on line 3: Attempting to use undefined symbol '+' on line 3.
Clearly the presence of object in the grammar is messing up how the parser behaves [SOMEhow], and I feel like I'm ignoring a rather simple principle of language theory that would fix this in a jiff, but the fact that there aren't any shift/reduce conflicts has left me bewildered.
Is there a better way (grammatically) to write these rules? What am I missing? Why aren't there any conflicts?
(And here's the full grammar file in case it helps)
UPDATE: To clarify, this language, which compiles into code being run by a virtual machine, is embedded into another system - a game, specifically. It has scalars and lists, and there are no complex data types. When I say I want to add objects to the language, that's actually a misnomer. I am not adding support for user-defined types to my language.
The objects being accessed with the object construct are actually objects from the game which I'm allowing the language processor to access through an intermediate layer which connects the VM to the game engine. This layer is designed to decouple as much as possible the language definition and the virtual machine mechanics from the implementation and details of the game engine.
So when, in my language I write:
player->name
That only gets codified by the compiler. "player" and "name" are not traditional identifiers because they are not added to the symbol table, and nothing is done with them at compile time except to translate the request for the name of the player into 3-address code.
It seems you are doing a classical error when using direct strings in the yacc source file. Since you are using a lexer, you can only use token names in yacc source files. More on this here
So I spent a reasonable amount of time picking over the grammar (and the bison output) and can't see what is obviously wrong here. Without having the means to execute it, I can't easily figure out what is going on by experimentation. Therefore, here are some concrete steps I usually go through when debugging grammars. Hopefully you can do any of these you haven't already done and then perhaps post follow-ups (or edit your question) with any results that might be revealing:
Contrive the smallest (in terms of number of tokens) possible working input, and the smallest possible non-working inputs based on the rules you expect to be applied.
Create a copy of the grammar file including only the troublesome rules and as few other supporting rules as you can get away with (i.e. you want a language that only allows construction of sequences consisting of the object and more_object rules, joined by ARROW. Does this work as you expect?
Does the rule in which it is nested work as you expect? Try replacing object with some other very simple rule (using some tokens not occuring elsewhere) and seeing if you can include those tokens without it breaking everything else.
Run bison with --report=all. Inspect the output to try to trace the rules you've added and the states that they affect. Try removing those rules and repeat the process - what has changed? This is extremely time consuming often, and is a giant pain, but it's a good last resort. I recommend a pencil and some paper.
Looking at the structure of your error output - '+' is being recognised as an identifier token, and is therefore being looked up as a symbol. It might be worth checker your lexer to see how it is processing identifier tokens. You might just accidentally be grabbing too much. As a further debugging technique, you might consider turning some of those token literals (e.g. '+', '{', etc) into real tokens so that bison's error reporting can help you out a little more.
EDIT: OK, the more I've dug into it, the more I'm convinced that the lexer is not necessarily working as it should be. I would double-check that the stream of tokens you are getting from yylex() matches your expectations before proceeding any further. In particular, it looks like a bunch of symbols that you consider special (e.g. '+' and '{') are being captured by some of your regular expressions, or at least are being allowed to pass for identifiers.
You don't get shift/reduce conflicts because your rules using object_name and more_objects are right-recursive - rather than the left-recursive rules that Yacc (Bison) handles most naturally.
On classic Yacc, you would find that you can run out of stack space with deep enough nesting of the 'object->name->what->not' notation. Bison extends its stack at runtime, so you have to run out of memory, which is a lot harder these days than it was when machines had a few megabytes of memory (or less).
One result of the right-recursion is that no reductions occur until you read the last of the object names in the chain (or, more accurately, one symbol beyond that). I see that you've used right-recursion with your statement_list rule too - and in a number of other places too.
I think your principal problem is that you failed to define a subtree constructor
in your object subgrammar. (EDIT: OP says he left the semantic actions for
object out of his example text. That doesn't change the following answer).
You probably have to lookup up the objects in the order encountered, too.
Maybe you intended:
primary_expression
: constant_value { $$ = $1; }
| '(' expression ')' { $$ = $2; }
| list_initialization { $$ = $1; }
| function_call { $$ = $1; }
| object { $$ = $1; }
;
object
: IDENTIFIER { $$ = LookupVariableOrObject( yytext ); }
| object ARROW IDENTIFIER { $$ = LookupSubobject( $1, yytext ); }
;
I assume that if one encounters an identifier X by itself, your default interpretation
is that it is a variable name. But, if you encounter X -> Y, then even if X
is a variable name, you want the object X with subobject Y.
What LookupVarOrObject does is to lookup the leftmost identifier encountered to see if it is variable
(and return essentially the same value as idlookup which must produce an AST node of type AST_VAR),
or see if it is valid object name, and return an AST node marked as an AST_OBJ,
or complain if the identifier isn't one of these.
What LookupSuboject does, is to check its left operand to ensure it is an AST_OBJ
(or an AST_VAR whose name happens to be the same as that of an object).
and complain if it is not. If it is, then its looks up the yytext-child object of
the named AST_OBJ.
EDIT: Based on discussion comments in another answer, right-recursion in the OP's original
grammar might be problematic if the OP's semantic checks inspect global lexer state (yytext).
This solution is left-recursive and won't run afoul of that particular trap.
id_lookup
: IDENTIFIER
is formally identical to
object_name
: IDENTIFIER
and object_name would accept everything that id_lookup wouldn't, so assertLookup( yytext ); probably runs on everything that may look like IDENTIFIER and is not accepted by enother rule just to decide between the 2 and then object_name can't accept because single lookahead forbids that.
For the twilight zone, the two chars that you got errors for are not declared as tokens with opends the zone of undefinded behavior and could trip parser into trying to treat them as potential identifiers when the grammar gets loose.
I just tried running muscl in Ubuntu 10.04 using bison 2.4.1 and I was able to run both of your examples with no syntax errors. My guess is that you have a bug in your version of bison. Let me know if I'm somehow running your parser wrong. Below is the output from the first example you gave.
./muscle < ./test1.m (this was your first test)
\-statement list
|-declaration (constant)
| |-symbol reference
| | \-list (constant)
| \-list
| |-value
| | \-1
| |-value
| | \-2
| |-value
| | \-3
| \-value
| \-4
|-loop (for-in)
| |-symbol reference
| | \-x (variable)
| |-symbol reference
| | \-list (constant)
| \-statement list
| \-send statement
| \-binary op (addition)
| |-binary op (addition)
| | |-value
| | | \-foo
| | \-symbol reference
| | \-x (variable)
| \-value
| \-!
\-send statement
\-value
\-Done!
+-----+----------+-----------------------+-----------------------+
| 1 | VALUE | 1 | |
| 2 | ELMT | #1 | |
| 3 | VALUE | 2 | |
| 4 | ELMT | #3 | |
| 5 | VALUE | 3 | |
| 6 | ELMT | #5 | |
| 7 | VALUE | 4 | |
| 8 | ELMT | #7 | |
| 9 | LIST | | |
| 10 | CONST | #10 | #9 |
| 11 | ITER_NEW | #11 | #10 |
| 12 | BRA | #14 | |
| 13 | ITER_INC | #11 | |
| 14 | ITER_END | #11 | |
| 15 | BRT | #22 | |
| 16 | VALUE | foo | |
| 17 | ADD | #16 | #11 |
| 18 | VALUE | ! | |
| 19 | ADD | #17 | #18 |
| 20 | SEND | #19 | |
| 21 | BRA | #13 | |
| 22 | VALUE | Done! | |
| 23 | SEND | #22 | |
| 24 | HALT | | |
+-----+----------+-----------------------+-----------------------+
foo 1!
foo 2!
foo 3!
foo 4!
Done!

Resources