EBNF with comma repetition - parsing

I find myself doing the following quite frequently as allowing one multiple entries separated by a comma:
( function | expression ) ( ',' ( function | expression ))*
Is there a more compact way to do this? Ideally I'd just like to be able to do something along the lines of:
( function | expression ) [,...]
Or:
( function | expression ',')*
By the way, I am using this as a validator: https://www.bottlecaps.de/rr/ui#_Production
The whole grammar I am trying to 'clean up' is the following:
AGGREGATION
::= 'GROUP BY' ( GROUPING_ROWS | PIVOT )?
PIVOT
::= 'PIVOT(' AXIS_EXPR (AXIS_EXPR ',' )? ')'
AXIS_EXPR
::= expr ( 'AS'? alias )? 'ON' ( 'ROWS' | 'COLS' ) ( 'HAVING' expr )? ( 'ORDER BY' expr ( 'ASC' | 'DESC' )? )? ( 'LIMIT' num 'PERCENT'? )?
GROUPING_ROWS
::= 'GROUPING_ROWS(' GROUPING_EXPR (GROUPING_EXPR ',' )? ')'
GROUPING_EXPR
::= NAME_OR_POS 'SUBTOTAL' 'S'? GROUPING_EXPR_SUBTOTAL (',' GROUPING_EXPR_SUBTOTAL)*
GROUPING_EXPR_SUBTOTAL
::= NAME_OR_POS ':' AGGREGATED_CALCULATION ( ',' AGGREGATED_CALCULATION )*
NAME_OR_POS
::= ( name | pos )
AGGREGATED_CALCULATION
::= ( aggregation_function | aggregation_expression ) ( 'AS'? alias)?
And as an example of the construct I find myself using all the time:

( function | expression ) ( ',' ( function | expression ))*
Is there a more compact way to do this?
Other than introducing "helper rules" like this:
rule
: atom_list
;
atom_list
: atom (',' atom)*
;
atom
: function
| expression
;
the answer is: no, there is no shorter way to write a (',' a)* into something like (a ',')* with ANTLR.
If you're repeating function | expression a lot, at the very least make a separate rule of those alternatives.

Related

How to resolve the no viable input error in racket?

I've been trying to execute the following program in a parser
#lang racket
( define ( highest-number xs )
( define ( max x1 x2 )
( if ( > x1 x2) x1 x2 ) )
( foldl max ( first xs ) ( rest xs ) )
The error that is generated is as follows :
line 3:4 no viable alternative at input '( define'
These are my rules :
grammar hello;
program
: defOrExpr+ EOF
;
defOrExpr
: definition
| expr
| testCase
| libraryRequire
;
definition
: '(' 'define' '(' name NAME+ ')' expr ')'
| '(' 'define' name expr ')'
| '(' 'define-struct' name '(' name* ')' ')'
;
(As Bart mentioned, it's much easier to help if there's a buildable grammar and sample input to reproduce your problem.)
In this case, I think the problem is fairly obvious.
Your definition rule does not contain an alternative that allows for a nested definition, and your sample input has a nested definition.
I'm not familiar enough with Racket to suggest an alternative to address the issue, but that's why you're getting your error. (it's on the ( define ( max x1 x2 ) part, not the first define)

Ambiguity between tuple and parenthesized expression

One of the expressions that can be very ambiguous up until almost the very end is that of a tuple vs. a parenthesized expression. A tuple is differentiated between a parenthesized expression by the presence of a comma -- and often a single-member tuple is not allowed, as it would be ambiguous, for example from BigQuery:
Tuple syntax
(expr1, expr2 [, ... ])
The output type is an anonymous STRUCT type with anonymous fields with types matching the types of the input expressions. There must be at least two expressions specified. Otherwise this syntax is indistinguishable from an expression wrapped with parentheses.
I am having trouble figuring out why my grammar is ambiguous here, which allows for both:
grammar DBParser;
options { caseInsensitive=true; }
statement: select EOF;
select:
'SELECT' expr (',' expr)*
('FROM' expr) ?
('WHERE' expr) ?
;
expr
: '(' expr ')' # parenExpression
| '(' expr (',' expr)+ ')' # tupleLiteralExpression
| expr 'IN' expr # inExpression
| select # subSelectExpression
| Atom # constantExpression
;
Atom:
[a-z-]+ | [0-9]+ | '\'' Atom '\''
;
WHITESPACE: [ \t\r\n] -> skip;
And with my input:
SELECT id FROM sales WHERE country IN ((select 1,1,1,1,1,1,1,1,1),1)
I get the following profiling information from Antlr telling me I have ambiguities.
Why is this occurring, and how would I properly resolve this?
The ambiguity arises from the non-parenthesized sub-select expression. For example if we have:
SELECT a FROM b WHERE x IN (select 1,1)
The IN expression part can be parsed in two different ways:
Atom inExpression(tupleLiteralExpression(subSelectExpression, Atom))
Or as:
Atom inExpression(subSelectExpression)
Since (SELECT 1,1) could either be seen as a select clause SELECT 1,1 or it can be seen as a tuple containing two elements, SELECT 1 and 1.
Because of this, we must require parentheses around the sub-select so we know where the select clause starts and ends. Here would be the proper grammar resolving the ambiguities:
grammar DBParser;
options { caseInsensitive=true; }
statement: select EOF;
select:
'SELECT' expr (',' expr)*
('FROM' expr) ?
('WHERE' expr) ?
;
expr
: '(' expr ')' # parenExpression
| '(' expr (',' expr)+ ')' # tupleLiteralExpression
| expr 'IN' expr # inExpression
| '(' select ')' # subSelectExpression
| Atom # constantExpression
;
Atom:
[a-z-]+ | [0-9]+ | '\'' Atom '\''
;
WHITESPACE: [ \t\r\n] -> skip;

ANTLR4 parser rules with other parser rules as arguments (meta-rules)

I would like to be able to write a "meta-rule" in ANTLR4 that takes a rule as an input argument and performs a set modification to that rule. Here's an example grammar:
grammar G;
WS: [ \t\n\r] + -> skip;
CHAR: [a-z];
term: (CHAR)+;
sum: term ('+' term)+;
pterm: '(' term ')' | '(' pterm ')';
psum: '(' sum ')' | '(' psum ')';
expr: term | sum | pterm | psum;
The rules for pterm and psum perform the same action on term and sum, enclosing them in possibly nested parentheses. I would like to be able to replace the last three lines above with something like the following:
enclose[rule]: '(' rule ')' | '(' enclose(rule) ')';
expr: term | sum | enclose(term) | enclose(sum);
Is there a way to construct a meta-rule like this?
The short answer is, no.
Better to resolve by refactoring the grammar and identifying the structurally significant terms:
expr: LPAREN sum RPAREN | LPAREN expr RPAREN ;
sum : term ('+' term)* ; // changed to Kleene star
term: CHAR+ ;
LPAREN : '(' ;
RPAREN : ')' ;
CHAR : [a-z] ;
WS : [ \t\n\r]+ -> skip ;
The sum rule will consume all terms, so the expr rule only needs to handle sums.

ANTLR Parsing Literals and Quoted IDs

I'm working on an SQL grammar in ANTLR which allows quoted identifiers (table names, field names, etc), as well as quoted literal strings.
The problem is that this grammar seems to always match quoted inputs as "QUOTED_LITERAL", and never as IDs wrapped in quotes.
Here are my results:
input: 'blahblah' result: string_literal as expected.
input: field1 restul: column_name as expected
input: table.field1 result: column_spec as expected
input: 'table'.'field1' result: string_literal, MissingTokenException
Below is my simplified grammar for the expression portion of the SQL grammar, if anybody can help identify what is needed to match quoted rules other than the quoted literal, thanks.
grammar test;
expression
:
simpleExpression EOF!
;
simpleExpression
:
column_spec
| literal_value
;
column_spec
:
(table_name '.')? column_name
| ('\''table_name '\'''.')? '\'' column_name '\''
| ('\"'table_name '\"' '.')? '\"' column_name '\"'
;
string_literal: QUOTED_LITERAL ;
boolean_literal: 'TRUE' | 'FALSE' ;
literal_value :
(
string_literal
| boolean_literal
)
;
table_name :ID;
column_name :ID;
QUOTED_LITERAL:
( '\''
( ('\\' '\\') | ('\'' '\'') | ('\\' '\'') | ~('\'') )*
'\'' )
|
( '\"'
( ('\\' '\\') | ('\"' '\"') | ('\\' '\"') | ~('\"') )*
'\"' )
;
ID
:
( 'A'..'Z' | 'a'..'z' ) ( 'A'..'Z' | 'a'..'z' | '_' | '0'..'9'| '::' )*
;
WHITE_SPACE : ( ' '|'\r'|'\t'|'\n' ) {$channel=HIDDEN;} ;
In case anybody is interested, I removed a little bit of the flexibility from the quoted literal strings. Literal strings can only be quoted by single quotes, and identifiers can be optionally quoted by double quotes. As long as the literal quote and the identifier quote is well defined and they don't overlap, the grammar is trivial.
This policy makes the grammar much cleaner, and doesn't remove the ability to quote identifiers. I make use of the JDBC method getIdentifierQuote to report which quote can be used to wrap identifiers.
This is your classical shift/reduce conflict. (Except that ANTLR does not shift or reduce; since it is not a stack automaton.)
You have the following problem:
When you are in the simpleExpression state you need to decide what branch to take with one token lookahead. In the case of ANTLR, since no difference is done between lexer and parser the one token is a single character. (You should see a warning from ANTLR about the conflict.)
It gets even better, what is the difference between "Bob Dillan" and "table1"? From the parsers point of view, none. So how do you expect to make a difference between:
('\"'table_name '\"' '.')? '\"' column_name '\"'
and
( '\"'
( ('\\' '\\') | ('\"' '\"') | ('\\' '\"') | ~('\"') )*
'\"' )
I strongly suggest to rewrite the simpleExpression rule to:
simpleExpression:
IDENTIFIER |
IDENTIFIER . IDENTIFIER |
QUOTED_LITERAL |
QUOTED_LITERAL . QUOTED_LITERAL |
boolean_literal;
And then decide in the action code of simpleExpression what to do. Especially since I am quite sure that you can reference a table with a quoted name; never the less "users" and "Bod Dillan" are syntactically equal.
It also depends on the grater grammar, you may also be able to resolve the amiability on a higher level.
The antlr lexer is greedy, in that when there are two possible token matches, it will match the longest possible one.
When the lexer sees 'some_id', it can match the first quote as just a quote, or a quoted literal. The literal is longer, so that matches.
As a side note, you generally do not want lexer rules that can match nothing (like ID) or to uses string constants in the parser rules, but only reference token names.
What you want to do is something like this.
QUOTE: '\'';
ID: ('a'..'z' | 'A'..'Z')+; // Must have at least one character
QUOTED_LITERAL: QUOTE ( (ID QUOTE) => { $type=QUOTE; } ) | .* QUOTE;
id: ID | QUOTE ID QUOTE;
quoted_literal: QUOTED_LITERAL | QUOTE ID QUOTE;
If the lexer sees something that looks like a quoted id, it cannot tell which to use, so it breaks it up into smaller tokens. In your parser, you use id where you expect a possibly quoted ID, and quoted_literal where you expect a QUOTED_LITERAL.
The syntactical predicate in QUOTED_LITERAL prevents it from matching the full quote when the input is ambiguous.
Looking that this, it will fail to correctly parse lines like
'tag' text 'second'
as ' text ' will be parsed as a QUOTED_LITERAL. If that is a valid input, then you would need something like
fragment QUOTED_ID;
QUOTED_LITERAL: QUOTE ( ID {$type=QUOTED_ID} | .* ) QUOTE;
id: ID | QUOTED_ID;
quoted_literal: QUOTED_LITERAL | QUOTED_ID;
(My example does not cover all the cases in your input, but extending it should be obvious. You also probably need some actions to either generate the correct tokens in your AST or add/remove quotes from the text, depending one what you do after you parse.)

Practical solution to fix a Grammar Problem

We have little snippets of vb6 code (the only use a subset of features) that gets wirtten by non-programmers. These are called rules. For the people writing these they are hard to debug so somebody wrote a kind of add hoc parser to be able to evaluete the subexpressions and thereby show better where the problem is.
This addhoc parser is very bad and does not really work woll. So Im trying to write a real parser (because im writting it by hand (no parser generator I could understand with vb6 backends) I want to go with recursive decent parser). I had to reverse-engineer the grammer because I could find anything. (Eventully I found something http://www.notebar.com/GoldParserEngine.html but its LALR and its way bigger then i need)
Here is the grammer for the subset of VB.
<Rule> ::= expr rule | e
<Expr> ::= ( expr )
| Not_List CompareExpr <and_or> expr
| Not_List CompareExpr
<and_or> ::= Or | And
<Not_List> ::= Not Not_List | e
<CompareExpr> ::= ConcatExpr comp CompareExpr
|ConcatExpr
<ConcatExpr> ::= term term_tail & ConcatExpr
|term term_tail
<term> ::= factor factor_tail
<term_tail> ::= add_op term term_tail | e
<factor> ::= add_op Value | Value
<factor_tail> ::= multi_op factor factor_tail | e
<Value> ::= ConstExpr | function | expr
<ConstExpr> ::= <bool> | number | string | Nothing
<bool> ::= True | False
<Nothing> ::= Nothing | Null | Empty
<function> ::= id | id ( ) | id ( arg_list )
<arg_list> ::= expr , arg_list | expr
<add_op> ::= + | -
<multi_op> ::= * | /
<comp> ::= > | < | <= | => | =< | >= | = | <>
All in all it works pretty good here are some simple examples:
my_function(1, 2 , 3)
looks like
(Programm
(rule
(expr
(Not_List)
(CompareExpr
(ConcatExpr
(term
(factor
(value
(function
my_function
(arg_list
(expr
(Not_List)
(CompareExpr
(ConcatExpr (term (factor (value 1))) (term_tail))))
(arg_list
(expr
(Not_List)
(CompareExpr
(ConcatExpr (term (factor (value 2))) (term_tail))))
(arg_list
(expr
(Not_List)
(CompareExpr
(ConcatExpr (term (factor (value 3))) (term_tail))))
(arg_list))))))))
(term_tail))))
(rule)))
Now whats my problem?
if you have code that looks like this (( true OR false ) AND true) I have a infinit recursion but the real problem is that in the (true OR false) AND true (after the first ( expr ) ) is understood as only (true or false).
Here is the Parstree:
So how to solve this. Should I change the grammer somehow or use some implmentation hack?
Something hard exmplale in case you need it.
(( f1 OR f1 ) AND (( f3="ALL" OR f4="test" OR f5="ALL" OR f6="make" OR f9(1, 2) ) AND ( f7>1 OR f8>1 )) OR f8 <> "")
You have several issues that I see.
You are treating OR and AND as equal precedence operators. You should have separate rules for OR, and for AND. Otherwise you will the wrong precedence (therefore evaluation) for the expression A OR B AND C.
So as a first step, I'd revise your rules as follows:
<Expr> ::= ( expr )
| Not_List AndExpr Or Expr
| Not_List AndExpr
<AndExpr> ::=
| CompareExpr And AndExpr
| Not_List CompareExpr
Next problem is that you have ( expr ) at the top level of your list. What if I write:
A AND (B OR C)
To fix this, change these two rules:
<Expr> ::= Not_List AndExpr Or Expr
| Not_List AndExpr
<Value> ::= ConstExpr | function | ( expr )
I think your implementation of Not is not appropriate. Not is an operator,
just with one operand, so its "tree" should have a Not node and a child which
is the expression be Notted. What you have a list of Nots with no operands.
Try this instead:
<Expr> ::= AndExpr Or Expr
| AndExpr
<Value> ::= ConstExpr | function | ( expr ) | Not Value
I haven't looked, but I think VB6 expressions have other messy things in them.
If you notice, the style of Expr and AndExpr I have written use right recursion to avoid left recursion. You should change your Concat, Sum, and Factor rules to follow a similar style; what you have is pretty complicated and hard to follow.
If they are just creating snippets then perhaps VB5 is "good enough" for creating them. And if VB5 is good enough, the free VB5 Control Creation Edition might be worth tracking down for them to use:
http://www.thevbzone.com/vbcce.htm
You could have them start from a "test harness" project they add snippets to, and they can even test them out.
With a little orientation this will probably prove much more practical than hand crafting a syntax analyzer, and a lot more useful since they can test for more than correct syntax.
Where VB5 is lacking you might include a static module in the "test harness" that provides a rough and ready equivalent of Split(), Replace(), etc:
http://support.microsoft.com/kb/188007

Resources