How to choose lexer/parser rule on instanciation - parsing

I have a grammar for parsing diverse SQL code.
Problem :
- sometimes, I want it to handle nested comments (ex: Microsoft SQL):
COMMENT: '/*' (COMMENT|.)*? ('*/' | EOF) -> channel(HIDDEN);
- sometimes, I want it to not handle them (ex: Oracle):
COMMENT: '/*' .*? '*/' -> channel(HIDDEN);
I don't want to :
make two different grammars
compile my grammar twice to get two different lexers and two different parsers.
Best solution would be to have an argument, passed to the lexer/parser, to choose which "COMMENT implementation" to use.
Can I do this ? If yes, how, if not, is there a satisfying solution to my problem ?
Thanks !

You can achieve this using a semantic predicate in your lexer. You will need to (1) split the lexer and parser from each other. (2) Create a base class for the lexer with a boolean field, property, or method that you can set true if you want the lexer to allow nested comments, or false to disallow. For sake of below code, assume you add "bool nested = false;" to the lexer base class. (3) Within your lexer grammar, create one COMMENT rule as shown below. (4) After creating you lexer, assign the "nested" field to true if you want nested comments to be recognized.
COMMENT
: (
{nested}? '/*' (COMMENT|.)*? ('*/' | EOF)
| '/*' .*? '*/') -> channel(HIDDEN)
;

Related

Validating expressions in the parser

I am working on a SQL grammar where pretty much anything can be an expression, even places where you might not realize it. Here are a few examples:
-- using an expression on the list indexing
SELECT ([1,2,3])[(select 1) : (select 1 union select 1 limit 1)];
Of course this is an extreme example, but my point being, many places in SQL you can use an arbitrarily nested expression (even when it would seem "Oh that is probably just going to allow a number or string constant).
Because of this, I currently have one long rule for expressions that may reference itself, the following being a pared down example:
grammar DBParser;
options { caseInsensitive=true; }
statement:select_statement EOF;
select_statement
: 'SELECT' expr
'WHERE' expr // The WHERE clause should only allow a BoolExpr
;
expr
: expr '=' expr # EqualsExpr
| expr 'OR' expr # BoolExpr
| ATOM # ConstExpr
;
ATOM: [0-9]+ | '\'' [a-z]+ '\'';
WHITESPACE: [ \t\r\n] -> skip;
With sample input SELECT 1 WHERE 'abc' OR 1=2. However, one place I do want to limit what expressions are allowed is in the WHERE (and HAVING) clause, where the expression must be a boolean expression, in other words WHERE 1=1 is valid, but WHERE 'abc' is invalid. In practical terms what this means is the top node of the expression must be a BoolExpr. Is this something that I should modify in my parser rules, or should I be doing this validation downstream, for example in the semantic phase of validation? Doing it this way would probably be quite a bit simpler (even if the lexer rules are a bit lax), as there would be so much indirection and probably indirect left-recursion involved that it would become incredibly convoluted. What would be a good approach here?
Your intuition is correct that breaking this out would probably create indirect left recursion. Also, is it possible that an IDENTIFIER could represent a boolean value?
This is the point of #user207421's comment. You can't fully capture types (i.e. whether an expression is boolean or not) in the parser.
The parser's job (in the Lexer & Parser sense), put fairly simply, is to convert your input stream of characters into the a parse tree that you can work with. As long as it gives a parse tree that is the only possible way to interest the input (whether it is semantically valid or not), it has served its purpose. Once you have a parse tree then during semantic validation, you can consider the expression passed as a parameter to your where clause and determine whether or not it has a boolean value (this may even require consulting a symbol table to determine the type of an identifier). Just like your semantic validation of an OR expression will need to determine that both the lhs and rhs are, themselves, boolean expressions.
Also consider that even if you could torture the parser into catching some of your type exceptions, the error messages you produce from semantic validation are almost guaranteed to be more useful than the generated syntax errors. The parser only catches syntax errors, and it should probably feel a bit "odd" to consider a non-boolean expression to be a "syntax error".

ANTLR: Why is this grammar rule for a tuples not LL(1)?

I have the following grammar rules defined to cover tuples of the form: (a), (a,), (a,b), (a,b,) and so on. However, antlr3 gives the warning:
"Decision can match input such as "COMMA" using multiple alternatives: 1, 2
I believe this means that my grammar is not LL(1). This caught me by surprise as, based on my extremely limited understanding of this topic, the parser would only need to look one token ahead from (COMMA)? to ')' in order to know which comma it was on.
Also based on the discussion I found here I am further confused: Amend JSON - based grammar to allow for trailing comma
And their source code here: https://github.com/doctrine/annotations/blob/1.13.x/lib/Doctrine/Common/Annotations/DocParser.php#L1307
Is this because of the kind of parser that antlr is trying to generate and not because my grammar isn't LL(1)? Any insight would be appreciated.
options {k=1; backtrack=no;}
tuple : '(' IDENT (COMMA IDENT)* (COMMA)? ')';
DIGIT : '0'..'9' ;
LOWER : 'a'..'z' ;
UPPER : 'A'..'Z' ;
IDENT : (LOWER | UPPER | '_') (LOWER | UPPER | '_' | DIGIT)* ;
edit: changed typo in tuple: ... from (IDENT)? to (COMMA)?
Note:
The question has been edited since this answer was written. In the original, the grammar had the line:
tuple : '(' IDENT (COMMA IDENT)* (IDENT)? ')';
and that's what this answer is referring to.
That grammar works without warnings, but it doesn't describe the language you intend to parse. It accepts, for example, (a, b c) but fails to accept (a, b,).
My best guess is that you actually used something like the grammars in the links you provide, in which the final optional element is a comma, not an identifier:
tuple : '(' IDENT (COMMA IDENT)* (COMMA)? ')';
That does give the warning you indicate, and it won't match (a,) (for example), because, as the warning says, the second alternative has been disabled.
LL(1) as a property of formal grammars only applies to grammars with fixed right-hand sides, as opposed to the "Extended" BNF used by many top-down parser generators, including Antlr, in which a right-hand side can be a set of possibilities. It's possible to expand EBNF using additional non-terminals for each subrule (although there is not necessarily a canonical expansion, and expansions might differ in their parsing category). But, informally, we could extend the concept of LL(k) by saying that in every EBNF right-hand side, at every point where there is more than one alternative, the parser must be able to predict the appropriate alternative looking only at the next k tokens.
You're right that the grammar you provide is LL(1) in that sense. When the parser has just seen IDENT, it has three clear alternatives, each marked by a different lookahead token:
COMMA ↠ predict another repetition of (COMMA IDENT).
IDENT ↠ predict (IDENT).
')' ↠ predict an empty (IDENT)?.
But in the correct grammar (with my modification above), IDENT is a syntax error and COMMA could be either another repetition of ( COMMA IDENT ), or it could be the COMMA in ( COMMA )?.
You could change k=1 to k=2, thereby allowing the parser to examine the next two tokens, and if you did so it would compile with no warnings. In effect, that grammar is LL(2).
You could make an LL(1) grammar by left-factoring the expansion of the EBNF, but it's not going to be as pretty (or as easy for a reader to understand). So if you have a parser generator which can cope with the grammar as written, you might as well not worry about it.
But, for what it's worth, here's a possible solution:
tuple : '(' idents ')' ;
idents : IDENT ( COMMA ( idents )? )? ;
Untested because I don't have a working Antlr3 installation, but it at least compiles the grammar without warnings. Sorry if there is a problem.
It would probably be better to use tuple : '(' (idents)? ')'; in order to allow empty tuples. Also, there's no obvious reason to insist on COMMA instead of just using ',', assuming that '(' and ')' work as expected on Antlr3.

Set a rule based on the value of a global variable

In my lexer & parser by ocamllex and ocamlyacc, I have a .mly as follows:
%{
open Params
open Syntax
%}
main:
| expr EOF { $1 }
expr:
| INTEGER { EE_integer $1 }
| LBRACKET expr_separators RBRACKET { EE_brackets (List.rev $2) }
expr_separators:
/* empty */ { [] }
| expr { [$1] }
| expr_separators ...... expr_separators { $3 :: $1 }
In params.ml, a variable separator is defined. Its value is either ; or , and set by the upstream system.
In the .mly, I want the rule of expr_separators to be defined based on the value of Params.separator. For example, when params.separtoris ;, only [1;2;3] is considered as expr, whereas [1,2,3] is not. When params.separtoris ,, only [1,2,3] is considered as expr, whereas [1;2;3] is not.
Does anyone know how to amend the lexer and parser to realize this?
PS:
The value of Params.separator is set before the parsing, it will not change during the parsing.
At the moment, in the lexer, , returns a token COMMA and ; returns SEMICOLON. In the parser, there are other rules where COMMA or SEMICOLON are involved.
I just want to set a rule expr_separators such that it considers ; and ignores , (which may be parsed by other rules), when Params.separator is ;; and it considers , and ignore ; (which may be parsed by other rules), when Params.separator is ,.
In some ways, this request is essentially the same as asking a macro preprocessor to alter its substitution at runtime, or a compiler to alter the type of a variable. As with the program itself, once the grammar has been compiled (whether into executable code or a parsing table), it's not possible to go back and modify it. At least, that's the case for most LR(k) parser generators, which produce deterministic parsers.
Moreover, it seems unlikely that the only difference the configuration parameter makes is the selection of a single separator token. If the non-selected separator token "may be parsed by other rules", then it may be parsed by those other rules when it is the selected separator token, unless the configuration setting also causes those other rules to be suppressed. So at a minimum, it seems like you'd be looking at something like:
expr : general_expr
expr_list : expr
%if separator is comma
expr : expr_using_semicolon
expr_list : expr_list ',' expr
%else
expr : expr_using_comma
expr_list : expr_list ';' expr
%endif
Without a more specific idea of what you're trying to achieve, the best suggestion I can provide is that you write two grammars and select which one to use at runtime, based on the configuration setting. Presumably the two grammars will be mostly similar, so you can probably use your own custom-written preprocessor to generate both of them from the same input text, which might look a bit like the above example. (You can use m4, which is a general-purpose macro processor, but you might feel the learning curve is too steep for such a simple application.)
Parser generators which produce general parsers have an easier time with run-time dynamic modifications; many such parser generators have mechanisms which can do that (although they are not necessarily efficient mechanisms). For example, the Bison tool can produce GLR parsers, in which case you can select or deselect specific rules using a predicate action. The OCAML GLR generator Dypgen allows sets of rules to be dynamically added to the grammar during the parse. (I've never used dypgen, but I keep on meaning to try it; it looks interesting.) And there are many others.
Having played around with dynamic parsing features in some GLR parsers, I can only say that my personal experience has been a bit mixed. Modifying grammars at run-time is a brittle technique; grammars tend not to be very easy to split into independent pieces, so modifying a grammar rule can have unexpected consequences in places you don't expect to be affected. You don't always know exactly what language your parsing accepts, because the dynamic modifications can be hard to predict. And so on. My suggest, if you try this technique, is to start with the simplest modification possible and put a lot more effort into grammar tests (which is always a good idea, anyway).

Antlr Matlab grammar lexing conflict

I've been using the Antlr Matlab grammar from Antlr grammars
I found out I need to implement the ' Matlab operator. It is the complex conjugate transpose operator, used as such
result = input'
I tried a straightforward solution of adding it to unary_expression as an option postfix_expression '\''
However, this failed to parse when multiple of these operators were used on a single line.
Here's a significantly simplified version of the grammar, still exhibiting the exact problem:
grammar Grammar;
unary_expression
: IDENTIFIER
| unary_expression '\''
;
translation_unit : unary_expression CR ;
STRING_LITERAL : '\'' [a-z]* '\'' ;
IDENTIFIER : [a-zA-Z] ;
CR : [\r\n] + ;
Test cases, being parsed as translation_unit:
"x''\n" //fails getNumberOfSyntaxErrors returns 1
"x'\n" //passes
The failure also prints the message line 1:1 extraneous input '''' expecting CR to stderr.
The failure goes away if I either remove STRING_LITERAL, or change the * to +. Neither is a proper solution of course, as removing it is entirely off the table, and mandating non-empty strings is not quite correct, though I might be able to live with it. Also, forcing non-empty string does nothing to help the real use case, when the input is something like x' + y' instead of using the operator twice.
For some reason removing CR from the grammar and \n from the tests also makes the parsing run without problems, but yet again is not a useable solution.
What can I do to the grammar to make it work correctly? I'm assuming it's a problem with lexing specifically because removing STRING_LITERAL or making it unable to match '' makes it go away.
The lexer can never be made that context aware I think, but I don't know Matlab well enough to be sure. How could you check during tokenisation that these single quotes are operators:
x' + y';
while these are strings:
x = 'x' + ' + y';
?
Maybe you can do something similar as how in ECMAScript a / can be a division operator or a regex delimiter. In this grammar that is handled by a predicate in the lexer that uses some target code to check this.
If something like the above is not possible, I see no other way than to "promote" the creation of strings to the parser. That would mean removing STRING_LITERAL and introducing a parser rule that matches something like this:
string_literal
: QUOTE ~(QUOTE | CR)* QUOTE
;
// Needed to match characters inside strings
OTHER
: .
;
However, that will fail when a string like 'hi there' is encountered: the space in between hi and there will now be skipped by the WS rule. So WS should also be removed (spaces will then get matched by the OTHER rule). But now (of course) all spaces will litter the token stream and you'll have to account for them in all parser rules (not really a viable solution).
All in all: I don't see ANTLR as a suitable tool in this case. You might look into parser generators where there is no separation between tokenisation and parsing. Google for "PEG" and/or "scannerless parsing".

How to resolve a shift-reduce conflict in unambiguous grammar

I'm trying to parse a simple grammar using an LALR(1) parser generator (Bison, but the problem is not specific to that tool), and I'm hitting a shift-reduce conflict. The docs and other sources I've found about fixing these tend to say one or more of the following:
If the grammar is ambiguous (e.g. if-then-else ambiguity), change the language to fix the ambiguity.
If it's an operator precedence issue, specify precedence explicitly.
Accept the default resolution and tell the generator not to complain about it.
However, none of these seem to apply to my situation: the grammar is unambiguous so far as I can tell (though of course it's ambiguous with only one character of lookahead), it has only one operator, and the default resolution leads to parse errors on correctly-formed input. Are there any techniques for reworking the definition of a grammar to remove shift-reduce conflicts that don't fall into the above buckets?
For concreteness, here's the grammar in question:
%token LETTER
%%
%start input;
input: /* empty */ | input input_elt;
input_elt: rule | statement;
statement: successor ';';
rule: LETTER "->" successor ';';
successor: /* empty */ | successor LETTER;
%%
The intent is to parse semicolon-separated lines of the form "[A-Za-z]+" or "[A-Za-z] -> [A-Za-z]+".
Using the Solaris version of yacc, I get:
1: shift/reduce conflict (shift 5, red'n 7) on LETTER
state 1
$accept : input_$end
input : input_input_elt
successor : _ (7)
$end accept
LETTER shift 5
. reduce 7
input_elt goto 2
rule goto 3
statement goto 4
successor goto 6
So, the trouble is, as it very often is, the empty rule - specifically, the empty successor. It isn't completely clear whether you want to allow a semi-colon as a valid input - at the moment, it is. If you modified the successor rule to:
successor: LETTER | successor LETTER;
the shift/reduce conflict is eliminated.
Thanks for whittling down the grammar and posting it. Changing the successor rule to -
successor: /* empty */ | LETTER successor;
...worked for me. ITYM the language looked unambiguous.

Resources