Differentiate between multiplication and comment - parsing

I am writing an ANTLR grammar for SAS and running into an issue where the lexer cannot differentiate between a single line comment and a multiplication operation.
The SAS syntax for comments is regrettably:
*message;
or
/*message*/
I have written a simple test grammar to illustrate the problem:
grammar TEST;
prog: expr* EOF;
expr
: VAR #base
| expr '*' expr #mult
;
VAR: ALPHA+;
fragment ALPHA : [a-zA-Z]+ ;
COMMENT: '*' ~[\r\n];
WS: [ \t\r\n] -> skip;
I'm not sure how I can qualify the lexer to differentiate between these two situations. I am an ANTLR beginner so I may have missed something obvious.

Related

When does order of alternation matter in antlr?

In the following example, the order matters in terms of precedence:
grammar Precedence;
root: expr EOF;
expr
: expr ('+'|'-') expr
| expr ('*' | '/') expr
| Atom
;
Atom: [0-9]+;
WHITESPACE: [ \t\r\n] -> skip;
For example, on the expression 1+1*2 the above would produce the following parse tree which would evaluate to (1+1)*2=4:
Whereas if I changed the first and second alternations in the expr I would then get the following parse tree which would evaluate to 1+(1*2)=3:
What are the 'rules' then for when it actually matters where the ordering in an alternation occurs? Is this only relevant if it one of the 'edges' of the alternation recursively calls the expr? For example, something like ~ expr or expr + expr would matter, but something like func_call '(' expr ')' or Atom would not. Or, when is it important to order things for precedence?
If ANTLR did not have the rule to give precedence to the first alternative that could match, then either of those trees would be valid interpretations of your input (and means the grammar is technically ambiguous).
However, when there are two alternatives that could be used to match your input, then ANTLR will use the first alternative to resolve the ambiguity, in this case establishing operator precedence, so typically you would put the multiplication/division operator before the addition/subtraction, since that would be the traditional order of operations:
grammar Precedence;
root: expr EOF;
expr
: expr ('+'|'-') expr
| expr ('*' | '/') expr
| Atom
;
Atom: [0-9]+;
WHITESPACE: [ \t\r\n] -> skip;
Most grammar authors will just put them in precedence order, but things like Atoms or parenthesized exprs won’t really care about the order since there’s only a single alternative that could be used.

Understanding what makes a rule left-recursive in antlr

I've been trial-and-erroring to figure out from an intuitive level when a rule in antlr is left-recursive of not. For example, this (Removing left recursion) is left-recursive in theory, but works in Antlr:
// Example input: x+y+1
grammar DBParser;
expression
: expression '+' term
| term;
term
: term '*' atom
| atom;
atom
: NUMBER
| IDENTIFIER
;
NUMBER : [0-9]+ ;
IDENTIFIER : [a-zA-Z]+ ;
So what makes a rule left-recursive and a problem in antlr4, and what would be the simplest example of showing that (in an actual program)? I'm trying to practice remove left-recursive productions, but I can't even figure out how to intentionally add a left-recursive rule that antlr4 can't resolve!
Antlr4 can handle direct left-recursion as long as it is not hidden:
Hidden means that the recursion is not at the beginning of the right-hand side but it might still be at the beginning of the expansion because all previous symbols are nullable.
Indirect means that the first symbol on the right-hand side of a production for N eventually derives a sequence starting with N. Antlr describes that as "mutual recursion".
Here are some SO questions I found by searching for [antlr] left recursive:
ANTLR4 - Mutually left-recursive grammar
ANTLR4 mutually left-recursive error when parsing
ANTLR4 left-recursive error
ANTLR Grammar Mutually Left Recursive
Mutually left-recursive lexer rules on ANTL4?
Mutually left recursive with simple calculator
There are lots more, including one you asked. I'm sure you can mine some examples.
As mentioned by rici above, one of the items that Antlr4 does not support is indirect left-recursion. Here would be an example:
grammar DBParser;
expr: binop | Atom;
binop: expr '+' expr;
Atom: [a-z]+ | [0-9]+ ;
error(119): DBParser.g4::: The following sets of rules are mutually left-recursive [expr, binop]
Notice that Antlr4 can support the following direct left-recursion though:
grammar DBParser;
expr: expr '+' expr | Atom;
Atom: [a-z]+ | [0-9]+ ;
However, even if you add in parentheticals (for whatever reason?) it doesn't work:
grammar DBParser;
expr: (expr '+' expr) | Atom;
Atom: [a-z]+ | [0-9]+ ;
Or:
grammar DBParser;
expr: (expr '+' expr | Atom);
Atom: [a-z]+ | [0-9]+ ;
Both will raise:
error(119): DBParser.g4::: The following sets of rules are mutually left-recursive [expr]

Matching parentheses in ANTLR

I'm new to Antlr and I met one issue with parentheses matching recently. Each node in the parse tree has the form (Node1,W1,W2,Node2), where Node1 and Node2 are two nodes and W1 and W2 are two weights between them. Given an input file as (1,C,10,2).((2,P,2,3).(3,S,3,2))*.(2,T,2,4), the parse tree looks wrong, where the operator is not the parent of those nodes and the parentheses are not matched.
The parse file I wrote is like this:
grammar Semi;
prog
: expr+
;
expr
: expr '*'
| expr ('.'|'+') expr
| tuple
| '(' expr ')'
;
tuple
: LP NODE W1 W2 NODE RP
;
LP : '(' ;
RP : ')' ;
W1 : [PCST0];
W2 : [0-9]+;
NODE: [0-9]+;
WS : [ \t\r\n]+ -> skip ; // toss out whitespace
COMMA: ',' -> skip;
It seems like expr| '(' expr ')' doesn't work correctly. So what should I do to make this parser detects if parentheses belong to the node or not?
Update:
There are two errors in the command:
line 1:1 no viable alternative at input '(1'
line 1:13 no viable alternative at input '(2'
So it seems like the lexer didn't detect the tuples, but why is that?
Your W2 and NODE rules are the same, so nodes you intend to be NODE are matching W2.
grun with -tokens option: (notice, no NODE tokens)
[#0,0:0='(',<'('>,1:0]
[#1,1:1='1',<W2>,1:1]
[#2,3:3='C',<W1>,1:3]
[#3,5:6='10',<W2>,1:5]
[#4,8:8='2',<W2>,1:8]
[#5,9:9=')',<')'>,1:9]
[#6,10:10='.',<'.'>,1:10]
[#7,11:11='(',<'('>,1:11]
[#8,12:12='(',<'('>,1:12]
[#9,13:13='2',<W2>,1:13]
[#10,15:15='P',<W1>,1:15]
[#11,17:17='2',<W2>,1:17]
[#12,19:19='3',<W2>,1:19]
[#13,20:20=')',<')'>,1:20]
[#14,21:21='.',<'.'>,1:21]
[#15,22:22='(',<'('>,1:22]
[#16,23:23='3',<W2>,1:23]
[#17,25:25='S',<W1>,1:25]
[#18,27:27='3',<W2>,1:27]
[#19,29:29='2',<W2>,1:29]
[#20,30:30=')',<')'>,1:30]
[#21,31:31=')',<')'>,1:31]
[#22,32:32='*',<'*'>,1:32]
[#23,33:33='.',<'.'>,1:33]
[#24,34:34='(',<'('>,1:34]
[#25,35:35='2',<W2>,1:35]
[#26,37:37='T',<W1>,1:37]
[#27,39:39='2',<W2>,1:39]
[#28,41:41='4',<W2>,1:41]
[#29,42:42=')',<')'>,1:42]
[#30,43:42='<EOF>',<EOF>,1:43]
If I replace the NODEs in your parse rule with W2s (sorry, I have no idea what this is supposed to represent), I get:
It appears that your misconception is that the recursive descent parsing starts with the parser rule and when it encounters a Lexer rule, attempts to match it.
This is not how ANTLR works. With ANTLR, your input is first run through the Lexer (aka Tokenizer) to produce a stream of tokens. This step knows absolutely nothing about your parser rules. (That's why it's so often useful to use grun to dump the stream of tokens, this gives you a picture of what your parser rules are acting upon (and you can see, in your example that there are no NODE tokens, because they all matched W2).
Also, a suggestion... It would appear that commas are an essential part of correct input (unless (1C102).((2P23).(3S32))*.(2T24) is considered valid input. On that assumption, I removed the -> skip and added them to your parser rule (that's why you see them in the parse tree). The resulting grammar I used was:
grammar Semi;
prog: expr+;
expr: expr '*' | expr ('.' | '+') expr | tuple | LP expr RP;
tuple: LP W2 COMMA W1 COMMA W2 COMMA W2 RP;
LP: '(';
RP: ')';
W1: [PCST0];
W2: [0-9]+;
NODE: [0-9]+;
WS: [ \t\r\n]+ -> skip; // toss out whitespace
COMMA: ',';
To take a bit more liberty with your grammar, I'd suggest that your Lexer rules should be raw type focused. And, that you can use labels to make the various elements or your tuple more easily accessible in your code. Here's an example:
grammar Semi;
prog: expr+;
expr: expr '*' | expr ('.' | '+') expr | tuple | LP expr RP;
tuple: LP nodef=INT COMMA w1=PCST0 COMMA w2=INT COMMA nodet=INT RP;
LP: '(';
RP: ')';
PCST0: [PCST0];
INT: [0-9]+;
COMMA: ',';
WS: [ \t\r\n]+ -> skip; // toss out whitespace
With this change, your tuple Context class will have accessors for w1, w1, and node. node will be an array of NBR tokens as I've defined it here.

antlr4 not parsing according to grammar

I'm trying to parse 'for loop' according to this (partial) grammar:
grammar GaleugParserNew;
/*
* PARSER RULES
*/
relational
: '>'
| '<'
;
varChange
: '++'
| '--'
;
values
: ID
| DIGIT
;
for_stat
: FOR '(' ID '=' values ';' values relational values ';' ID varChange ')' '{' '}'
;
/*
* LEXER RULES
*/
FOR : 'for' ;
ID : [a-zA-Z_] [a-zA-Z_0-9]* ;
DIGIT : [0-9]+ ;
SPACE : [ \t\r\n] -> skip ;
When I try to generate the gui of how it's parsed, it's not following the grammar I provided above. This is what it produces:
I've encountered this problem before, what I did then was simply exit cmd, open it again and compile everything and somehow that worked then. It's not working now though.
I'm not really very knowledgeable about antlr4 so I'm not sure where to look to solve this problem.
Must be a problem of the IDE you are using. The grammar is fine and produces this parse tree in Visual Studio Code:
I guess the IDE is using the wrong parser or lexer (maybe from a different work file?). Print the lexer tokens to see if they are what you expect. Hint: avoid defining implicit lexer tokens (like '(', '}' etc.), which will allow to give the tokens good names.

Ignoring whitespace (in certain parts) in Antlr4

I am not so familiar with antlr. I am using version 4 and I have a grammar where whitespace is not important in some parts (but it might be in others, or rather its luck).
So say we have the following grammar
grammar Foo;
program : A* ;
A : ID '#' ID '(' IDList ')' ';' ;
ID : [a-zA-Z]+ ;
IDList : ID (',' IDList)* ;
WS : [ \t\r\n]+ -> skip ;
and a test input
foo#bar(X,Y);
foo#baz ( z,Z) ;
The first line is parsed correctly whereas the second one is not.
I don't want to polute my rules with the places where whitespace is not relevant, since my actual grammar is more complicated than the toy example. In case it's not clear the part ID'#'ID should not have a whitespace. Whitespace in any other position shouldn't matter at all.
Even though you are skipping WS, lexer rules are still sensitive to the existence of the whitespace characters. Skip simply means that no token is generated for consumption by the parser. Thus, the lexer Addr rule explicitly does not permit any interior whitespace characters.
Conversely, the a and idList parser rules never see interior whitespace tokens so those rules are insensitive to the occurrence of whitespace characters occurring between the generated tokens.
grammar Foo;
program : a* EOF ; // EOF will require parsing the entire input
a : Addr LParen IDList RParen Semi ;
idList : ID (Comma ID)* ; // simpler equivalent construct
Addr : ID '#' ID ;
ID : [a-zA-Z]+ ;
WS : [ \t\r\n]+ -> skip ;
Define ID '#' ID as lexer token rather than as parser token.
A : AID '(' IDList ')' ';' ;
AID : [a-zA-Z]+ '#' [a-zA-Z]+;
Other options
enable/disable whitespaces in your token stream, e.g. here
enable/disable whitespaces with lexer modes (may be a problem because lexer modes are triggered on context, which is not easy to determine in your case)

Resources