How to exclude " and \ in ANTLR 4 string matching? - antlrworks

I have the following string that I want to match against the rule, stringLiteral:
"D:\\Downloads\\Java\\MyFile"
And my grammar is the file: String.g4, as follows:
grammar String;
fragment
HexDigit : ('0'..'9'|'a'..'f'|'A'..'F') ;
stringLiteral
: '"' ( EscapeSequence | XXXXX )* '"'
;
fragment
EscapeSequence
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UnicodeEscape
| OctalEscape
;
fragment
OctalEscape
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment
UnicodeEscape
: '\\' 'u' HexDigit HexDigit HexDigit HexDigit
;
What should I put in the XXXXX location in order to match any character that is not \ or "?
I tried the following, and it all doesn't work:
~['\\'"']
~['\\'\"']
~["\]
~[\"\\]
~('\"'|'\\')
~[\\\"]
I am using ANTLRWorks 2 to try this out. Errors are the following:
D:\Downloads\ANTLR\String.g4 line 26:5 mismatched character '<EOF>' expecting '"'
error(50): D:\Downloads\ANTLR\String.g4:26:5: syntax error: '<EOF>' came as a complete surprise to me while looking for rule element

Inside a character class, you only need to escape the backslash:
The following is illegal, it escapes the ]:
[\]
The following matches a backslash:
[\\]
The following matches a quote:
["]
And the following matches either a backslash or quote:
[\\"]
In v4 style, your grammar could look like this:
grammar String;
/* other rules */
StringLiteral
: '"' ( EscapeSequence | ~[\\"] )* '"'
;
fragment
HexDigit
: [0-9a-fA-F]
;
fragment
EscapeSequence
: '\\' [btnfr"'\\]
| UnicodeEscape
| OctalEscape
;
fragment
OctalEscape
: '\\' [0-3] [0-7] [0-7]
| '\\' [0-7] [0-7]
| '\\' [0-7]
;
fragment
UnicodeEscape
: '\\' 'u' HexDigit HexDigit HexDigit HexDigit
;
Note that you can't use fragments inside parser rules: StringLiteral must be a lexer rule!

Related

Match any printable letter-like characters in ANTLR4 with Go as target

This is freaking me out, I just can't find a solution to it. I have a grammar for search queries and would like to match any searchterm in a query composed out of printable letters except for special characters "(", ")". Strings enclosed in quotes are handled separately and work.
Here is a somewhat working grammar:
/* ANTLR Grammar for Minidb Query Language */
grammar Mdb;
start
: searchclause EOF
;
searchclause
: table expr
;
expr
: fieldsearch
| searchop fieldsearch
| unop expr
| expr relop expr
| lparen expr relop expr rparen
;
lparen
: '('
;
rparen
: ')'
;
unop
: NOT
;
relop
: AND
| OR
;
searchop
: NO
| EVERY
;
fieldsearch
: field EQ searchterm
;
field
: ID
;
table
: ID
;
searchterm
:
| STRING
| ID+
| DIGIT+
| DIGIT+ ID+
;
STRING
: '"' ~('\n'|'"')* ('"' )
;
AND
: 'and'
;
OR
: 'or'
;
NOT
: 'not'
;
NO
: 'no'
;
EVERY
: 'every'
;
EQ
: '='
;
fragment VALID_ID_START
: ('a' .. 'z') | ('A' .. 'Z') | '_'
;
fragment VALID_ID_CHAR
: VALID_ID_START | ('0' .. '9')
;
ID
: VALID_ID_START VALID_ID_CHAR*
;
DIGIT
: ('0' .. '9')
;
/*
NOT_SPECIAL
: ~(' ' | '\t' | '\n' | '\r' | '\'' | '"' | ';' | '.' | '=' | '(' | ')' )
; */
WS
: [ \r\n\t] + -> skip
;
The problem is that searchterm is too restricted. It should match any character that is in the commented out NOT_SPECIAL, i.e., valid queries would be:
Person Name=%
Person Address=^%Street%%%$^&*#^
But whenever I try to put NOT_SPECIAL in any way into the definition of searchterm it doesn't work. I have tried putting it literally into the rule, too (commenting out NOT_SPECIAL) and many others things, but it just doesn't work. In most of my attempts the grammar just complained about extraneous input after "=" and said it was expecting EOF. But I also cannot put EOF into NOT_SPECIAL.
Is there any way I can simply parse every text after "=" in rule fieldsearch until there is a whitespace or ")", "("?
N.B. The STRING rule works fine, but the user ought not be required to use quotes every time, because this is a command line tool and they'd need to be escaped.
Target language is Go.
You could solve that by introducing a lexical mode that you'll enter whenever you match an EQ token. Once in that lexical mode, you either match a (, ) or a whitespace (in which case you pop out of the lexical mode), or you keep matching your NOT_SPECIAL chars.
By using lexical modes, you must define your lexer- and parser rules in their own files. Be sure to use lexer grammar ... and parser grammar ... instead of the grammar ... you use in a combined .g4 file.
A quick demo:
lexer grammar MdbLexer;
STRING
: '"' ~[\r\n"]* '"'
;
OPAR
: '('
;
CPAR
: ')'
;
AND
: 'and'
;
OR
: 'or'
;
NOT
: 'not'
;
NO
: 'no'
;
EVERY
: 'every'
;
EQ
: '=' -> pushMode(NOT_SPECIAL_MODE)
;
ID
: VALID_ID_START VALID_ID_CHAR*
;
DIGIT
: [0-9]
;
WS
: [ \r\n\t]+ -> skip
;
fragment VALID_ID_START
: [a-zA-Z_]
;
fragment VALID_ID_CHAR
: [a-zA-Z_0-9]
;
mode NOT_SPECIAL_MODE;
OPAR2
: '(' -> type(OPAR), popMode
;
CPAR2
: ')' -> type(CPAR), popMode
;
WS2
: [ \t\r\n] -> skip, popMode
;
NOT_SPECIAL
: ~[ \t\r\n()]+
;
Your parser grammar would start like this:
parser grammar MdbParser;
options {
tokenVocab=MdbLexer;
}
start
: searchclause EOF
;
// your other parser rules
My Go is a bit rusty, but a small Java test:
String source = "Person Address=^%Street%%%$^&*#^()";
MdbLexer lexer = new MdbLexer(CharStreams.fromString(source));
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
for (Token t : tokens.getTokens()) {
System.out.printf("%-15s %s\n", MdbLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
print the following:
ID Person
ID Address
EQ =
NOT_SPECIAL ^%Street%%%$^&*#^
OPAR (
CPAR )
EOF <EOF>

How to debug ANTLR4 grammar extraneous / mismatched input error

I want to parse the Rulebook "demo.rb" files like below:
rulebook Titanic-Normalization {
version 1
meta {
description "Test"
source "my-rules.xslx"
user "joltie"
}
rule remove-first-line {
description "Removes first line when offset is zero"
when(present(offset) && offset == 0) then {
filter-row-if-true true;
}
}
}
I wrote the ANTLR4 grammar file Rulebook.g4 like below. For now, it can parse the *.rb files generally well, but it throw unexpected error when encounter the "expression" / "statement" rules.
grammar Rulebook;
rulebookStatement
: KWRulebook
(GeneralIdentifier | Identifier)
'{'
KWVersion
VersionConstant
metaStatement
(ruleStatement)+
'}'
;
metaStatement
: KWMeta
'{'
KWDescription
StringLiteral
KWSource
StringLiteral
KWUser
StringLiteral
'}'
;
ruleStatement
: KWRule
(GeneralIdentifier | Identifier)
'{'
KWDescription
StringLiteral
whenThenStatement
'}'
;
whenThenStatement
: KWWhen '(' expression ')'
KWThen '{' statement '}'
;
primaryExpression
: GeneralIdentifier
| Identifier
| StringLiteral+
| '(' expression ')'
;
postfixExpression
: primaryExpression
| postfixExpression '[' expression ']'
| postfixExpression '(' argumentExpressionList? ')'
| postfixExpression '.' Identifier
| postfixExpression '->' Identifier
| postfixExpression '++'
| postfixExpression '--'
;
argumentExpressionList
: assignmentExpression
| argumentExpressionList ',' assignmentExpression
;
unaryExpression
: postfixExpression
| '++' unaryExpression
| '--' unaryExpression
| unaryOperator castExpression
;
unaryOperator
: '&' | '*' | '+' | '-' | '~' | '!'
;
castExpression
: unaryExpression
| DigitSequence // for
;
multiplicativeExpression
: castExpression
| multiplicativeExpression '*' castExpression
| multiplicativeExpression '/' castExpression
| multiplicativeExpression '%' castExpression
;
additiveExpression
: multiplicativeExpression
| additiveExpression '+' multiplicativeExpression
| additiveExpression '-' multiplicativeExpression
;
shiftExpression
: additiveExpression
| shiftExpression '<<' additiveExpression
| shiftExpression '>>' additiveExpression
;
relationalExpression
: shiftExpression
| relationalExpression '<' shiftExpression
| relationalExpression '>' shiftExpression
| relationalExpression '<=' shiftExpression
| relationalExpression '>=' shiftExpression
;
equalityExpression
: relationalExpression
| equalityExpression '==' relationalExpression
| equalityExpression '!=' relationalExpression
;
andExpression
: equalityExpression
| andExpression '&' equalityExpression
;
exclusiveOrExpression
: andExpression
| exclusiveOrExpression '^' andExpression
;
inclusiveOrExpression
: exclusiveOrExpression
| inclusiveOrExpression '|' exclusiveOrExpression
;
logicalAndExpression
: inclusiveOrExpression
| logicalAndExpression '&&' inclusiveOrExpression
;
logicalOrExpression
: logicalAndExpression
| logicalOrExpression '||' logicalAndExpression
;
conditionalExpression
: logicalOrExpression ('?' expression ':' conditionalExpression)?
;
assignmentExpression
: conditionalExpression
| unaryExpression assignmentOperator assignmentExpression
| DigitSequence // for
;
assignmentOperator
: '=' | '*=' | '/=' | '%=' | '+=' | '-=' | '<<=' | '>>=' | '&=' | '^=' | '|='
;
expression
: assignmentExpression
| expression ',' assignmentExpression
;
statement
: expressionStatement
;
expressionStatement
: expression+ ';'
;
KWRulebook: 'rulebook';
KWVersion: 'version';
KWMeta: 'meta';
KWDescription: 'description';
KWSource: 'source';
KWUser: 'user';
KWRule: 'rule';
KWWhen: 'when';
KWThen: 'then';
KWTrue: 'true';
KWFalse: 'false';
fragment
LeftParen : '(';
fragment
RightParen : ')';
fragment
LeftBracket : '[';
fragment
RightBracket : ']';
fragment
LeftBrace : '{';
fragment
RightBrace : '}';
Identifier
: IdentifierNondigit
( IdentifierNondigit
| Digit
)*
;
GeneralIdentifier
: Identifier
('-' Identifier)+
;
fragment
IdentifierNondigit
: Nondigit
//| // other implementation-defined characters...
;
VersionConstant
: DigitSequence ('.' DigitSequence)*
;
DigitSequence
: Digit+
;
fragment
Nondigit
: [a-zA-Z_]
;
fragment
Digit
: [0-9]
;
StringLiteral
: '"' SCharSequence? '"'
| '\'' SCharSequence? '\''
;
fragment
SCharSequence
: SChar+
;
fragment
SChar
: ~["\\\r\n]
| '\\\n' // Added line
| '\\\r\n' // Added line
;
Whitespace
: [ \t]+
-> skip
;
Newline
: ( '\r' '\n'?
| '\n'
)
-> skip
;
BlockComment
: '/*' .*? '*/'
-> skip
;
LineComment
: '//' ~[\r\n]*
-> skip
;
I tested the Rulebook parser with unit test like below:
public void testScanRulebookFile() throws IOException {
String fileName = "C:\\rulebooks\\demo.rb";
FileInputStream fis = new FileInputStream(fileName);
// create a CharStream that reads from standard input
CharStream input = CharStreams.fromStream(fis);
// create a lexer that feeds off of input CharStream
RulebookLexer lexer = new RulebookLexer(input);
// create a buffer of tokens pulled from the lexer
CommonTokenStream tokens = new CommonTokenStream(lexer);
// create a parser that feeds off the tokens buffer
RulebookParser parser = new RulebookParser(tokens);
RulebookStatementContext context = parser.rulebookStatement();
// WhenThenStatementContext context = parser.whenThenStatement();
System.out.println(context.toStringTree(parser));
// ParseTree tree = parser.getContext(); // begin parsing at init rule
// System.out.println(tree.toStringTree(parser)); // print LISP-style tree
}
For the "demo.rb" as above, the parser got the error as below. I also print the RulebookStatementContext as toStringTree.
line 12:25 mismatched input '&&' expecting ')'
(rulebookStatement rulebook Titanic-Normalization { version 1 (metaStatement meta { description "Test" source "my-rules.xslx" user "joltie" }) (ruleStatement rule remove-first-line { description "Removes first line when offset is zero" (whenThenStatement when ( (expression (assignmentExpression (conditionalExpression (logicalOrExpression (logicalAndExpression (inclusiveOrExpression (exclusiveOrExpression (andExpression (equalityExpression (relationalExpression (shiftExpression (additiveExpression (multiplicativeExpression (castExpression (unaryExpression (postfixExpression (postfixExpression (primaryExpression present)) ( (argumentExpressionList (assignmentExpression (conditionalExpression (logicalOrExpression (logicalAndExpression (inclusiveOrExpression (exclusiveOrExpression (andExpression (equalityExpression (relationalExpression (shiftExpression (additiveExpression (multiplicativeExpression (castExpression (unaryExpression (postfixExpression (primaryExpression offset))))))))))))))))) ))))))))))))))))) && offset == 0 ) then { filter-row-if-true true ;) }) })
I also write the unit test to test short input context like "when (offset == 0) then {\n" + "filter-row-if-true true;\n" + "}\n" to debug the problem. But it still got the error like:
line 1:16 mismatched input '0' expecting {'(', '++', '--', '&&', '&', '*', '+', '-', '~', '!', Identifier, GeneralIdentifier, DigitSequence, StringLiteral}
line 2:19 extraneous input 'true' expecting {'(', '++', '--', '&&', '&', '*', '+', '-', '~', '!', ';', Identifier, GeneralIdentifier, DigitSequence, StringLiteral}
With two day's tries, I didn't got any progress. The question is so long as above, please someone give me some advises about how to debug ANTLR4 grammar extraneous / mismatched input error.
I don't know if there are any more sophisticated methods to debug a grammar/parser but here's how I usally do it:
Reduce the input that causes the problem to as few characters as
possible.
Reduce your grammar as far as possible so that it still produces the same error on the respective input (most of the time that means wrinting a minimal grammar for the reduced input by recycling the rules of the original grammar (simplifying as far as possible)
Make sure the lexer segments the input properly (for that the feature in ANTLRWorks that shows you the lexer output is great)
Have a look at the ParseTree. ANTLR's testRig has a feature that displays the ParseTree graphically (You can access this functionality either through ANTLRWorks or by ANTLR's TreeViewer) so you can have a look where the parser's interpretation differs from the one you have
Do the parsing "by hand". That means you will take your grammar and go through the input by yourself, step by step and try to apply no logic or assumptions/knowledge/etc. during that process. Just follow through your own grammar as a computer would do it. Question every step you take (Is there another way to match the input) and always try to match the input in another way than the one you actually want it to be parsed
Try to fix the error in the minimal grammar and migrate the solution to your real grammar afterwards.
In addition to Raven answer I have used the Intellij 12+ Plugin for ANTLR 4 and it saved me a lot of effort debugging a grammar. I had a very simple bug (unescaped dot . instead of '.' in a floating number rule) that I could not find. This tool allows to select any parser rule of the grammar, test it with an input and shows graphically the parsing tree. I didn't notice it has this very useful feature until I started searching for ways to debug the grammar. Highly recommended.
Update the g4 file to fix parsing error
grammar Rulebook;
#header {
package com.someone.commons.rulebook.parser;
}
rulebookStatement
: KWRulebook
(GeneralIdentifier | Identifier)
'{'
KWVersion
VersionConstant
metaStatement
(ruleStatement)+
'}'
;
metaStatement
: KWMeta
'{'
KWDescription
StringLiteral
KWSource
StringLiteral
KWUser
StringLiteral
'}'
;
ruleStatement
: KWRule
(GeneralIdentifier | Identifier)
'{'
KWDescription
StringLiteral
whenThenStatement
'}'
;
whenThenStatement
: KWWhen '(' expression ')'
KWThen '{' (statement)* '}'
;
primaryExpression
: GeneralIdentifier
| Identifier
| StringLiteral+
| Constant
| '(' expression ')'
| '[' expression ']'
;
postfixExpression
: primaryExpression
| postfixExpression '[' expression ']'
| postfixExpression '(' argumentExpressionList? ')'
| postfixExpression '.' Identifier
| postfixExpression '->' Identifier
| postfixExpression '++'
| postfixExpression '--'
;
argumentExpressionList
: assignmentExpression
| argumentExpressionList ',' assignmentExpression
;
unaryExpression
: postfixExpression
| '++' unaryExpression
| '--' unaryExpression
| unaryOperator castExpression
;
unaryOperator
: '&' | '*' | '+' | '-' | '~' | '!'
;
castExpression
: unaryExpression
;
multiplicativeExpression
: castExpression
| multiplicativeExpression '*' castExpression
| multiplicativeExpression '/' castExpression
| multiplicativeExpression '%' castExpression
;
additiveExpression
: multiplicativeExpression
| additiveExpression '+' multiplicativeExpression
| additiveExpression '-' multiplicativeExpression
;
shiftExpression
: additiveExpression
| shiftExpression '<<' additiveExpression
| shiftExpression '>>' additiveExpression
;
relationalExpression
: shiftExpression
| relationalExpression '<' shiftExpression
| relationalExpression '>' shiftExpression
| relationalExpression '<=' shiftExpression
| relationalExpression '>=' shiftExpression
;
equalityExpression
: relationalExpression
| equalityExpression '==' relationalExpression
| equalityExpression '!=' relationalExpression
;
andExpression
: equalityExpression
| andExpression '&' equalityExpression
;
exclusiveOrExpression
: andExpression
| exclusiveOrExpression '^' andExpression
;
inclusiveOrExpression
: exclusiveOrExpression
| inclusiveOrExpression '|' exclusiveOrExpression
;
logicalAndExpression
: inclusiveOrExpression
| logicalAndExpression '&&' inclusiveOrExpression
;
logicalOrExpression
: logicalAndExpression
| logicalOrExpression '||' logicalAndExpression
;
conditionalExpression
: logicalOrExpression ('?' expression? ':' conditionalExpression)?
;
assignmentExpression
: conditionalExpression
| unaryExpression assignmentOperator assignmentExpression
;
assignmentOperator
: '=' | '*=' | '/=' | '%=' | '+=' | '-=' | '<<=' | '>>=' | '&=' | '^=' | '|='
;
expression
: assignmentExpression
| expression ',' assignmentExpression
;
statement
: expressionStatement
;
expressionStatement
: expression+ ';'
;
KWRulebook: 'rulebook';
KWVersion: 'version';
KWMeta: 'meta';
KWDescription: 'description';
KWSource: 'source';
KWUser: 'user';
KWRule: 'rule';
KWWhen: 'when';
KWThen: 'then';
Identifier
: IdentifierNondigit
( IdentifierNondigit
| Digit
)*
;
GeneralIdentifier
: Identifier
( '-'
| '.'
| IdentifierNondigit
| Digit
)*
;
fragment
IdentifierNondigit
: Nondigit
//| // other implementation-defined characters...
;
VersionConstant
: DigitSequence ('.' DigitSequence)*
;
Constant
: IntegerConstant
| FloatingConstant
;
fragment
IntegerConstant
: DecimalConstant
;
fragment
DecimalConstant
: NonzeroDigit Digit*
;
fragment
FloatingConstant
: DecimalFloatingConstant
;
fragment
DecimalFloatingConstant
: FractionalConstant
;
fragment
FractionalConstant
: DigitSequence? '.' DigitSequence
| DigitSequence '.'
;
fragment
DigitSequence
: Digit+
;
fragment
Nondigit
: [a-zA-Z_]
;
fragment
Digit
: [0-9]
;
fragment
NonzeroDigit
: [1-9]
;
StringLiteral
: '"' SCharSequence? '"'
| '\'' SCharSequence? '\''
;
fragment
SCharSequence
: SChar+
;
fragment
SChar
: ~["\\\r\n]
| '\\\n' // Added line
| '\\\r\n' // Added line
;
Whitespace
: [ \t]+
-> skip
;
Newline
: ( '\r' '\n'?
| '\n'
)
-> skip
;
BlockComment
: '/*' .*? '*/'
-> skip
;
LineComment
: '//' ~[\r\n]*
-> skip
;

Antlr3 - Non Greedy Double Quoted String with Escaped Double Quote

The following Antlr3 Grammar file doesn't cater for escaped double quotes as part of the STRING lexer rule. Any ideas why?
Expressions working:
\"hello\"
ref(\"hello\",\"hello\")
Expressions NOT working:
\"h\"e\"l\"l\"o\"
ref(\"hello\", \"hel\"lo\")
Antlr3 grammar file runnable in AntlrWorks:
grammar Grammar;
options
{
output=AST;
ASTLabelType=CommonTree;
language=CSharp3;
}
public oaExpression
: exponentiationExpression EOF!
;
exponentiationExpression
: equalityExpression ( '^' equalityExpression )*
;
equalityExpression
: relationalExpression ( ( ('==' | '=' ) | ('!=' | '<>' ) ) relationalExpression )*
;
relationalExpression
: additiveExpression ( ( '>' | '>=' | '<' | '<=' ) additiveExpression )*
;
additiveExpression
: multiplicativeExpression ( ( '+' | '-' ) multiplicativeExpression )*
;
multiplicativeExpression
: primaryExpression ( ( '*' | '/' ) primaryExpression )*
;
primaryExpression
: '(' exponentiationExpression ')' | value | identifier (arguments )?
;
value
: STRING
;
identifier
: ID
;
expressionList
: exponentiationExpression ( ',' exponentiationExpression )*
;
arguments
: '(' ( expressionList )? ')'
;
/*
* Lexer rules
*/
ID
: LETTER (LETTER | DIGIT)*
;
STRING
: '"' ( options { greedy=false; } : ~'"' )* '"'
;
WS
: (' '|'\r'|'\t'|'\u000C'|'\n') {$channel=Hidden;}
;
/*
* Fragment Lexer rules
*/
fragment
LETTER
: 'a'..'z'
| 'A'..'Z'
| '_'
;
fragment
EXPONENT
: ('e'|'E') ('+'|'-')? ( DIGIT )+
;
fragment
HEX_DIGIT
: ( DIGIT |'a'..'f'|'A'..'F')
;
fragment
DIGIT
: '0'..'9'
;
Try this:
STRING
: '"' // a opening quote
( // start group
'\\' ~('\r' | '\n') // an escaped char other than a line break char
| // OR
~('\\' | '"'| '\r' | '\n') // any char other than '"', '\' and line breaks
)* // end group and repeat zero or more times
'"' // the closing quote
;
When I test the 4 different test cases from your comment:
"\"hello\""
"ref(\"hello\",\"hello\")"
"\"h\"e\"l\"l\"o\""
"ref(\"hello\", \"hel\"lo\")"
with the lexer rule I suggested:
grammar T;
parse
: string+ EOF
;
string
: STRING
;
STRING
: '"' ('\\' ~('\r' | '\n') | ~('\\' | '"'| '\r' | '\n'))* '"'
;
SPACE
: (' ' | '\t' | '\r' | '\n')+ {skip();}
;
ANTLRWorks' debugger produces the following parse tree:
In other words: it works just fine (on my machine :)).
EDIT II
And I've also used your grammar (making some small changes to make it Java compatible) where I replaced the incorrect STRING rule into the one I suggested:
oaExpression
: STRING+ EOF!
//: exponentiationExpression EOF!
;
exponentiationExpression
: equalityExpression ( '^' equalityExpression )*
;
equalityExpression
: relationalExpression ( ( ('==' | '=' ) | ('!=' | '<>' ) ) relationalExpression )*
;
relationalExpression
: additiveExpression ( ( '>' | '>=' | '<' | '<=' ) additiveExpression )*
;
additiveExpression
: multiplicativeExpression ( ( '+' | '-' ) multiplicativeExpression )*
;
multiplicativeExpression
: primaryExpression ( ( '*' | '/' ) primaryExpression )*
;
primaryExpression
: '(' exponentiationExpression ')' | value | identifier (arguments )?
;
value
: STRING
;
identifier
: ID
;
expressionList
: exponentiationExpression ( ',' exponentiationExpression )*
;
arguments
: '(' ( expressionList )? ')'
;
/*
* Lexer rules
*/
ID
: LETTER (LETTER | DIGIT)*
;
//STRING
// : '"' ( options { greedy=false; } : ~'"' )* '"'
// ;
STRING
: '"' ('\\' ~('\r' | '\n') | ~('\\' | '"'| '\r' | '\n'))* '"'
;
WS
: (' '|'\r'|'\t'|'\u000C'|'\n') {$channel=HIDDEN;} /*{$channel=Hidden;}*/
;
/*
* Fragment Lexer rules
*/
fragment
LETTER
: 'a'..'z'
| 'A'..'Z'
| '_'
;
fragment
EXPONENT
: ('e'|'E') ('+'|'-')? ( DIGIT )+
;
fragment
HEX_DIGIT
: ( DIGIT |'a'..'f'|'A'..'F')
;
fragment
DIGIT
: '0'..'9'
;
which parses the input from my previous example in an identical parse tree.
This is how I do this with strings that can contain escape sequences (not just \" but any):
DOUBLE_QUOTED_TEXT
#init { int escape_count = 0; }:
DOUBLE_QUOTE
(
DOUBLE_QUOTE DOUBLE_QUOTE { escape_count++; }
| ESCAPE_OPERATOR . { escape_count++; }
| ~(DOUBLE_QUOTE | ESCAPE_OPERATOR)
)*
DOUBLE_QUOTE
{ EMIT(); LTOKEN->user1 = escape_count; }
;
The rule additionally counts the escapes and stores them in the token. This allows the receiver to quickly see if it needs to do anything with the string (if user1 > 0). If you don't need that remove the #init part and the actions.

ANTLR ambiguity '-'

I have a grammar and everything works fine until this portion:
lexp
: factor ( ('+' | '-') factor)*
;
factor :('-')? IDENT;
This of course introduces an ambiguity. For example a-a can be matched by either Factor - Factor or Factor -> - IDENT
I get the following warning stating this:
[18:49:39] warning(200): withoutWarningButIncomplete.g:57:31:
Decision can match input such as "'-' {IDENT, '-'}" using multiple alternatives: 1, 2
How can I resolve this ambiguity? I just don't see a way around it. Is there some kind of option that I can use?
Here is the full grammar:
program
: includes decls (procedure)*
;
/* Check if correct! */
includes
: ('#include' STRING)*
;
decls
: (typedident ';')*
;
typedident
: ('int' | 'char') IDENT
;
procedure
: ('int' | 'char') IDENT '(' args ')' body
;
args
: typedident (',' typedident )* /* Check if correct! */
| /* epsilon */
;
body
: '{' decls stmtlist '}'
;
stmtlist
: (stmt)*;
stmt
: '{' stmtlist '}'
| 'read' '(' IDENT ')' ';'
| 'output' '(' IDENT ')' ';'
| 'print' '(' STRING ')' ';'
| 'return' (lexp)* ';'
| 'readc' '(' IDENT ')' ';'
| 'outputc' '(' IDENT ')' ';'
| IDENT '(' (IDENT ( ',' IDENT )*)? ')' ';'
| IDENT '=' lexp ';';
lexp
: term (( '+' | '-' ) term) * /*Add in | '-' to reveal the warning! !*/
;
term
: factor (('*' | '/' | '%') factor )*
;
factor : '(' lexp ')'
| ('-')? IDENT
| NUMBER;
fragment DIGIT
: ('0' .. '9')
;
IDENT : ('A' .. 'Z' | 'a' .. 'z') (( 'A' .. 'Z' | 'a' .. 'z' | '0' .. '9' | '_'))* ;
NUMBER
: ( ('-')? DIGIT+)
;
CHARACTER
: '\'' ('a' .. 'z' | 'A' .. 'Z' | '0' .. '9' | '\\n' | '\\t' | '\\\\' | '\\' | 'EOF' |'.' | ',' |':' ) '\'' /* IS THIS COMPLETE? */
;
As mentioned in the comments: these rules are not ambiguous:
lexp
: factor (('+' | '-') factor)*
;
factor : ('-')? IDENT;
This is the cause of the ambiguity:
'return' (lexp)* ';'
which can parse the input a-b in two different ways:
a-b as a single binary expression
a as a single expression, and -b as an unary expression
You will need to change your grammar. Perhaps add a comma in multiple return values? Something like this:
'return' (lexp (',' lexp)*)? ';'
which will match:
return;
return a;
return a, -b;
return a-b, c+d+e, f;
...

Parser rule is ignored in ANTLR grammar

I am trying to write my first ANTLR grammar. And I am parsing following test example:
token1 token2
chapter1 token3 token4 token5
chapter2
token6 token7
chapter3 token8
And using following grammar:
grammar Chapters;
message : chapter+ EOF
;
chapter : (chapter1|chapter2|chapter3) text
;
text : ~(chapter1|chapter2|chapter3)*
;
chapter1 : 'chapter1'
;
chapter2 : 'chapter2'
;
chapter3 : 'chapter3'
;
Id : ('a'..'z'|'A'..'Z'|'_') ('a'..'z'|'A'..'Z'|'0'..'9'|'_')*
;
Int : '0'..'9'+
;
Float
: ('0'..'9')+ '.' ('0'..'9')* EXPONENT?
| '.' ('0'..'9')+ EXPONENT?
| ('0'..'9')+ EXPONENT
;
WS : ( ' '
| '\t'
| '\r'
| '\n'
) {$channel=HIDDEN;}
;
Char: '\'' ( ESC_SEQ | ~('\''|'\\') ) '\''
;
fragment
EXPONENT : ('e'|'E') ('+'|'-')? ('0'..'9')+ ;
fragment
HEX_DIGIT : ('0'..'9'|'a'..'f'|'A'..'F') ;
fragment
ESC_SEQ
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UNICODE_ESC
| OCTAL_ESC
;
fragment
OCTAL_ESC
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment
UNICODE_ESC
: '\\' 'u' HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
;
And I get following result:
What I wanted to see was token1 and token2 under text node, same goes for tokens 3,4 and 5 etc. So I wanted to break content under each chapter node into chapter name and chapter text. How should I change my grammar to achieve this?

Resources