I am trying to parse input like this:
(0002,0980);
(000a,f987);
(0001,[foo]00);
The pattern is ( g, e);
where g is a four-digit hex number. If g is even, e is a four-digit hex number either even or odd. If g is odd, e has the pattern '[IDENT] hex-digit hex-digit'.
I have tried many variations but this summarizes my thinking...
grammar Post;
script : statement (statement)* EOF ;
statement : tag ';' ;
tag : even_tag | odd_tag ;
even_tag : '(' g_even ',' e_even ')' ;
odd_tag : '(' g_odd ',' e_odd ')' ;
g_even : HEXDIGIT HEXDIGIT HEXDIGIT EVEN_HEXDIGIT ;
g_odd : HEXDIGIT HEXDIGIT HEXDIGIT ODD_HEXDIGIT ;
e_even : HEXDIGIT HEXDIGIT HEXDIGIT HEXDIGIT ;
e_odd : '[' IDENT ']' HEXDIGIT HEXDIGIT ;
HEXDIGIT : ODD_HEXDIGIT | EVEN_HEXDIGIT ;
ODD_HEXDIGIT : ['1','3','5','7','9', 'b', 'B', 'd', 'D', 'f', 'F'];
EVEN_HEXDIGIT : ['0','2','4','6','8', 'a', 'A', 'c', 'C', 'e', 'E'];
IDENT : LETTER (LETTER | DIGIT | ' ')*;
fragment LETTER : ('a'..'z' | 'A'..'Z') ;
fragment DIGIT : ('0'..'9');
It fails with the un-inspiring error
line 2:12 token recognition error at: '\n'
line 3:4 no viable alternative at input '(0001'
Modifying this to
grammar P2;
script : statement (statement)* EOF ;
statement : tag ';' ;
tag : even_tag | odd_tag ;
even_tag : '(' g_even ',' e_even ')' ;
odd_tag : '(' g_odd ',' e_odd ')' ;
g_even : G_EVEN ;
g_odd : G_ODD ;
e_even : E_EVEN ;
e_odd : E_ODD ;
G_EVEN : HEXDIGIT HEXDIGIT HEXDIGIT EVEN_HEXDIGIT ;
G_ODD : HEXDIGIT HEXDIGIT HEXDIGIT ODD_HEXDIGIT ;
E_EVEN : HEXDIGIT HEXDIGIT HEXDIGIT HEXDIGIT ;
E_ODD : '[' IDENT ']' HEXDIGIT HEXDIGIT ;
ODD_HEXDIGIT : ['1','3','5','7','9', 'b', 'B', 'd', 'D', 'f', 'F'];
EVEN_HEXDIGIT : ['0','2','4','6','8', 'a', 'A', 'c', 'C', 'e', 'E'];
HEXDIGIT : ODD_HEXDIGIT | EVEN_HEXDIGIT ;
IDENT : LETTER (LETTER | DIGIT | ' ')*;
fragment LETTER : ('a'..'z' | 'A'..'Z') ;
fragment DIGIT : ('0'..'9');
helps alot but the problem looks more clearly like an ambiguity in e.
line 2:5 mismatched input ',098' expecting ','
line 2:12 token recognition error at: '\n'
I suspect the problem results from the fact that g_even and e_even are ambiguous and g_odd and e_even are ambiguous. However, the pattern is such that this ambiguity can be avoided because g is always parsed first and g_even and g_odd are not ambiguous. Once g is known, there is no ambiguity left. There is only ambiguity if the parser doesn't know it is always looking for g first. There is only ambiguity if parsing might begin with e and that is never the case.
Perhaps the problem isn't ambiguity at all. I'm new to this game.
How can I parse this so that the parse tree labels g_even, g_odd, e_even, e_odd?
Thanks!
I got this to work with the following:
grammar P2;
script : statement (statement)* EOF ;
statement : tag ';' ;
tag : even_tag | odd_tag ;
even_tag : '(' g_even ',' e_even ')' ;
odd_tag : '(' g_odd ',' e_odd ')' ;
g_even : G_EVEN ;
g_odd : G_ODD ;
e_even : G_EVEN | G_ODD;
e_odd : E_ODD ;
G_EVEN : HEXDIGIT HEXDIGIT HEXDIGIT EVEN_HEXDIGIT ;
G_ODD : HEXDIGIT HEXDIGIT HEXDIGIT ODD_HEXDIGIT ;
//E_EVEN : HEXDIGIT HEXDIGIT HEXDIGIT HEXDIGIT ;
E_ODD : '[' IDENT ']' HEXDIGIT HEXDIGIT ;
ODD_HEXDIGIT : [13579bBdDfF];
EVEN_HEXDIGIT : [02468aAcCeE];
HEXDIGIT : ODD_HEXDIGIT | EVEN_HEXDIGIT ;
IDENT : LETTER (LETTER | DIGIT | ' ')*;
fragment LETTER : ('a'..'z' | 'A'..'Z') ;
fragment DIGIT : ('0'..'9');
There was no need to define the overlapping E_EVEN. Plus, the list-of-charcters-to-match syntax was wrong and caused unintended matching on commas. Doh.
Related
I have 3 types of numbers defined, number, decimal and percentage.
Percentage : (Sign)? Digit+ (Dot Digit+)? '%' ;
Number : Sign? Digit+;
Decimal : Sign? Digit+ Dot Digit*;
Percentage and decimal work fine but when I assign a number, unless I put a sign (+ or -) in front of the number, it doesn't recognize it as a number.
number foo = +5 // does recognize
number foo = 5; // does not recognize
It does recognize it in an evaluation expression.
if (foo == 5 ) // does recognize
Here is my language (I took out the functions and left only the language recognition).
grammar Fetal;
transaction : begin statements end;
begin : 'begin' ;
end : 'end' ;
statements : (statement)+
;
statement
: declaration ';'
| command ';'
| assignment ';'
| evaluation
| ';'
;
declaration : type var;
var returns : identifier;
type returns
: DecimalType
| NumberType
| StringType
| BooleanType
| DateType
| ObjectType
| DaoType
;
assignment
: lharg Equals rharg
| lharg unaryOP rharg
;
assignmentOp : Equals
;
unaryOP : PlusEquals
| MinusEquals
| MultiplyEquals
| DivideEquals
| ModuloEquals
| ExponentEquals
;
expressionOp : arithExpressOp
| bitwiseExpressOp
;
arithExpressOp : Multiply
| Divide
| Plus
| Minus
| Modulo
| Exponent
;
bitwiseExpressOp
: And
| Or
| Not
;
comparisonOp : IsEqualTo
| IsLessThan
| IsLessThanOrEqualTo
| IsGreaterThan
| IsGreaterThanOrEqualTo
| IsNotEqualTo
;
logicExpressOp : AndExpression
| OrExpression
| ExclusiveOrExpression
;
rharg returns
: rharg expressionOp rharg
| '(' rharg expressionOp rharg ')'
| var
| literal
| assignmentCommands
;
lharg returns : var;
identifier : Identifier;
evaluation : IfStatement '(' evalExpression ')' block (Else block)?;
block : OpenBracket statements CloseBracket;
evalExpression
: evalExpression logicExpressOp evalExpression
| '(' evalExpression logicExpressOp evalExpression ')'
| eval
| '(' eval ')'
;
eval : rharg comparisonOp rharg ;
assignmentCommands
: GetBalance '(' stringArg ')'
| GetVariableType '(' var ')'
| GetDescription
| Today
| GetDays '(' startPeriod=dateArg ',' endPeriod=dateArg ')'
| DayOfTheWeek '(' dateArg ')'
| GetCalendarDay '(' dateArg ')'
| GetMonth '(' dateArg ')'
| GetYear '(' dateArg ')'
| Import '(' stringArg ')' /* Import( path ) */
| Lookup '(' sql=stringArg ',' argumentList ')' /* Lookup( table, SQL) */
| List '(' sql=stringArg ',' argumentList ')' /* List( table, SQL) */
| invocation
;
command : Print '(' rharg ')'
| Credit '(' amtArg ',' stringArg ')'
| Debit '(' amtArg ',' stringArg ')'
| Ledger '(' debitOrCredit ',' amtArg ',' acc=stringArg ',' desc=stringArg ')'
| Alias '(' account=stringArg ',' name=stringArg ')'
| MapFile ':' stringArg
| invocation
| Update '(' sql=stringArg ',' argumentList ')'
;
invocation
: o=objectLiteral '.' m=identifier '('argumentList? ')'
| o=objectLiteral '.' m=identifier '()'
;
argumentList
: rharg (',' rharg )*
;
amtArg : rharg ;
stringArg : rharg ;
numberArg : rharg ;
dateArg : rharg ;
debitOrCredit : charLiteral ;
literal
: numericLiteral
| doubleLiteral
| booleanLiteral
| percentLiteral
| stringLiteral
| dateLiteral
;
fileName : '<' fn=Identifier ('.' ft=Identifier)? '>' ;
charLiteral : ('D' | 'C');
numericLiteral : Number ;
doubleLiteral : Decimal ;
percentLiteral : Percentage ;
booleanLiteral : Boolean ;
stringLiteral : String ;
dateLiteral : Date ;
objectLiteral : Identifier ;
daoLiteral : Identifier ;
//Below are Token definitions
// Data Types
DecimalType : 'decimal' ;
NumberType : 'number' ;
StringType : 'string' ;
BooleanType : 'boolean' ;
DateType : 'date' ;
ObjectType : 'object' ;
DaoType : 'dao' ;
/******************************************************************
* Assignmnt operator
******************************************************************/
Equals : '=' ;
/*****************************************************************
* Unary operators
*****************************************************************/
PlusEquals : '+=' ;
MinusEquals : '-=' ;
MultiplyEquals : '*=' ;
DivideEquals : '/=' ;
ModuloEquals : '%=' ;
ExponentEquals : '^=' ;
/*****************************************************************
* Binary operators
*****************************************************************/
Plus : '+' ;
Minus : '-' ;
Multiply : '*' ;
Divide : '/' ;
Modulo : '%' ;
Exponent : '^' ;
/***************************************************************
* Bitwise operators
***************************************************************/
And : '&' ;
Or : '|' ;
Not : '!' ;
/*************************************************************
* Compariso operators
*************************************************************/
IsEqualTo : '==' ;
IsLessThan : '<' ;
IsLessThanOrEqualTo : '<=' ;
IsGreaterThan : '>' ;
IsGreaterThanOrEqualTo : '>=' ;
IsNotEqualTo : '!=' ;
/*************************************************************
* Expression operators
*************************************************************/
AndExpression : '&&' ;
OrExpression : '||' ;
ExclusiveOrExpression : '^^' ;
// Reserve words (Assignment Commands)
GetBalance : 'getBalance';
GetVariableType : 'getVariableType' ;
GetDescription : 'getDescription' ;
Today : 'today';
GetDays : 'getDays' ;
DayOfTheWeek : 'dayOfTheWeek' ;
GetCalendarDay : 'getCalendarDay' ;
GetMonth : 'getMonth' ;
GetYear : 'getYear' ;
Import : 'import' ;
Lookup : 'lookup' ;
List : 'list' ;
// Reserve words (Commands)
Credit : 'credit';
Debit : 'debit';
Ledger : 'ledger';
Alias : 'alias' ;
MapFile : 'mapFile' ;
Update : 'update' ;
Print : 'print';
IfStatement : 'if';
Else : 'else';
OpenBracket : '{';
CloseBracket : '}';
Percentage : (Sign)? Digit+ (Dot Digit+)? '%' ;
Boolean : 'true' | 'false';
Number : Sign? Digit+;
Decimal : Sign? Digit+ Dot Digit*;
Date : Year '-' Month '-' Day;
Identifier
: IdentifierNondigit
( IdentifierNondigit
| Digit
)*
;
String: '"' ( ESC | ~[\\"] )* '"';
/************************************************************
* Fragment Definitions
************************************************************/
fragment
ESC : '\\' [abtnfrv"'\\]
;
fragment
IdentifierNondigit
: Nondigit
//| // other implementation-defined characters...
;
fragment
Nondigit
: [a-zA-Z_]
;
fragment
Digit
: [0-9]
;
fragment
Sign : Plus | Minus;
fragment
Digits
: [-+]?[0-9]+
;
fragment
Year
: Digit Digit Digit Digit;
fragment
Month
: Digit Digit;
fragment
Day
: Digit Digit;
fragment Dot : '.';
fragment
SCharSequence
: SChar+
;
fragment
SChar
: ~["\\\r\n]
| SimpleEscapeSequence
| '\\\n' // Added line
| '\\\r\n' // Added line
;
fragment
CChar
: ~['\\\r\n]
| SimpleEscapeSequence
;
fragment
SimpleEscapeSequence
: '\\' ['"?abfnrtv\\]
;
ExtendedAscii
: [\x80-\xfe]+
-> skip
;
Whitespace
: [ \t]+
-> skip
;
Newline
: ( '\r' '\n'?
| '\n'
)
-> skip
;
BlockComment
: '/*' .*? '*/'
-> skip
;
LineComment
: '//' ~[\r\n]*
-> skip
;
I have a hunch that this use of a fragment is incorrect:
fragment Sign : Plus | Minus;
I couldn't find anything in the reference book, but I think it needs to be changed to something like this:
fragment Sign : [+-];
I found the issue. I was using version 4.5.2-1 because every attempt to upgrade to 4.7 caused more errors and I didn't want to cause more errors while trying to solve another. I finally broke down and upgraded the libraries to 4.7, fixed the errors and the number recognition issue disappeared. It was a bug in the library, all this time.
I want to parse the Rulebook "demo.rb" files like below:
rulebook Titanic-Normalization {
version 1
meta {
description "Test"
source "my-rules.xslx"
user "joltie"
}
rule remove-first-line {
description "Removes first line when offset is zero"
when(present(offset) && offset == 0) then {
filter-row-if-true true;
}
}
}
I wrote the ANTLR4 grammar file Rulebook.g4 like below. For now, it can parse the *.rb files generally well, but it throw unexpected error when encounter the "expression" / "statement" rules.
grammar Rulebook;
rulebookStatement
: KWRulebook
(GeneralIdentifier | Identifier)
'{'
KWVersion
VersionConstant
metaStatement
(ruleStatement)+
'}'
;
metaStatement
: KWMeta
'{'
KWDescription
StringLiteral
KWSource
StringLiteral
KWUser
StringLiteral
'}'
;
ruleStatement
: KWRule
(GeneralIdentifier | Identifier)
'{'
KWDescription
StringLiteral
whenThenStatement
'}'
;
whenThenStatement
: KWWhen '(' expression ')'
KWThen '{' statement '}'
;
primaryExpression
: GeneralIdentifier
| Identifier
| StringLiteral+
| '(' expression ')'
;
postfixExpression
: primaryExpression
| postfixExpression '[' expression ']'
| postfixExpression '(' argumentExpressionList? ')'
| postfixExpression '.' Identifier
| postfixExpression '->' Identifier
| postfixExpression '++'
| postfixExpression '--'
;
argumentExpressionList
: assignmentExpression
| argumentExpressionList ',' assignmentExpression
;
unaryExpression
: postfixExpression
| '++' unaryExpression
| '--' unaryExpression
| unaryOperator castExpression
;
unaryOperator
: '&' | '*' | '+' | '-' | '~' | '!'
;
castExpression
: unaryExpression
| DigitSequence // for
;
multiplicativeExpression
: castExpression
| multiplicativeExpression '*' castExpression
| multiplicativeExpression '/' castExpression
| multiplicativeExpression '%' castExpression
;
additiveExpression
: multiplicativeExpression
| additiveExpression '+' multiplicativeExpression
| additiveExpression '-' multiplicativeExpression
;
shiftExpression
: additiveExpression
| shiftExpression '<<' additiveExpression
| shiftExpression '>>' additiveExpression
;
relationalExpression
: shiftExpression
| relationalExpression '<' shiftExpression
| relationalExpression '>' shiftExpression
| relationalExpression '<=' shiftExpression
| relationalExpression '>=' shiftExpression
;
equalityExpression
: relationalExpression
| equalityExpression '==' relationalExpression
| equalityExpression '!=' relationalExpression
;
andExpression
: equalityExpression
| andExpression '&' equalityExpression
;
exclusiveOrExpression
: andExpression
| exclusiveOrExpression '^' andExpression
;
inclusiveOrExpression
: exclusiveOrExpression
| inclusiveOrExpression '|' exclusiveOrExpression
;
logicalAndExpression
: inclusiveOrExpression
| logicalAndExpression '&&' inclusiveOrExpression
;
logicalOrExpression
: logicalAndExpression
| logicalOrExpression '||' logicalAndExpression
;
conditionalExpression
: logicalOrExpression ('?' expression ':' conditionalExpression)?
;
assignmentExpression
: conditionalExpression
| unaryExpression assignmentOperator assignmentExpression
| DigitSequence // for
;
assignmentOperator
: '=' | '*=' | '/=' | '%=' | '+=' | '-=' | '<<=' | '>>=' | '&=' | '^=' | '|='
;
expression
: assignmentExpression
| expression ',' assignmentExpression
;
statement
: expressionStatement
;
expressionStatement
: expression+ ';'
;
KWRulebook: 'rulebook';
KWVersion: 'version';
KWMeta: 'meta';
KWDescription: 'description';
KWSource: 'source';
KWUser: 'user';
KWRule: 'rule';
KWWhen: 'when';
KWThen: 'then';
KWTrue: 'true';
KWFalse: 'false';
fragment
LeftParen : '(';
fragment
RightParen : ')';
fragment
LeftBracket : '[';
fragment
RightBracket : ']';
fragment
LeftBrace : '{';
fragment
RightBrace : '}';
Identifier
: IdentifierNondigit
( IdentifierNondigit
| Digit
)*
;
GeneralIdentifier
: Identifier
('-' Identifier)+
;
fragment
IdentifierNondigit
: Nondigit
//| // other implementation-defined characters...
;
VersionConstant
: DigitSequence ('.' DigitSequence)*
;
DigitSequence
: Digit+
;
fragment
Nondigit
: [a-zA-Z_]
;
fragment
Digit
: [0-9]
;
StringLiteral
: '"' SCharSequence? '"'
| '\'' SCharSequence? '\''
;
fragment
SCharSequence
: SChar+
;
fragment
SChar
: ~["\\\r\n]
| '\\\n' // Added line
| '\\\r\n' // Added line
;
Whitespace
: [ \t]+
-> skip
;
Newline
: ( '\r' '\n'?
| '\n'
)
-> skip
;
BlockComment
: '/*' .*? '*/'
-> skip
;
LineComment
: '//' ~[\r\n]*
-> skip
;
I tested the Rulebook parser with unit test like below:
public void testScanRulebookFile() throws IOException {
String fileName = "C:\\rulebooks\\demo.rb";
FileInputStream fis = new FileInputStream(fileName);
// create a CharStream that reads from standard input
CharStream input = CharStreams.fromStream(fis);
// create a lexer that feeds off of input CharStream
RulebookLexer lexer = new RulebookLexer(input);
// create a buffer of tokens pulled from the lexer
CommonTokenStream tokens = new CommonTokenStream(lexer);
// create a parser that feeds off the tokens buffer
RulebookParser parser = new RulebookParser(tokens);
RulebookStatementContext context = parser.rulebookStatement();
// WhenThenStatementContext context = parser.whenThenStatement();
System.out.println(context.toStringTree(parser));
// ParseTree tree = parser.getContext(); // begin parsing at init rule
// System.out.println(tree.toStringTree(parser)); // print LISP-style tree
}
For the "demo.rb" as above, the parser got the error as below. I also print the RulebookStatementContext as toStringTree.
line 12:25 mismatched input '&&' expecting ')'
(rulebookStatement rulebook Titanic-Normalization { version 1 (metaStatement meta { description "Test" source "my-rules.xslx" user "joltie" }) (ruleStatement rule remove-first-line { description "Removes first line when offset is zero" (whenThenStatement when ( (expression (assignmentExpression (conditionalExpression (logicalOrExpression (logicalAndExpression (inclusiveOrExpression (exclusiveOrExpression (andExpression (equalityExpression (relationalExpression (shiftExpression (additiveExpression (multiplicativeExpression (castExpression (unaryExpression (postfixExpression (postfixExpression (primaryExpression present)) ( (argumentExpressionList (assignmentExpression (conditionalExpression (logicalOrExpression (logicalAndExpression (inclusiveOrExpression (exclusiveOrExpression (andExpression (equalityExpression (relationalExpression (shiftExpression (additiveExpression (multiplicativeExpression (castExpression (unaryExpression (postfixExpression (primaryExpression offset))))))))))))))))) ))))))))))))))))) && offset == 0 ) then { filter-row-if-true true ;) }) })
I also write the unit test to test short input context like "when (offset == 0) then {\n" + "filter-row-if-true true;\n" + "}\n" to debug the problem. But it still got the error like:
line 1:16 mismatched input '0' expecting {'(', '++', '--', '&&', '&', '*', '+', '-', '~', '!', Identifier, GeneralIdentifier, DigitSequence, StringLiteral}
line 2:19 extraneous input 'true' expecting {'(', '++', '--', '&&', '&', '*', '+', '-', '~', '!', ';', Identifier, GeneralIdentifier, DigitSequence, StringLiteral}
With two day's tries, I didn't got any progress. The question is so long as above, please someone give me some advises about how to debug ANTLR4 grammar extraneous / mismatched input error.
I don't know if there are any more sophisticated methods to debug a grammar/parser but here's how I usally do it:
Reduce the input that causes the problem to as few characters as
possible.
Reduce your grammar as far as possible so that it still produces the same error on the respective input (most of the time that means wrinting a minimal grammar for the reduced input by recycling the rules of the original grammar (simplifying as far as possible)
Make sure the lexer segments the input properly (for that the feature in ANTLRWorks that shows you the lexer output is great)
Have a look at the ParseTree. ANTLR's testRig has a feature that displays the ParseTree graphically (You can access this functionality either through ANTLRWorks or by ANTLR's TreeViewer) so you can have a look where the parser's interpretation differs from the one you have
Do the parsing "by hand". That means you will take your grammar and go through the input by yourself, step by step and try to apply no logic or assumptions/knowledge/etc. during that process. Just follow through your own grammar as a computer would do it. Question every step you take (Is there another way to match the input) and always try to match the input in another way than the one you actually want it to be parsed
Try to fix the error in the minimal grammar and migrate the solution to your real grammar afterwards.
In addition to Raven answer I have used the Intellij 12+ Plugin for ANTLR 4 and it saved me a lot of effort debugging a grammar. I had a very simple bug (unescaped dot . instead of '.' in a floating number rule) that I could not find. This tool allows to select any parser rule of the grammar, test it with an input and shows graphically the parsing tree. I didn't notice it has this very useful feature until I started searching for ways to debug the grammar. Highly recommended.
Update the g4 file to fix parsing error
grammar Rulebook;
#header {
package com.someone.commons.rulebook.parser;
}
rulebookStatement
: KWRulebook
(GeneralIdentifier | Identifier)
'{'
KWVersion
VersionConstant
metaStatement
(ruleStatement)+
'}'
;
metaStatement
: KWMeta
'{'
KWDescription
StringLiteral
KWSource
StringLiteral
KWUser
StringLiteral
'}'
;
ruleStatement
: KWRule
(GeneralIdentifier | Identifier)
'{'
KWDescription
StringLiteral
whenThenStatement
'}'
;
whenThenStatement
: KWWhen '(' expression ')'
KWThen '{' (statement)* '}'
;
primaryExpression
: GeneralIdentifier
| Identifier
| StringLiteral+
| Constant
| '(' expression ')'
| '[' expression ']'
;
postfixExpression
: primaryExpression
| postfixExpression '[' expression ']'
| postfixExpression '(' argumentExpressionList? ')'
| postfixExpression '.' Identifier
| postfixExpression '->' Identifier
| postfixExpression '++'
| postfixExpression '--'
;
argumentExpressionList
: assignmentExpression
| argumentExpressionList ',' assignmentExpression
;
unaryExpression
: postfixExpression
| '++' unaryExpression
| '--' unaryExpression
| unaryOperator castExpression
;
unaryOperator
: '&' | '*' | '+' | '-' | '~' | '!'
;
castExpression
: unaryExpression
;
multiplicativeExpression
: castExpression
| multiplicativeExpression '*' castExpression
| multiplicativeExpression '/' castExpression
| multiplicativeExpression '%' castExpression
;
additiveExpression
: multiplicativeExpression
| additiveExpression '+' multiplicativeExpression
| additiveExpression '-' multiplicativeExpression
;
shiftExpression
: additiveExpression
| shiftExpression '<<' additiveExpression
| shiftExpression '>>' additiveExpression
;
relationalExpression
: shiftExpression
| relationalExpression '<' shiftExpression
| relationalExpression '>' shiftExpression
| relationalExpression '<=' shiftExpression
| relationalExpression '>=' shiftExpression
;
equalityExpression
: relationalExpression
| equalityExpression '==' relationalExpression
| equalityExpression '!=' relationalExpression
;
andExpression
: equalityExpression
| andExpression '&' equalityExpression
;
exclusiveOrExpression
: andExpression
| exclusiveOrExpression '^' andExpression
;
inclusiveOrExpression
: exclusiveOrExpression
| inclusiveOrExpression '|' exclusiveOrExpression
;
logicalAndExpression
: inclusiveOrExpression
| logicalAndExpression '&&' inclusiveOrExpression
;
logicalOrExpression
: logicalAndExpression
| logicalOrExpression '||' logicalAndExpression
;
conditionalExpression
: logicalOrExpression ('?' expression? ':' conditionalExpression)?
;
assignmentExpression
: conditionalExpression
| unaryExpression assignmentOperator assignmentExpression
;
assignmentOperator
: '=' | '*=' | '/=' | '%=' | '+=' | '-=' | '<<=' | '>>=' | '&=' | '^=' | '|='
;
expression
: assignmentExpression
| expression ',' assignmentExpression
;
statement
: expressionStatement
;
expressionStatement
: expression+ ';'
;
KWRulebook: 'rulebook';
KWVersion: 'version';
KWMeta: 'meta';
KWDescription: 'description';
KWSource: 'source';
KWUser: 'user';
KWRule: 'rule';
KWWhen: 'when';
KWThen: 'then';
Identifier
: IdentifierNondigit
( IdentifierNondigit
| Digit
)*
;
GeneralIdentifier
: Identifier
( '-'
| '.'
| IdentifierNondigit
| Digit
)*
;
fragment
IdentifierNondigit
: Nondigit
//| // other implementation-defined characters...
;
VersionConstant
: DigitSequence ('.' DigitSequence)*
;
Constant
: IntegerConstant
| FloatingConstant
;
fragment
IntegerConstant
: DecimalConstant
;
fragment
DecimalConstant
: NonzeroDigit Digit*
;
fragment
FloatingConstant
: DecimalFloatingConstant
;
fragment
DecimalFloatingConstant
: FractionalConstant
;
fragment
FractionalConstant
: DigitSequence? '.' DigitSequence
| DigitSequence '.'
;
fragment
DigitSequence
: Digit+
;
fragment
Nondigit
: [a-zA-Z_]
;
fragment
Digit
: [0-9]
;
fragment
NonzeroDigit
: [1-9]
;
StringLiteral
: '"' SCharSequence? '"'
| '\'' SCharSequence? '\''
;
fragment
SCharSequence
: SChar+
;
fragment
SChar
: ~["\\\r\n]
| '\\\n' // Added line
| '\\\r\n' // Added line
;
Whitespace
: [ \t]+
-> skip
;
Newline
: ( '\r' '\n'?
| '\n'
)
-> skip
;
BlockComment
: '/*' .*? '*/'
-> skip
;
LineComment
: '//' ~[\r\n]*
-> skip
;
I have the following string that I want to match against the rule, stringLiteral:
"D:\\Downloads\\Java\\MyFile"
And my grammar is the file: String.g4, as follows:
grammar String;
fragment
HexDigit : ('0'..'9'|'a'..'f'|'A'..'F') ;
stringLiteral
: '"' ( EscapeSequence | XXXXX )* '"'
;
fragment
EscapeSequence
: '\\' ('b'|'t'|'n'|'f'|'r'|'\"'|'\''|'\\')
| UnicodeEscape
| OctalEscape
;
fragment
OctalEscape
: '\\' ('0'..'3') ('0'..'7') ('0'..'7')
| '\\' ('0'..'7') ('0'..'7')
| '\\' ('0'..'7')
;
fragment
UnicodeEscape
: '\\' 'u' HexDigit HexDigit HexDigit HexDigit
;
What should I put in the XXXXX location in order to match any character that is not \ or "?
I tried the following, and it all doesn't work:
~['\\'"']
~['\\'\"']
~["\]
~[\"\\]
~('\"'|'\\')
~[\\\"]
I am using ANTLRWorks 2 to try this out. Errors are the following:
D:\Downloads\ANTLR\String.g4 line 26:5 mismatched character '<EOF>' expecting '"'
error(50): D:\Downloads\ANTLR\String.g4:26:5: syntax error: '<EOF>' came as a complete surprise to me while looking for rule element
Inside a character class, you only need to escape the backslash:
The following is illegal, it escapes the ]:
[\]
The following matches a backslash:
[\\]
The following matches a quote:
["]
And the following matches either a backslash or quote:
[\\"]
In v4 style, your grammar could look like this:
grammar String;
/* other rules */
StringLiteral
: '"' ( EscapeSequence | ~[\\"] )* '"'
;
fragment
HexDigit
: [0-9a-fA-F]
;
fragment
EscapeSequence
: '\\' [btnfr"'\\]
| UnicodeEscape
| OctalEscape
;
fragment
OctalEscape
: '\\' [0-3] [0-7] [0-7]
| '\\' [0-7] [0-7]
| '\\' [0-7]
;
fragment
UnicodeEscape
: '\\' 'u' HexDigit HexDigit HexDigit HexDigit
;
Note that you can't use fragments inside parser rules: StringLiteral must be a lexer rule!
I have a grammar and everything works fine until this portion:
lexp
: factor ( ('+' | '-') factor)*
;
factor :('-')? IDENT;
This of course introduces an ambiguity. For example a-a can be matched by either Factor - Factor or Factor -> - IDENT
I get the following warning stating this:
[18:49:39] warning(200): withoutWarningButIncomplete.g:57:31:
Decision can match input such as "'-' {IDENT, '-'}" using multiple alternatives: 1, 2
How can I resolve this ambiguity? I just don't see a way around it. Is there some kind of option that I can use?
Here is the full grammar:
program
: includes decls (procedure)*
;
/* Check if correct! */
includes
: ('#include' STRING)*
;
decls
: (typedident ';')*
;
typedident
: ('int' | 'char') IDENT
;
procedure
: ('int' | 'char') IDENT '(' args ')' body
;
args
: typedident (',' typedident )* /* Check if correct! */
| /* epsilon */
;
body
: '{' decls stmtlist '}'
;
stmtlist
: (stmt)*;
stmt
: '{' stmtlist '}'
| 'read' '(' IDENT ')' ';'
| 'output' '(' IDENT ')' ';'
| 'print' '(' STRING ')' ';'
| 'return' (lexp)* ';'
| 'readc' '(' IDENT ')' ';'
| 'outputc' '(' IDENT ')' ';'
| IDENT '(' (IDENT ( ',' IDENT )*)? ')' ';'
| IDENT '=' lexp ';';
lexp
: term (( '+' | '-' ) term) * /*Add in | '-' to reveal the warning! !*/
;
term
: factor (('*' | '/' | '%') factor )*
;
factor : '(' lexp ')'
| ('-')? IDENT
| NUMBER;
fragment DIGIT
: ('0' .. '9')
;
IDENT : ('A' .. 'Z' | 'a' .. 'z') (( 'A' .. 'Z' | 'a' .. 'z' | '0' .. '9' | '_'))* ;
NUMBER
: ( ('-')? DIGIT+)
;
CHARACTER
: '\'' ('a' .. 'z' | 'A' .. 'Z' | '0' .. '9' | '\\n' | '\\t' | '\\\\' | '\\' | 'EOF' |'.' | ',' |':' ) '\'' /* IS THIS COMPLETE? */
;
As mentioned in the comments: these rules are not ambiguous:
lexp
: factor (('+' | '-') factor)*
;
factor : ('-')? IDENT;
This is the cause of the ambiguity:
'return' (lexp)* ';'
which can parse the input a-b in two different ways:
a-b as a single binary expression
a as a single expression, and -b as an unary expression
You will need to change your grammar. Perhaps add a comma in multiple return values? Something like this:
'return' (lexp (',' lexp)*)? ';'
which will match:
return;
return a;
return a, -b;
return a-b, c+d+e, f;
...
If I just add on to the following yacc file, will it turn into a parser?
/* C-Minus BNF Grammar */
%token ELSE
%token IF
%token INT
%token RETURN
%token VOID
%token WHILE
%token ID
%token NUM
%token LTE
%token GTE
%token EQUAL
%token NOTEQUAL
%%
program : declaration_list ;
declaration_list : declaration_list declaration | declaration ;
declaration : var_declaration | fun_declaration ;
var_declaration : type_specifier ID ';'
| type_specifier ID '[' NUM ']' ';' ;
type_specifier : INT | VOID ;
fun_declaration : type_specifier ID '(' params ')' compound_stmt ;
params : param_list | VOID ;
param_list : param_list ',' param
| param ;
param : type_specifier ID | type_specifier ID '[' ']' ;
compound_stmt : '{' local_declarations statement_list '}' ;
local_declarations : local_declarations var_declaration
| /* empty */ ;
statement_list : statement_list statement
| /* empty */ ;
statement : expression_stmt
| compound_stmt
| selection_stmt
| iteration_stmt
| return_stmt ;
expression_stmt : expression ';'
| ';' ;
selection_stmt : IF '(' expression ')' statement
| IF '(' expression ')' statement ELSE statement ;
iteration_stmt : WHILE '(' expression ')' statement ;
return_stmt : RETURN ';' | RETURN expression ';' ;
expression : var '=' expression | simple_expression ;
var : ID | ID '[' expression ']' ;
simple_expression : additive_expression relop additive_expression
| additive_expression ;
relop : LTE | '<' | '>' | GTE | EQUAL | NOTEQUAL ;
additive_expression : additive_expression addop term | term ;
addop : '+' | '-' ;
term : term mulop factor | factor ;
mulop : '*' | '/' ;
factor : '(' expression ')' | var | call | NUM ;
call : ID '(' args ')' ;
args : arg_list | /* empty */ ;
arg_list : arg_list ',' expression | expression ;
Heh
Its only a grammer of PL
To make it a parser you need to add some code into this.
Like there http://dinosaur.compilertools.net/yacc/index.html
Look at chapter 2. Actions
Also you'd need lexical analyzer -- 3: Lexical Analysis