antlr lexer rule that may match almost anything - parsing

I got a problem here that I'm sure it is about how antlr works and I am doing it all wrong but I read a lot of docs and tutorials and I still don't fully understand it.
My symptom is thy my grammar stop working when I add (because I need) a lexer rule that may match things it should not. It should only be applied in the right context.
I need ATTR rule because I need description and other rules that will follow to get a string from that keyword to end of line.
This is the conflicting rule:
ATTR
: (~('\r'| '\n'))*
;
It seems it matches anything so it 'eats' the text that should match different tokens. It makes sense, but I need it or I need another solution.
This is my current example input:
; This is a comment
; Comment 2
audit-template this is the id {
description a description may include any char but {line break}
}
For reference this is my current complete grammar:
grammar Grammar;
options {
superClass = AbstractTParser;
}
#header {
package antlrTest;
}
#lexer::header {
package antlrTest;
}
#lexer::members {
private void debug(String str) {
System.err.println("DEBUG(L) " + str);
}
}
#members {
private void debug(String str) {
System.err.println("DEBUG(P) " + str);
}
}
parse
: (template|'**TODO**') EOF { debug ("EOF"); }
;
template : 'audit-template' id=IDENTIFIER OB content=templateContent CB { debug("template id=" + $id.text); }
;
templateContent:
description?
;
description : 'description' ATTR
;
//
COMMENT
: ';' ~( '\r' | '\n' )* {$channel=HIDDEN; debug("COMMENT");}
;
SPACES : ( '\t' | '\f' | ' ' | '\n'| '\r' ) {$channel=HIDDEN;}
;
OB : '{' { debug("OB"); }
;
CB : '}' ('\r'| '\n')+ { debug("CB(lf)"); }
| '}' EOF { debug("CB(eof)"); }
;
IDENTIFIER
: ( 'a'..'z' | 'A'..'Z' | '_' )
( 'a'..'z' | 'A'..'Z' | '_' | '0'..'9' | '.' | ' ' | '\t')*
;
ATTR
: (~('\r'| '\n'))*
;
I am getting this error (the error handling is a bit tweaked):
Line 4 char 0
at antlrTest.AbstractTParser.reportError(AbstractTParser.java:45)
at antlrTest.GrammarParser.parse(GrammarParser.java:106)
at antlrTest.TemplateParser.parseInput(TemplateParser.java:15)
at antlrTest.Main.testFile(Main.java:32)
at antlrTest.Main.main(Main.java:11)
Caused by: NoViableAltException(7#[])
at antlrTest.GrammarParser.parse(GrammarParser.java:71)
... 3 more
Solution? Alternatives? Thank you.

Related

ANTLR 4 IDE strange <.> symbol in parse tree

I'm trying to create sort of css preprocessor for my study.
This is my grammar. It does a little for now, but it already doesn't work.
grammar CSSS;
#header {
package antlr;
}
program
: member*
;
member
: element
;
element
: selector '{' property* '}'
;
selector
: selector_atom (('>'|'+'|'~') selector_atom)*
;
selector_atom
: ('#'|'.')? NAME
| '*'
;
property
: prop_name ':' prop_value+ ';'
;
prop_name
: NAME
;
prop_value
: (CSSLETTER | NUMBER)+
;
NAME: CSSLETTER+;
NUMBER: '0'..'9';
CSSLETTER: 'a'..'z' | 'A'..'Z' | '-' | '_';
WS: (' '|'\n'|'\r'|'\t'|'\f')+ { skip(); };
Problem is when i try to feed it with this input
body {
font-size: 19px;
...
}
parse tree outputs this strange dot symbol where it should skip whitespace.
Can someone please explain what is this thing and how to make everything work fine.

BibTex grammar for ANTLR

I'm looking for a bibtex grammar in ANTLR to use in a hobby project. I don't want to spend my time for writing ANTLR grammar (this may take some time for me because it will involve a learning curve). So I'd appreciate for any pointers.
Note: I've found bibtex grammars for bison and yacc but couldn't find any for antlr.
Edit: As Bart pointed the I don't need to parse the preambles and tex in the quoted strings.
Here's a (very) rudimentary BibTex grammar that emits an AST (contrary to a simple parse tree):
grammar BibTex;
options {
output=AST;
ASTLabelType=CommonTree;
}
tokens {
BIBTEXFILE;
TYPE;
STRING;
PREAMBLE;
COMMENT;
TAG;
CONCAT;
}
//////////////////////////////// Parser rules ////////////////////////////////
parse
: (entry (Comma? entry)* Comma?)? EOF -> ^(BIBTEXFILE entry*)
;
entry
: Type Name Comma tags CloseBrace -> ^(TYPE Name tags)
| StringType Name Assign QuotedContent CloseBrace -> ^(STRING Name QuotedContent)
| PreambleType content CloseBrace -> ^(PREAMBLE content)
| CommentType -> ^(COMMENT CommentType)
;
tags
: (tag (Comma tag)* Comma?)? -> tag*
;
tag
: Name Assign content -> ^(TAG Name content)
;
content
: concatable (Concat concatable)* -> ^(CONCAT concatable+)
| Number
| BracedContent
;
concatable
: QuotedContent
| Name
;
//////////////////////////////// Lexer rules ////////////////////////////////
Assign
: '='
;
Concat
: '#'
;
Comma
: ','
;
CloseBrace
: '}'
;
QuotedContent
: '"' (~('\\' | '{' | '}' | '"') | '\\' . | BracedContent)* '"'
;
BracedContent
: '{' (~('\\' | '{' | '}') | '\\' . | BracedContent)* '}'
;
StringType
: '#' ('s'|'S') ('t'|'T') ('r'|'R') ('i'|'I') ('n'|'N') ('g'|'G') SP? '{'
;
PreambleType
: '#' ('p'|'P') ('r'|'R') ('e'|'E') ('a'|'A') ('m'|'M') ('b'|'B') ('l'|'L') ('e'|'E') SP? '{'
;
CommentType
: '#' ('c'|'C') ('o'|'O') ('m'|'M') ('m'|'M') ('e'|'E') ('n'|'N') ('t'|'T') SP? BracedContent
| '%' ~('\r' | '\n')*
;
Type
: '#' Letter+ SP? '{'
;
Number
: Digit+
;
Name
: Letter (Letter | Digit | ':' | '-')*
;
Spaces
: SP {skip();}
;
//////////////////////////////// Lexer fragments ////////////////////////////////
fragment Letter
: 'a'..'z'
| 'A'..'Z'
;
fragment Digit
: '0'..'9'
;
fragment SP
: (' ' | '\t' | '\r' | '\n' | '\f')+
;
(if you don't want the AST, remove all -> and everything to the right of it and remove both the options{...} and tokens{...} blocks)
which can be tested with the following class:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
// parse the file 'test.bib'
BibTexLexer lexer = new BibTexLexer(new ANTLRFileStream("test.bib"));
BibTexParser parser = new BibTexParser(new CommonTokenStream(lexer));
// you can use the following tree in your code
// see: http://www.antlr.org/api/Java/classorg_1_1antlr_1_1runtime_1_1tree_1_1_common_tree.html
CommonTree tree = (CommonTree)parser.parse().getTree();
// print a DOT tree of our AST
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
and the following example Bib-input (file: test.bib):
#PREAMBLE{
"\newcommand{\noopsort}[1]{} "
# "\newcommand{\singleletter}[1]{#1} "
}
#string {
me = "Bart Kiers"
}
#ComMENt{some comments here}
% or some comments here
#article{mrx05,
auTHor = me # "Mr. X",
Title = {Something Great},
publisher = "nob" # "ody",
YEAR = 2005,
x = {{Bib}\TeX},
y = "{Bib}\TeX",
z = "{Bib}" # "\TeX",
},
#misc{ patashnik-bibtexing,
author = "Oren Patashnik",
title = "BIBTEXing",
year = "1988"
} % no comma here
#techreport{presstudy2002,
author = "Dr. Diessen, van R. J. and Drs. Steenbergen, J. F.",
title = "Long {T}erm {P}reservation {S}tudy of the {DNEP} {P}roject",
institution = "IBM, National Library of the Netherlands",
year = "2002",
month = "December",
}
Run the demo
If you now generate a parser & lexer from the grammar:
java -cp antlr-3.3.jar org.antlr.Tool BibTex.g
and compile all .java source files:
javac -cp antlr-3.3.jar *.java
and finally run the Main class:
*nix/MacOS
java -cp .:antlr-3.3.jar Main
Windows
java -cp .;antlr-3.3.jar Main
You'll see some output on your console which corresponds to the following AST:
(click the image to enlarge it, generated with graphviz-dev.appspot.com)
To emphasize: I did not properly test the grammar! I wrote it a while back and never really used it in any project.

Expressions in a CoCo to ANTLR translator

I'm parsing CoCo/R grammars in a utility to automate CoCo -> ANTLR translation. The core ANTLR grammar is:
rule '=' expression '.' ;
expression
: term ('|' term)*
-> ^( OR_EXPR term term* )
;
term
: (factor (factor)*)? ;
factor
: symbol
| '(' expression ')'
-> ^( GROUPED_EXPR expression )
| '[' expression']'
-> ^( OPTIONAL_EXPR expression)
| '{' expression '}'
-> ^( SEQUENCE_EXPR expression)
;
symbol
: IF_ACTION
| ID (ATTRIBUTES)?
| STRINGLITERAL
;
My problem is with constructions such as these:
CS = { ExternAliasDirective }
{ UsingDirective }
EOF .
CS results in an AST with a OR_EXPR node although no '|' character
actually appears. I'm sure this is due to the definition of
expression but I cannot see any other way to write the rules.
I did experiment with this to resolve the ambiguity.
// explicitly test for the presence of an '|' character
expression
#init { bool ored = false; }
: term {ored = (input.LT(1).Type == OR); } (OR term)*
-> {ored}? ^(OR_EXPR term term*)
-> ^(LIST term term*)
It works but the hack reinforces my conviction that something fundamental is wrong.
Any tips much appreciated.
Your rule:
expression
: term ('|' term)*
-> ^( OR_EXPR term term* )
;
always causes the rewrite rule to create a tree with a root of type OR_EXPR. You can create "sub rewrite rules" like this:
expression
: (term -> REWRITE_RULE_X) ('|' term -> ^(REWRITE_RULE_Y))*
;
And to resolve the ambiguity in your grammar, it's easiest to enable global backtracking which can be done in the options { ... } section of your grammar.
A quick demo:
grammar CocoR;
options {
output=AST;
backtrack=true;
}
tokens {
RULE;
GROUP;
SEQUENCE;
OPTIONAL;
OR;
ATOMS;
}
parse
: rule EOF -> rule
;
rule
: ID '=' expr* '.' -> ^(RULE ID expr*)
;
expr
: (a=atoms -> $a) ('|' b=atoms -> ^(OR $expr $b))*
;
atoms
: atom+ -> ^(ATOMS atom+)
;
atom
: ID
| '(' expr ')' -> ^(GROUP expr)
| '{' expr '}' -> ^(SEQUENCE expr)
| '[' expr ']' -> ^(OPTIONAL expr)
;
ID
: ('a'..'z' | 'A'..'Z') ('a'..'z' | 'A'..'Z' | '0'..'9')*
;
Space
: (' ' | '\t' | '\r' | '\n') {skip();}
;
with input:
CS = { ExternAliasDirective }
{ UsingDirective }
EOF .
produces the AST:
and the input:
foo = a | b ({c} | d [e f]) .
produces:
The class to test this:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
/*
String source =
"CS = { ExternAliasDirective } \n" +
"{ UsingDirective } \n" +
"EOF . ";
*/
String source = "foo = a | b ({c} | d [e f]) .";
ANTLRStringStream in = new ANTLRStringStream(source);
CocoRLexer lexer = new CocoRLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
CocoRParser parser = new CocoRParser(tokens);
CocoRParser.parse_return returnValue = parser.parse();
CommonTree tree = (CommonTree)returnValue.getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
and with the output this class produces, I used the following website to create the AST-images: http://graph.gafol.net/
HTH
EDIT
To account for epsilon (empty string) in your OR expressions, you might try something (quickly tested!) like this:
expr
: (a=atoms -> $a) ( ( '|' b=atoms -> ^(OR $expr $b)
| '|' -> ^(OR $expr NOTHING)
)
)*
;
which parses the source:
foo = a | b | .
into the following AST:
The production for expression explicitly says that it can only return an OR_EXPR node. You can try something like:
expression
:
term
|
term ('|' term)+
-> ^( OR_EXPR term term* )
;
Further down, you could use:
term
: factor*;

Solving ambiguities in grammars

I am writing a parser for delphi's dfm's files. The lexer looks like this:
EXP ([Ee][-+]?[0-9]+)
%%
("#"([0-9]{1,5}|"$"[0-9a-fA-F]{1,6})|"'"([^']|'')*"'")+ {
return tkStringLiteral; }
"object" { return tkObjectBegin; }
"end" { return tkObjectEnd; }
"true" { /*yyval.boolean = true;*/ return tkBoolean; }
"false" { /*yyval.boolean = false;*/ return tkBoolean; }
"+" | "." | "(" | ")" | "[" | "]" | "{" | "}" | "<" | ">" | "=" | "," |
":" { return yytext[0]; }
[+-]?[0-9]{1,10} { /*yyval.integer = atoi(yytext);*/ return tkInteger; }
[0-9A-F]+ { return tkHexValue; }
[+-]?[0-9]+"."[0-9]+{EXP}? { /*yyval.real = atof(yytext);*/ return tkReal; }
[a-zA-Z_][0-9A-Z_]* { return tkIdentifier; }
"$"[0-9A-F]+ { /* yyval.integer = atoi(yytext);*/ return tkHexNumber; }
[ \t\r\n] { /* ignore whitespace */ }
. { std::cerr << boost::format("Mystery character %c\n") % *yytext; }
<<EOF>> { yyterminate(); }
%%
and the bison grammar looks like
%token tkInteger
%token tkReal
%token tkIdentifier
%token tkHexValue
%token tkHexNumber
%token tkObjectBegin
%token tkObjectEnd
%token tkBoolean
%token tkStringLiteral
%%object:
tkObjectBegin tkIdentifier ':' tkIdentifier
property_assignment_list tkObjectEnd
;
property_assignment_list:
property_assignment
| property_assignment_list property_assignment
;
property_assignment:
property '=' value
| object
;
property:
tkIdentifier
| property '.' tkIdentifier
;
value:
atomic_value
| set
| binary_data
| strings
| collection
;
atomic_value:
tkInteger
| tkReal
| tkIdentifier
| tkBoolean
| tkHexNumber
| long_string
;
long_string:
tkStringLiteral
| long_string '+' tkStringLiteral
;
atomic_value_list:
atomic_value
| atomic_value_list ',' atomic_value
;
set:
'[' ']'
| '[' atomic_value_list ']'
;
binary_data:
'{' '}'
| '{' hexa_lines '}'
;
hexa_lines:
tkHexValue
| hexa_lines tkHexValue
;
strings:
'(' ')'
| '(' string_list ')'
;
string_list:
tkStringLiteral
| string_list tkStringLiteral
;
collection:
'<' '>'
| '<' collection_item_list '>'
;
collection_item_list:
collection_item
| collection_item_list collection_item
;
collection_item:
tkIdentifier property_assignment_list tkObjectEnd
;
%%
void yyerror(const char *s, ...) {...}
The problem with this grammar occurs while parsing the binary data. Binary data in the dfm's files is nothing
but a sequence of hexadecimal characters which never spans more than 80 characters per line. An example of
it is:
Picture.Data = {
055449636F6E0000010001002020000001000800A80800001600000028000000
2000000040000000010008000000000000000000000000000000000000000000
...
FF00000000000000000000000000000000000000000000000000000000000000
00000000FF000000FF000000FF00000000000000000000000000000000000000
00000000}
As you can see, this element lacks any markers, so the strings clashes with other elements. In the example
above the first line is returns the proper token tkHexValue. The second however returns a tkInteger token
and the third a tkIdentifier token. So when the parsing comes, it fails with an syntax error because
binary data is composed only of tkHexValue tokens.
My first workaround was to require integers to have a maximum length (which helped in all but the last line
of the binary data). And the second was to move the tkHexValue token above the tkIdentifier but it means
that now I will not have identifiers like F0
I was wondering if there is any way to fix this grammar?
Ok, I solved this one. I needed to define a state so tkHexValue is only returned while reading binary data. In the preamble part of the lexer I added
%x BINARY
and modify the following rules
"{" {BEGIN BINARY; return yytext[0];}
<BINARY>"}" {BEGIN INITIAL; return yytext[0];}
<BINARY>[ \t\r\n] { /* ignore whitespace */ }
And that was all!

Parsing string interpolation in ANTLR

I'm working on a simple string manipulation DSL for internal purposes, and I would like the language to support string interpolation as it is used in Ruby.
For example:
name = "Bob"
msg = "Hello ${name}!"
print(msg) # prints "Hello Bob!"
I'm attempting to implement my parser in ANTLRv3, but I'm pretty inexperienced with using ANTLR so I'm unsure how to implement this feature. So far, I've specified my string literals in the lexer, but in this case I'll obviously need to handle the interpolation content in the parser.
My current string literal grammar looks like this:
STRINGLITERAL : '"' ( StringEscapeSeq | ~( '\\' | '"' | '\r' | '\n' ) )* '"' ;
fragment StringEscapeSeq : '\\' ( 't' | 'n' | 'r' | '"' | '\\' | '$' | ('0'..'9')) ;
Moving the string literal handling into the parser seems to make everything else stop working as it should. Cursory web searches didn't yield any information. Any suggestions as to how to get started on this?
I'm no ANTLR expert, but here's a possible grammar:
grammar Str;
parse
: ((Space)* statement (Space)* ';')+ (Space)* EOF
;
statement
: print | assignment
;
print
: 'print' '(' (Identifier | stringLiteral) ')'
;
assignment
: Identifier (Space)* '=' (Space)* stringLiteral
;
stringLiteral
: '"' (Identifier | EscapeSequence | NormalChar | Space | Interpolation)* '"'
;
Interpolation
: '${' Identifier '}'
;
Identifier
: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*
;
EscapeSequence
: '\\' SpecialChar
;
SpecialChar
: '"' | '\\' | '$'
;
Space
: (' ' | '\t' | '\r' | '\n')
;
NormalChar
: ~SpecialChar
;
As you notice, there are a couple of (Space)*-es inside the example grammar. This is because the stringLiteral is a parser-rule instead of a lexer-rule. Therefor, when tokenizing the source file, the lexer cannot know if a white space is part of a string literal, or is just a space inside the source file that can be ignored.
I tested the example with a little Java class and all worked as expected:
/* the same grammar, but now with a bit of Java code in it */
grammar Str;
#parser::header {
package antlrdemo;
import java.util.HashMap;
}
#lexer::header {
package antlrdemo;
}
#parser::members {
HashMap<String, String> vars = new HashMap<String, String>();
}
parse
: ((Space)* statement (Space)* ';')+ (Space)* EOF
;
statement
: print | assignment
;
print
: 'print' '('
( id=Identifier {System.out.println("> "+vars.get($id.text));}
| st=stringLiteral {System.out.println("> "+$st.value);}
)
')'
;
assignment
: id=Identifier (Space)* '=' (Space)* st=stringLiteral {vars.put($id.text, $st.value);}
;
stringLiteral returns [String value]
: '"'
{StringBuilder b = new StringBuilder();}
( id=Identifier {b.append($id.text);}
| es=EscapeSequence {b.append($es.text);}
| ch=(NormalChar | Space) {b.append($ch.text);}
| in=Interpolation {b.append(vars.get($in.text.substring(2, $in.text.length()-1)));}
)*
'"'
{$value = b.toString();}
;
Interpolation
: '${' i=Identifier '}'
;
Identifier
: ('a'..'z' | 'A'..'Z' | '_') ('a'..'z' | 'A'..'Z' | '_' | '0'..'9')*
;
EscapeSequence
: '\\' SpecialChar
;
SpecialChar
: '"' | '\\' | '$'
;
Space
: (' ' | '\t' | '\r' | '\n')
;
NormalChar
: ~SpecialChar
;
And a class with a main method to test it all:
package antlrdemo;
import org.antlr.runtime.*;
public class ANTLRDemo {
public static void main(String[] args) throws RecognitionException {
String source = "name = \"Bob\"; \n"+
"msg = \"Hello ${name}\"; \n"+
"print(msg); \n"+
"print(\"Bye \\${for} now!\"); ";
ANTLRStringStream in = new ANTLRStringStream(source);
StrLexer lexer = new StrLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
StrParser parser = new StrParser(tokens);
parser.parse();
}
}
which produces the following output:
> Hello Bob
> Bye \${for} now!
Again, I am no expert, but this (at least) gives you a way to solve it.
HTH.

Resources