I'm trying to get a disambiguation working, one in the same vein as the question I asked a few days ago. In that previous question, there was an undocumented limitation in the language implementation; I'm wondering if there's something similar going on here.
Tests [tuvw]1 are all throwing ambiguity exceptions (BTW: How do you catch those? [Edit: answered]). All of them look like they ought to pass. Note that they have to be unambiguous in order to pass. Neither the priority rule Scheme nor the reserve rules UnknownScheme[23] seem to be removing the ambiguity. There might be some interaction with follow rules I'm not understanding; it might be another limitation or a defect. What's up?
I'm on the unstable branch. Version (from Eclipse): 0.10.0.201806220838
EDIT.
I modified the example code to more clearly highlight what's happening. I removed some redundant tests and the tests that were behaving correctly. I expanded some possibly-verbose diagnostics. I changed the exposition above to match. Newer results follow.
It looks like there are two different things at play here. "http" is being accepted (correctly) by both KnownScheme and UnknownScheme in tests s1[ab]. It seems to be behaving as if the priority declaration in Scheme just isn't functioning, as if > is being substituted with |.
In the other case, tests s1[cde] are failing, but s1f is passing. This looks even more like a defect. It's possible to reserve a single keyword, apparently, but not more than one. Since the various reservation declarations are failing, it's no surprise that there's an ambiguity when put into an alternative.
module ssce
import analysis::grammars::Ambiguity;
import IO;
lexical Scheme = AnyScheme ;
lexical AnyScheme = KnownScheme > UnknownScheme ;
lexical AnySchemeChar = [a-z*];
lexical KnownScheme = KnownSchemes !>> AnySchemeChar ;
lexical KnownSchemes = "http" | "https" | "http*" | "javascript" ;
lexical UnknownScheme = UnknownFixedScheme | UnknownWildScheme ;
lexical UnknownFixedScheme = [a-z]+ !>> AnySchemeChar ;
lexical UnknownWildScheme = [a-z]* '*' AnySchemeChar* !>> AnySchemeChar ;
lexical Scheme2 = UnknownScheme2 | KnownScheme ;
lexical UnknownScheme2 = UnknownScheme \ KnownSchemes ;
lexical Scheme3 = UnknownScheme3 | KnownScheme ;
lexical UnknownScheme3 = AnySchemeChar+ \ KnownSchemes ;
lexical Scheme4 = UnknownScheme4 | KnownScheme ;
lexical UnknownScheme4 = AnySchemeChar+ \ ("http"|"https") ;
lexical Scheme5 = UnknownScheme5 | KnownScheme ;
lexical UnknownScheme5 = AnySchemeChar+ \ "http" ;
test bool t1() { return parseAccept( #Scheme, "http" ); }
test bool u1() { return parseAccept( #Scheme2, "http" ); }
test bool v1() { return parseAccept( #Scheme3, "http" ); }
test bool w1() { return parseAccept( #Scheme4, "http" ); }
test bool x1() { return parseAccept( #Scheme5, "http" ); }
test bool s1a() { return parseAccept( #KnownScheme, "http" ); }
test bool s1b() { return parseAccept( #UnknownScheme, "http" ); }
test bool s1c() { return parseReject( #UnknownScheme2, "http" ); }
test bool s1d() { return parseReject( #UnknownScheme3, "http" ); }
test bool s1e() { return parseReject( #UnknownScheme4, "http" ); }
test bool s1f() { return parseReject( #UnknownScheme5, "http" ); }
bool verbose = false;
bool parseAccept( type[&T<:Tree] begin, str input )
{
try
{
parse(begin, input, allowAmbiguity=false);
}
catch ParseError(loc _):
{
return false;
}
catch Ambiguity(loc l, str a, str b):
{
if (verbose)
{
println("[Ambiguity] " + a + ", " + b);
Tree tt = parse(begin, input, allowAmbiguity=true) ;
iprintln(tt);
list[Message] m = diagnose(tt) ;
println( ToString(m) );
}
fail;
}
return true;
}
bool parseReject( type[&T<:Tree] begin, str input )
{
try
{
parse(begin, input, allowAmbiguity=false);
}
catch ParseError(loc _):
{
return true;
}
return false;
}
str ToString( list[Message] msgs ) =
( ToString( msgs[0] ) | it + "\n" + ToString(m) | m <- msgs[1..] );
str ToString( Message msg)
{
switch(msg)
{
case error(str s, loc _): return "error: " + s;
case warning(str s, loc _): return "warning: " + s;
case info(str s, loc _): return "info: " + s;
}
return "";
}
I've been making this ambiguity diagnostics tool, and here's what it came up with for your grammar. It seems you've discovered more things we need to document and write little checkers for.
Well-formedness of \ is murky.
The problem is that the \ operator only accepts literal strings, such as A \ "a" \ "b" or a keyword non-terminal defined like keyword Hello = "a" | "b";, used as A \ Hello, and nothing else. So also A \ ("a" | "b") is not allowed, and also indirect non-terminals like A \ Hello where lexical Hello = Bye; lexical Bye = "if" | "then"; also not allowed. Only the simplest of the simplest forms.
Well-formedness of follow-restrictions
Similar rules for !>> disallow any non-terminal to the right of the !>> operator.
So [a-z]+ !>> [a-z] or [a-z]+ !>> "*", but not [a-z]+ \ myCharClass where lexical myCharClass = [a-z];
Names for character-classes is on our todoy list; but they will not be like non-terminals. More like aliases which will be substituted at parser generator time.
Whole words
Keyword reservation only works if you subtract the sentence from the whole word. Sometimes you have to group non-terminals to get this right:
lexical Ex = ([a-z]+ "*") \ "https*" instead of lexical Ex = [a-z]+ "*" \ "https*")
The latter would try to subtract the "https*" language from the "*" language. The first works.
case-insensitivity
'if' is defined by lexical 'if' = [iI][fF];
"if" is defined by lexical "if" = [i][f];
'*' is defined by lexical '*' = [*];
"*" is defined by lexical "*" = [*];
New grammar
I used a random generator to generate all the ambiguities I could find, and resolved them step by step by adding keyword reservation:
lexical Scheme = AnyScheme ;
lexical AnyScheme = KnownScheme > UnknownScheme ;
lexical AnySchemeChar = [a-z*];
lexical KnownScheme = KnownSchemes !>> AnySchemeChar ;
keyword KnownSchemes = "http" | "https" | "http*" | "javascript" ;
lexical UnknownScheme = UnknownFixedScheme | UnknownWildScheme ;
lexical UnknownFixedScheme = [a-z]+ !>> AnySchemeChar \ KnownSchemes ;
lexical UnknownWildScheme = ([a-z]* '*' AnySchemeChar*) !>> AnySchemeChar \ KnownSchemes ;
Related
I gave ANTLR4 the following parser and lexer grammar in separate files (referring to a simple grammar for BNF grammar )
parser grammar BNFParser;
options {tokenVocab = BNFLexer;}
compileUnit
: grammar_rule+
;
grammar_rule : NON_TERMINAL COLON (OR? grammar_rule_alternative)* SEMICOLON
;
grammar_rule_alternative : (NON_TERMINAL|TERMINAL)+
;
and
lexer grammar BNFLexer;
TERMINAL : [A-Z][A-Za-z0-9_]*;
NON_TERMINAL : [a-z][A-Za-z0-9_]*;
OR : '|';
COLON : ':';
SEMICOLON : ';';
WS
: [ \t\r\n]+ -> skip
;
The main program
private static void Main(string[] args) {
StreamReader reader = new StreamReader(args[0]);
AntlrInputStream stream = new AntlrInputStream(reader);
BNFLexer lexer = new BNFLexer(stream);
CommonTokenStream tokens = new CommonTokenStream(lexer);
BNFParser parser = new BNFParser(tokens);
IParseTree root = parser.compileUnit();
Console.WriteLine(root.ToStringTree());
}
Also supplied the following test file for testing the grammar
compileunit : x a
;
x : S b
;
S : compileunit f
;
Please notice from the lexer grammar that Non-Terminals begin with a lowercase letters while Terminals begin with an uppercase letter. This given grammar has an error. The third rule uses a capital letter ( S ) to define Non-Terminal S. The expected behaviour would be to report this as an error. In the contrary parsing succeeds by consuming the first 2 rules and ignoring the third for S without reporting any error. I have also seen the generated files and i noticed the following
try {
EnterOuterAlt(_localctx, 1);
{
State = 7;
_errHandler.Sync(this);
_la = _input.La(1);
do {
{
{
State = 6; grammar_rule();
}
}
State = 9;
_errHandler.Sync(this);
_la = _input.La(1);
} while ( _la==NON_TERMINAL );
}
}
catch (RecognitionException re) {
_localctx.exception = re;
_errHandler.ReportError(this, re);
_errHandler.Recover(this, re);
}
The above code shows that the parser expects a Non-Terminal symbol at the start of a grammar_rule which is what i expect. However what happens when this is not the case? Also another weird issue is that the CommonTokenStream object that contains the tokens recognized by the lexer contains only the tokens until the end of the second rule but non of the tokens of the third rule (S). Is this proper behaviour?
Add an EOF token to your main rule (compileUnit). That will force the parser to use all input until EOF and report an error if that didn't fully match.
I'm parsing a script language that defines two types of statements; control statements and non control statements. Non control statements are always ended with ';', while control statements may end with ';' or EOL ('\n'). A part of the grammar looks like this:
script
: statement* EOF
;
statement
: control_statement
| no_control_statement
;
control_statement
: if_then_control_statement
;
if_then_control_statement
: IF expression THEN end_control_statment
( statement ) *
( ELSEIF expression THEN end_control_statment ( statement )* )*
( ELSE end_control_statment ( statement )* )?
END IF end_control_statment
;
no_control_statement
: sleep_statement
;
sleep_statement
: SLEEP expression END_STATEMENT
;
end_control_statment
: END_STATEMENT
| EOL
;
END_STATEMENT
: ';'
;
ANY_SPACE
: ( LINE_SPACE | EOL ) -> channel(HIDDEN)
;
EOL
: [\n\r]+
;
LINE_SPACE
: [ \t]+
;
In all other aspects of the script language, I never care about EOL so I use the normal lexer rules to hide white space.
This works fine in all cases but the cases where I need to use a EOL to find a termination of a control statement, but with the grammar above, all EOL is hidden and not used in the control statement rules.
Is there a way to change my grammar so that I can skip all EOL but the ones needed to terminate parts of my control statements?
Found one way to handle this.
The idea is to divert EOL into one hidden channel and the other stuff I donĀ“t want to see in another hidden channel (like spaces and comments). Then I use some code to backtrack the tokens when an EOL is supposed to show up and examine the previous tokens channels (since they already have been consumed). If I find something on EOL channel before I run into something from the ordinary channel, then it is ok.
It looks like this:
Changed the lexer rules:
#lexer::members {
public static int EOL_CHANNEL = 1;
public static int OTHER_CHANNEL = 2;
}
...
EOL
: '\r'? '\n' -> channel(EOL_CHANNEL)
;
LINE_SPACE
: [ \t]+ -> channel(OTHER_CHANNEL)
;
I also diverted all other HIDDEN channels (comments) to the OTHER_CHANNEL.
Then I changed the rule end_control_statment:
end_control_statment
: END_STATEMENT
| { isEOLPrevious() }?
;
and added
#parser::members {
public static int EOL_CHANNEL = 1;
public static int OTHER_CHANNEL = 2;
boolean isEOLPrevious()
{
int idx = getCurrentToken().getTokenIndex();
int ch;
do
{
ch = getTokenStream().get(--idx).getChannel();
}
while (ch == OTHER_CHANNEL);
// Channel 1 is only carrying EOL, no need to check token itself
return (ch == EOL_CHANNEL);
}
}
One could stick to the ordinary hidden channel but then there is a need to both track channel and tokens while backtracking so this is maybe a bit easier...
Hope this could help someone else dealing with these kind of issues...
I have run into a problem, when i tried to parse a stacked arithmetic comparison expression:
"1<2<3<4<5"
into a logical Tree of Conjunctions:
CONJUNCTION(COMPARISON(1,2,<) COMPARISON(2,3,<) COMPARISON(3,4,<) COMPARISON(4,5,<))
Is there a way in Antlr3 Tree Rewrite rules to iterate through matched tokens and create the result Tree from them in the target language (I'm using java)? So i could make COMPARISON nodes from element x, x-1 of matched 'addition' tokens. I know i can reference the last result of a rule but that way i'd only get nested COMPARISON rules, that's not what i wish for.
/This is how i approached the problem, sadly it doesn't do what i would like to do yet of course.
fragment COMPARISON:;
operator
:
('<'|'>'|'<='|'>='|'=='|'!=')
;
comparison
#init{boolean secondpart = false;}
:
e=addition (operator {secondpart=true;} k=addition)*
-> {secondpart}? ^(COMPARISON ^(VALUES addition*) ^(OPERATORS operator*))
-> $e
;
//Right now what this does is:
tree=(COMPARISON (VALUES (INTEGERVALUE (VALUE 1)) (INTEGERVALUE (VALUE 2)) (INTEGERVALUE (VALUE 3)) (INTEGERVALUE (VALUE 4)) (INTEGERVALUE (VALUE 5))) (OPERATORS < < < <))
//The label for the CONJUNCTION TreeNode that i would like to use:
fragment CONJUNCTION:;
I came up with a nasty solution to this problem by writing actual tree building java code:
grammar testgrammarforcomparison;
options {
language = Java;
output = AST;
}
tokens
{
CONJUNCTION;
COMPARISON;
OPERATOR;
ADDITION;
}
WS
:
('\t' | '\f' | ' ' | '\r' | '\n' )+
{$channel = HIDDEN;}
;
comparison
#init
{
List<Object> additions = new ArrayList<Object>();
List<Object> operators = new ArrayList<Object>();
boolean secondpart = false;
}
:
(( e=addition {additions.add(e.getTree());} ) ( op=operator k=addition {additions.add(k.getTree()); operators.add(op.getTree()); secondpart = true;} )*)
{
if(secondpart)
{
root_0 = (Object)adaptor.nil();
Object root_1 = (Object)adaptor.nil();
root_1 = (Object)adaptor.becomeRoot(
(Object)adaptor.create(CONJUNCTION, "CONJUNCTION")
, root_1);
Object lastaddition = additions.get(0);
for(int i=1;i<additions.size();i++)
{
Object root_2 = (Object)adaptor.nil();
root_2 = (Object)adaptor.becomeRoot(
(Object)adaptor.create(COMPARISON, "COMPARISON")
, root_2);
adaptor.addChild(root_2, additions.get(i-1));
adaptor.addChild(root_2, operators.get(i-1));
adaptor.addChild(root_2, additions.get(i));
adaptor.addChild(root_1, root_2);
}
adaptor.addChild(root_0, root_1);
}
else
{
root_0 = (Object)adaptor.nil();
adaptor.addChild(root_0, e.getTree());
}
}
;
/** lowercase letters */
fragment LOWCHAR
: 'a'..'z';
/** uppercase letters */
fragment HIGHCHAR
: 'A'..'Z';
/** numbers */
fragment DIGIT
: '0'..'9';
fragment LETTER
: LOWCHAR
| HIGHCHAR
;
IDENTIFIER
:
LETTER (LETTER|DIGIT)*
;
addition
:
IDENTIFIER ->^(ADDITION IDENTIFIER)
;
operator
:
('<'|'>') ->^(OPERATOR '<'* '>'*)
;
parse
:
comparison EOF
;
For input
"DATA1 < DATA2 > DATA3"
This outputs tree such as:
If you guys know any better solutions, please tell me about them
I'm trying to write a simple parser for a meta programming language.
Everything works fine, but I want to use ';' as statement delimiter and not newline or ommit the semicolon entirely.
So this is the expected behaviour:
// good code
v1 = v2;
v3 = 23;
should parse without errors
But:
// bad code
v1 = v2
v3 = 23;
should fail
yet if I remove the 'empty' rule from separator both codes fail like this:
ID to ID
Error detected in parsing: syntax error, unexpected ID, expecting SEMICOLON
;
If I leave the 'empty' rule active, then both codes are accepted, which is not desired.
ID to ID // should raise error
ID to NUM;
Any help is welcome here, as most tutorials do not cover delimiters at all.
Here is a simplified version of my parser/lexxer:
parser.l:
%{
#include "parser.tab.h"
#include<stdio.h>
%}
num [0-9]
alpha [a-zA-Z_]
alphanum [a-zA-Z_0-9]
comment "//"[^\n]*"\n"
string \"[^\"]*\"
whitespace [ \t\n]
%x ML_COMMENT
%%
<INITIAL>"/*" {BEGIN(ML_COMMENT); printf("/*");}
<ML_COMMENT>"*/" {BEGIN(INITIAL); printf("*/");}
<ML_COMMENT>[.]+ { }
<ML_COMMENT>[\n]+ { printf("\n"); }
{comment}+ {printf("%s",yytext);}
{alpha}{alphanum}+ { yylval.str= strdup(yytext); return ID;}
{num}+ { yylval.str= strdup(yytext); return NUM;}
{string} { yylval.str= strdup(yytext); return STRING;}
';' {return SEMICOLON;}
"=" {return ASSIGNMENT;}
" "+ { }
<<EOF>> {exit(0); /* this is suboptimal */}
%%
parser.y:
%{
#include<stdio.h>
#include<string.h>
%}
%error-verbose
%union{
char *str;
}
%token <str> ID
%token <str> NUM
%token <str> STRING
%left SEMICOLON
%left ASSIGNMENT
%start input
%%
input: /* empty */
| expression separator input
;
expression: assign
| error {}
;
separator: SEMICOLON
| empty
;
empty:
;
assign: ID ASSIGNMENT ID { printf("ID to ID"); }
| ID ASSIGNMENT STRING { printf("ID to STRING"); }
| ID ASSIGNMENT NUM { printf("ID to NUM"); }
;
%%
yyerror(char* str)
{
printf("Error detected in parsing: %s\n", str);
}
main()
{
yyparse();
}
Compiled like this:
$>flex -t parser.l > parser.lex.yy.c
$>bison -v -d parser.y
$>cc parser.tab.c parser.lex.yy.c -lfl -o parser
Never mind... the problematic line was this one:
';' {return SEMICOLON;}
which required to be changed to
";" {return SEMICOLON;}
Now the behaviour is correct. :-)
I use this BNF to parser my script:
{identset} = {ASCII} - {"\{\}}; //<--all ascii charset except '\"' '{' and '}'
{strset} = {ASCII} - {"};
ident = {identset}*;
str = {strset}*;
node ::= ident "{" nodes "}" | //<--entry point
"\"" str "\"" |
ident;
nodes ::= node nodes |
node;
It can parse correctly the following text into tree structure
doc {
title { "some title goes here" }
refcode { "SDS-1" }
rev { "1.0" }
revdate { "04062010" }
body {
"this is the body of the document
all text should go here"
chapter { "some inline section" }
"text again"
}
}
my question is, how do I handle escape sequence inside string literal:
"some text of \"quotation\" should escape"
Define str as:
str = ( strset strescape ) *;
with
strescape = { \\ } {\" } ;