antlr - specify a parser rule with any sequence - parsing

I have a section of a ALTLR grammar which goes like this:
mainfilter: mandatoryfilter (optionalfilter1)? (optionalfilter2)? (optionalfilter3)? ;
mandatoryfilter: 'NAME' '=' ID;
optionalfilter1: 'VALUE1' EQ ID;
optionalfilter2: 'VALUE2' EQ ID;
optionalfilter3: 'VALUE3' EQ ID;
EQ: '=' ;
ID: [A-Za-z0-9]+
//Also I will skip spaces and whitespace
My requirement is that the "optionalfilter" rules can occur in any order.
One approach I think of is rewrite the rule like below and then validate using a Listener:
mainfilter: mandatoryfilter (optionalfilter1|optionalfilter2|optionalfilter3)*;
Another way to achieve this is to put all combinations in one parser rule each . but that may not be a wiser solution if the number of optionalfilter increases.
Sample input:
NAME = BOB VALUE1=X VALUE2=Y VALUE3 = Z
NAME = BILL VALUE3=X VALUE1=Y VALUE2 = Z
my grammar will successfully parse the first input but not the second one.
So is there an elegant way to handle this in my grammar itself ?

So is there an elegant way to handle this in my grammar itself ?
No.
Usually, zero or more are matched and then after parsing, it is validated that a filter only occurs once.
Take for example the Java Language Specification that defines that a class definition can have zero or more class modifiers (the {ClassModifier} part)
NormalClassDeclaration:
{ClassModifier} class Identifier [TypeParameters] [Superclass] [Superinterfaces] ClassBody
ClassModifier:
(one of)
Annotation public protected private abstract static final strictfp
which would match public public class Foo {}. This is rejected at the stage after parsing.

Related

Make lexer consider parser before determining tokens?

I'm writing a lexer and parser in ocamllex and ocamlyacc as follows. function_name and table_name are same regular expression, i.e., a string containing only english alphabets. The only way to determine if a string is function_name or table_name is to check its surroundings. For example, if such a string is surrounded by [ and ], then we know that it is a table_name. Here is the current code:
In lexer.mll,
... ...
let function_name = ['a'-'z' 'A'-'Z']+
let table_name = ['a'-'z' 'A'-'Z']+
rule token = parse
| function_name as s { FUNCTIONNAME s }
| table_name as s { TABLENAME s }
... ...
In parser.mly:
... ...
main:
| LBRACKET TABLENAME RBRACKET { Table $2 }
... ...
As I wrote | function_name as s { FUNCTIONNAME s } before | table_name as s { TABLENAME s }, the above code failed to parse [haha]; it firstly considered haha as a function_name in the lexer, then it could not find any corresponding rule for it in the parser. If it could consider haha as a table_name in the lexer, it would match [haha] as a table in the parser.
One workaround for this is to be more precise in the lexer. For example, we define let table_name_with_brackets = '[' ['a'-'z' 'A'-'Z']+ ']' and | table_name_with_brackets as s { TABLENAMEWITHBRACKETS s } in the lexer. But, I would like to know if there is any other options. Is it not possible to make lexer and parser work together to determine the tokens and the reduction?
You should avoid trying to get the lexer to do the parser's work. The lexer should just identify lexemes; it should not try to figured out where a lexeme fits into the syntax. So in your (simplified) example, there should be only one lexical type, name. The parser will figure it out from there.
But it seems, from the comments, that in the unsimplified original, the two patterns are overlapping rather than identical. That's more annoying, although it's only slightly more complicated. Basically, you need to separate out the common pattern as one lexical type, and then add the additional matches as one or two other lexical types (depending on whether or not one pattern is a strict superset of the other).
That might not be too difficult, depending on the precise relationship between the two patterns. You might be able to find a very simple solution by writing the patterns in the correct order, for example, because of the longest match rule:
If several regular expressions match a prefix of the input, the “longest match” rule applies: the regular expression that matches the longest prefix of the input is selected. In case of tie, the regular expression that occurs earlier in the rule is selected.
Most of the time, that's all it takes: first define the intersection of the two patterns as a based lexeme, and then add the full lexical patterns of each contextual type to provide additional matches. Your parser will then have to match name | function_name in one context and name | table_name in the other context. But that's not too bad.
Where it will fail is when an input stream cannot be unambiguously divided in lexemes. For example, suppose that in a function context, a name could include a ? character, but in a table context the ? is a valid postscript operator. In that case, you have to actively prevent foo? from being analysed as a single token in the table context, which means that the lexer does have to be aware of parser context.

Token with different interpretations (i.e. keyword and identifier)

I am writing a grammar with a lot of case-insensitive keywords in ANTLR4. I collected some example files for the format, that I try to test parse and some use the same tokens which exist as keywords as identifiers in other places. For example there is a CORE keyword, which in other places is used as a ID for a structure from user input. Here some parts of my grammar:
fragment A : [aA]; // match either an 'a' or 'A'
fragment B : [bB];
fragment C : [cC];
[...]
CORE: C O R E ;
[...]
IDSTRING: [a-zA-Z_] [a-zA-Z0-9_]*;
id: IDSTRING ;
The error thrown then is line 7982:8 mismatched input 'core' expecting IDSTRING, as the user input is intended as IDSTRING, but always eaten by the keyword rule. In the input it exists both as keyword and as id like this:
MACRO oa12f01
CLASS CORE ; #here it is a KEYWORD
[...]
SITE core ; #here it is a ID
Is there a way I can let users use some keywords as identifiers by changing my grammar somehow like "casting" the token to IDSTRING for conjunctive rules like this or is this a false hope in not hand written parsers?
You can simply list the keywords that are allowed as identifiers as alternatives in the id rule:
id: IDSTRING | CORE | ... ;

How do I format a grammar rule that requies two tokens to be equal?

How can I write a Yacc grammar that matches two tokens? For example:
START some_random_id
stuff stuff stuff
END some_random_id
I would like to make the requirement that the some_random_id matches at both places to match the entire block. So it would be something like:
block <- START ID block_body END ID
with the additional requirement that both IDs are equal.
So long as some_random_id is drawn from a set of arbitrary size, this is not possible to do with grammar rules alone. There is a classic mathematical proof of this. You can only do it with parser action code that checks whether the ids are the same. But this is not very hard. Define the yylval union to have a field id_string that is filled in by the scanner. Then you'll have something like:
%union {
char *id_string;
...
}
%token <id_string> ID KW_START KW_END
%%
...
block : KW_START ID stuff KW_END ID {
if (strcmp($2.id_string, $5.id_string) != 0) YYERROR; }

How to back reference AST in custom rewrite action?

I already know the workaround for this problem, but I would like to really use this one approach, for at least one reason -- it should work.
This is rule taken from "The Definitive ANTLR Reference" by Terence Parr (the books is for ANTLR3):
expr : (INT -> INT) ('+' i=INT -> ^('+' $expr $i) )*;
If INT is not followed by + the result will be INT (single node), if it is -- subtree will be built with first INT (referred as $expr) as left branch.
I would like to build similar rule, yet with custom action:
mult_expr : (pow_expr -> pow_expr )
(op=MUL exr=pow_expr
-> { new BinExpr($op,$mult_expr.tree,$exr.tree) })*;
ANTLR accepts such rule, but when I run my parser with input (for example) "5 * 3" it gives me an error "line 1:1 missing EOF at '*'5".
QUESTION: how to use back reference with custom rewrite action?
I'd recommend creating your own CommonTreeAdaptor and move the creation ow custom nodes to this CommonTreeAdaptor instead of doing this in your grammar file. More information on that, see: Extend ANTLR3 AST's
In case of operators that could have multiple meanings, like the minus sign (binary or unary operator), let your parser rule rewrite the unary operator like this:
grammar X;
...
tokens { U_SUB; }
add_expr
: mult_expr ((SUB | ADD)^ mult_expr)*
;
...
unary_expr
: SUB atom -> ^(U_SUB atom)
| atom
;
...
And then in your implementation of your CommonTreeAdaptor, do something like this:
#Override
public Object create(Token t) {
...
switch(t.getType()) {
case X.SUB : /* return a binary-tree */
...
case X.U_SUB : /* return an unary-tree */
}
...
}
I am persistent guy, and this idea of using my custom nodes in one step was bothering me... ;-)
So, I did. The crucial points are:
putting EOF! at the end of the "main" rule,
when labeling the tokens, putting labels next to token, not to group, so (op='*'|op='/'), not op=('*'|'/')
I don't know for sure if this approach of using grammar rules to create immediately custom nodes will be a good a idea, but since this solves the problem asked in question I am marking this as solution.
And for the record, the most interesting rule looks now like this:
mult_expr : (exl=pow_expr -> $exl )
((op=MUL|op=IDIV|op=RDIV|op=MOD) exr=pow_expr
-> { new BinaryExpression($op,$exl.tree,$exr.tree) })*;

How to define syntax

I am new at language processing and I want to create a parser with Irony for a following syntax:
name1:value1 name2:value2 name3:value ...
where name1 is the name of an xml element and value is the value of the element which can also include spaces.
I have tried to modify included samples like this:
public TestGrammar()
{
var name = CreateTerm("name");
var value = new IdentifierTerminal("value");
var queries = new NonTerminal("queries");
var query = new NonTerminal("query");
queries.Rule = MakePlusRule(queries, null, query);
query.Rule = name + ":" + value;
Root = queries;
}
private IdentifierTerminal CreateTerm(string name)
{
IdentifierTerminal term = new IdentifierTerminal(name, "!##$%^*_'.?-", "!##$%^*_'.?0123456789");
term.CharCategories.AddRange(new[]
{
UnicodeCategory.UppercaseLetter, //Ul
UnicodeCategory.LowercaseLetter, //Ll
UnicodeCategory.TitlecaseLetter, //Lt
UnicodeCategory.ModifierLetter, //Lm
UnicodeCategory.OtherLetter, //Lo
UnicodeCategory.LetterNumber, //Nl
UnicodeCategory.DecimalDigitNumber, //Nd
UnicodeCategory.ConnectorPunctuation, //Pc
UnicodeCategory.SpacingCombiningMark, //Mc
UnicodeCategory.NonSpacingMark, //Mn
UnicodeCategory.Format //Cf
});
//StartCharCategories are the same
term.StartCharCategories.AddRange(term.CharCategories);
return term;
}
but this doesn't work if the values include spaces. Can this be done (using Irony) without modifying the syntax (like adding quotes around values)?
Many thanks!
If newlines were included between key-value pairs, it would be easily achievable. I have no knowledge of "Irony", but my initial feeling is that almost no parser/lexer generator is going to deal with this given only a naive grammar description. This requires essentially unbounded lookahead.
Conceptually (because I know nothing about this product), here's how I would do it:
Tokenise based on spaces and colons (i.e. every continguous sequence of characters that isn't a space or a colon is an "identifier" token of some sort).
You then need to make it such that every "sentence" is described from colon-to-colon:
sentence = identifier_list
| : identifier_list identifier : sentence
That's not enough to make it work, but you get the idea at least, I hope. You would need to be very careful to distinguish an identifier_list from a single identifier such that they could be parsed unambiguously. Similarly, if your tool allows you to define precedence and associativity, you might be able to get away with making ":" bind very tightly to the left, such that your grammar is simply:
sentence = identifier : identifier_list
And the behaviour of that needs to be (identifier :) identifier_list.

Resources