I already know the workaround for this problem, but I would like to really use this one approach, for at least one reason -- it should work.
This is rule taken from "The Definitive ANTLR Reference" by Terence Parr (the books is for ANTLR3):
expr : (INT -> INT) ('+' i=INT -> ^('+' $expr $i) )*;
If INT is not followed by + the result will be INT (single node), if it is -- subtree will be built with first INT (referred as $expr) as left branch.
I would like to build similar rule, yet with custom action:
mult_expr : (pow_expr -> pow_expr )
(op=MUL exr=pow_expr
-> { new BinExpr($op,$mult_expr.tree,$exr.tree) })*;
ANTLR accepts such rule, but when I run my parser with input (for example) "5 * 3" it gives me an error "line 1:1 missing EOF at '*'5".
QUESTION: how to use back reference with custom rewrite action?
I'd recommend creating your own CommonTreeAdaptor and move the creation ow custom nodes to this CommonTreeAdaptor instead of doing this in your grammar file. More information on that, see: Extend ANTLR3 AST's
In case of operators that could have multiple meanings, like the minus sign (binary or unary operator), let your parser rule rewrite the unary operator like this:
grammar X;
...
tokens { U_SUB; }
add_expr
: mult_expr ((SUB | ADD)^ mult_expr)*
;
...
unary_expr
: SUB atom -> ^(U_SUB atom)
| atom
;
...
And then in your implementation of your CommonTreeAdaptor, do something like this:
#Override
public Object create(Token t) {
...
switch(t.getType()) {
case X.SUB : /* return a binary-tree */
...
case X.U_SUB : /* return an unary-tree */
}
...
}
I am persistent guy, and this idea of using my custom nodes in one step was bothering me... ;-)
So, I did. The crucial points are:
putting EOF! at the end of the "main" rule,
when labeling the tokens, putting labels next to token, not to group, so (op='*'|op='/'), not op=('*'|'/')
I don't know for sure if this approach of using grammar rules to create immediately custom nodes will be a good a idea, but since this solves the problem asked in question I am marking this as solution.
And for the record, the most interesting rule looks now like this:
mult_expr : (exl=pow_expr -> $exl )
((op=MUL|op=IDIV|op=RDIV|op=MOD) exr=pow_expr
-> { new BinaryExpression($op,$exl.tree,$exr.tree) })*;
Related
The relevant part of my grammar is structured like this:
someRule: subrule1 | WS sign=('+' | '-') subrule2 ; // whitespace required here
// ... etc
WS: [ \t\r\n]+ -> channel(HIDDEN) ; // whitespace is usually ignored
I want to ignore whitespace, but require it on a specific rule. I'm pretty sure there was a way to do it in a previous ANTLR version (though I don't remember exactly, I think there was a syntax allowing to not hide them on a specific rule). I don't know how to do it in ANTLR4, of if it can be done at all without using language-specific actions.
I thought about making WS a parser rule somehow, but I don't think that's the right approach...
(and obviously I don't want to put WS? everywhere in the grammar)
Is there a (preferably language-independent) way to either (a) ensure that a specific point has whitespace, or (b) ensure both ends on a specific point are not "touching" on that channel, or (c) selectively choose the WS channel (default or hidden) depending on which rule it appears in somehow?
I'm guessing (c) is impossible and (a|b) would require language-dependent actions, unless I'm missing something?
I don't believe there is any way to have parser rules evaluate tokens on the HIDDEN channel (or any channel other than 0). Maybe I'm missing something but i couldn't find it.
The question I can't answer from your excerpts is whether there is another parser rule that should match if there is NOT a WS before your sign. That makes a big difference.
I tend to think of a successful grammar as one that will produce a parse tree that represents the correct way to interpret the input stream. IMHO, too many people complicate grammars by trying to encode "all the rules" into the grammar. If you have an accurate tree of the only way to interpret the input (whether it's "error free" or not), then you can write a Listener (maybe a visitor) that visits the tree and performs edits for additional rules (such as "the 'sign' much be preceded by whitespace).
This accomplishes a couple of things:
keeps the grammar simpler
allows you to be very specific in your error messages.
ANTLR is pretty good about error messages, for what information it has, but "expected WS, but saw '+'", is just not going to be as good an error message as "signs must follow whitespace".
With that in mind, you can get to the HIDDEN channel inside a listener.
First of all you'll need to make the token Stream available in your Listener:
class TestListener extends TestBaseListener {
BufferedTokenStream tokens;
public TestListener(BufferedTokenStream tokens) {
this.tokens = tokens;
}
// ...
}
and pass it to the constructor of your listener:
CommonTokenStream tokens = new CommonTokenStream(lexer);
TestListener listener = new TestListener(tokens);
then, in the enter* method for whatever rule you need to add this to, you can do something like the following:
const int HIDDEN = 1;
#Override
public void enterAddSub(TxlParser.AddSubContext ctx) {
Token op = ctx.op;
int opIndex = op.getTokenIndex();
List<Token> hiddenChannel = tokens.getHiddenTokensToLeft(opIndex, HIDDEN);
if (hiddenChannel != null) {
Token ws = hiddenChannel.get(0);
if (ws != null) {
System.out.println("Found Ws (" + ws.getText() + ")");
}
} else {
System.out.println("There was no WS to the left of the operator");
// Your code here to add an error
}
}
for reference, this was the rule for I used with AddSub
expr:
expr (MULT | DIV) expr # MulDiv
| lExpr = expr op = (PLUS | MINUS) rExpr = expr # AddSub
// ...
;
If I run this with input a=x+y I get:
There was no WS to the left of the operator
But the input a=x +y gives me:
Found Ws ( )
I'm using ANTLR4 to try to parse code that has asterisk-leading comments, like:
* This is a comment
I was initially having issues with multiplication expressions getting mistaken for these comments, so decided to make my lexer rule:
LINE_COMMENT : '\r\n' '*' ~[\r\n]* ;
This forces there to be a newline so it doesn't see 2 * 3, with '* 3' being a comment.
This worked just fine until I had code that starts with a comment on the first line, which does not have a newline to begin with. For example:
* This is the first line of the code's file\r\n
* This is the second line of the codes's file\r\n
I have also tried the {getCharPositionInLine==x}? to make sure that it only recognizes a comment if there is an asterisk or spaces/tabs coming first in the current line. This works when using
antlr4 *.g4
, but will not work with my JavaScript parser generated using
antlr4 -Dlanguage=JavaScript *.g4
Is there a way to get the same results of {getCharPositionInLine==x}? with my JavaScript parser or some way to prevent multiplication from being recognized as a comment? I should also mention that this coding language doesn't use semicolons at the end of lines.
I've tried playing around with this simple grammar, but I haven't had any luck.
grammar wow;
program : expression | Comment ;
expression : expression '*' expression
| NUMBER ;
Comment : '*' ~[\r\n]*;
NUMBER : [0-9]+ ;
Asterisk : '*' ;
Space : ' ' -> skip;
and using a test file: test.txt
5 * 5
Make the comment rule match at least one more non-whitespace character, otherwise it could match the same content as the Asterisk rule, like so:
Comment: '*' ' '* ~[\r\n]+;
Do comments have to be at the beginning of line?
If so you can check it with this._tokenStartCharPositionInLine == 0 and have lexer rule like this
Comment : '*' ~[\r\n]* {this._tokenStartCharPositionInLine == 0}?;
If not, you should gather information about previous tokens, which could allow us to have multiplication (for example your NUMBER rule), so you should write something like (java code)
#lexer::members {
private static final Set<Integer> MULTIPLIABLE_TOKENS = new HashSet<>();
static {
MULTIPLIABLE_TOKENS.add(NUMBER);
}
private boolean canBeMultiplied = false;
#Override
public void emit(final Token token) {
final int type = token.getType();
if (type != Whitespace && type != Newline) { // skip ws tokens from consideration
canBeMultiplied = MULTIPLIABLE_TOKENS.contains(type);
}
super.emit(token);
}
}
Comment : {!canBeMultiplied}? '*' ~[\r\n]*;
UPDATE
If you need function analogs for JavaScript, take a look into the sources -> Lexer.js
I've got a problem with using a reserve (backslash) declaration for priority disambiguation. Below is a self-contained example. The production 'Ipv4Address' is a strict subset of 'Domain0'. In parsing URL's, though, you want dotted-quad addresses to be handled differently than domain names, so you want to split 'Domain0' into two parts; 'Domain1' is one of those two parts. The test suite included, however, is failing at 't3()', where 'Domain1' is accepting an IP address, which looks like it should be excluded.
Is this a problem with the reserve declaration, or is this a defect in the current version of Rascal? I'm on the 0.10.x unstable branch at present, per advice to see if that corrected a different problem (with the Tutor). I haven't checked with the stable branch since keeping them both installed means parallel Eclipse environments, which I haven't been motivated to do.
module grammar_test
import ParseTree;
syntax Domain0 = { Subdomain '.' }+;
syntax Domain1 = Domain0 \ IPv4Address ;
lexical Subdomain = [0-9A-Za-z]+ | [0-9A-Za-z]+'-'[a-zA-Z0-9\-]*[a-zA-Z0-9] ;
lexical IPv4Address = DecimalOctet '.' DecimalOctet '.' DecimalOctet '.' DecimalOctet ;
lexical DecimalOctet = [0-9] | [1-9][0-9] | '1'[0-9][0-9] | '2'[0-4][0-9] | '25'[0-5] ;
test bool t1()
{
return parseAccept(#IPv4Address, "192.168.0.1");
}
test bool t2()
{
return parseAccept(#Domain0, "192.168.0.1");
}
test bool t3()
{
return parseReject(#Domain1, "192.168.0.1");
}
bool parseAccept( type[&T<:Tree] begin, str input )
{
try
{
parse(begin, input, allowAmbiguity=false);
}
catch ParseError(loc _):
{
return false;
}
return true;
}
bool parseReject( type[&T<:Tree] begin, str input )
{
try
{
parse(begin, input, allowAmbiguity=false);
}
catch ParseError(loc _):
{
return true;
}
return false;
}
This example has been cut down from larger code. I first encountered the error in a larger scope. Using the rule "IPv4Address | Domain1" was throwing an Ambiguity exception, which I tracked down to the behavior that "Domain1" was accepting something it shouldn't be. Curiously "IPv4Address > Domain1" was also throwing Ambiguity, but I'm guessing this has the same root cause as the present isolated example.
The difference operator for keyword reservations currently only works correctly if the right-hand side is a finite language expressed as disjunction of literal keywords like "if" | "then" | "while" or a non-terminal which is defined like that: lexical X = "if" | "then" | "while". And then you can writeA \ X` for some effect.
For other types of non-terminals the parser is just generated but the \ constraint has no effect. You wrote Domain0 \ IPv4Address and IPv3Address does not hold to the above assumption.
(We should either add a warning about that or generate a parser which can implement the full semantics of language difference; but that's for another time).
Admittedly such a powerful difference operator could be used to express an some order of preference between non-terminals. Alas.
Possible (sketches of) solutions:
stage two passes solution: parse the input using the more general Subdomain syntax, then pattern and match rewrite in a single pass all quadruples to IPv4Address
maximal munch solution: adapt the grammar using follow restrictions to implement eager behavior for the IPv4Address, like {Subdomain !>> [.][0-9] "."}+ or something in that vain.
I am trying to do a top-down visit of an Algebraic Data Type. When I find a node of a certain type, I would also like to bind to the nodes of that particular node, for e.g.
data Script=script(list[Stmt] body | ...
data Stmt =exprstmt(Expr expr)| ...
data Expr =assign(Expr left, Expr right) | var(str name)| scalar(Type aType)|... ;
Script myScript=someScript(srcFile);
top-down visit(myScript)
{
case (Expr e:assign(left,right), left:=var(_), right :=scalar(_) )
{
str varName=left.name;
Type myType=right.aType;
}
}
So what I'm trying to do in the case statement is to search for a specific kind of node: i.e. of type assign(var(),scalar()) by doing a couple of pattern matches. My intention is to bind the variables left and right to var() and scalar() respectively at the same time that I find the particular kind of node. I hoped to NOT do a nested 'case' statement in order to retrieve information about the sub-nodes. Maybe that is possible, but I'm not sure.
You can just nest the patterns, like so:
top-down visit(myScript) {
case e:assign(l:var(varName),r:scalar(aType)) :
// do something useful
println("<varName> : <aType>");
}
I believe left and right might be reserved keywords so this is why I used l and r instead.
Or simpler (because you don't need names for the nested patterns:
top-down visit(myScript) {
case assign(var(varName),scalar(aType)) :
// do something useful
println("<varName> : <aType>");
}
I am trying to create a parser for simple chemical formulae. Meaning, they have no states of matter, charge, or anything like that. The formulae only have strings representing compounds, quantities, and parentheses.
Following this answer to a similar question, and some rudimentary knowledge of discrete math, I hoped that I could write a simple Recursive Descent Parser to generate the number of each atom inside of the formula. I already have a really simple answer for this that involves single parentheses, but not nested parentheses.
Here are the productions of the grammar without parentheses:
Compound: Component { Component };
Component: Atom [Quantity]
Atom: 'H' | 'He' | 'Li' | 'Be' ...
Quantity: Digit { Digit }
Digit: '0' | '1' | ... '9'
[...] is read as optional, and will be an if test in the program (either it is there or missing)
| is alternatives, and so is an if .. else if .. else or switch 'test', it is saying the input must match one of these
{ ... } is read as repetition of 0 or more, and will be a while loop in the program
Characters between quotes are literal characters which will be in the string. All the other words are names of rules, and for a recursive descent parser, end up being the names of the functions which get called to chop up, and handle the input.
With nested parentheses, I have no idea what to do. By nested parentheses I mean something like (Fe2(OH)2(H2O)8)2, or something fictitious and complicated like (Ab(CD2(Ef(G2H)3)(IJ2)4)3)2
Because now there is a production that I don't really understand how to articulate, but here is my best attempt:
Parenthetical: Compound { Parenthetical } [Quantity]
So the basic rules parse any simple sequence of chemical symbols and quantities without parenthesis.
I assume the Quantity is defining the quantity of the whole chunk of stuff between '(' ... ')'
So, '(' ... ') [Quantity] needs to be parsed as exactly the same thing as the Component, i.e. as an alternative to: Atom [Quantity]
So the only thing to change is the Component rule; it becomes:
Component: Atom [Quantity] | '(' Compound ')' [Quantity]
In the code function (or procedure) which is parsing Component, it will have a look at the next character (token), and if it is an '(', it will consume it, then call the function (or procedure) responsible for parsing Compound, and after that, check the next character (token) is a ')' (if not, it's a syntax error), then handle the optional Quantity, and then it is finished.
I am assuming you are using a programming language which supports recursive function (or procedure) calls. That housekeeping, done by code behind the scenes for your program, will make this 'just work' (TM).
Alternatively, you could solve the problem in a different way. Add a new rule, which says:
Stuff: Atom | '(' Compound ')'
Then modify the rule:
Compound: Stuff [Quantity]
Then write a new function (or procedure) for Stuff, and change the Compound code to simply call Stuff, then handle the optional Quantity.
There are good technical reasons for doing this to support some parsing technology. However you're using recursive descent where it won't really matter.
Edit:
The type of grammar which works very well for a recursive decent parser is called LL(1), which means parse from left-to-right, and create the left-most derivation. That is a 'natural' way to parse when the code and function calls is the control flow. To find the theory of how to check grammars are LL(1) search the web for "parsing LL(1)" or "grammar follow sets".
It is pretty uncommon to see nested brackets in chemical formula. But maybe, for instance ammonium carbonate and barium nitrate in a 2:3 ratio could be written as "( (NH4)2 CO3)2 ( Ba(NO3)2 )3"
I found a right-to-left parser that pushes the multiplier onto a multiplier stack worked really well for me:
double multiplier[8];
double num = 1.0;
int multdepth = 0;
multiplier[0] = 1;
char molecule[1024]; // contains molecular formula
//parse the molecular formula right-to-left whilst keeping track of multiplier
for (int i = strlen(molecule) - 1; i >= 0; i--)
{
if (isdigit(molecule[i]) || molecule[i] == '.')
i = readnum(i, &num);
if (isalpha(molecule[i]))
{
i = parseatom(i, num * multiplier[multdepth]);
num = 1.0; // need to reset the multiplier here
}
if (molecule[i] == ')')
{
multdepth++;
multiplier[multdepth] = num * multiplier[multdepth - 1];
num = 1.0;
}
if (molecule[i] == '(')
{
multdepth--;
if (multdepth < 0)
error("Opening bracket not terminated");
}
}