Is this reserve declaration hard to understand or defective? - rascal

I've got a problem with using a reserve (backslash) declaration for priority disambiguation. Below is a self-contained example. The production 'Ipv4Address' is a strict subset of 'Domain0'. In parsing URL's, though, you want dotted-quad addresses to be handled differently than domain names, so you want to split 'Domain0' into two parts; 'Domain1' is one of those two parts. The test suite included, however, is failing at 't3()', where 'Domain1' is accepting an IP address, which looks like it should be excluded.
Is this a problem with the reserve declaration, or is this a defect in the current version of Rascal? I'm on the 0.10.x unstable branch at present, per advice to see if that corrected a different problem (with the Tutor). I haven't checked with the stable branch since keeping them both installed means parallel Eclipse environments, which I haven't been motivated to do.
module grammar_test
import ParseTree;
syntax Domain0 = { Subdomain '.' }+;
syntax Domain1 = Domain0 \ IPv4Address ;
lexical Subdomain = [0-9A-Za-z]+ | [0-9A-Za-z]+'-'[a-zA-Z0-9\-]*[a-zA-Z0-9] ;
lexical IPv4Address = DecimalOctet '.' DecimalOctet '.' DecimalOctet '.' DecimalOctet ;
lexical DecimalOctet = [0-9] | [1-9][0-9] | '1'[0-9][0-9] | '2'[0-4][0-9] | '25'[0-5] ;
test bool t1()
{
return parseAccept(#IPv4Address, "192.168.0.1");
}
test bool t2()
{
return parseAccept(#Domain0, "192.168.0.1");
}
test bool t3()
{
return parseReject(#Domain1, "192.168.0.1");
}
bool parseAccept( type[&T<:Tree] begin, str input )
{
try
{
parse(begin, input, allowAmbiguity=false);
}
catch ParseError(loc _):
{
return false;
}
return true;
}
bool parseReject( type[&T<:Tree] begin, str input )
{
try
{
parse(begin, input, allowAmbiguity=false);
}
catch ParseError(loc _):
{
return true;
}
return false;
}
This example has been cut down from larger code. I first encountered the error in a larger scope. Using the rule "IPv4Address | Domain1" was throwing an Ambiguity exception, which I tracked down to the behavior that "Domain1" was accepting something it shouldn't be. Curiously "IPv4Address > Domain1" was also throwing Ambiguity, but I'm guessing this has the same root cause as the present isolated example.

The difference operator for keyword reservations currently only works correctly if the right-hand side is a finite language expressed as disjunction of literal keywords like "if" | "then" | "while" or a non-terminal which is defined like that: lexical X = "if" | "then" | "while". And then you can writeA \ X` for some effect.
For other types of non-terminals the parser is just generated but the \ constraint has no effect. You wrote Domain0 \ IPv4Address and IPv3Address does not hold to the above assumption.
(We should either add a warning about that or generate a parser which can implement the full semantics of language difference; but that's for another time).
Admittedly such a powerful difference operator could be used to express an some order of preference between non-terminals. Alas.
Possible (sketches of) solutions:
stage two passes solution: parse the input using the more general Subdomain syntax, then pattern and match rewrite in a single pass all quadruples to IPv4Address
maximal munch solution: adapt the grammar using follow restrictions to implement eager behavior for the IPv4Address, like {Subdomain !>> [.][0-9] "."}+ or something in that vain.

Related

How can I get mandatory whitespace in a specific rule while having other whitespace ignored?

The relevant part of my grammar is structured like this:
someRule: subrule1 | WS sign=('+' | '-') subrule2 ; // whitespace required here
// ... etc
WS: [ \t\r\n]+ -> channel(HIDDEN) ; // whitespace is usually ignored
I want to ignore whitespace, but require it on a specific rule. I'm pretty sure there was a way to do it in a previous ANTLR version (though I don't remember exactly, I think there was a syntax allowing to not hide them on a specific rule). I don't know how to do it in ANTLR4, of if it can be done at all without using language-specific actions.
I thought about making WS a parser rule somehow, but I don't think that's the right approach...
(and obviously I don't want to put WS? everywhere in the grammar)
Is there a (preferably language-independent) way to either (a) ensure that a specific point has whitespace, or (b) ensure both ends on a specific point are not "touching" on that channel, or (c) selectively choose the WS channel (default or hidden) depending on which rule it appears in somehow?
I'm guessing (c) is impossible and (a|b) would require language-dependent actions, unless I'm missing something?
I don't believe there is any way to have parser rules evaluate tokens on the HIDDEN channel (or any channel other than 0). Maybe I'm missing something but i couldn't find it.
The question I can't answer from your excerpts is whether there is another parser rule that should match if there is NOT a WS before your sign. That makes a big difference.
I tend to think of a successful grammar as one that will produce a parse tree that represents the correct way to interpret the input stream. IMHO, too many people complicate grammars by trying to encode "all the rules" into the grammar. If you have an accurate tree of the only way to interpret the input (whether it's "error free" or not), then you can write a Listener (maybe a visitor) that visits the tree and performs edits for additional rules (such as "the 'sign' much be preceded by whitespace).
This accomplishes a couple of things:
keeps the grammar simpler
allows you to be very specific in your error messages.
ANTLR is pretty good about error messages, for what information it has, but "expected WS, but saw '+'", is just not going to be as good an error message as "signs must follow whitespace".
With that in mind, you can get to the HIDDEN channel inside a listener.
First of all you'll need to make the token Stream available in your Listener:
class TestListener extends TestBaseListener {
BufferedTokenStream tokens;
public TestListener(BufferedTokenStream tokens) {
this.tokens = tokens;
}
// ...
}
and pass it to the constructor of your listener:
CommonTokenStream tokens = new CommonTokenStream(lexer);
TestListener listener = new TestListener(tokens);
then, in the enter* method for whatever rule you need to add this to, you can do something like the following:
const int HIDDEN = 1;
#Override
public void enterAddSub(TxlParser.AddSubContext ctx) {
Token op = ctx.op;
int opIndex = op.getTokenIndex();
List<Token> hiddenChannel = tokens.getHiddenTokensToLeft(opIndex, HIDDEN);
if (hiddenChannel != null) {
Token ws = hiddenChannel.get(0);
if (ws != null) {
System.out.println("Found Ws (" + ws.getText() + ")");
}
} else {
System.out.println("There was no WS to the left of the operator");
// Your code here to add an error
}
}
for reference, this was the rule for I used with AddSub
expr:
expr (MULT | DIV) expr # MulDiv
| lExpr = expr op = (PLUS | MINUS) rExpr = expr # AddSub
// ...
;
If I run this with input a=x+y I get:
There was no WS to the left of the operator
But the input a=x +y gives me:
Found Ws ( )

How do I fix my ANTLR Parser to separate comments from multiplication?

I'm using ANTLR4 to try to parse code that has asterisk-leading comments, like:
* This is a comment
I was initially having issues with multiplication expressions getting mistaken for these comments, so decided to make my lexer rule:
LINE_COMMENT : '\r\n' '*' ~[\r\n]* ;
This forces there to be a newline so it doesn't see 2 * 3, with '* 3' being a comment.
This worked just fine until I had code that starts with a comment on the first line, which does not have a newline to begin with. For example:
* This is the first line of the code's file\r\n
* This is the second line of the codes's file\r\n
I have also tried the {getCharPositionInLine==x}? to make sure that it only recognizes a comment if there is an asterisk or spaces/tabs coming first in the current line. This works when using
antlr4 *.g4
, but will not work with my JavaScript parser generated using
antlr4 -Dlanguage=JavaScript *.g4
Is there a way to get the same results of {getCharPositionInLine==x}? with my JavaScript parser or some way to prevent multiplication from being recognized as a comment? I should also mention that this coding language doesn't use semicolons at the end of lines.
I've tried playing around with this simple grammar, but I haven't had any luck.
grammar wow;
program : expression | Comment ;
expression : expression '*' expression
| NUMBER ;
Comment : '*' ~[\r\n]*;
NUMBER : [0-9]+ ;
Asterisk : '*' ;
Space : ' ' -> skip;
and using a test file: test.txt
5 * 5
Make the comment rule match at least one more non-whitespace character, otherwise it could match the same content as the Asterisk rule, like so:
Comment: '*' ' '* ~[\r\n]+;
Do comments have to be at the beginning of line?
If so you can check it with this._tokenStartCharPositionInLine == 0 and have lexer rule like this
Comment : '*' ~[\r\n]* {this._tokenStartCharPositionInLine == 0}?;
If not, you should gather information about previous tokens, which could allow us to have multiplication (for example your NUMBER rule), so you should write something like (java code)
#lexer::members {
private static final Set<Integer> MULTIPLIABLE_TOKENS = new HashSet<>();
static {
MULTIPLIABLE_TOKENS.add(NUMBER);
}
private boolean canBeMultiplied = false;
#Override
public void emit(final Token token) {
final int type = token.getType();
if (type != Whitespace && type != Newline) { // skip ws tokens from consideration
canBeMultiplied = MULTIPLIABLE_TOKENS.contains(type);
}
super.emit(token);
}
}
Comment : {!canBeMultiplied}? '*' ~[\r\n]*;
UPDATE
If you need function analogs for JavaScript, take a look into the sources -> Lexer.js

Parse a block where each line starts with a specific symbol

I need to parse a block of code which looks like this:
* Block
| Line 1
| Line 2
| ...
It is easy to do:
block : head lines;
head : '*' line;
lines : lines '|' line
| '|' line
;
Now I wonder, how can I add nested blocks, e.g.:
* Block
| Line 1
| * Subblock
| | Line 1.1
| | ...
| Line 2
| ...
Can this be expressed as a LALR grammar?
I can, of course, parse the top-level blocks and than run my parser again to deal with each of these top-level blocks. However, I'm just learning this topic so it's interesting for me to avoid such approach.
The nested-block language is not context-free [Note 2], so it cannot be parsed with an LALR(k) parser.
However, nested parenthesis languages are, of course, context-free and it is relatively easy to transform the input into a parenthetic form by replacing the initial | sequences in the lexical scanner. The transformation is simple:
when the initial sequence of |s is longer than the previous line, insert an BEGIN_BLOCK. (The initial sequence must be exactly one | longer; otherwise it is presumably a syntax error.)
when the initial sequence of |s, is shorter then the previous line, enough END_BLOCKs are inserted to bring the expected length to the correct value.
The |s themselves are not passed through to the parser.
This is very similar to the INDENT/DEDENT strategy used to parse layout-aware languages like Python an Haskell. The main difference is that here we don't need a stack of indent levels.
Once that transformation is finished, the grammar will look something like:
content: /* empty */
| content line
| content block
block : head BEGIN_BLOCK content END_BLOCK
| head
head : '*' line
A rough outline of a flex implementation would be something like this: (see Note 1, below).
%x INDENT CONTENT
%%
static int depth = 0, new_depth = 0;
/* Handle pending END_BLOCKs */
send_end:
if (new_depth < depth) {
--depth;
return END_BLOCK;
}
^"|"[[:blank:]]* { new_depth = 1; BEGIN(INDENT); }
^. { new_depth = 0; yyless(0); BEGIN(CONTENT);
goto send_end; }
^\n /* Ignore blank lines */
<INDENT>{
"|"[[:blank:]]* ++new_depth;
. { yyless(0); BEGIN(CONTENT);
if (new_depth > depth) {
++depth;
if (new_depth > depth) { /* Report syntax error */ }
return BEGIN_BLOCK;
} else goto send_end;
}
\n BEGIN(INITIAL); /* Maybe you care about this blank line? */
}
/* Put whatever you use here to lexically scan the lines */
<CONTENT>{
\n BEGIN(INITIAL);
}
Notes:
Not everyone will be happy with the goto but it saves some code-duplication. The fact that the state variable (depth and new_depth) are local static variables makes the lexer non-reentrant and non-restartable (after an error). That's only useful for toy code; for anything real, you should make the lexical scanner re-entrant and put the state variables into the extra data structure.
The terms "context-free" and "context-sensitive" are technical descriptions of grammars, and are therefore a bit misleading. Intuitions based on what the words seem to mean are often wrong. One very common source of context-sensitivity is a language where validity depends on two different derivations of the same non-terminal producing the same token sequence. (Assuming the non-terminal could derive more than one token sequence; otherwise, the non-terminal could be eliminated.)
There are lots of examples of such context-sensitivity in normal programming languages; usually, the grammar will allow these constructs and the check will be performed later in some semantic analysis phase. These include the requirement that an identifier be declared (two derivations of IDENTIFIER produce the same string) or the requirement that a function be called with the correct number of parameters (here, it is only necessary that the length of the derivations of the non-terminals match, but that is sufficient to trigger context-sensitivity).
In this case, the requirement is that two instances of what might be called bar-prefix in consecutive lines produce the same string of |s. In this case, since the effect is really syntactic, deferring to a later semantic analysis defeats the point of parsing. Whether the other examples of context-sensitivity are "syntactic" or "semantic" is a debate which produces a surprising amount of heat without casting much light on the discussion.
If you write an explicit end-of-block token, things become clearer:
*Block{
|Line 1
*SubBlock{
| line 1.1
| line 1.2
}
|Line 2
|...
}
and grammar becomes:
block : '*' ID '{' lines '}'
lines : lines '|' line
| lines block
|

Can nested parentheticals be parsed in chemical formulae?

I am trying to create a parser for simple chemical formulae. Meaning, they have no states of matter, charge, or anything like that. The formulae only have strings representing compounds, quantities, and parentheses.
Following this answer to a similar question, and some rudimentary knowledge of discrete math, I hoped that I could write a simple Recursive Descent Parser to generate the number of each atom inside of the formula. I already have a really simple answer for this that involves single parentheses, but not nested parentheses.
Here are the productions of the grammar without parentheses:
Compound: Component { Component };
Component: Atom [Quantity]
Atom: 'H' | 'He' | 'Li' | 'Be' ...
Quantity: Digit { Digit }
Digit: '0' | '1' | ... '9'
[...] is read as optional, and will be an if test in the program (either it is there or missing)
| is alternatives, and so is an if .. else if .. else or switch 'test', it is saying the input must match one of these
{ ... } is read as repetition of 0 or more, and will be a while loop in the program
Characters between quotes are literal characters which will be in the string. All the other words are names of rules, and for a recursive descent parser, end up being the names of the functions which get called to chop up, and handle the input.
With nested parentheses, I have no idea what to do. By nested parentheses I mean something like (Fe2(OH)2(H2O)8)2, or something fictitious and complicated like (Ab(CD2(Ef(G2H)3)(IJ2)4)3)2
Because now there is a production that I don't really understand how to articulate, but here is my best attempt:
Parenthetical: Compound { Parenthetical } [Quantity]
So the basic rules parse any simple sequence of chemical symbols and quantities without parenthesis.
I assume the Quantity is defining the quantity of the whole chunk of stuff between '(' ... ')'
So, '(' ... ') [Quantity] needs to be parsed as exactly the same thing as the Component, i.e. as an alternative to: Atom [Quantity]
So the only thing to change is the Component rule; it becomes:
Component: Atom [Quantity] | '(' Compound ')' [Quantity]
In the code function (or procedure) which is parsing Component, it will have a look at the next character (token), and if it is an '(', it will consume it, then call the function (or procedure) responsible for parsing Compound, and after that, check the next character (token) is a ')' (if not, it's a syntax error), then handle the optional Quantity, and then it is finished.
I am assuming you are using a programming language which supports recursive function (or procedure) calls. That housekeeping, done by code behind the scenes for your program, will make this 'just work' (TM).
Alternatively, you could solve the problem in a different way. Add a new rule, which says:
Stuff: Atom | '(' Compound ')'
Then modify the rule:
Compound: Stuff [Quantity]
Then write a new function (or procedure) for Stuff, and change the Compound code to simply call Stuff, then handle the optional Quantity.
There are good technical reasons for doing this to support some parsing technology. However you're using recursive descent where it won't really matter.
Edit:
The type of grammar which works very well for a recursive decent parser is called LL(1), which means parse from left-to-right, and create the left-most derivation. That is a 'natural' way to parse when the code and function calls is the control flow. To find the theory of how to check grammars are LL(1) search the web for "parsing LL(1)" or "grammar follow sets".
It is pretty uncommon to see nested brackets in chemical formula. But maybe, for instance ammonium carbonate and barium nitrate in a 2:3 ratio could be written as "( (NH4)2 CO3)2 ( Ba(NO3)2 )3"
I found a right-to-left parser that pushes the multiplier onto a multiplier stack worked really well for me:
double multiplier[8];
double num = 1.0;
int multdepth = 0;
multiplier[0] = 1;
char molecule[1024]; // contains molecular formula
//parse the molecular formula right-to-left whilst keeping track of multiplier
for (int i = strlen(molecule) - 1; i >= 0; i--)
{
if (isdigit(molecule[i]) || molecule[i] == '.')
i = readnum(i, &num);
if (isalpha(molecule[i]))
{
i = parseatom(i, num * multiplier[multdepth]);
num = 1.0; // need to reset the multiplier here
}
if (molecule[i] == ')')
{
multdepth++;
multiplier[multdepth] = num * multiplier[multdepth - 1];
num = 1.0;
}
if (molecule[i] == '(')
{
multdepth--;
if (multdepth < 0)
error("Opening bracket not terminated");
}
}

bison error recovery

I have found out that I can use 'error' in the grammar rule as a mechanism for error recovery. So if there was an error, the parser must discard the current line and resume parsing from the next line. An example from bison manual to achieve this could be something like this:
stmts:
exp
|stmts exp
| error '\n'
But I cannot use that; because I had to make flex ignores '\n' in my scannar, so that an expression is not restricted to be expressed in one line. How can I make the parser -when encountering an error- continue parsing to the following line, given that there is no special character (i.e. semicolon) to indicate an end of expression and there is no 'newline' token?
Thanks..
Since you've eliminated the marker used by the example, you're going to have to pull a stunt to get the equivalent effect.
I think you can use this:
stmts:
exp
| stmts exp
| error { eat_to_newline(); }
Where eat_to_newline() is a function in the scanner (source file) that arranges to discard any saved tokens and read up to the next newline.
extern void eat_to_newline(void);
void eat_to_newline(void)
{
int c;
while ((c = getchar()) != EOF && c != '\n')
;
}
It probably needs to be a little more complex than that, but not a lot more complex than that. You might need to use yyerrok; (and, as the comment reminds me, yyclearin; too) after calling eat_to_newline().

Resources