ANTLR on a noisy data stream Part 3

ANTLR on a noisy data stream Part 3 - parsing

Still in the process of learning ANTLR... Recently I have been posting 2 questions regarding parsing some text and extracting information leaving aside "unwanted" words or character. Following a very interesing discussion with Bart Kiers on parsing a noisy datastream Part 1 and and parsing a noisy datastream Part 2, I'm ending up with one more problem...
Originally, my grammar looks like this
VERB : 'SLEEPING' | 'WALKING';
SUBJECT : 'CAT'|'DOG'|'BIRD';
INDIRECT_OBJECT : 'CAR'| 'SOFA';
ANY2 :'A'..'Z'+ {skip();};
ANY : . {skip();};
parse
: sentenceParts+ EOF
;
sentenceParts
: SUBJECT VERB INDIRECT_OBJECT
;
a sentence like it's 10PM and the Lazy CAT is currently SLEEPING heavily on the SOFA in front of the TV. will produce the following
This is good... and it does what I want, i.e. extracting only the word CAT, SLEEPING and SOFA, leaving aside other words. Now, for another reason, I need to introduce a new token in my grammar, let's call it OTHER : 'PLANE'. It will be used later by another rule. I still want my primary rule to work : SUBJECT VERB INDIRECT_OBJECT. Let's say the token 'PLANE' appears in my sentence, like
it's 10PM and the Lazy CAT on the PLANE is currently SLEEPING heavily on the SOFA in front of the TV. It will produce the following error (no surprise here as the lexer has a clear definition of 'PLANE' as a token)
Is there a way to tell ANTLR that if I'm entering the rule sentenceParts I only care about the 3 tokens I have defined, namely SUBJECT, VERB or INDIRECT_OBJECT and that, even if it comes across a different token, not to take it into account ? I would like to be able to do that without putting OTHER? everywhere in this rule

Well in fact, I might have found a way to do it... Although it's questionable at that point to introduce tokens if you don't want to parse them, this solution works :
VERB : 'SLEEPING' | 'WALKING';
SUBJECT : 'CAT'|'DOG'|'BIRD';
INDIRECT_OBJECT : 'CAR'| 'SOFA';
OTHER : 'PLANE';
OTHER2 : 'BEAUTIFUL';
OTHER3 : 'HEAVILLY';
ANY2 :'A'..'Z'+ {skip();};
ANY : . {skip();};
parse
: sentenceParts+ EOF
;
next : ( options {greedy=false;}: .)*;
sentenceParts
: SUBJECT next VERB next INDIRECT_OBJECT
;
this will produce on the following sentence it's 10PM and the Lazy CAT on the BEAUTIFUL PLANE is currently SLEEPING HEAVILLY on the SOFA in front of the TV the following tree... So that intermediary token

Is there a way to tell ANTLR that if I'm entering the rule sentenceParts I only care about the 3 tokens I have defined, namely SUBJECT, VERB or INDIRECT_OBJECT and that, even if it comes across a different token, not to take it into account ? I would like to be able to do that without putting OTHER? everywhere in this rule
No.
You either ignore the token, or you don't, in which case you'll have to make it optional in your parser rule(s).

Related

Handling arbitrary text blocks in an Xtext grammar

In an effort to better understand Xtext, I'm working on writing a grammar and have hit a roadblock. I've boiled it down to the following scenario. I have some input such as this:
thing {abc}
{def}
There may be keywords (e.g.'thing') followed by other language elements (e.g. ID) in braces. Or, there can just be a block of content inside braces. This content should simply be passed along to the parser en masse.
If I try something like this:
Model: (things+=AThing | blocks+=ABlock)*;
AThing : 'thing' '{' name = ID '}';
ABlock : block=BLOCK;
terminal BLOCK:'{' -> '}';
and parse the sample text above, I get an error:
'mismatched input '{abc}' expecting '{'' on ABlock, offset 6, length 5
So, '{abc}' is being matched by the BLOCK terminal rule, which I understand. But how do I alter the grammar to properly handle the sample input? I've been wrestling with this problem for a while and have come up empty. So it's either something very simple that I've missed, or the problem is really complex and I don't realize it. Any enlightenment would be greatly appreciated.

Parsing happens in two stages: tokenizer and lexical. In the first one the text input is divided into tokens, in the second one the tokens are matched against lexical rules. Broadly something like (with some arbitrary language):
1st phase:
text: class X { this ; }
----- --- --- ---- --- ---
tokens: ID ID LB ID SC RB
2nd phase:
Is there a rule that starts with a 'class' string?
YES: Is the next expected token an ID?
YES: Is the next expected token a LB?
...
NO: Is there another rule that starts with 'class'?
...
NO: Is there a rule that starts with an ID token?
...
The lexer implementation is a bit more complex, but I hope you get the idea.
The issue with your grammar is that your termial BLOCK rule is used during the first phase, hence you get
thing {abc} {def}
----- ----- -----
ID BLOCK BLOCK
That is why the error message says if found '{abc}' and not a '{'. The lexer matched the thing and was expecting the next token to be a '{' but it got a BLOCK.
If you want arbitrary text inside the block, I don't think you can use '{' to identify the name of things.

This looks like what is mentioned here:
A quite common case requiring backtracking is when your language uses the same delimiter pair for two different semantics
So the simplest solution seems to use different delimiters. Otherwise you may have to look into enabling backtracking.

ANTLR: Different token with trailing bracket

I am working on an ANTLRv4 grammar for BUGS - my repo is here, the link points to a particular commit so shouldn't go out of date.
Minimum code example below.
I would like the input rule to go along t route if input is T(, but to go along the id route if the input is T for the grammar below.
grammar temp;
input: t | id;
t: T '(';
id: ID;
T: 'T' {_input.LA(1)==(}?;
ID: [a-zA-Z][a-zA-Z0-9._]*;
My ANLTRv4 specification of BUGS grammar was obtained heavily inspired with the FLEX+BISON lexing and parsing grammar incorporated in JAGS 4.3.0 source code, in files src/lib/compiler/parser.yy and src/lib/compiler/scanner.ll.
The way they accomplish it is by using the trailing context in the lexer, e.g. r/s. The way to do it in ANTLR is given here, but I cannot get it to work.
I need it to work this way because another part of the grammar depends on this mechanism - relevant code fragment here.
You can recreate my particular issue by cloning my repo and running make - this will give list of tokens lexed and error in parsing stage. In the tokens list the letter T is lexed as token 'T' rather than ID as I'd like it to be.
I feel there is much more natural/correct way to do it in ANTLR, but I'm new to this and cannot figure out a way.
PS If you have an idea how to better name this question please edit it.

If I understand the problem correctly the following code will work fine:
grammar temp;
input: t | id;
t: T '(';
id: ID | T;
T: 'T';
LPAREN: '(';
ID: [a-zA-Z][a-zA-Z0-9._]*;

Does -> skip change the behavior of the lexer rule precedence?

I am writing a grammar to parse a configuration export file from a closed system. when a parameter identified in the export file has a particularly long string value assigned to it, the export file inserts "\r\n\t" (double quotes included) every so often in the value. In the file I'll see something like:
"stuff""morestuff""maybesomemorestuff"\r\n\t"morestuff""morestuff"...etc."
In that line, "" is the way the export file escapes a " that is part of the actual string value - vs. a single " which indicates the end of the string value.
my current approach for the grammar to get this string value to the parser is to grab "stuff" as a token and \r\n\t as a token. So I have rules like:
quoted_value : (QUOTED_PART | QUOTE_SEPARATOR)+ ;
QUOTED_PART : '"' .*? '"';
QUOTE_SEPARATOR : '\r\n\t';
WS : [ \t\r\n] -> skip; //note - just one char at a time
I get no errors when I lex or parse a sample string. However, in the token stream - no QUOTE_SEPARATOR tokens show up and there is literally nothing in the stream where they should have been.
I had expected that since QUOTE_SEPARATOR is longer than WS and that it is first in the grammar that it would be selected, but it is behaving as if WS was matched and the characters were skipped and not send to the token string.
Does the -> skip do something to change how rule precedence works?
I am also open to a different approach to the lexing that completely removes the "\r\n\t" (all five characters) - this way just seemed easier and it should be easy enough for the program that will process the parse tree to deal with as other manipulations to the data will be done there anyway (my first grammar - teach me;) ).

No, skip does not affect rule precedence.
Change the QUOTE_SEPARATOR rule to
QUOTE_SEPARATOR : '\\r\\n\\t' ;
in order to match the actual textual content of the source string.