ANTLR on a noisy data stream Part 3 - parsing

Still in the process of learning ANTLR... Recently I have been posting 2 questions regarding parsing some text and extracting information leaving aside "unwanted" words or character. Following a very interesing discussion with Bart Kiers on parsing a noisy datastream Part 1 and and parsing a noisy datastream Part 2, I'm ending up with one more problem...
Originally, my grammar looks like this
VERB : 'SLEEPING' | 'WALKING';
SUBJECT : 'CAT'|'DOG'|'BIRD';
INDIRECT_OBJECT : 'CAR'| 'SOFA';
ANY2 :'A'..'Z'+ {skip();};
ANY : . {skip();};
parse
: sentenceParts+ EOF
;
sentenceParts
: SUBJECT VERB INDIRECT_OBJECT
;
a sentence like it's 10PM and the Lazy CAT is currently SLEEPING heavily on the SOFA in front of the TV. will produce the following
This is good... and it does what I want, i.e. extracting only the word CAT, SLEEPING and SOFA, leaving aside other words. Now, for another reason, I need to introduce a new token in my grammar, let's call it OTHER : 'PLANE'. It will be used later by another rule. I still want my primary rule to work : SUBJECT VERB INDIRECT_OBJECT. Let's say the token 'PLANE' appears in my sentence, like
it's 10PM and the Lazy CAT on the PLANE is currently SLEEPING heavily on the SOFA in front of the TV. It will produce the following error (no surprise here as the lexer has a clear definition of 'PLANE' as a token)
Is there a way to tell ANTLR that if I'm entering the rule sentenceParts I only care about the 3 tokens I have defined, namely SUBJECT, VERB or INDIRECT_OBJECT and that, even if it comes across a different token, not to take it into account ? I would like to be able to do that without putting OTHER? everywhere in this rule

Well in fact, I might have found a way to do it... Although it's questionable at that point to introduce tokens if you don't want to parse them, this solution works :
VERB : 'SLEEPING' | 'WALKING';
SUBJECT : 'CAT'|'DOG'|'BIRD';
INDIRECT_OBJECT : 'CAR'| 'SOFA';
OTHER : 'PLANE';
OTHER2 : 'BEAUTIFUL';
OTHER3 : 'HEAVILLY';
ANY2 :'A'..'Z'+ {skip();};
ANY : . {skip();};
parse
: sentenceParts+ EOF
;
next : ( options {greedy=false;}: .)*;
sentenceParts
: SUBJECT next VERB next INDIRECT_OBJECT
;
this will produce on the following sentence it's 10PM and the Lazy CAT on the BEAUTIFUL PLANE is currently SLEEPING HEAVILLY on the SOFA in front of the TV the following tree... So that intermediary token

Is there a way to tell ANTLR that if I'm entering the rule sentenceParts I only care about the 3 tokens I have defined, namely SUBJECT, VERB or INDIRECT_OBJECT and that, even if it comes across a different token, not to take it into account ? I would like to be able to do that without putting OTHER? everywhere in this rule
No.
You either ignore the token, or you don't, in which case you'll have to make it optional in your parser rule(s).

Related

Handling arbitrary text blocks in an Xtext grammar

In an effort to better understand Xtext, I'm working on writing a grammar and have hit a roadblock. I've boiled it down to the following scenario. I have some input such as this:
thing {abc}
{def}
There may be keywords (e.g.'thing') followed by other language elements (e.g. ID) in braces. Or, there can just be a block of content inside braces. This content should simply be passed along to the parser en masse.
If I try something like this:
Model: (things+=AThing | blocks+=ABlock)*;
AThing : 'thing' '{' name = ID '}';
ABlock : block=BLOCK;
terminal BLOCK:'{' -> '}';
and parse the sample text above, I get an error:
'mismatched input '{abc}' expecting '{'' on ABlock, offset 6, length 5
So, '{abc}' is being matched by the BLOCK terminal rule, which I understand. But how do I alter the grammar to properly handle the sample input? I've been wrestling with this problem for a while and have come up empty. So it's either something very simple that I've missed, or the problem is really complex and I don't realize it. Any enlightenment would be greatly appreciated.
Parsing happens in two stages: tokenizer and lexical. In the first one the text input is divided into tokens, in the second one the tokens are matched against lexical rules. Broadly something like (with some arbitrary language):
1st phase:
text: class X { this ; }
----- --- --- ---- --- ---
tokens: ID ID LB ID SC RB
2nd phase:
Is there a rule that starts with a 'class' string?
YES: Is the next expected token an ID?
YES: Is the next expected token a LB?
...
NO: Is there another rule that starts with 'class'?
...
NO: Is there a rule that starts with an ID token?
...
The lexer implementation is a bit more complex, but I hope you get the idea.
The issue with your grammar is that your termial BLOCK rule is used during the first phase, hence you get
thing {abc} {def}
----- ----- -----
ID BLOCK BLOCK
That is why the error message says if found '{abc}' and not a '{'. The lexer matched the thing and was expecting the next token to be a '{' but it got a BLOCK.
If you want arbitrary text inside the block, I don't think you can use '{' to identify the name of things.
This looks like what is mentioned here:
A quite common case requiring backtracking is when your language uses the same delimiter pair for two different semantics
So the simplest solution seems to use different delimiters. Otherwise you may have to look into enabling backtracking.

ANTLR: Different token with trailing bracket

I am working on an ANTLRv4 grammar for BUGS - my repo is here, the link points to a particular commit so shouldn't go out of date.
Minimum code example below.
I would like the input rule to go along t route if input is T(, but to go along the id route if the input is T for the grammar below.
grammar temp;
input: t | id;
t: T '(';
id: ID;
T: 'T' {_input.LA(1)==(}?;
ID: [a-zA-Z][a-zA-Z0-9._]*;
My ANLTRv4 specification of BUGS grammar was obtained heavily inspired with the FLEX+BISON lexing and parsing grammar incorporated in JAGS 4.3.0 source code, in files src/lib/compiler/parser.yy and src/lib/compiler/scanner.ll.
The way they accomplish it is by using the trailing context in the lexer, e.g. r/s. The way to do it in ANTLR is given here, but I cannot get it to work.
I need it to work this way because another part of the grammar depends on this mechanism - relevant code fragment here.
You can recreate my particular issue by cloning my repo and running make - this will give list of tokens lexed and error in parsing stage. In the tokens list the letter T is lexed as token 'T' rather than ID as I'd like it to be.
I feel there is much more natural/correct way to do it in ANTLR, but I'm new to this and cannot figure out a way.
PS If you have an idea how to better name this question please edit it.
If I understand the problem correctly the following code will work fine:
grammar temp;
input: t | id;
t: T '(';
id: ID | T;
T: 'T';
LPAREN: '(';
ID: [a-zA-Z][a-zA-Z0-9._]*;

Does -> skip change the behavior of the lexer rule precedence?

I am writing a grammar to parse a configuration export file from a closed system. when a parameter identified in the export file has a particularly long string value assigned to it, the export file inserts "\r\n\t" (double quotes included) every so often in the value. In the file I'll see something like:
"stuff""morestuff""maybesomemorestuff"\r\n\t"morestuff""morestuff"...etc."
In that line, "" is the way the export file escapes a " that is part of the actual string value - vs. a single " which indicates the end of the string value.
my current approach for the grammar to get this string value to the parser is to grab "stuff" as a token and \r\n\t as a token. So I have rules like:
quoted_value : (QUOTED_PART | QUOTE_SEPARATOR)+ ;
QUOTED_PART : '"' .*? '"';
QUOTE_SEPARATOR : '\r\n\t';
WS : [ \t\r\n] -> skip; //note - just one char at a time
I get no errors when I lex or parse a sample string. However, in the token stream - no QUOTE_SEPARATOR tokens show up and there is literally nothing in the stream where they should have been.
I had expected that since QUOTE_SEPARATOR is longer than WS and that it is first in the grammar that it would be selected, but it is behaving as if WS was matched and the characters were skipped and not send to the token string.
Does the -> skip do something to change how rule precedence works?
I am also open to a different approach to the lexing that completely removes the "\r\n\t" (all five characters) - this way just seemed easier and it should be easy enough for the program that will process the parse tree to deal with as other manipulations to the data will be done there anyway (my first grammar - teach me;) ).
No, skip does not affect rule precedence.
Change the QUOTE_SEPARATOR rule to
QUOTE_SEPARATOR : '\\r\\n\\t' ;
in order to match the actual textual content of the source string.

ANTLR on a noisy data stream

I'm very new in the ANTLR world and I'm trying to figure out how can I use this parsing tool to interpret a set of "noisy" string. What I would like to achieve is the following.
let's take for example this phrase : It's 10PM and the Lazy CAT is currently SLEEPING heavily on the SOFA in front of the TV
What I would like to extract is CAT, SLEEPING and SOFA and have a grammar that match easily the following pattern : SUBJECT - VERB - INDIRECT OBJECT... where I could define
VERB : 'SLEEPING' | 'WALKING';
SUBJECT : 'CAT'|'DOG'|'BIRD';
INDIRECT_OBJECT : 'CAR'| 'SOFA';
etc.. I don't want to ends up with a permanent "NoViableException" as I can't describe all the possibilities around the language structure. I just want to tear apart useless words and just keep the one that are interesting.
It's more like if I had a tokeniser and asked the parser "Ok, read the stream until you find a SUBJECT, then ignore the rest until you find a VERB, etc.."
I need to extract an organized structure in an un-organized set... For example, I would like to be able to interpret (I'm not judging the pertinence of this utterly basic and incorrect view of 'english grammar')
SUBJECT - VERB - INDIRECT OBJECT
INDIRECT OBJECT - SUBJECT - VERB
so I will parse sentences like
It's 10PM and the Lazy CAT is currently SLEEPING heavily on the SOFA in front of the TV or It's 10PM and, on the SOFA in front of the TV, the Lazy CAT is currently SLEEPING heavily
You could create only a couple of lexer rules (the ones you posted, for example), and as a last lexer rule, you could match any character and skip() it:
VERB : 'SLEEPING' | 'WALKING';
SUBJECT : 'CAT'|'DOG'|'BIRD';
INDIRECT_OBJECT : 'CAR'| 'SOFA';
ANY : . {skip();};
The order is important here: the lexer tries to match tokens from top to bottom, so if it can't match any of the tokens VERB, SUBJECT or INDIRECT_OBJECT, it "falls through" to the ANY rule and skips this token. You can then use these parser rules to filter your input stream:
parse
: sentenceParts+ EOF
;
sentenceParts
: SUBJECT VERB INDIRECT_OBJECT
;
which will parse the input text:
It's 10PM and the Lazy CAT is currently SLEEPING
heavily on the SOFA in front of the TV. The DOG
is WALKING on the SOFA.
as follows:

Help with Shift/Reduce conflict - Trying to model (X A)* (X B)*

Im trying to model the EBNF expression
("declare" "namespace" ";")* ("declare" "variable" ";")*
I have built up the yacc (Im using MPPG) grammar, which seems to represent this, but it fails to match my test expression.
The test case i'm trying to match is
declare variable;
The Token stream from the lexer is
KW_Declare
KW_Variable
Separator
The grammar parse says there is a "Shift/Reduce conflict, state 6 on KW_Declare". I have attempted to solve this with "%left PrologHeaderList PrologBodyList", but neither solution works.
Program : Prolog;
Prolog : PrologHeaderList PrologBodyList;
PrologHeaderList : /*EMPTY*/
| PrologHeaderList PrologHeader;
PrologHeader : KW_Declare KW_Namespace Separator;
PrologBodyList : /*EMPTY*/
| PrologBodyList PrologBody;
PrologBody : KW_Declare KW_Variable Separator;
KW_Declare KW_Namespace KW_Variable Separator are all tokens with values "declare", "naemsapce", "variable", ";".
It's been a long time since I've used anything yacc-like, but here are a couple of suggestions that may or may not help.
It seems that you need a 2-token lookahead in this situation. The parser gets to the last PrologHeader, and it has to decide whether the next construct is a PrologHeader or a PrologBody, and it can't tell that from the KW_Declare. If there's a directive to increase lookahead in this situation, it will probably solve the problem.
You could also introduce context into your actions: rather than define PrologHeaderList and PrologBodyList, define PrologRuleList and have the actions throw an error if a header appears after a body. Ugly, but sometimes you have to do it: what appears simple in a grammar may not be simple in the generated parser.
A hackish approach might be to combine the tokens: rather than KW_Declare and KW_Variable, have your lexer recognize the space and use KW_Declare_Variable. Since both are keywords, you're not going to run into namespace collision problems.
The grammar at the top is regular so IIRC you can plot it out as a DFA (or a NDA and convert it to a DFA) and then convert the DFA to a grammar. It's bean a while so I'll leave the work as an exercise for the reader.

Resources