JavaCC: How to parse characters that may not be exist?

JavaCC: How to parse characters that may not be exist? - parsing

I am trying to adapt a PDDL parser and there is a token that is optional to be exist. Suppose these are 2 options I want to read.
(figure1)
(node1)
(node1 :isGood) // :isGood is optional to be exist
To support both situations, I developed code in .jj like this figure2 below. It works correctly; however, it is an inappropriate way to write like this.
(figure2)
<LEFT_BRACKET>
<NODE>
(LOOKAHEAD(2) <IS_GOOD> <RIGHT_BRACKET> | <RIGHT_BRACKET>)
The code in .jj that I actually want it to be should be like in this figure3 below. From figure3, it parses from .jj successfully but it is unable to parse the scripts in the figure1 from which I received unexpected token ")" instead. (figure3)
<LEFT_BRACKET>
<NODE>
(LOOKAHEAD(2) <IS_GOOD>) // this is where it should support an optional token
<RIGHT_BRACKET>
Question: How to write the code in .jj to support both conditions in figure1? In other words, how to make it support the optional token :isGood that may not be exist with an appropriate approach of programming.
I probably don't know how the LOOKAHEAD works. Any solution to read the figure1 is appreciate.

You don't need the lookahead specification in this case. Just use
<LEFT_BRACKET>
<NODE>
[ <IS_GOOD> ]
<RIGHT_BRACKET>
or
<LEFT_BRACKET>
<NODE>
( <IS_GOOD> )?
<RIGHT_BRACKET>
or
<LEFT_BRACKET>
<NODE>
( <IS_GOOD> | {} )
<RIGHT_BRACKET>
They mean the same thing.

Related

Is there a way to insert phases between the lexer and parser in ANTLR

I am writing a lexer/parser for a language that allows abbreviations (and globs) for its keywords. And, I am trying to determine the best way to do it.
And one thought that occurs to me, is to insert a phase between the lexer and the parser, where the lexer recognizes the general class, e.g. is this a "command name" or is it an "option" and then passes those general tokens to a second phase which does further analysis and recognizes which command name it is and passes that on as the token type to the parser.
It will make the parser simple. I will only have to deal with well formed command names. Every token will be clear what it means.
It will keep the lexer simple. It will only have to divide things into classes. This is a simple name. This is a glob. This is an option name (starts with a dash).
The phase is the middle will also be relatively simple. The simple name (and option forms) will only have to deal with strings. The glob form can use standard glob techniques to match the glob against the legal candidates, which are in the tables for the simple names and options.
The question is how to insert that phase into ANTLR, so that I call the lexer and it creates tokens and the intermediate phase massages them and then the parser gets the tokens the intermediate phase has categorized.
Is there a known solution for this?
Something like:
lexer grammar simple
letter: [A-Z][a-z];
digit: [0-9];
glob-char: [*?];
name: letter (letter | digit)*;
option: '-'name;
glob: (glob-char|letter)(glob-char|letter|digit)*;
glob-option: '-'glob;
filter grammar name;
end: 'e' | 'end';
generate: 'ge' | 'generate';
goto: 'go' | 'goto';
help: 'h' | 'help';
if: 'i' | 'if';
then: 't' | 'then';
parser grammar simple;
The user (programmer writing the language I am parsing) need to be to write
g*te and have if match generate.
The phase between the lexer and the parser when it sees a glob needs to look at the glob (and the list of keywords) and see if only one of them matches the glob and if so, return that keyword. The stuff I listed in the "filter grammar" is the stuff that builds the list of keywords globs can match. I have found code on the web that matches globs to a list of names. That part isn't hard.
And, I've since found in the ANTLR doc how to run arbitrary code on matching a token and how to change the resulting tokens type. (See my answer.)

It looks like you can use lexerCustomActions to achieve the desired effect. Something like the following.
in your lexer:
GLOB: [-A-Za-z0-9_.]* '*' [-A-Za-z0-9_.*]* { setType(lexGlob(getText())); }
in your Java (or whatever language you are using code):
void int lexGlob(String origText()) {
return xyzzy; // some code that computes the right kind of token type
}

Handling arbitrary text blocks in an Xtext grammar

In an effort to better understand Xtext, I'm working on writing a grammar and have hit a roadblock. I've boiled it down to the following scenario. I have some input such as this:
thing {abc}
{def}
There may be keywords (e.g.'thing') followed by other language elements (e.g. ID) in braces. Or, there can just be a block of content inside braces. This content should simply be passed along to the parser en masse.
If I try something like this:
Model: (things+=AThing | blocks+=ABlock)*;
AThing : 'thing' '{' name = ID '}';
ABlock : block=BLOCK;
terminal BLOCK:'{' -> '}';
and parse the sample text above, I get an error:
'mismatched input '{abc}' expecting '{'' on ABlock, offset 6, length 5
So, '{abc}' is being matched by the BLOCK terminal rule, which I understand. But how do I alter the grammar to properly handle the sample input? I've been wrestling with this problem for a while and have come up empty. So it's either something very simple that I've missed, or the problem is really complex and I don't realize it. Any enlightenment would be greatly appreciated.

Parsing happens in two stages: tokenizer and lexical. In the first one the text input is divided into tokens, in the second one the tokens are matched against lexical rules. Broadly something like (with some arbitrary language):
1st phase:
text: class X { this ; }
----- --- --- ---- --- ---
tokens: ID ID LB ID SC RB
2nd phase:
Is there a rule that starts with a 'class' string?
YES: Is the next expected token an ID?
YES: Is the next expected token a LB?
...
NO: Is there another rule that starts with 'class'?
...
NO: Is there a rule that starts with an ID token?
...
The lexer implementation is a bit more complex, but I hope you get the idea.
The issue with your grammar is that your termial BLOCK rule is used during the first phase, hence you get
thing {abc} {def}
----- ----- -----
ID BLOCK BLOCK
That is why the error message says if found '{abc}' and not a '{'. The lexer matched the thing and was expecting the next token to be a '{' but it got a BLOCK.
If you want arbitrary text inside the block, I don't think you can use '{' to identify the name of things.

This looks like what is mentioned here:
A quite common case requiring backtracking is when your language uses the same delimiter pair for two different semantics
So the simplest solution seems to use different delimiters. Otherwise you may have to look into enabling backtracking.

Lexer rule optional suffix not matching, when it should match

Using ANTLR 3, my lexer has rule
SELECT_ASSIGN:
'SELECT' WS+ IDENTIFIER WS+ 'ASSIGN' WS+ (('TO'|'USING') WS+)?
using this these match correctly
SELECT VAR1 ASSIGN TO
SELECT VAR1 ASSIGN USING
and this also matches
SELECT VAR1 ASSIGN FOO
However this does not match
SELECT VAR1 ASSIGN TWO
Whereas I have marked TO|USING as optional in the rule.
From generated Java code I see...
When lexer notices T of TWO, it goes to match('TO')
but since does not find O after T
then generates failure.... and returns all the way from the rule -- hence not matching it.
How do I get my lexer rule to match, when input has word with chars starting with suffixed optional part of the rule
Basically I want my rule to match this also (beside what it already matches - as lised at the start):
SELECT VAR1 ASSIGN TWO
Kindly suggest how I approach/resolve this situation.
NOTE:
Such rules are recommended in the parser - But I have this in lexer - because I do not want to parse the entire input by the parser, and want to parse only content of interest. So using such rules in lexer, I locate sections which I really want to parse by the parser.
UPDATE 1
I could circumvent this problem by making 2 rules, like so:
SELECT_ASSIGN_USING_TO
: tok='SELECT' WS+ name=IDENTIFIER WS+ 'ASSIGN' WS+ ('USING'|'TO')
SELECT_ASSIGN
: tok='SELECT' WS+ name=IDENTIFIER WS+ 'ASSIGN'
But is it possible to do the desired in one lexer rule?

An approach to get this in one rule, suggested by my senior - use syntactic predicate
SELECT_ASSIGN
: tok='SELECT' WS+ name=IDENTIFIER WS+ 'ASSIGN'
(
(WS+ ('TO'|'USING') WS+)=> (WS+ ('TO'|'USING') WS+)
| (WS+)
)

Tokens match a complete char sequence or none. It cannot match partially and the grammar rule determines which exactly. You cannot expect a rule for TO to match TWO. If you want TWO to match too you have to add it to your lexer rule.
A few notes here:
The solution your "senior" gave you makes no sense at all. A
syntactic predicate is a kinda lookahead to guide the parser in case
of ambiquities. There are no ambiquities involved here.
Writing
the entire SELECT_ASSIGN rule as a lexer rule is very uncommon and
not flexible. A lexer rule should not be used for entire sentences,
but only for a small set of characters to find tokens to assign them
a type (usually elementary structures of a language like string,
number, comment etc.).
ANTLR3 is totally outdated and I wonder why this is still used in your class. ANTLR4 is out since 5 years and should be the choice for any new project.

JvmFormalParameter rule ambigouous?

I have a simple little grammar which keeps giving a multiple alternatives error when I try to generate Xtext artefacts.
The grammar is:
grammar org.xtext.example.hyrule.HyRule with org.eclipse.xtext.xbase.Xbase
generate hyRule (You can only use links to eclipse.org sites while you have fewer than 25 messages )
Start:
rules+=Rule+
;
Rule:
'FOR''PAYLOAD'payload=PAYLOAD'ELEMENTS' elements+=JvmFormalParameter+'CONSTRAINED' 'BY' expressions+= XExpression*;
PAYLOAD:
"Stacons"|"PFResults"|"any"
;
And the exact error I get is:
![warning(200): ../org.xtext.example.hyrule/src-gen/org/xtext/example/hyrule/parser/antlr/internal/InternalHyRule.g:3197:2: Decision can match input such as "{RULE_ID, '=>', '('}" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input
error(201): ../org.xtext.example.hyrule/src-gen/org/xtext/example/hyrule/parser/antlr/internal/InternalHyRule.g:3197:2: The following alternatives can never be matched: 2][1]
I have attached the Syntax diagram for the generated antlr grammar in antlrworks, and can clearly see the multiple alternatives(JvmFormalParameter can match RULE_ID via the JvmTypeReference or the ValidID rule).
So it looks as if JvmFormalParameter is ambiguous...Apologies for my stupidity but could someone point out what it is I'm missing? Is there some way of overcoming this ambiguity when using the JvmFormalParameter rule in my grammar?

The rule JvmFormalParameter is defined as
JvmFormalParameter returns types::JvmFormalParameter:
(parameterType=JvmTypeReference)? name=ValidID;
so the type of the parameter is optional. If you use elements+=JvmFormalParameter+, you allow multiple parameters without a delimiter thus the parser cannot decide about the input sequence
String s
since both String and s could be names of two parameters or String s could be a single parameter with a type String and the name s. You should use a delimiter like
elements+=JvmFormalParameter (',' elements+=JvmFormalParameter)*
or use the rule FullJvmFormalParameter which is defined with a mandatory type reference:
FullJvmFormalParameter returns types::JvmFormalParameter:
parameterType=JvmTypeReference name=ValidID;

Help with Shift/Reduce conflict - Trying to model (X A)* (X B)*

Im trying to model the EBNF expression
("declare" "namespace" ";")* ("declare" "variable" ";")*
I have built up the yacc (Im using MPPG) grammar, which seems to represent this, but it fails to match my test expression.
The test case i'm trying to match is
declare variable;
The Token stream from the lexer is
KW_Declare
KW_Variable
Separator
The grammar parse says there is a "Shift/Reduce conflict, state 6 on KW_Declare". I have attempted to solve this with "%left PrologHeaderList PrologBodyList", but neither solution works.
Program : Prolog;
Prolog : PrologHeaderList PrologBodyList;
PrologHeaderList : /*EMPTY*/
| PrologHeaderList PrologHeader;
PrologHeader : KW_Declare KW_Namespace Separator;
PrologBodyList : /*EMPTY*/
| PrologBodyList PrologBody;
PrologBody : KW_Declare KW_Variable Separator;
KW_Declare KW_Namespace KW_Variable Separator are all tokens with values "declare", "naemsapce", "variable", ";".

It's been a long time since I've used anything yacc-like, but here are a couple of suggestions that may or may not help.
It seems that you need a 2-token lookahead in this situation. The parser gets to the last PrologHeader, and it has to decide whether the next construct is a PrologHeader or a PrologBody, and it can't tell that from the KW_Declare. If there's a directive to increase lookahead in this situation, it will probably solve the problem.
You could also introduce context into your actions: rather than define PrologHeaderList and PrologBodyList, define PrologRuleList and have the actions throw an error if a header appears after a body. Ugly, but sometimes you have to do it: what appears simple in a grammar may not be simple in the generated parser.
A hackish approach might be to combine the tokens: rather than KW_Declare and KW_Variable, have your lexer recognize the space and use KW_Declare_Variable. Since both are keywords, you're not going to run into namespace collision problems.

The grammar at the top is regular so IIRC you can plot it out as a DFA (or a NDA and convert it to a DFA) and then convert the DFA to a grammar. It's bean a while so I'll leave the work as an exercise for the reader.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

JavaCC: How to parse characters that may not be exist? - parsing

You don't need the lookahead specification in this case. Just use <LEFT_BRACKET> <NODE> [ <IS_GOOD> ] <RIGHT_BRACKET> or <LEFT_BRACKET> <NODE> ( <IS_GOOD> )? <RIGHT_BRACKET> or <LEFT_BRACKET> <NODE> ( <IS_GOOD> | {} ) <RIGHT_BRACKET> They mean the same thing.

Related

Is there a way to insert phases between the lexer and parser in ANTLR

Handling arbitrary text blocks in an Xtext grammar

Lexer rule optional suffix not matching, when it should match

JvmFormalParameter rule ambigouous?

Help with Shift/Reduce conflict - Trying to model (X A)* (X B)*

Categories

Resources