Antlr 4: Method for switching modes in parser

Antlr 4: Method for switching modes in parser - parsing

I'm trying to build a MVS JCL recognizer using Antlr4. The general endeavour is going reasonably well, but I am having trouble handling the MVS equivalent of *nix "here docs" (inline files). I cannot use lexer modes to flip-flop between JCL and here-doc content, so I am looking for alternatives that I might use a parser level.
IBM MVS allows the use of "instream datasets", similar to *nix here-docs.
Example:
This defines a three-line inline file, terminated by the characters "ZZ" and accessible to a referencing program using the label "ANYNAME":
//ANYNAME DD *,SYMBOLS=(JCLONLY,FILEREF),DLM=ZZ
HEREDOC TEXT 1
HEREDOC TEXT 2
HEREDOC TEXT 3
ZZ
//NEXTFILE DD ...stuff...
ANYNAME is a handle by which a program can access the here-doc content.
DD * is mandatory and informs MVS that a here-doc follows.
SYMBOLS=(JCLONLY,FILEREF) is optional detail relating to how the here-doc is handled.
DLM=ZZ is also optional and defines the here-doc terminator (default terminator = /*).
I need to be able, at parser level, to process the //ANYNAME... line (I have that bit), then to read the here-doc content until I find the (possibly non-default) here-doc terminator. In a sense, this looks like a lexer modes opportunity- but at this point I am working within the parser and I do not have a fixed terminator to work with.
I need guidance on how to switch modes to handle my here-doc, then switch back again to continue processing my JCL.
A hugely abridged version of my grammar follows (the actual grammar, so far, is about 2,200 lines and is incomplete).
Thanks for any insights. I appreciate your help, comments and suggestions.
/* the ddstmt parser rule should be considered the main entry point. It handles (at least):
//ANYNAME DD *,SYMBOLS=(JCLONLY,FILEREF),DLM=ZZ
and // DD *,DLM=ZZ
and //ANYNAME DD *,SYMBOLS=EXECSYS
and //ANYNAME DD *
I need to be able process the above line as JCL then read the here-doc content...
"HEREDOC TEXT 1"
"HEREDOC TEXT 2"
"HEREDOC TEXT 3"
as either a single token or a series of tokens, then, after reading the here-doc
delimiter...
"ZZ"
, go back to processing regular JCL again.
*/
/* lexer rules: */
LINECOMMENT3 : SLASH SLASH STAR ;
DSLASH : SLASH SLASH ;
INSTREAMTERMINATE : SLASH STAR ;
SLASH : '/' ;
STAR : '*' ;
OPAREN : '(' ;
CPAREN : ')' ;
COMMA : ',' ;
KWDD : 'DD' ;
KWDLM : 'DLM' ;
KWSYMBOLS : 'SYMBOLS' ;
KWDATA : 'DATA' ;
SYMBOLSTARGET : 'JCLONLY'|'EXECSYS'|'CNVTSYS' ;
EQ : '=' ;
APOST : '\'' ;
fragment
SPC : ' ' ;
SPCS : SPC+ ;
NL : ('\r'? '\n') ;
UNQUOTEDTEXT : (APOST APOST|~[=\'\"\r\n\t,/() ])+ ;
/* parser rules: */
label : unquotedtext
;
separator : SPCS
;
/* handle crazy JCL comment rules - start */
partcomment : SPCS partcommenttext NL
;
partcommenttext : ((~NL+?)?)
;
linecomment : LINECOMMENT3 linecommenttext NL
;
linecommenttext : ((~NL+?)?)
;
postcommaeol : ( (partcomment|NL) linecomment* DSLASH SPCS )?
;
poststmteol : ( (partcomment|NL) linecomment* )?
;
/* handle crazy JCL comment rules - end */
ddstmt : DSLASH (label|) separator KWDD separator dddecl
;
dddecl : ...
| ddinstreamdecl
| ...
;
ddinstreamdecl : (STAR|KWDATA) poststmteol ddinstreamopts
;
ddinstreamopts : ( COMMA postcommaeol ddinstreamopt poststmteol )*
;
ddinstreamopt : ( ddinstreamdelim
| symbolsdecl
)
;
ddinstreamdelim : KWDLM EQ unquotedtext
;
symbolsdecl : KWSYMBOLS EQ symbolsdef
;
symbolsdef : OPAREN symbolstarget ( COMMA symbolsloggingdd )? CPAREN
| symbolstarget
;
symbolstarget : SYMBOLSTARGET
;
symbolsloggingdd : unquotedtext
;
unquotedtext : UNQUOTEDTEXT
;

Your lexer needs to be able to tokenize the entire document prior to the beginning of the parsing operation. Any attempt to control the lexer from within the parser is a recipe for endless nightmares down the road. The following fragments of a PHP Lexer show how predicates can be used in combination with lexer modes to detect the end of a string with a user-defined delimiter. The key part is recording the start delimiter, and then checking tokens which start at the beginning of the line against it.
PHP_NOWDOC_START
: '<<<\'' PHP_IDENTIFIER '\'' {_input.La(1) == '\r' || _input.La(1) == '\n'}?
-> pushMode(PhpNowDoc)
;
mode PhpNowDoc;
PhpNowDoc_NEWLINE : NEWLINE -> type(NEWLINE);
PHP_NOWDOC_END
: {_input.La(-1) == '\n'}?
PHP_IDENTIFIER ';'?
{CheckHeredocEnd(_input.La(1), Text);}?
-> popMode
;
PHP_NOWDOC_TEXT
: ~[\r\n]+
;
The identifier is actually recorded in a custom override of NextToken() (shown here for a C# target):
public override IToken NextToken()
{
IToken token = base.NextToken();
switch (token.Type)
{
case PHP_NOWDOC_START:
// <<<'identifier'
_heredocIdentifier = token.Text.Substring(3).Trim('\'');
break;
case PHP_NOWDOC_END:
_heredocIdentifier = null;
break;
default:
break;
}
return token;
}
private bool CheckHeredocEnd(int la1, string text)
{
// identifier
// - or -
// identifier;
bool semi = text[text.Length - 1] == ';';
string identifier = semi ? text.Substring(0, text.Length - 1) : text;
return string.Equals(identifier, HeredocIdentifier, StringComparison.Ordinal);
}

Related

Disable wrapping in Xtext formatter

I have a xtext grammar which consists of one declaration per line. When I format the code, all the declarations end up in the same line, the line breaks are removed.
As I didn't manage to change the grammar to require line breaks, I would like to disable the removal of line breaks. How do I do that? Bonus points if someone can tell me how to require line breaks at the end of each declaration.
Part of the Grammar:
grammar com.example.Msg with org.eclipse.xtext.common.Terminals
hidden(WS, SL_COMMENT)
import "http://www.eclipse.org/emf/2002/Ecore" as ecore
generate msg_idl "http://www.example.com/ex/ample/msg"
Model:
MsgDef
;
MsgDef:
(definitions+=definition)+
;
definition:
type=fieldType ' '+ name=ValidID (' '* '=' ' '* const=Value)?
;
fieldType:
value = ( builtinType | header)
;
builtinType:
BOOL = "bool"
| INT32 = "int32"
| CHAR = "char"
;
header:
value="Header"
;
Bool_l:
target=BOOL_E
;
String_l:
target = ('""'|STRING)
;
Number_l:
Double_l | Integer_l | NegInteger_l
;
NegInteger_l:
target=NEG_INT
;
Integer_l :
target=INT
;
Double_l:
target=DOUBLE
;
terminal NEG_INT returns ecore::EInt:
'-' INT
;
terminal DOUBLE returns ecore::EDouble :
('-')? ('0'..'9')* ('.' INT) |
('-')? INT ('.') |
('-')? INT ('.' ('0'..'9')*)? (('e'|'E')('-'|'+')? INT )|
'nan' | 'inf' | '-inf'
;
enum BOOL_E :
true | false
;
ValidID:
"bool"
| "string"
| "time"
| "duration"
| "char"
| ID ;
Value:
String_l | Number_l
;
terminal SL_COMMENT :
' '* '#' !('\n'|'\r')* ('\r'? '\n')?
;
Example data
string left
string top
string right
string bottom
I already tried:
class MsgFormatter extends AbstractDeclarativeFormatter {
extension MsgGrammarAccess msgGrammarAccess = grammarAccess as MsgGrammarAccess
override protected void configureFormatting(FormattingConfig c) {
c.setLinewrap(0, 1, 2).before(SL_COMMENTRule)
c.setLinewrap(0, 1, 2).before(ML_COMMENTRule)
c.setLinewrap(0, 1, 1).after(ML_COMMENTRule)
c.setLinewrap().before(definitionRule); // does not work
c.setLinewrap(1,1,2).before(definitionRule); // does not work
c.setLinewrap().before(fieldTypeRule); // does not work
}
}

In general it is a bad idea to encode whitespace into the language itself. Most of the time it is better to write the language in a way that you can use all kinds of whitespaces (blanks, tabs, newlines ...) to separate tokens.
You should implement a custom formatter for your language that inserts the line breaks after each statement. Xtext comes with two formatter APIs (an old one and a new one starting with Xtext 2.8). I propose to use the new one.
Here you extend AbstractFormatter2 and implement the format methods.
You can find a bit information in the online manual: https://www.eclipse.org/Xtext/documentation/303_runtime_concepts.html#formatting
Some more explanation in the folowing blog post: https://blogs.itemis.com/en/tabular-formatting-with-the-new-formatter-api
Some technical background: https://de.slideshare.net/meysholdt/xtexts-new-formatter-api

Antlr4: How can I both hide and use Tokens in a grammar

I'm parsing a script language that defines two types of statements; control statements and non control statements. Non control statements are always ended with ';', while control statements may end with ';' or EOL ('\n'). A part of the grammar looks like this:
script
: statement* EOF
;
statement
: control_statement
| no_control_statement
;
control_statement
: if_then_control_statement
;
if_then_control_statement
: IF expression THEN end_control_statment
( statement ) *
( ELSEIF expression THEN end_control_statment ( statement )* )*
( ELSE end_control_statment ( statement )* )?
END IF end_control_statment
;
no_control_statement
: sleep_statement
;
sleep_statement
: SLEEP expression END_STATEMENT
;
end_control_statment
: END_STATEMENT
| EOL
;
END_STATEMENT
: ';'
;
ANY_SPACE
: ( LINE_SPACE | EOL ) -> channel(HIDDEN)
;
EOL
: [\n\r]+
;
LINE_SPACE
: [ \t]+
;
In all other aspects of the script language, I never care about EOL so I use the normal lexer rules to hide white space.
This works fine in all cases but the cases where I need to use a EOL to find a termination of a control statement, but with the grammar above, all EOL is hidden and not used in the control statement rules.
Is there a way to change my grammar so that I can skip all EOL but the ones needed to terminate parts of my control statements?

Found one way to handle this.
The idea is to divert EOL into one hidden channel and the other stuff I don´t want to see in another hidden channel (like spaces and comments). Then I use some code to backtrack the tokens when an EOL is supposed to show up and examine the previous tokens channels (since they already have been consumed). If I find something on EOL channel before I run into something from the ordinary channel, then it is ok.
It looks like this:
Changed the lexer rules:
#lexer::members {
public static int EOL_CHANNEL = 1;
public static int OTHER_CHANNEL = 2;
}
...
EOL
: '\r'? '\n' -> channel(EOL_CHANNEL)
;
LINE_SPACE
: [ \t]+ -> channel(OTHER_CHANNEL)
;
I also diverted all other HIDDEN channels (comments) to the OTHER_CHANNEL.
Then I changed the rule end_control_statment:
end_control_statment
: END_STATEMENT
| { isEOLPrevious() }?
;
and added
#parser::members {
public static int EOL_CHANNEL = 1;
public static int OTHER_CHANNEL = 2;
boolean isEOLPrevious()
{
int idx = getCurrentToken().getTokenIndex();
int ch;
do
{
ch = getTokenStream().get(--idx).getChannel();
}
while (ch == OTHER_CHANNEL);
// Channel 1 is only carrying EOL, no need to check token itself
return (ch == EOL_CHANNEL);
}
}
One could stick to the ordinary hidden channel but then there is a need to both track channel and tokens while backtracking so this is maybe a bit easier...
Hope this could help someone else dealing with these kind of issues...

ANTLR v4 grammar to recognize jQuery blocks in Java code

I am having a hard time trying to implement a grammar to parse jQuery blocks in between java code.
I do not need to implement a java grammar. This is going to be a translator. I just need to output the java as it is and translate jQuery to java...
jQuery blocks are surrounded by the following tokens: /*#jQ ... */. There can be multiple blocks, but nesting is not allowed. Here is an example:
package test;
public class Test {
public static void main(String[] args) {
System.out.println("Hello world!");
/*#jQ
*/
System.out.println("Good bye world!");
}
}
The desired output of the translator, for this particular case, would be:
package test;
public class Test {
public static void main(String[] args) {
System.out.println("Hello world!");
System.out.println("Good bye world!");
}
}
The problem is I am not being able to read java until a /*#jQ is found. Here is an excerpt of what I have so far:
main
:
java
(
jQueryBlock+ java
)*
;
java
:
.*?
;
jQueryBlock
:
JQUERYBLOCKSTART
(
jQueryStatement SINGLE_LINE_COMMENT?
)* JQUERYBLOCKEND
;
and...
JQUERYBLOCKSTART
:
'/*#jQ'
;
Although the generated parse tree is somewhat acceptable (see below), I get several token recognition error...
JjQuery::main:3:22: token recognition error at: '{'
JjQuery::main:5:44: token recognition error at: '{'
JjQuery::main:6:12: token recognition error at: '.'
JjQuery::main:6:16: token recognition error at: '.'
JjQuery::main:6:37: token recognition error at: '!"'
JjQuery::main:12:12: token recognition error at: '.'
JjQuery::main:12:16: token recognition error at: '.'
JjQuery::main:12:40: token recognition error at: '!"'
JjQuery::main:13:5: token recognition error at: '}'
JjQuery::main:15:4: token recognition error at: '}'
Thanks in advance!
UPDATE
I have modified my grammar as suggested, but I'm still having some problems. Here is an example input, the generated parse tree, and below it the errors thrown.
warning(155): Lexer.g4:22:28: rule SINGLE_LINE_COMMENT contains a lexer command with an unrecognized constant value; lexer interpreters may produce incorrect output
warning(155): Lexer.g4:28:25: rule WS contains a lexer command with an unrecognized constant value; lexer interpreters may produce incorrect output
Parser::src:1:3: extraneous input '\n\n' expecting {<EOF>, '/*#jQ', JAVA}
Parser::src:3:5: token recognition error at: '\n'
Parser::src:4:0: token recognition error at: '\n'
Parser::src:5:2: token recognition error at: ' '
Parser::src:5:8: token recognition error at: '\n'
Parser::src:6:0: token recognition error at: '\n'
Parser::src:7:2: extraneous input '\n\n' expecting {<EOF>, '/*#jQ', JAVA}
Here is the current Lexer.g4:
lexer grammar Lexer;
#lexer::members {
public static final int WHITESPACE = 1;
public static final int COMMENTS = 2;
}
// Default mode rules (the SEA)
JQBegin
:
'/*#jQ' -> pushMode ( JQUERY )
;
JAVA
:
.
;
WS
:
[ \t\r\n]+ -> channel ( WHITESPACE ) // channel(1)
;
SINGLE_LINE_COMMENT
:
'//' .*? '\n' -> channel ( COMMENTS ) // channel(2)
;
mode JQUERY;
JQEnd
:
'*/' -> popMode
;
IN
:
'in'
;
OUT
:
'out'
;
ID
:
[a-zA-Z_] [a-zA-Z0-9_]*
;
SEMICOLON
:
';'
;
And the Parser.g4:
parser grammar Parser;
options {
tokenVocab = Lexer;
} // use tokens from ModeTagsLexer.g4
src
:
(
JAVA
| jQuery
)+ EOF
;
jQuery
:
JQBegin
(
in
| out
)* JQEnd
;
in
:
IN ID SEMICOLON
;
out
:
OUT ID SEMICOLON
;

Use lexical modes to separately handle JQuery and Java blocks (even though the Java blocks are trivial in your case). Note, lexer modes are only available in Lexer grammars and not in combined grammars.
Also, the Java catchall must match a single character at a time. Otherwise it can consume the JQuery begin sequence (this is likely the source of the errors you are seeing).
main: ( JAVA | jqBlock )+ EOF ;
jqBlock: JQBegin
( ... | ... | ... ) // your JQuery rules
JQEnd
;
JQBegin: '/*#jQ' -> pushMode(JQ) ;
JAVA : . ;
mode JQ;
... // your JQuery specific rules
BlockComment : '/*' .*? '*/' ; // handle any possibly ambiguous
// sequences that otherwise might
// cause early exits
JQEnd: '*/' -> popMode() ;

How to handle arithmetic operator < and > in antlr grammar that removes html tags

Following is my antlr 3 grammar. I want to strip off content inside html tags.
The problem arises when I have arithmetic operator < > inside the tag.
How can this be handled?
grammar T;
options {
output=AST;
}
tokens {
ROOT;
}
parse
: text+ ;
text
: (tag)=> tag !
| SPACE !
| outsidetag
;
SPACE
: (' ' | '\t' | '\r' | '\n')+ ;
tag
: OPEN INSIDETAG CLOSE ;
CLOSE : '>' ;
OPEN : '<' ;
INSIDETAG
: ~(CLOSE|OPEN)+ ;
outsidetag
: ~(SPACE) ;

First you don't need to check for OPEN in your INSIDETAG rule, since there is no harm in skipping it there. In fact you want it that way. Additionally combine tag and INSIDETAG and make it greedy so it tries to consume anything until the last CLOSE TOKEN, skipping so any intermediate ones:
tag: options { greedy = true; }: OPEN ~CLOSE* CLOSE;

ANTLR rule to consume fixed number of characters

I am trying to write an ANTLR grammar for the PHP serialize() format, and everything seems to work fine, except for strings. The problem is that the format of serialized strings is :
s:6:"length";
In terms of regexes, a rule like s:(\d+):".{\1}"; would describe this format if only backreferences were allowed in the "number of matches" count (but they are not).
But I cannot find a way to express this for either a lexer or parser grammar: the whole idea is to make the number of characters read depend on a backreference describing the number of characters to read, as in Fortran Hollerith constants (i.e. 6HLength), not on a string delimiter.
This example from the ANTLR grammar for Fortran seems to point the way, but I don't see how. Note that my target language is Python, while most of the doc and examples are for Java:
// numeral literal
ICON {int counter=0;} :
/* other alternatives */
// hollerith
'h' ({counter>0}? NOTNL {counter--;})* {counter==0}?
{
$setType(HOLLERITH);
String str = $getText;
str = str.replaceFirst("([0-9])+h", "");
$setText(str);
}
/* more alternatives */
;

Since input like s:3:"a"b"; is valid, you can't define a String token in your lexer, unless the first and last double quote are always the start and end of your string. But I guess this is not the case.
So, you'll need a lexer rule like this:
SString
: 's:' Int ':"' ( . )* '";'
;
In other words: match a s:, then an integer value followed by :" then one or more characters that can be anything, ending with ";. But you need to tell the lexer to stop consuming when the value Int is not reached. You can do that by mixing some plain code in your grammar to do so. You can embed plain code by wrapping it inside { and }. So first convert the value the token Int holds into an integer variable called chars:
SString
: 's:' Int {chars = int($Int.text)} ':"' ( . )* '";'
;
Now embed some code inside the ( . )* loop to stop it consuming as soon as chars is counted down to zero:
SString
: 's:' Int {chars = int($Int.text)} ':"' ( {if chars == 0: break} . {chars = chars-1} )* '";'
;
and that's it.
A little demo grammar:
grammar Test;
options {
language=Python;
}
parse
: (SString {print 'parsed: [\%s]' \% $SString.text})+ EOF
;
SString
: 's:' Int {chars = int($Int.text)} ':"' ( {if chars == 0: break} . {chars = chars-1} )* '";'
;
Int
: '0'..'9'+
;
(note that you need to escape the % inside your grammar!)
And a test script:
import antlr3
from TestLexer import TestLexer
from TestParser import TestParser
input = 's:6:"length";s:1:""";s:0:"";s:3:"end";'
char_stream = antlr3.ANTLRStringStream(input)
lexer = TestLexer(char_stream)
tokens = antlr3.CommonTokenStream(lexer)
parser = TestParser(tokens)
parser.parse()
which produces the following output:
parsed: [s:6:"length";]
parsed: [s:1:""";]
parsed: [s:0:"";]
parsed: [s:3:"end";]

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Antlr 4: Method for switching modes in parser - parsing

Related

Disable wrapping in Xtext formatter

Antlr4: How can I both hide and use Tokens in a grammar

ANTLR v4 grammar to recognize jQuery blocks in Java code

How to handle arithmetic operator < and > in antlr grammar that removes html tags

ANTLR rule to consume fixed number of characters

Categories

Resources