ANTLR v4 grammar to recognize jQuery blocks in Java code - parsing

I am having a hard time trying to implement a grammar to parse jQuery blocks in between java code.
I do not need to implement a java grammar. This is going to be a translator. I just need to output the java as it is and translate jQuery to java...
jQuery blocks are surrounded by the following tokens: /*#jQ ... */. There can be multiple blocks, but nesting is not allowed. Here is an example:
package test;
public class Test {
public static void main(String[] args) {
System.out.println("Hello world!");
/*#jQ
*/
System.out.println("Good bye world!");
}
}
The desired output of the translator, for this particular case, would be:
package test;
public class Test {
public static void main(String[] args) {
System.out.println("Hello world!");
System.out.println("Good bye world!");
}
}
The problem is I am not being able to read java until a /*#jQ is found. Here is an excerpt of what I have so far:
main
:
java
(
jQueryBlock+ java
)*
;
java
:
.*?
;
jQueryBlock
:
JQUERYBLOCKSTART
(
jQueryStatement SINGLE_LINE_COMMENT?
)* JQUERYBLOCKEND
;
and...
JQUERYBLOCKSTART
:
'/*#jQ'
;
Although the generated parse tree is somewhat acceptable (see below), I get several token recognition error...
JjQuery::main:3:22: token recognition error at: '{'
JjQuery::main:5:44: token recognition error at: '{'
JjQuery::main:6:12: token recognition error at: '.'
JjQuery::main:6:16: token recognition error at: '.'
JjQuery::main:6:37: token recognition error at: '!"'
JjQuery::main:12:12: token recognition error at: '.'
JjQuery::main:12:16: token recognition error at: '.'
JjQuery::main:12:40: token recognition error at: '!"'
JjQuery::main:13:5: token recognition error at: '}'
JjQuery::main:15:4: token recognition error at: '}'
Thanks in advance!
UPDATE
I have modified my grammar as suggested, but I'm still having some problems. Here is an example input, the generated parse tree, and below it the errors thrown.
warning(155): Lexer.g4:22:28: rule SINGLE_LINE_COMMENT contains a lexer command with an unrecognized constant value; lexer interpreters may produce incorrect output
warning(155): Lexer.g4:28:25: rule WS contains a lexer command with an unrecognized constant value; lexer interpreters may produce incorrect output
Parser::src:1:3: extraneous input '\n\n' expecting {<EOF>, '/*#jQ', JAVA}
Parser::src:3:5: token recognition error at: '\n'
Parser::src:4:0: token recognition error at: '\n'
Parser::src:5:2: token recognition error at: ' '
Parser::src:5:8: token recognition error at: '\n'
Parser::src:6:0: token recognition error at: '\n'
Parser::src:7:2: extraneous input '\n\n' expecting {<EOF>, '/*#jQ', JAVA}
Here is the current Lexer.g4:
lexer grammar Lexer;
#lexer::members {
public static final int WHITESPACE = 1;
public static final int COMMENTS = 2;
}
// Default mode rules (the SEA)
JQBegin
:
'/*#jQ' -> pushMode ( JQUERY )
;
JAVA
:
.
;
WS
:
[ \t\r\n]+ -> channel ( WHITESPACE ) // channel(1)
;
SINGLE_LINE_COMMENT
:
'//' .*? '\n' -> channel ( COMMENTS ) // channel(2)
;
mode JQUERY;
JQEnd
:
'*/' -> popMode
;
IN
:
'in'
;
OUT
:
'out'
;
ID
:
[a-zA-Z_] [a-zA-Z0-9_]*
;
SEMICOLON
:
';'
;
And the Parser.g4:
parser grammar Parser;
options {
tokenVocab = Lexer;
} // use tokens from ModeTagsLexer.g4
src
:
(
JAVA
| jQuery
)+ EOF
;
jQuery
:
JQBegin
(
in
| out
)* JQEnd
;
in
:
IN ID SEMICOLON
;
out
:
OUT ID SEMICOLON
;

Use lexical modes to separately handle JQuery and Java blocks (even though the Java blocks are trivial in your case). Note, lexer modes are only available in Lexer grammars and not in combined grammars.
Also, the Java catchall must match a single character at a time. Otherwise it can consume the JQuery begin sequence (this is likely the source of the errors you are seeing).
main: ( JAVA | jqBlock )+ EOF ;
jqBlock: JQBegin
( ... | ... | ... ) // your JQuery rules
JQEnd
;
JQBegin: '/*#jQ' -> pushMode(JQ) ;
JAVA : . ;
mode JQ;
... // your JQuery specific rules
BlockComment : '/*' .*? '*/' ; // handle any possibly ambiguous
// sequences that otherwise might
// cause early exits
JQEnd: '*/' -> popMode() ;

Related

Antlr4: How can I both hide and use Tokens in a grammar

I'm parsing a script language that defines two types of statements; control statements and non control statements. Non control statements are always ended with ';', while control statements may end with ';' or EOL ('\n'). A part of the grammar looks like this:
script
: statement* EOF
;
statement
: control_statement
| no_control_statement
;
control_statement
: if_then_control_statement
;
if_then_control_statement
: IF expression THEN end_control_statment
( statement ) *
( ELSEIF expression THEN end_control_statment ( statement )* )*
( ELSE end_control_statment ( statement )* )?
END IF end_control_statment
;
no_control_statement
: sleep_statement
;
sleep_statement
: SLEEP expression END_STATEMENT
;
end_control_statment
: END_STATEMENT
| EOL
;
END_STATEMENT
: ';'
;
ANY_SPACE
: ( LINE_SPACE | EOL ) -> channel(HIDDEN)
;
EOL
: [\n\r]+
;
LINE_SPACE
: [ \t]+
;
In all other aspects of the script language, I never care about EOL so I use the normal lexer rules to hide white space.
This works fine in all cases but the cases where I need to use a EOL to find a termination of a control statement, but with the grammar above, all EOL is hidden and not used in the control statement rules.
Is there a way to change my grammar so that I can skip all EOL but the ones needed to terminate parts of my control statements?
Found one way to handle this.
The idea is to divert EOL into one hidden channel and the other stuff I donĀ“t want to see in another hidden channel (like spaces and comments). Then I use some code to backtrack the tokens when an EOL is supposed to show up and examine the previous tokens channels (since they already have been consumed). If I find something on EOL channel before I run into something from the ordinary channel, then it is ok.
It looks like this:
Changed the lexer rules:
#lexer::members {
public static int EOL_CHANNEL = 1;
public static int OTHER_CHANNEL = 2;
}
...
EOL
: '\r'? '\n' -> channel(EOL_CHANNEL)
;
LINE_SPACE
: [ \t]+ -> channel(OTHER_CHANNEL)
;
I also diverted all other HIDDEN channels (comments) to the OTHER_CHANNEL.
Then I changed the rule end_control_statment:
end_control_statment
: END_STATEMENT
| { isEOLPrevious() }?
;
and added
#parser::members {
public static int EOL_CHANNEL = 1;
public static int OTHER_CHANNEL = 2;
boolean isEOLPrevious()
{
int idx = getCurrentToken().getTokenIndex();
int ch;
do
{
ch = getTokenStream().get(--idx).getChannel();
}
while (ch == OTHER_CHANNEL);
// Channel 1 is only carrying EOL, no need to check token itself
return (ch == EOL_CHANNEL);
}
}
One could stick to the ordinary hidden channel but then there is a need to both track channel and tokens while backtracking so this is maybe a bit easier...
Hope this could help someone else dealing with these kind of issues...

Antlr 4: Method for switching modes in parser

I'm trying to build a MVS JCL recognizer using Antlr4. The general endeavour is going reasonably well, but I am having trouble handling the MVS equivalent of *nix "here docs" (inline files). I cannot use lexer modes to flip-flop between JCL and here-doc content, so I am looking for alternatives that I might use a parser level.
IBM MVS allows the use of "instream datasets", similar to *nix here-docs.
Example:
This defines a three-line inline file, terminated by the characters "ZZ" and accessible to a referencing program using the label "ANYNAME":
//ANYNAME DD *,SYMBOLS=(JCLONLY,FILEREF),DLM=ZZ
HEREDOC TEXT 1
HEREDOC TEXT 2
HEREDOC TEXT 3
ZZ
//NEXTFILE DD ...stuff...
ANYNAME is a handle by which a program can access the here-doc content.
DD * is mandatory and informs MVS that a here-doc follows.
SYMBOLS=(JCLONLY,FILEREF) is optional detail relating to how the here-doc is handled.
DLM=ZZ is also optional and defines the here-doc terminator (default terminator = /*).
I need to be able, at parser level, to process the //ANYNAME... line (I have that bit), then to read the here-doc content until I find the (possibly non-default) here-doc terminator. In a sense, this looks like a lexer modes opportunity- but at this point I am working within the parser and I do not have a fixed terminator to work with.
I need guidance on how to switch modes to handle my here-doc, then switch back again to continue processing my JCL.
A hugely abridged version of my grammar follows (the actual grammar, so far, is about 2,200 lines and is incomplete).
Thanks for any insights. I appreciate your help, comments and suggestions.
/* the ddstmt parser rule should be considered the main entry point. It handles (at least):
//ANYNAME DD *,SYMBOLS=(JCLONLY,FILEREF),DLM=ZZ
and // DD *,DLM=ZZ
and //ANYNAME DD *,SYMBOLS=EXECSYS
and //ANYNAME DD *
I need to be able process the above line as JCL then read the here-doc content...
"HEREDOC TEXT 1"
"HEREDOC TEXT 2"
"HEREDOC TEXT 3"
as either a single token or a series of tokens, then, after reading the here-doc
delimiter...
"ZZ"
, go back to processing regular JCL again.
*/
/* lexer rules: */
LINECOMMENT3 : SLASH SLASH STAR ;
DSLASH : SLASH SLASH ;
INSTREAMTERMINATE : SLASH STAR ;
SLASH : '/' ;
STAR : '*' ;
OPAREN : '(' ;
CPAREN : ')' ;
COMMA : ',' ;
KWDD : 'DD' ;
KWDLM : 'DLM' ;
KWSYMBOLS : 'SYMBOLS' ;
KWDATA : 'DATA' ;
SYMBOLSTARGET : 'JCLONLY'|'EXECSYS'|'CNVTSYS' ;
EQ : '=' ;
APOST : '\'' ;
fragment
SPC : ' ' ;
SPCS : SPC+ ;
NL : ('\r'? '\n') ;
UNQUOTEDTEXT : (APOST APOST|~[=\'\"\r\n\t,/() ])+ ;
/* parser rules: */
label : unquotedtext
;
separator : SPCS
;
/* handle crazy JCL comment rules - start */
partcomment : SPCS partcommenttext NL
;
partcommenttext : ((~NL+?)?)
;
linecomment : LINECOMMENT3 linecommenttext NL
;
linecommenttext : ((~NL+?)?)
;
postcommaeol : ( (partcomment|NL) linecomment* DSLASH SPCS )?
;
poststmteol : ( (partcomment|NL) linecomment* )?
;
/* handle crazy JCL comment rules - end */
ddstmt : DSLASH (label|) separator KWDD separator dddecl
;
dddecl : ...
| ddinstreamdecl
| ...
;
ddinstreamdecl : (STAR|KWDATA) poststmteol ddinstreamopts
;
ddinstreamopts : ( COMMA postcommaeol ddinstreamopt poststmteol )*
;
ddinstreamopt : ( ddinstreamdelim
| symbolsdecl
)
;
ddinstreamdelim : KWDLM EQ unquotedtext
;
symbolsdecl : KWSYMBOLS EQ symbolsdef
;
symbolsdef : OPAREN symbolstarget ( COMMA symbolsloggingdd )? CPAREN
| symbolstarget
;
symbolstarget : SYMBOLSTARGET
;
symbolsloggingdd : unquotedtext
;
unquotedtext : UNQUOTEDTEXT
;
Your lexer needs to be able to tokenize the entire document prior to the beginning of the parsing operation. Any attempt to control the lexer from within the parser is a recipe for endless nightmares down the road. The following fragments of a PHP Lexer show how predicates can be used in combination with lexer modes to detect the end of a string with a user-defined delimiter. The key part is recording the start delimiter, and then checking tokens which start at the beginning of the line against it.
PHP_NOWDOC_START
: '<<<\'' PHP_IDENTIFIER '\'' {_input.La(1) == '\r' || _input.La(1) == '\n'}?
-> pushMode(PhpNowDoc)
;
mode PhpNowDoc;
PhpNowDoc_NEWLINE : NEWLINE -> type(NEWLINE);
PHP_NOWDOC_END
: {_input.La(-1) == '\n'}?
PHP_IDENTIFIER ';'?
{CheckHeredocEnd(_input.La(1), Text);}?
-> popMode
;
PHP_NOWDOC_TEXT
: ~[\r\n]+
;
The identifier is actually recorded in a custom override of NextToken() (shown here for a C# target):
public override IToken NextToken()
{
IToken token = base.NextToken();
switch (token.Type)
{
case PHP_NOWDOC_START:
// <<<'identifier'
_heredocIdentifier = token.Text.Substring(3).Trim('\'');
break;
case PHP_NOWDOC_END:
_heredocIdentifier = null;
break;
default:
break;
}
return token;
}
private bool CheckHeredocEnd(int la1, string text)
{
// identifier
// - or -
// identifier;
bool semi = text[text.Length - 1] == ';';
string identifier = semi ? text.Substring(0, text.Length - 1) : text;
return string.Equals(identifier, HeredocIdentifier, StringComparison.Ordinal);
}

Catching (and keeping) all comments with ANTLR

I'm writing a grammar in ANTLR that parses Java source files into ASTs for later analysis. Unlike other parsers (like JavaDoc) I'm trying to keep all of the comments. This is difficult comments can be used literally anywhere in the code. If a comment is somewhere in the source code that doesn't match the grammar, ANTLR can't finish parsing the file.
Is there a way to make ANTLR automatically add any comments it finds to the AST? I know the lexer can simply ignore all of the comments using either {skip();} or by sending the text to the hidden channel. With either of those options set, ANTLR parses the file without any problems at all.
Any ideas are welcome.
Section 12.1 in "The Definitive Antlr 4 Reference" shows how to get access to comments without having to sprinkle the comments rules throughout the grammar. In short you add this to the grammar file:
grammar Java;
#lexer::members {
public static final int WHITESPACE = 1;
public static final int COMMENTS = 2;
}
Then for your comments rules do this:
COMMENT
: '/*' .*? '*/' -> channel(COMMENTS)
;
LINE_COMMENT
: '//' ~[\r\n]* -> channel(COMMENTS)
;
Then in your code ask for the tokens through the getHiddenTokensToLeft/getHiddenTokensToRight and look at the 12.1 section in the book and you will see how to do this.
first: direct all comments to a certain channel (only comments)
COMMENT
: '/*' .*? '*/' -> channel(2)
;
LINE_COMMENT
: '//' ~[\r\n]* -> channel(2)
;
second: print out all comments
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
for (int index = 0; index < tokens.size(); index++)
{
Token token = tokens.get(index);
// substitute whatever parser you have
if (token.getType() != Parser.WS)
{
String out = "";
// Comments will be printed as channel 2 (configured in .g4 grammar file)
out += "Channel: " + token.getChannel();
out += " Type: " + token.getType();
out += " Hidden: ";
List<Token> hiddenTokensToLeft = tokens.getHiddenTokensToLeft(index);
for (int i = 0; hiddenTokensToLeft != null && i < hiddenTokensToLeft.size(); i++)
{
if (hiddenTokensToLeft.get(i).getType() != IDLParser.WS)
{
out += "\n\t" + i + ":";
out += "\n\tChannel: " + hiddenTokensToLeft.get(i).getChannel() + " Type: " + hiddenTokensToLeft.get(i).getType();
out += hiddenTokensToLeft.get(i).getText().replaceAll("\\s", "");
}
}
out += token.getText().replaceAll("\\s", "");
System.out.println(out);
}
}
Is there a way to make ANTLR automatically add any comments it finds to the AST?
No, you'll have to sprinkle your entire grammar with extra comments rules to account for all the valid places comments can occur:
...
if_stat
: 'if' comments '(' comments expr comments ')' comments ...
;
...
comments
: (SingleLineComment | MultiLineComment)*
;
SingleLineComment
: '//' ~('\r' | '\n')*
;
MultiLineComment
: '/*' .* '*/'
;
The feature "island grammars" can also be used. See the the following section in the ANTLR4 book:
Island Grammars: Dealing with Different Formats in the Same File
I did that on my lexer part :
WS : ( [ \t\r\n] | COMMENT) -> skip
;
fragment
COMMENT
: '/*'.*'*/' /*single comment*/
| '//'~('\r' | '\n')* /* multiple comment*/
;
Like that it will remove them automatically !
For ANTLR v3:
The whitespace tokens are usually not processed by parser, but they are still captured on the HIDDEN channel.
If you use BufferedTokenStream, you can get to list of all tokens through it and do a postprocessing, adding them as needed.

Antlr whitespace token error

I have the following grammar and I want to match the String "{name1, name2}". I just want lists of names/intergers with at least one element. However I get the error:
line 1:6 no viable alternative at character ' '
line 1:11 no viable alternative at character '}'
line 1:7 mismatched input 'name' expecting SIMPLE_VAR_TYPE
I would expect whitespaces and such are ignored... Also interesting is the error does not occur with input "{name1,name2}" (no space after ',').
Heres my gramar
grammar NusmvInput;
options {
language = Java;
}
#header {
package secltlmc.grammar;
}
#lexer::header {
package secltlmc.grammar;
}
specification :
SIMPLE_VAR_TYPE EOF
;
INTEGER
: ('0'..'9')+
;
SIMPLE_VAR_TYPE
: ('{' (NAME | INTEGER) (',' (NAME | INTEGER))* '}' )
;
NAME
: ('A'..'Z' | 'a'..'z') ('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '$' | '#' | '-')*
;
WS
: (' ' | '\t' | '\n' | '\r')+ {$channel = HIDDEN;}
;
And this is my testing code
package secltlmc;
public class Main {
public static void main(String[] args) throws
IOException, RecognitionException {
CharStream stream = new ANTLRStringStream("{name1, name2}");
NusmvInputLexer lexer = new NusmvInputLexer(stream);
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
NusmvInputParser parser = new NusmvInputParser(tokenStream);
parser.specification();
}
}
Thanks for your help.
The problem is that you are trying to parse SIMPLE_VAR_TYPE with the lexer, i.e. you are trying to make it a single token. In reality, it looks like you want a multi-token production, since you'd like whitespace to be re-directed to hidden channel through WS.
You should change SIMPLE_VAR_TYPE from a lexer rule to a parser rule by changing its initial letter (or better yet, the entire name) to lower case.
specification :
simple_var_type EOF
;
simple_var_type
: ('{' (NAME | INTEGER) (',' (NAME | INTEGER))* '}' )
;
The defintion of SIMPLE_VAR_TYPE specifies the following expression:
Open {
followed by one of NAME or INTEGER
follwoed by zero or more of:
comma (,) followed by one of NAME or INTEGER
followed by closing }
Nowhere does it allow white-space in the input (neither NAME nor INTEGER allows it either), so you get an error when you supply one
Try:
SIMPLE_VAR_TYPE
: ('{' (NAME | INTEGER) (WS* ',' WS* (NAME | INTEGER))* '}' )
;

ANTLR Grammar to Preprocess Source Files While Preserving WhiteSpace Formatting

I am trying to preprocess my C++ source files by ANTLR. I would like to output an input file preserving all the whitespace formatting of the original source file while inserting some new source codes of my own at the appropriate locations.
I know preserving WS requires this lexer rule:
WS: (' '|'\n'| '\r'|'\t'|'\f' )+ {$channel=HIDDEN;};
With this my parser rules would have a $text attribute containing all the hidden WS. But the problem is, for any parser rule, its $text attribute only include those input text starting from the position that matches the first token of the rule. For example, if this is my input (note the formatting WS before and in between the tokens):
line 1; line 2;
And, if I have 2 separate parser rules matching
"line 1;"
and
"line 2;"
above separately but not the whole line:
" line 1; line 2;"
, then the leading WS and those WS in between "line 1" and "line 2" are lost (not accessible by any of my rules).
What should I do to preserve ALL THE WHITESPACEs while allowing my parser rules to determine when to add new codes at the appropriate locations?
EDIT
Let's say whenever my code contains a call to function(1) using 1 as the parameter but not something else, it adds an extraFunction() before it:
void myFunction() {
function();
function(1);
}
Becomes:
void myFunction() {
function();
extraFunction();
function(1);
}
This preprocessed output should remain human readable as people would continue coding on it. For this simple example, text editor can handle it. But there are more complicated cases that justify the use of ANTLR.
Another solution, but maybe also not very practical (?): You can collect all Whitespaces backwards, something like this untested pseudocode:
grammar T;
#members {
public printWhitespaceBetweenRules(Token start) {
int index = start.getTokenIndex() - 1;
while(index >= 0) {
Token token = input.get(index);
if(token.getChannel() != Token.HIDDEN_CHANNEL) break;
System.out.print(token.getText());
index--;
}
}
}
line1: 'line' '1' {printWhitespaceBetweenRules($start); };
line2: 'line' '2' {printWhitespaceBetweenRules($start); };
WS: (' '|'\n'| '\r'|'\t'|'\f' )+ {$channel=HIDDEN;};
But you would still need to change every rule.
I guess one solution is to keep the WS tokens in the same channel by removing the $channel = HIDDEN;. This will allow you to get access to the information of a WS token in your parser.
Here's another way to solve it (at least the example you posted).
So you want to replace ...function(1) with ...extraFunction();\nfunction(1), where the dots are indents, and \n a line break.
What you could do is match:
Function1
: Spaces 'function' Spaces '(' Spaces '1' Spaces ')'
;
fragment Spaces
: (' ' | '\t')*
;
and replace that with the text it matches, but pre-pended with your extra method. However, the lexer will now complain when it stumbles upon input like:
'function()'
(without the 1 as a parameter)
or:
' x...'
(indents not followed by the f from function)
So, you'll need to "branch out" in your Function1 rule and make sure you only replace the proper occurrence.
You also must take care of occurrences of function(1) inside string literals and comments, assuming you don't want them to be pre-pended with extraFunction();\n.
A little demo:
grammar T;
parse
: (t=. {System.out.print($t.text);})* EOF
;
Function1
: indent=Spaces
( 'function' Spaces '(' Spaces ( '1' Spaces ')' {setText($indent.text + "extraFunction();\n" + $text);}
| ~'1' // do nothing if something other than `1` occurs
)
| '"' ~('"' | '\r' | '\n')* '"' // do nothing in case of a string literal
| '/*' .* '*/' // do nothing in case of a multi-line comment
| '//' ~('\r' | '\n')* // do nothing in case of a single-line comment
| ~'f' // do nothing in case of a char other than 'f' is seen
)
;
OtherChar
: . // a "fall-through" rule: it will match anything if none of the above matched
;
fragment Spaces
: (' ' | '\t')* // fragment rules are only used inside other lexer rules
;
You can test it with the following class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String source =
"/* \n" +
" function(1) \n" +
"*/ \n" +
"void myFunction() { \n" +
" s = \"function(1)\"; \n" +
" function(); \n" +
" function(1); \n" +
"} \n";
System.out.println(source);
System.out.println("---------------------------------");
TLexer lexer = new TLexer(new ANTLRStringStream(source));
TParser parser = new TParser(new CommonTokenStream(lexer));
parser.parse();
}
}
And if you run this Main class, you will see the following being printed to the console:
bart#hades:~/Programming/ANTLR/Demos/T$ java -cp antlr-3.3.jar org.antlr.Tool T.g
bart#hades:~/Programming/ANTLR/Demos/T$ javac -cp antlr-3.3.jar *.java
bart#hades:~/Programming/ANTLR/Demos/T$ java -cp .:antlr-3.3.jar Main
/*
function(1)
*/
void myFunction() {
s = "function(1)";
function();
function(1);
}
---------------------------------
/*
function(1)
*/
void myFunction() {
s = "function(1)";
function();
extraFunction();
function(1);
}
I'm sure it's not fool-proof (I did't account for char-literals, for one), but this could be a start to solve this, IMO.

Resources