Catching (and keeping) all comments with ANTLR - parsing

I'm writing a grammar in ANTLR that parses Java source files into ASTs for later analysis. Unlike other parsers (like JavaDoc) I'm trying to keep all of the comments. This is difficult comments can be used literally anywhere in the code. If a comment is somewhere in the source code that doesn't match the grammar, ANTLR can't finish parsing the file.
Is there a way to make ANTLR automatically add any comments it finds to the AST? I know the lexer can simply ignore all of the comments using either {skip();} or by sending the text to the hidden channel. With either of those options set, ANTLR parses the file without any problems at all.
Any ideas are welcome.

Section 12.1 in "The Definitive Antlr 4 Reference" shows how to get access to comments without having to sprinkle the comments rules throughout the grammar. In short you add this to the grammar file:
grammar Java;
#lexer::members {
public static final int WHITESPACE = 1;
public static final int COMMENTS = 2;
}
Then for your comments rules do this:
COMMENT
: '/*' .*? '*/' -> channel(COMMENTS)
;
LINE_COMMENT
: '//' ~[\r\n]* -> channel(COMMENTS)
;
Then in your code ask for the tokens through the getHiddenTokensToLeft/getHiddenTokensToRight and look at the 12.1 section in the book and you will see how to do this.

first: direct all comments to a certain channel (only comments)
COMMENT
: '/*' .*? '*/' -> channel(2)
;
LINE_COMMENT
: '//' ~[\r\n]* -> channel(2)
;
second: print out all comments
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
for (int index = 0; index < tokens.size(); index++)
{
Token token = tokens.get(index);
// substitute whatever parser you have
if (token.getType() != Parser.WS)
{
String out = "";
// Comments will be printed as channel 2 (configured in .g4 grammar file)
out += "Channel: " + token.getChannel();
out += " Type: " + token.getType();
out += " Hidden: ";
List<Token> hiddenTokensToLeft = tokens.getHiddenTokensToLeft(index);
for (int i = 0; hiddenTokensToLeft != null && i < hiddenTokensToLeft.size(); i++)
{
if (hiddenTokensToLeft.get(i).getType() != IDLParser.WS)
{
out += "\n\t" + i + ":";
out += "\n\tChannel: " + hiddenTokensToLeft.get(i).getChannel() + " Type: " + hiddenTokensToLeft.get(i).getType();
out += hiddenTokensToLeft.get(i).getText().replaceAll("\\s", "");
}
}
out += token.getText().replaceAll("\\s", "");
System.out.println(out);
}
}

Is there a way to make ANTLR automatically add any comments it finds to the AST?
No, you'll have to sprinkle your entire grammar with extra comments rules to account for all the valid places comments can occur:
...
if_stat
: 'if' comments '(' comments expr comments ')' comments ...
;
...
comments
: (SingleLineComment | MultiLineComment)*
;
SingleLineComment
: '//' ~('\r' | '\n')*
;
MultiLineComment
: '/*' .* '*/'
;

The feature "island grammars" can also be used. See the the following section in the ANTLR4 book:
Island Grammars: Dealing with Different Formats in the Same File

I did that on my lexer part :
WS : ( [ \t\r\n] | COMMENT) -> skip
;
fragment
COMMENT
: '/*'.*'*/' /*single comment*/
| '//'~('\r' | '\n')* /* multiple comment*/
;
Like that it will remove them automatically !

For ANTLR v3:
The whitespace tokens are usually not processed by parser, but they are still captured on the HIDDEN channel.
If you use BufferedTokenStream, you can get to list of all tokens through it and do a postprocessing, adding them as needed.

Related

Antlr4: How can I both hide and use Tokens in a grammar

I'm parsing a script language that defines two types of statements; control statements and non control statements. Non control statements are always ended with ';', while control statements may end with ';' or EOL ('\n'). A part of the grammar looks like this:
script
: statement* EOF
;
statement
: control_statement
| no_control_statement
;
control_statement
: if_then_control_statement
;
if_then_control_statement
: IF expression THEN end_control_statment
( statement ) *
( ELSEIF expression THEN end_control_statment ( statement )* )*
( ELSE end_control_statment ( statement )* )?
END IF end_control_statment
;
no_control_statement
: sleep_statement
;
sleep_statement
: SLEEP expression END_STATEMENT
;
end_control_statment
: END_STATEMENT
| EOL
;
END_STATEMENT
: ';'
;
ANY_SPACE
: ( LINE_SPACE | EOL ) -> channel(HIDDEN)
;
EOL
: [\n\r]+
;
LINE_SPACE
: [ \t]+
;
In all other aspects of the script language, I never care about EOL so I use the normal lexer rules to hide white space.
This works fine in all cases but the cases where I need to use a EOL to find a termination of a control statement, but with the grammar above, all EOL is hidden and not used in the control statement rules.
Is there a way to change my grammar so that I can skip all EOL but the ones needed to terminate parts of my control statements?
Found one way to handle this.
The idea is to divert EOL into one hidden channel and the other stuff I donĀ“t want to see in another hidden channel (like spaces and comments). Then I use some code to backtrack the tokens when an EOL is supposed to show up and examine the previous tokens channels (since they already have been consumed). If I find something on EOL channel before I run into something from the ordinary channel, then it is ok.
It looks like this:
Changed the lexer rules:
#lexer::members {
public static int EOL_CHANNEL = 1;
public static int OTHER_CHANNEL = 2;
}
...
EOL
: '\r'? '\n' -> channel(EOL_CHANNEL)
;
LINE_SPACE
: [ \t]+ -> channel(OTHER_CHANNEL)
;
I also diverted all other HIDDEN channels (comments) to the OTHER_CHANNEL.
Then I changed the rule end_control_statment:
end_control_statment
: END_STATEMENT
| { isEOLPrevious() }?
;
and added
#parser::members {
public static int EOL_CHANNEL = 1;
public static int OTHER_CHANNEL = 2;
boolean isEOLPrevious()
{
int idx = getCurrentToken().getTokenIndex();
int ch;
do
{
ch = getTokenStream().get(--idx).getChannel();
}
while (ch == OTHER_CHANNEL);
// Channel 1 is only carrying EOL, no need to check token itself
return (ch == EOL_CHANNEL);
}
}
One could stick to the ordinary hidden channel but then there is a need to both track channel and tokens while backtracking so this is maybe a bit easier...
Hope this could help someone else dealing with these kind of issues...

Antlr 4: Method for switching modes in parser

I'm trying to build a MVS JCL recognizer using Antlr4. The general endeavour is going reasonably well, but I am having trouble handling the MVS equivalent of *nix "here docs" (inline files). I cannot use lexer modes to flip-flop between JCL and here-doc content, so I am looking for alternatives that I might use a parser level.
IBM MVS allows the use of "instream datasets", similar to *nix here-docs.
Example:
This defines a three-line inline file, terminated by the characters "ZZ" and accessible to a referencing program using the label "ANYNAME":
//ANYNAME DD *,SYMBOLS=(JCLONLY,FILEREF),DLM=ZZ
HEREDOC TEXT 1
HEREDOC TEXT 2
HEREDOC TEXT 3
ZZ
//NEXTFILE DD ...stuff...
ANYNAME is a handle by which a program can access the here-doc content.
DD * is mandatory and informs MVS that a here-doc follows.
SYMBOLS=(JCLONLY,FILEREF) is optional detail relating to how the here-doc is handled.
DLM=ZZ is also optional and defines the here-doc terminator (default terminator = /*).
I need to be able, at parser level, to process the //ANYNAME... line (I have that bit), then to read the here-doc content until I find the (possibly non-default) here-doc terminator. In a sense, this looks like a lexer modes opportunity- but at this point I am working within the parser and I do not have a fixed terminator to work with.
I need guidance on how to switch modes to handle my here-doc, then switch back again to continue processing my JCL.
A hugely abridged version of my grammar follows (the actual grammar, so far, is about 2,200 lines and is incomplete).
Thanks for any insights. I appreciate your help, comments and suggestions.
/* the ddstmt parser rule should be considered the main entry point. It handles (at least):
//ANYNAME DD *,SYMBOLS=(JCLONLY,FILEREF),DLM=ZZ
and // DD *,DLM=ZZ
and //ANYNAME DD *,SYMBOLS=EXECSYS
and //ANYNAME DD *
I need to be able process the above line as JCL then read the here-doc content...
"HEREDOC TEXT 1"
"HEREDOC TEXT 2"
"HEREDOC TEXT 3"
as either a single token or a series of tokens, then, after reading the here-doc
delimiter...
"ZZ"
, go back to processing regular JCL again.
*/
/* lexer rules: */
LINECOMMENT3 : SLASH SLASH STAR ;
DSLASH : SLASH SLASH ;
INSTREAMTERMINATE : SLASH STAR ;
SLASH : '/' ;
STAR : '*' ;
OPAREN : '(' ;
CPAREN : ')' ;
COMMA : ',' ;
KWDD : 'DD' ;
KWDLM : 'DLM' ;
KWSYMBOLS : 'SYMBOLS' ;
KWDATA : 'DATA' ;
SYMBOLSTARGET : 'JCLONLY'|'EXECSYS'|'CNVTSYS' ;
EQ : '=' ;
APOST : '\'' ;
fragment
SPC : ' ' ;
SPCS : SPC+ ;
NL : ('\r'? '\n') ;
UNQUOTEDTEXT : (APOST APOST|~[=\'\"\r\n\t,/() ])+ ;
/* parser rules: */
label : unquotedtext
;
separator : SPCS
;
/* handle crazy JCL comment rules - start */
partcomment : SPCS partcommenttext NL
;
partcommenttext : ((~NL+?)?)
;
linecomment : LINECOMMENT3 linecommenttext NL
;
linecommenttext : ((~NL+?)?)
;
postcommaeol : ( (partcomment|NL) linecomment* DSLASH SPCS )?
;
poststmteol : ( (partcomment|NL) linecomment* )?
;
/* handle crazy JCL comment rules - end */
ddstmt : DSLASH (label|) separator KWDD separator dddecl
;
dddecl : ...
| ddinstreamdecl
| ...
;
ddinstreamdecl : (STAR|KWDATA) poststmteol ddinstreamopts
;
ddinstreamopts : ( COMMA postcommaeol ddinstreamopt poststmteol )*
;
ddinstreamopt : ( ddinstreamdelim
| symbolsdecl
)
;
ddinstreamdelim : KWDLM EQ unquotedtext
;
symbolsdecl : KWSYMBOLS EQ symbolsdef
;
symbolsdef : OPAREN symbolstarget ( COMMA symbolsloggingdd )? CPAREN
| symbolstarget
;
symbolstarget : SYMBOLSTARGET
;
symbolsloggingdd : unquotedtext
;
unquotedtext : UNQUOTEDTEXT
;
Your lexer needs to be able to tokenize the entire document prior to the beginning of the parsing operation. Any attempt to control the lexer from within the parser is a recipe for endless nightmares down the road. The following fragments of a PHP Lexer show how predicates can be used in combination with lexer modes to detect the end of a string with a user-defined delimiter. The key part is recording the start delimiter, and then checking tokens which start at the beginning of the line against it.
PHP_NOWDOC_START
: '<<<\'' PHP_IDENTIFIER '\'' {_input.La(1) == '\r' || _input.La(1) == '\n'}?
-> pushMode(PhpNowDoc)
;
mode PhpNowDoc;
PhpNowDoc_NEWLINE : NEWLINE -> type(NEWLINE);
PHP_NOWDOC_END
: {_input.La(-1) == '\n'}?
PHP_IDENTIFIER ';'?
{CheckHeredocEnd(_input.La(1), Text);}?
-> popMode
;
PHP_NOWDOC_TEXT
: ~[\r\n]+
;
The identifier is actually recorded in a custom override of NextToken() (shown here for a C# target):
public override IToken NextToken()
{
IToken token = base.NextToken();
switch (token.Type)
{
case PHP_NOWDOC_START:
// <<<'identifier'
_heredocIdentifier = token.Text.Substring(3).Trim('\'');
break;
case PHP_NOWDOC_END:
_heredocIdentifier = null;
break;
default:
break;
}
return token;
}
private bool CheckHeredocEnd(int la1, string text)
{
// identifier
// - or -
// identifier;
bool semi = text[text.Length - 1] == ';';
string identifier = semi ? text.Substring(0, text.Length - 1) : text;
return string.Equals(identifier, HeredocIdentifier, StringComparison.Ordinal);
}

How to handle arithmetic operator < and > in antlr grammar that removes html tags

Following is my antlr 3 grammar. I want to strip off content inside html tags.
The problem arises when I have arithmetic operator < > inside the tag.
How can this be handled?
grammar T;
options {
output=AST;
}
tokens {
ROOT;
}
parse
: text+ ;
text
: (tag)=> tag !
| SPACE !
| outsidetag
;
SPACE
: (' ' | '\t' | '\r' | '\n')+ ;
tag
: OPEN INSIDETAG CLOSE ;
CLOSE : '>' ;
OPEN : '<' ;
INSIDETAG
: ~(CLOSE|OPEN)+ ;
outsidetag
: ~(SPACE) ;
First you don't need to check for OPEN in your INSIDETAG rule, since there is no harm in skipping it there. In fact you want it that way. Additionally combine tag and INSIDETAG and make it greedy so it tries to consume anything until the last CLOSE TOKEN, skipping so any intermediate ones:
tag: options { greedy = true; }: OPEN ~CLOSE* CLOSE;

ANTLR Grammar to Preprocess Source Files While Preserving WhiteSpace Formatting

I am trying to preprocess my C++ source files by ANTLR. I would like to output an input file preserving all the whitespace formatting of the original source file while inserting some new source codes of my own at the appropriate locations.
I know preserving WS requires this lexer rule:
WS: (' '|'\n'| '\r'|'\t'|'\f' )+ {$channel=HIDDEN;};
With this my parser rules would have a $text attribute containing all the hidden WS. But the problem is, for any parser rule, its $text attribute only include those input text starting from the position that matches the first token of the rule. For example, if this is my input (note the formatting WS before and in between the tokens):
line 1; line 2;
And, if I have 2 separate parser rules matching
"line 1;"
and
"line 2;"
above separately but not the whole line:
" line 1; line 2;"
, then the leading WS and those WS in between "line 1" and "line 2" are lost (not accessible by any of my rules).
What should I do to preserve ALL THE WHITESPACEs while allowing my parser rules to determine when to add new codes at the appropriate locations?
EDIT
Let's say whenever my code contains a call to function(1) using 1 as the parameter but not something else, it adds an extraFunction() before it:
void myFunction() {
function();
function(1);
}
Becomes:
void myFunction() {
function();
extraFunction();
function(1);
}
This preprocessed output should remain human readable as people would continue coding on it. For this simple example, text editor can handle it. But there are more complicated cases that justify the use of ANTLR.
Another solution, but maybe also not very practical (?): You can collect all Whitespaces backwards, something like this untested pseudocode:
grammar T;
#members {
public printWhitespaceBetweenRules(Token start) {
int index = start.getTokenIndex() - 1;
while(index >= 0) {
Token token = input.get(index);
if(token.getChannel() != Token.HIDDEN_CHANNEL) break;
System.out.print(token.getText());
index--;
}
}
}
line1: 'line' '1' {printWhitespaceBetweenRules($start); };
line2: 'line' '2' {printWhitespaceBetweenRules($start); };
WS: (' '|'\n'| '\r'|'\t'|'\f' )+ {$channel=HIDDEN;};
But you would still need to change every rule.
I guess one solution is to keep the WS tokens in the same channel by removing the $channel = HIDDEN;. This will allow you to get access to the information of a WS token in your parser.
Here's another way to solve it (at least the example you posted).
So you want to replace ...function(1) with ...extraFunction();\nfunction(1), where the dots are indents, and \n a line break.
What you could do is match:
Function1
: Spaces 'function' Spaces '(' Spaces '1' Spaces ')'
;
fragment Spaces
: (' ' | '\t')*
;
and replace that with the text it matches, but pre-pended with your extra method. However, the lexer will now complain when it stumbles upon input like:
'function()'
(without the 1 as a parameter)
or:
' x...'
(indents not followed by the f from function)
So, you'll need to "branch out" in your Function1 rule and make sure you only replace the proper occurrence.
You also must take care of occurrences of function(1) inside string literals and comments, assuming you don't want them to be pre-pended with extraFunction();\n.
A little demo:
grammar T;
parse
: (t=. {System.out.print($t.text);})* EOF
;
Function1
: indent=Spaces
( 'function' Spaces '(' Spaces ( '1' Spaces ')' {setText($indent.text + "extraFunction();\n" + $text);}
| ~'1' // do nothing if something other than `1` occurs
)
| '"' ~('"' | '\r' | '\n')* '"' // do nothing in case of a string literal
| '/*' .* '*/' // do nothing in case of a multi-line comment
| '//' ~('\r' | '\n')* // do nothing in case of a single-line comment
| ~'f' // do nothing in case of a char other than 'f' is seen
)
;
OtherChar
: . // a "fall-through" rule: it will match anything if none of the above matched
;
fragment Spaces
: (' ' | '\t')* // fragment rules are only used inside other lexer rules
;
You can test it with the following class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String source =
"/* \n" +
" function(1) \n" +
"*/ \n" +
"void myFunction() { \n" +
" s = \"function(1)\"; \n" +
" function(); \n" +
" function(1); \n" +
"} \n";
System.out.println(source);
System.out.println("---------------------------------");
TLexer lexer = new TLexer(new ANTLRStringStream(source));
TParser parser = new TParser(new CommonTokenStream(lexer));
parser.parse();
}
}
And if you run this Main class, you will see the following being printed to the console:
bart#hades:~/Programming/ANTLR/Demos/T$ java -cp antlr-3.3.jar org.antlr.Tool T.g
bart#hades:~/Programming/ANTLR/Demos/T$ javac -cp antlr-3.3.jar *.java
bart#hades:~/Programming/ANTLR/Demos/T$ java -cp .:antlr-3.3.jar Main
/*
function(1)
*/
void myFunction() {
s = "function(1)";
function();
function(1);
}
---------------------------------
/*
function(1)
*/
void myFunction() {
s = "function(1)";
function();
extraFunction();
function(1);
}
I'm sure it's not fool-proof (I did't account for char-literals, for one), but this could be a start to solve this, IMO.

ANTLR rule to consume fixed number of characters

I am trying to write an ANTLR grammar for the PHP serialize() format, and everything seems to work fine, except for strings. The problem is that the format of serialized strings is :
s:6:"length";
In terms of regexes, a rule like s:(\d+):".{\1}"; would describe this format if only backreferences were allowed in the "number of matches" count (but they are not).
But I cannot find a way to express this for either a lexer or parser grammar: the whole idea is to make the number of characters read depend on a backreference describing the number of characters to read, as in Fortran Hollerith constants (i.e. 6HLength), not on a string delimiter.
This example from the ANTLR grammar for Fortran seems to point the way, but I don't see how. Note that my target language is Python, while most of the doc and examples are for Java:
// numeral literal
ICON {int counter=0;} :
/* other alternatives */
// hollerith
'h' ({counter>0}? NOTNL {counter--;})* {counter==0}?
{
$setType(HOLLERITH);
String str = $getText;
str = str.replaceFirst("([0-9])+h", "");
$setText(str);
}
/* more alternatives */
;
Since input like s:3:"a"b"; is valid, you can't define a String token in your lexer, unless the first and last double quote are always the start and end of your string. But I guess this is not the case.
So, you'll need a lexer rule like this:
SString
: 's:' Int ':"' ( . )* '";'
;
In other words: match a s:, then an integer value followed by :" then one or more characters that can be anything, ending with ";. But you need to tell the lexer to stop consuming when the value Int is not reached. You can do that by mixing some plain code in your grammar to do so. You can embed plain code by wrapping it inside { and }. So first convert the value the token Int holds into an integer variable called chars:
SString
: 's:' Int {chars = int($Int.text)} ':"' ( . )* '";'
;
Now embed some code inside the ( . )* loop to stop it consuming as soon as chars is counted down to zero:
SString
: 's:' Int {chars = int($Int.text)} ':"' ( {if chars == 0: break} . {chars = chars-1} )* '";'
;
and that's it.
A little demo grammar:
grammar Test;
options {
language=Python;
}
parse
: (SString {print 'parsed: [\%s]' \% $SString.text})+ EOF
;
SString
: 's:' Int {chars = int($Int.text)} ':"' ( {if chars == 0: break} . {chars = chars-1} )* '";'
;
Int
: '0'..'9'+
;
(note that you need to escape the % inside your grammar!)
And a test script:
import antlr3
from TestLexer import TestLexer
from TestParser import TestParser
input = 's:6:"length";s:1:""";s:0:"";s:3:"end";'
char_stream = antlr3.ANTLRStringStream(input)
lexer = TestLexer(char_stream)
tokens = antlr3.CommonTokenStream(lexer)
parser = TestParser(tokens)
parser.parse()
which produces the following output:
parsed: [s:6:"length";]
parsed: [s:1:""";]
parsed: [s:0:"";]
parsed: [s:3:"end";]

Resources