ANTLR rule to consume fixed number of characters - parsing

I am trying to write an ANTLR grammar for the PHP serialize() format, and everything seems to work fine, except for strings. The problem is that the format of serialized strings is :
s:6:"length";
In terms of regexes, a rule like s:(\d+):".{\1}"; would describe this format if only backreferences were allowed in the "number of matches" count (but they are not).
But I cannot find a way to express this for either a lexer or parser grammar: the whole idea is to make the number of characters read depend on a backreference describing the number of characters to read, as in Fortran Hollerith constants (i.e. 6HLength), not on a string delimiter.
This example from the ANTLR grammar for Fortran seems to point the way, but I don't see how. Note that my target language is Python, while most of the doc and examples are for Java:
// numeral literal
ICON {int counter=0;} :
/* other alternatives */
// hollerith
'h' ({counter>0}? NOTNL {counter--;})* {counter==0}?
{
$setType(HOLLERITH);
String str = $getText;
str = str.replaceFirst("([0-9])+h", "");
$setText(str);
}
/* more alternatives */
;

Since input like s:3:"a"b"; is valid, you can't define a String token in your lexer, unless the first and last double quote are always the start and end of your string. But I guess this is not the case.
So, you'll need a lexer rule like this:
SString
: 's:' Int ':"' ( . )* '";'
;
In other words: match a s:, then an integer value followed by :" then one or more characters that can be anything, ending with ";. But you need to tell the lexer to stop consuming when the value Int is not reached. You can do that by mixing some plain code in your grammar to do so. You can embed plain code by wrapping it inside { and }. So first convert the value the token Int holds into an integer variable called chars:
SString
: 's:' Int {chars = int($Int.text)} ':"' ( . )* '";'
;
Now embed some code inside the ( . )* loop to stop it consuming as soon as chars is counted down to zero:
SString
: 's:' Int {chars = int($Int.text)} ':"' ( {if chars == 0: break} . {chars = chars-1} )* '";'
;
and that's it.
A little demo grammar:
grammar Test;
options {
language=Python;
}
parse
: (SString {print 'parsed: [\%s]' \% $SString.text})+ EOF
;
SString
: 's:' Int {chars = int($Int.text)} ':"' ( {if chars == 0: break} . {chars = chars-1} )* '";'
;
Int
: '0'..'9'+
;
(note that you need to escape the % inside your grammar!)
And a test script:
import antlr3
from TestLexer import TestLexer
from TestParser import TestParser
input = 's:6:"length";s:1:""";s:0:"";s:3:"end";'
char_stream = antlr3.ANTLRStringStream(input)
lexer = TestLexer(char_stream)
tokens = antlr3.CommonTokenStream(lexer)
parser = TestParser(tokens)
parser.parse()
which produces the following output:
parsed: [s:6:"length";]
parsed: [s:1:""";]
parsed: [s:0:"";]
parsed: [s:3:"end";]

Related

A case where ANTLR4 terminates parsing successfully before the end of file is reached due to a parsing error

I gave ANTLR4 the following parser and lexer grammar in separate files (referring to a simple grammar for BNF grammar )
parser grammar BNFParser;
options {tokenVocab = BNFLexer;}
compileUnit
: grammar_rule+
;
grammar_rule : NON_TERMINAL COLON (OR? grammar_rule_alternative)* SEMICOLON
;
grammar_rule_alternative : (NON_TERMINAL|TERMINAL)+
;
and
lexer grammar BNFLexer;
TERMINAL : [A-Z][A-Za-z0-9_]*;
NON_TERMINAL : [a-z][A-Za-z0-9_]*;
OR : '|';
COLON : ':';
SEMICOLON : ';';
WS
: [ \t\r\n]+ -> skip
;
The main program
private static void Main(string[] args) {
StreamReader reader = new StreamReader(args[0]);
AntlrInputStream stream = new AntlrInputStream(reader);
BNFLexer lexer = new BNFLexer(stream);
CommonTokenStream tokens = new CommonTokenStream(lexer);
BNFParser parser = new BNFParser(tokens);
IParseTree root = parser.compileUnit();
Console.WriteLine(root.ToStringTree());
}
Also supplied the following test file for testing the grammar
compileunit : x a
;
x : S b
;
S : compileunit f
;
Please notice from the lexer grammar that Non-Terminals begin with a lowercase letters while Terminals begin with an uppercase letter. This given grammar has an error. The third rule uses a capital letter ( S ) to define Non-Terminal S. The expected behaviour would be to report this as an error. In the contrary parsing succeeds by consuming the first 2 rules and ignoring the third for S without reporting any error. I have also seen the generated files and i noticed the following
try {
EnterOuterAlt(_localctx, 1);
{
State = 7;
_errHandler.Sync(this);
_la = _input.La(1);
do {
{
{
State = 6; grammar_rule();
}
}
State = 9;
_errHandler.Sync(this);
_la = _input.La(1);
} while ( _la==NON_TERMINAL );
}
}
catch (RecognitionException re) {
_localctx.exception = re;
_errHandler.ReportError(this, re);
_errHandler.Recover(this, re);
}
The above code shows that the parser expects a Non-Terminal symbol at the start of a grammar_rule which is what i expect. However what happens when this is not the case? Also another weird issue is that the CommonTokenStream object that contains the tokens recognized by the lexer contains only the tokens until the end of the second rule but non of the tokens of the third rule (S). Is this proper behaviour?
Add an EOF token to your main rule (compileUnit). That will force the parser to use all input until EOF and report an error if that didn't fully match.

How to parsing Velocity Variables using ANTLR4

The variables of Velocity has following notation. (see Velocity User Guide):
The shorthand notation of a variable consists of a leading "$" character followed by a VTL Identifier. A VTL Identifier must start with an alphabetic character (a .. z or A .. Z). The rest of the characters are limited to the following types of characters:
alphabetic (a .. z, A .. Z)
numeric (0 .. 9)
underscore ("_")
I want to use lexer mode to split the normal text and the variables, so I wrote something like this:
// default mode
DOLLAR : ‘$’ -> pushMode(VARIABLE);
TEXT : ~[$]+? -> skip;
mode VARIABLE:
ID : [a-zA-Z] [a-zA-Z0-9-_]*;
???? : XXX -> popMode; // how can I pop mode to default?
Because the notation of the variables has no explicit end character, so I don't know how to determine its end.
Maybe I got it wrong?
You would pop out of that scope like this:
mode VARIABLE;
ID : [a-zA-Z] [a-zA-Z0-9-_]* -> popMode;
Here's a quick demo:
lexer grammar VelocityLexer;
DOLLAR : '$' -> more, pushMode(VARIABLE);
TEXT : ~[$]+ -> skip;
mode VARIABLE;
// the `-` needs to be escaped!
ID : [a-zA-Z] [a-zA-Z0-9\-_]* -> popMode;
Note the more in the DOLLAR which will cause the $ to be included in the ID token. If you don't, you end up with two tokens ($ and foo for the input $foo)
Test the grammar with the following Java class:
import org.antlr.v4.runtime.*;
public class Main {
public static void main(String[] args) {
VelocityLexer lexer = new VelocityLexer(CharStreams.fromString("<strong>$Mu</strong>$foo..."));
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
tokenStream.fill();
for (Token t : tokenStream.getTokens()) {
System.out.printf("%-10s '%s'\n", VelocityLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText());
}
}
}
which will print:
ID '$Mu'
ID '$foo'
EOF '<EOF>'
However, I think a lexical mode is not a good choice in case of an ID. Why not simply do:
lexer grammar VelocityLexer;
DOLLAR : '$' [a-zA-Z] [a-zA-Z0-9\-_]*;
TEXT : ~[$]+ -> skip;
?

Antlr4: How can I both hide and use Tokens in a grammar

I'm parsing a script language that defines two types of statements; control statements and non control statements. Non control statements are always ended with ';', while control statements may end with ';' or EOL ('\n'). A part of the grammar looks like this:
script
: statement* EOF
;
statement
: control_statement
| no_control_statement
;
control_statement
: if_then_control_statement
;
if_then_control_statement
: IF expression THEN end_control_statment
( statement ) *
( ELSEIF expression THEN end_control_statment ( statement )* )*
( ELSE end_control_statment ( statement )* )?
END IF end_control_statment
;
no_control_statement
: sleep_statement
;
sleep_statement
: SLEEP expression END_STATEMENT
;
end_control_statment
: END_STATEMENT
| EOL
;
END_STATEMENT
: ';'
;
ANY_SPACE
: ( LINE_SPACE | EOL ) -> channel(HIDDEN)
;
EOL
: [\n\r]+
;
LINE_SPACE
: [ \t]+
;
In all other aspects of the script language, I never care about EOL so I use the normal lexer rules to hide white space.
This works fine in all cases but the cases where I need to use a EOL to find a termination of a control statement, but with the grammar above, all EOL is hidden and not used in the control statement rules.
Is there a way to change my grammar so that I can skip all EOL but the ones needed to terminate parts of my control statements?
Found one way to handle this.
The idea is to divert EOL into one hidden channel and the other stuff I don´t want to see in another hidden channel (like spaces and comments). Then I use some code to backtrack the tokens when an EOL is supposed to show up and examine the previous tokens channels (since they already have been consumed). If I find something on EOL channel before I run into something from the ordinary channel, then it is ok.
It looks like this:
Changed the lexer rules:
#lexer::members {
public static int EOL_CHANNEL = 1;
public static int OTHER_CHANNEL = 2;
}
...
EOL
: '\r'? '\n' -> channel(EOL_CHANNEL)
;
LINE_SPACE
: [ \t]+ -> channel(OTHER_CHANNEL)
;
I also diverted all other HIDDEN channels (comments) to the OTHER_CHANNEL.
Then I changed the rule end_control_statment:
end_control_statment
: END_STATEMENT
| { isEOLPrevious() }?
;
and added
#parser::members {
public static int EOL_CHANNEL = 1;
public static int OTHER_CHANNEL = 2;
boolean isEOLPrevious()
{
int idx = getCurrentToken().getTokenIndex();
int ch;
do
{
ch = getTokenStream().get(--idx).getChannel();
}
while (ch == OTHER_CHANNEL);
// Channel 1 is only carrying EOL, no need to check token itself
return (ch == EOL_CHANNEL);
}
}
One could stick to the ordinary hidden channel but then there is a need to both track channel and tokens while backtracking so this is maybe a bit easier...
Hope this could help someone else dealing with these kind of issues...

Antlr 4: Method for switching modes in parser

I'm trying to build a MVS JCL recognizer using Antlr4. The general endeavour is going reasonably well, but I am having trouble handling the MVS equivalent of *nix "here docs" (inline files). I cannot use lexer modes to flip-flop between JCL and here-doc content, so I am looking for alternatives that I might use a parser level.
IBM MVS allows the use of "instream datasets", similar to *nix here-docs.
Example:
This defines a three-line inline file, terminated by the characters "ZZ" and accessible to a referencing program using the label "ANYNAME":
//ANYNAME DD *,SYMBOLS=(JCLONLY,FILEREF),DLM=ZZ
HEREDOC TEXT 1
HEREDOC TEXT 2
HEREDOC TEXT 3
ZZ
//NEXTFILE DD ...stuff...
ANYNAME is a handle by which a program can access the here-doc content.
DD * is mandatory and informs MVS that a here-doc follows.
SYMBOLS=(JCLONLY,FILEREF) is optional detail relating to how the here-doc is handled.
DLM=ZZ is also optional and defines the here-doc terminator (default terminator = /*).
I need to be able, at parser level, to process the //ANYNAME... line (I have that bit), then to read the here-doc content until I find the (possibly non-default) here-doc terminator. In a sense, this looks like a lexer modes opportunity- but at this point I am working within the parser and I do not have a fixed terminator to work with.
I need guidance on how to switch modes to handle my here-doc, then switch back again to continue processing my JCL.
A hugely abridged version of my grammar follows (the actual grammar, so far, is about 2,200 lines and is incomplete).
Thanks for any insights. I appreciate your help, comments and suggestions.
/* the ddstmt parser rule should be considered the main entry point. It handles (at least):
//ANYNAME DD *,SYMBOLS=(JCLONLY,FILEREF),DLM=ZZ
and // DD *,DLM=ZZ
and //ANYNAME DD *,SYMBOLS=EXECSYS
and //ANYNAME DD *
I need to be able process the above line as JCL then read the here-doc content...
"HEREDOC TEXT 1"
"HEREDOC TEXT 2"
"HEREDOC TEXT 3"
as either a single token or a series of tokens, then, after reading the here-doc
delimiter...
"ZZ"
, go back to processing regular JCL again.
*/
/* lexer rules: */
LINECOMMENT3 : SLASH SLASH STAR ;
DSLASH : SLASH SLASH ;
INSTREAMTERMINATE : SLASH STAR ;
SLASH : '/' ;
STAR : '*' ;
OPAREN : '(' ;
CPAREN : ')' ;
COMMA : ',' ;
KWDD : 'DD' ;
KWDLM : 'DLM' ;
KWSYMBOLS : 'SYMBOLS' ;
KWDATA : 'DATA' ;
SYMBOLSTARGET : 'JCLONLY'|'EXECSYS'|'CNVTSYS' ;
EQ : '=' ;
APOST : '\'' ;
fragment
SPC : ' ' ;
SPCS : SPC+ ;
NL : ('\r'? '\n') ;
UNQUOTEDTEXT : (APOST APOST|~[=\'\"\r\n\t,/() ])+ ;
/* parser rules: */
label : unquotedtext
;
separator : SPCS
;
/* handle crazy JCL comment rules - start */
partcomment : SPCS partcommenttext NL
;
partcommenttext : ((~NL+?)?)
;
linecomment : LINECOMMENT3 linecommenttext NL
;
linecommenttext : ((~NL+?)?)
;
postcommaeol : ( (partcomment|NL) linecomment* DSLASH SPCS )?
;
poststmteol : ( (partcomment|NL) linecomment* )?
;
/* handle crazy JCL comment rules - end */
ddstmt : DSLASH (label|) separator KWDD separator dddecl
;
dddecl : ...
| ddinstreamdecl
| ...
;
ddinstreamdecl : (STAR|KWDATA) poststmteol ddinstreamopts
;
ddinstreamopts : ( COMMA postcommaeol ddinstreamopt poststmteol )*
;
ddinstreamopt : ( ddinstreamdelim
| symbolsdecl
)
;
ddinstreamdelim : KWDLM EQ unquotedtext
;
symbolsdecl : KWSYMBOLS EQ symbolsdef
;
symbolsdef : OPAREN symbolstarget ( COMMA symbolsloggingdd )? CPAREN
| symbolstarget
;
symbolstarget : SYMBOLSTARGET
;
symbolsloggingdd : unquotedtext
;
unquotedtext : UNQUOTEDTEXT
;
Your lexer needs to be able to tokenize the entire document prior to the beginning of the parsing operation. Any attempt to control the lexer from within the parser is a recipe for endless nightmares down the road. The following fragments of a PHP Lexer show how predicates can be used in combination with lexer modes to detect the end of a string with a user-defined delimiter. The key part is recording the start delimiter, and then checking tokens which start at the beginning of the line against it.
PHP_NOWDOC_START
: '<<<\'' PHP_IDENTIFIER '\'' {_input.La(1) == '\r' || _input.La(1) == '\n'}?
-> pushMode(PhpNowDoc)
;
mode PhpNowDoc;
PhpNowDoc_NEWLINE : NEWLINE -> type(NEWLINE);
PHP_NOWDOC_END
: {_input.La(-1) == '\n'}?
PHP_IDENTIFIER ';'?
{CheckHeredocEnd(_input.La(1), Text);}?
-> popMode
;
PHP_NOWDOC_TEXT
: ~[\r\n]+
;
The identifier is actually recorded in a custom override of NextToken() (shown here for a C# target):
public override IToken NextToken()
{
IToken token = base.NextToken();
switch (token.Type)
{
case PHP_NOWDOC_START:
// <<<'identifier'
_heredocIdentifier = token.Text.Substring(3).Trim('\'');
break;
case PHP_NOWDOC_END:
_heredocIdentifier = null;
break;
default:
break;
}
return token;
}
private bool CheckHeredocEnd(int la1, string text)
{
// identifier
// - or -
// identifier;
bool semi = text[text.Length - 1] == ';';
string identifier = semi ? text.Substring(0, text.Length - 1) : text;
return string.Equals(identifier, HeredocIdentifier, StringComparison.Ordinal);
}

ANTLR Grammar to Preprocess Source Files While Preserving WhiteSpace Formatting

I am trying to preprocess my C++ source files by ANTLR. I would like to output an input file preserving all the whitespace formatting of the original source file while inserting some new source codes of my own at the appropriate locations.
I know preserving WS requires this lexer rule:
WS: (' '|'\n'| '\r'|'\t'|'\f' )+ {$channel=HIDDEN;};
With this my parser rules would have a $text attribute containing all the hidden WS. But the problem is, for any parser rule, its $text attribute only include those input text starting from the position that matches the first token of the rule. For example, if this is my input (note the formatting WS before and in between the tokens):
line 1; line 2;
And, if I have 2 separate parser rules matching
"line 1;"
and
"line 2;"
above separately but not the whole line:
" line 1; line 2;"
, then the leading WS and those WS in between "line 1" and "line 2" are lost (not accessible by any of my rules).
What should I do to preserve ALL THE WHITESPACEs while allowing my parser rules to determine when to add new codes at the appropriate locations?
EDIT
Let's say whenever my code contains a call to function(1) using 1 as the parameter but not something else, it adds an extraFunction() before it:
void myFunction() {
function();
function(1);
}
Becomes:
void myFunction() {
function();
extraFunction();
function(1);
}
This preprocessed output should remain human readable as people would continue coding on it. For this simple example, text editor can handle it. But there are more complicated cases that justify the use of ANTLR.
Another solution, but maybe also not very practical (?): You can collect all Whitespaces backwards, something like this untested pseudocode:
grammar T;
#members {
public printWhitespaceBetweenRules(Token start) {
int index = start.getTokenIndex() - 1;
while(index >= 0) {
Token token = input.get(index);
if(token.getChannel() != Token.HIDDEN_CHANNEL) break;
System.out.print(token.getText());
index--;
}
}
}
line1: 'line' '1' {printWhitespaceBetweenRules($start); };
line2: 'line' '2' {printWhitespaceBetweenRules($start); };
WS: (' '|'\n'| '\r'|'\t'|'\f' )+ {$channel=HIDDEN;};
But you would still need to change every rule.
I guess one solution is to keep the WS tokens in the same channel by removing the $channel = HIDDEN;. This will allow you to get access to the information of a WS token in your parser.
Here's another way to solve it (at least the example you posted).
So you want to replace ...function(1) with ...extraFunction();\nfunction(1), where the dots are indents, and \n a line break.
What you could do is match:
Function1
: Spaces 'function' Spaces '(' Spaces '1' Spaces ')'
;
fragment Spaces
: (' ' | '\t')*
;
and replace that with the text it matches, but pre-pended with your extra method. However, the lexer will now complain when it stumbles upon input like:
'function()'
(without the 1 as a parameter)
or:
' x...'
(indents not followed by the f from function)
So, you'll need to "branch out" in your Function1 rule and make sure you only replace the proper occurrence.
You also must take care of occurrences of function(1) inside string literals and comments, assuming you don't want them to be pre-pended with extraFunction();\n.
A little demo:
grammar T;
parse
: (t=. {System.out.print($t.text);})* EOF
;
Function1
: indent=Spaces
( 'function' Spaces '(' Spaces ( '1' Spaces ')' {setText($indent.text + "extraFunction();\n" + $text);}
| ~'1' // do nothing if something other than `1` occurs
)
| '"' ~('"' | '\r' | '\n')* '"' // do nothing in case of a string literal
| '/*' .* '*/' // do nothing in case of a multi-line comment
| '//' ~('\r' | '\n')* // do nothing in case of a single-line comment
| ~'f' // do nothing in case of a char other than 'f' is seen
)
;
OtherChar
: . // a "fall-through" rule: it will match anything if none of the above matched
;
fragment Spaces
: (' ' | '\t')* // fragment rules are only used inside other lexer rules
;
You can test it with the following class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String source =
"/* \n" +
" function(1) \n" +
"*/ \n" +
"void myFunction() { \n" +
" s = \"function(1)\"; \n" +
" function(); \n" +
" function(1); \n" +
"} \n";
System.out.println(source);
System.out.println("---------------------------------");
TLexer lexer = new TLexer(new ANTLRStringStream(source));
TParser parser = new TParser(new CommonTokenStream(lexer));
parser.parse();
}
}
And if you run this Main class, you will see the following being printed to the console:
bart#hades:~/Programming/ANTLR/Demos/T$ java -cp antlr-3.3.jar org.antlr.Tool T.g
bart#hades:~/Programming/ANTLR/Demos/T$ javac -cp antlr-3.3.jar *.java
bart#hades:~/Programming/ANTLR/Demos/T$ java -cp .:antlr-3.3.jar Main
/*
function(1)
*/
void myFunction() {
s = "function(1)";
function();
function(1);
}
---------------------------------
/*
function(1)
*/
void myFunction() {
s = "function(1)";
function();
extraFunction();
function(1);
}
I'm sure it's not fool-proof (I did't account for char-literals, for one), but this could be a start to solve this, IMO.

Resources