ANTLR Grammar to Preprocess Source Files While Preserving WhiteSpace Formatting

ANTLR Grammar to Preprocess Source Files While Preserving WhiteSpace Formatting - parsing

I am trying to preprocess my C++ source files by ANTLR. I would like to output an input file preserving all the whitespace formatting of the original source file while inserting some new source codes of my own at the appropriate locations.
I know preserving WS requires this lexer rule:
WS: (' '|'\n'| '\r'|'\t'|'\f' )+ {$channel=HIDDEN;};
With this my parser rules would have a $text attribute containing all the hidden WS. But the problem is, for any parser rule, its $text attribute only include those input text starting from the position that matches the first token of the rule. For example, if this is my input (note the formatting WS before and in between the tokens):
line 1; line 2;
And, if I have 2 separate parser rules matching
"line 1;"
and
"line 2;"
above separately but not the whole line:
" line 1; line 2;"
, then the leading WS and those WS in between "line 1" and "line 2" are lost (not accessible by any of my rules).
What should I do to preserve ALL THE WHITESPACEs while allowing my parser rules to determine when to add new codes at the appropriate locations?
EDIT
Let's say whenever my code contains a call to function(1) using 1 as the parameter but not something else, it adds an extraFunction() before it:
void myFunction() {
function();
function(1);
}
Becomes:
void myFunction() {
function();
extraFunction();
function(1);
}
This preprocessed output should remain human readable as people would continue coding on it. For this simple example, text editor can handle it. But there are more complicated cases that justify the use of ANTLR.

Another solution, but maybe also not very practical (?): You can collect all Whitespaces backwards, something like this untested pseudocode:
grammar T;
#members {
public printWhitespaceBetweenRules(Token start) {
int index = start.getTokenIndex() - 1;
while(index >= 0) {
Token token = input.get(index);
if(token.getChannel() != Token.HIDDEN_CHANNEL) break;
System.out.print(token.getText());
index--;
}
}
}
line1: 'line' '1' {printWhitespaceBetweenRules($start); };
line2: 'line' '2' {printWhitespaceBetweenRules($start); };
WS: (' '|'\n'| '\r'|'\t'|'\f' )+ {$channel=HIDDEN;};
But you would still need to change every rule.

I guess one solution is to keep the WS tokens in the same channel by removing the $channel = HIDDEN;. This will allow you to get access to the information of a WS token in your parser.

Here's another way to solve it (at least the example you posted).
So you want to replace ...function(1) with ...extraFunction();\nfunction(1), where the dots are indents, and \n a line break.
What you could do is match:
Function1
: Spaces 'function' Spaces '(' Spaces '1' Spaces ')'
;
fragment Spaces
: (' ' | '\t')*
;
and replace that with the text it matches, but pre-pended with your extra method. However, the lexer will now complain when it stumbles upon input like:
'function()'
(without the 1 as a parameter)
or:
' x...'
(indents not followed by the f from function)
So, you'll need to "branch out" in your Function1 rule and make sure you only replace the proper occurrence.
You also must take care of occurrences of function(1) inside string literals and comments, assuming you don't want them to be pre-pended with extraFunction();\n.
A little demo:
grammar T;
parse
: (t=. {System.out.print($t.text);})* EOF
;
Function1
: indent=Spaces
( 'function' Spaces '(' Spaces ( '1' Spaces ')' {setText($indent.text + "extraFunction();\n" + $text);}
| ~'1' // do nothing if something other than `1` occurs
)
| '"' ~('"' | '\r' | '\n')* '"' // do nothing in case of a string literal
| '/*' .* '*/' // do nothing in case of a multi-line comment
| '//' ~('\r' | '\n')* // do nothing in case of a single-line comment
| ~'f' // do nothing in case of a char other than 'f' is seen
)
;
OtherChar
: . // a "fall-through" rule: it will match anything if none of the above matched
;
fragment Spaces
: (' ' | '\t')* // fragment rules are only used inside other lexer rules
;
You can test it with the following class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String source =
"/* \n" +
" function(1) \n" +
"*/ \n" +
"void myFunction() { \n" +
" s = \"function(1)\"; \n" +
" function(); \n" +
" function(1); \n" +
"} \n";
System.out.println(source);
System.out.println("---------------------------------");
TLexer lexer = new TLexer(new ANTLRStringStream(source));
TParser parser = new TParser(new CommonTokenStream(lexer));
parser.parse();
}
}
And if you run this Main class, you will see the following being printed to the console:
bart#hades:~/Programming/ANTLR/Demos/T$ java -cp antlr-3.3.jar org.antlr.Tool T.g
bart#hades:~/Programming/ANTLR/Demos/T$ javac -cp antlr-3.3.jar *.java
bart#hades:~/Programming/ANTLR/Demos/T$ java -cp .:antlr-3.3.jar Main
/*
function(1)
*/
void myFunction() {
s = "function(1)";
function();
function(1);
}
---------------------------------
/*
function(1)
*/
void myFunction() {
s = "function(1)";
function();
extraFunction();
function(1);
}
I'm sure it's not fool-proof (I did't account for char-literals, for one), but this could be a start to solve this, IMO.

Related

ANTLR v4 grammar to recognize jQuery blocks in Java code

I am having a hard time trying to implement a grammar to parse jQuery blocks in between java code.
I do not need to implement a java grammar. This is going to be a translator. I just need to output the java as it is and translate jQuery to java...
jQuery blocks are surrounded by the following tokens: /*#jQ ... */. There can be multiple blocks, but nesting is not allowed. Here is an example:
package test;
public class Test {
public static void main(String[] args) {
System.out.println("Hello world!");
/*#jQ
*/
System.out.println("Good bye world!");
}
}
The desired output of the translator, for this particular case, would be:
package test;
public class Test {
public static void main(String[] args) {
System.out.println("Hello world!");
System.out.println("Good bye world!");
}
}
The problem is I am not being able to read java until a /*#jQ is found. Here is an excerpt of what I have so far:
main
:
java
(
jQueryBlock+ java
)*
;
java
:
.*?
;
jQueryBlock
:
JQUERYBLOCKSTART
(
jQueryStatement SINGLE_LINE_COMMENT?
)* JQUERYBLOCKEND
;
and...
JQUERYBLOCKSTART
:
'/*#jQ'
;
Although the generated parse tree is somewhat acceptable (see below), I get several token recognition error...
JjQuery::main:3:22: token recognition error at: '{'
JjQuery::main:5:44: token recognition error at: '{'
JjQuery::main:6:12: token recognition error at: '.'
JjQuery::main:6:16: token recognition error at: '.'
JjQuery::main:6:37: token recognition error at: '!"'
JjQuery::main:12:12: token recognition error at: '.'
JjQuery::main:12:16: token recognition error at: '.'
JjQuery::main:12:40: token recognition error at: '!"'
JjQuery::main:13:5: token recognition error at: '}'
JjQuery::main:15:4: token recognition error at: '}'
Thanks in advance!
UPDATE
I have modified my grammar as suggested, but I'm still having some problems. Here is an example input, the generated parse tree, and below it the errors thrown.
warning(155): Lexer.g4:22:28: rule SINGLE_LINE_COMMENT contains a lexer command with an unrecognized constant value; lexer interpreters may produce incorrect output
warning(155): Lexer.g4:28:25: rule WS contains a lexer command with an unrecognized constant value; lexer interpreters may produce incorrect output
Parser::src:1:3: extraneous input '\n\n' expecting {<EOF>, '/*#jQ', JAVA}
Parser::src:3:5: token recognition error at: '\n'
Parser::src:4:0: token recognition error at: '\n'
Parser::src:5:2: token recognition error at: ' '
Parser::src:5:8: token recognition error at: '\n'
Parser::src:6:0: token recognition error at: '\n'
Parser::src:7:2: extraneous input '\n\n' expecting {<EOF>, '/*#jQ', JAVA}
Here is the current Lexer.g4:
lexer grammar Lexer;
#lexer::members {
public static final int WHITESPACE = 1;
public static final int COMMENTS = 2;
}
// Default mode rules (the SEA)
JQBegin
:
'/*#jQ' -> pushMode ( JQUERY )
;
JAVA
:
.
;
WS
:
[ \t\r\n]+ -> channel ( WHITESPACE ) // channel(1)
;
SINGLE_LINE_COMMENT
:
'//' .*? '\n' -> channel ( COMMENTS ) // channel(2)
;
mode JQUERY;
JQEnd
:
'*/' -> popMode
;
IN
:
'in'
;
OUT
:
'out'
;
ID
:
[a-zA-Z_] [a-zA-Z0-9_]*
;
SEMICOLON
:
';'
;
And the Parser.g4:
parser grammar Parser;
options {
tokenVocab = Lexer;
} // use tokens from ModeTagsLexer.g4
src
:
(
JAVA
| jQuery
)+ EOF
;
jQuery
:
JQBegin
(
in
| out
)* JQEnd
;
in
:
IN ID SEMICOLON
;
out
:
OUT ID SEMICOLON
;

Use lexical modes to separately handle JQuery and Java blocks (even though the Java blocks are trivial in your case). Note, lexer modes are only available in Lexer grammars and not in combined grammars.
Also, the Java catchall must match a single character at a time. Otherwise it can consume the JQuery begin sequence (this is likely the source of the errors you are seeing).
main: ( JAVA | jqBlock )+ EOF ;
jqBlock: JQBegin
( ... | ... | ... ) // your JQuery rules
JQEnd
;
JQBegin: '/*#jQ' -> pushMode(JQ) ;
JAVA : . ;
mode JQ;
... // your JQuery specific rules
BlockComment : '/*' .*? '*/' ; // handle any possibly ambiguous
// sequences that otherwise might
// cause early exits
JQEnd: '*/' -> popMode() ;

Matching a "text" in line by line file with XText

I try to write the Xtext BNF for Configuration files (known with the .ini extension)
For instance, I'd like to successfully parse
[Section1]
a = Easy123
b = This *is* valid too
[Section_2]
c = Voilà # inline comments are ignored
My problem is matching the property value (what's on the right of the '=').
My current grammar works if the property matches the ID terminal (eg a = Easy123).
PropertyFile hidden(SL_COMMENT, WS):
sections+=Section*;
Section:
'[' name=ID ']'
(NEWLINE properties+=Property)+
NEWLINE+;
Property:
name=ID (':' | '=') value=ID ';'?;
terminal WS:
(' ' | '\t')+;
terminal NEWLINE:
// New line on DOS or Unix
'\r'? '\n';
terminal ID:
('A'..'Z' | 'a'..'z') ('A'..'Z' | 'a'..'z' | '_' | '-' | '0'..'9')*;
terminal SL_COMMENT:
// Single line comment
'#' !('\n' | '\r')*;
I don't know how to generalize the grammar to match any text (eg c = Voilà).
I certainly need to introduce a new terminal
Property:
name=ID (':' | '=') value=TEXT ';'?;
Question is: how should I define this TEXT terminal?
I have tried
terminal TEXT: ANY_OTHER+;
This raises a warning
The following token definitions can never be matched because prior tokens match the same input: RULE_INT,RULE_STRING,RULE_ML_COMMENT,RULE_ANY_OTHER
(I think it doesn't matter).
Parsing Fails with
Required loop (...)+ did not match anything at input 'à'
terminal TEXT: !('\r'|'\n'|'#')+;
This raises a warning
The following token definitions can never be matched because prior tokens match the same input: RULE_INT
(I think it doesn't matter).
Parsing Fails with
Missing EOF at [Section1]
terminal TEXT: ('!'|'$'..'~'); (which covers most characters, except # and ")
No warning during the generation of the lexer/parser.
However Parsing Fails with
Mismatch input 'Easy123' expecting RULE_TEXT
Extraneous input 'This' expecting RULE_TEXT
Required loop (...)+ did not match anything at 'is'
Thanks for your help (and I hope this grammar can be useful for others too)

This grammar does the trick:
grammar org.xtext.example.mydsl.MyDsl hidden(SL_COMMENT, WS)
generate myDsl "http://www.xtext.org/example/mydsl/MyDsl"
import "http://www.eclipse.org/emf/2002/Ecore"
PropertyFile:
sections+=Section*;
Section:
'[' name=ID ']'
(NEWLINE+ properties+=Property)+
NEWLINE+;
Property:
name=ID value=PROPERTY_VALUE;
terminal PROPERTY_VALUE: (':' | '=') !('\n' | '\r')*;
terminal WS:
(' ' | '\t')+;
terminal NEWLINE:
// New line on DOS or Unix
'\r'? '\n';
terminal ID:
('A'..'Z' | 'a'..'z') ('A'..'Z' | 'a'..'z' | '_' | '-' | '0'..'9')*;
terminal SL_COMMENT:
// Single line comment
'#' !('\n' | '\r')*;
Key is, that you do not try to cover the complete semantics only in the grammar but take other services into account, too. The terminal rule PROPERTY_VALUE consumes the complete value including leading assignment and optional trailing semicolon.
Now just register a value converter service for that language and take care of the insignificant parts of the input, there:
import org.eclipse.xtext.conversion.IValueConverter;
import org.eclipse.xtext.conversion.ValueConverter;
import org.eclipse.xtext.conversion.ValueConverterException;
import org.eclipse.xtext.conversion.impl.AbstractDeclarativeValueConverterService;
import org.eclipse.xtext.conversion.impl.AbstractIDValueConverter;
import org.eclipse.xtext.conversion.impl.AbstractLexerBasedConverter;
import org.eclipse.xtext.nodemodel.INode;
import org.eclipse.xtext.util.Strings;
import com.google.inject.Inject;
public class PropertyConverters extends AbstractDeclarativeValueConverterService {
#Inject
private AbstractIDValueConverter idValueConverter;
#ValueConverter(rule = "ID")
public IValueConverter<String> ID() {
return idValueConverter;
}
#Inject
private PropertyValueConverter propertyValueConverter;
#ValueConverter(rule = "PROPERTY_VALUE")
public IValueConverter<String> PropertyValue() {
return propertyValueConverter;
}
public static class PropertyValueConverter extends AbstractLexerBasedConverter<String> {
#Override
protected String toEscapedString(String value) {
return " = " + Strings.convertToJavaString(value, false);
}
public String toValue(String string, INode node) {
if (string == null)
return null;
try {
String value = string.substring(1).trim();
if (value.endsWith(";")) {
value = value.substring(0, value.length() - 1);
}
return value;
} catch (IllegalArgumentException e) {
throw new ValueConverterException(e.getMessage(), node, e);
}
}
}
}
The follow test case will succeed, after you registered the service in the runtime module like this:
#Override
public Class<? extends IValueConverterService> bindIValueConverterService() {
return PropertyConverters.class;
}
Test case:
import org.junit.runner.RunWith
import org.eclipse.xtext.junit4.XtextRunner
import org.xtext.example.mydsl.MyDslInjectorProvider
import org.eclipse.xtext.junit4.InjectWith
import org.junit.Test
import org.eclipse.xtext.junit4.util.ParseHelper
import com.google.inject.Inject
import org.xtext.example.mydsl.myDsl.PropertyFile
import static org.junit.Assert.*
#RunWith(typeof(XtextRunner))
#InjectWith(typeof(MyDslInjectorProvider))
class ParserTest {
#Inject
ParseHelper<PropertyFile> helper
#Test
def void testSample() {
val file = helper.parse('''
[Section1]
a = Easy123
b : This *is* valid too;
[Section_2]
# comment
c = Voilà # inline comments are ignored
''')
assertEquals(2, file.sections.size)
val section1 = file.sections.head
assertEquals(2, section1.properties.size)
assertEquals("a", section1.properties.head.name)
assertEquals("Easy123", section1.properties.head.value)
assertEquals("b", section1.properties.last.name)
assertEquals("This *is* valid too", section1.properties.last.value)
val section2 = file.sections.last
assertEquals(1, section2.properties.size)
assertEquals("Voilà # inline comments are ignored", section2.properties.head.value)
}
}

The problem (or one problem anyway) with parsing a format like that is that, since the text part may contain = characters, a line like foo = bar will be interpreted as a single TEXT token, not an ID, followed by a '=', followed by a TEXT. I can see no way to avoid that without disallowing (or requiring escaping of) = characters in the text part.
If that is not an option, I think, the only solution would be to make a token type LINE that matches an entire line and then take that apart yourself. You'd do that by removing TEXT and ID from your grammar and replacing them with a token type LINE that matches everything up to the next line break or comment sign and must start with a valid ID. So something like this:
LINE :
('A'..'Z' | 'a'..'z') ('A'..'Z' | 'a'..'z' | '_' | '-' | '0'..'9')*
WS* '=' WS*
!('\r' | '\n' | '#')+
;
This token would basically replace your Property rule.
Of course this is a rather unsatisfactory solution as it will give you the entire line as a string and you still have to pick it apart yourself to separate the ID from the text part. It also prevents you from highlighting the ID part or the = sign as the entire line is one token and you can't highlight part of a token (as far as I know). Overall this does not buy you all that much over not using XText at all, but I don't see a better way.

As a workaround, I have changed
Property:
name=ID ':' value=ID ';'?;
Now, of course, = is not in conflict any more, but this is certainly not a good solution, because properties can usually defined with name=value
Edit: Actually, my input is a specific property file, and the properties are know in advance.
My code now looks like
Section:
'[' name=ID ']'
(NEWLINE (properties+=AbstractProperty)?)+;
AbstractProperty:
ADef
| BDef
ADef:
'A' (':'|'=') ID;
BDef:
'B' (':'|'=') Float;
There is an extra benefit, the property names are know as keywords, and colored as such. However, autocompletion only suggest '[' :(

Catching (and keeping) all comments with ANTLR

I'm writing a grammar in ANTLR that parses Java source files into ASTs for later analysis. Unlike other parsers (like JavaDoc) I'm trying to keep all of the comments. This is difficult comments can be used literally anywhere in the code. If a comment is somewhere in the source code that doesn't match the grammar, ANTLR can't finish parsing the file.
Is there a way to make ANTLR automatically add any comments it finds to the AST? I know the lexer can simply ignore all of the comments using either {skip();} or by sending the text to the hidden channel. With either of those options set, ANTLR parses the file without any problems at all.
Any ideas are welcome.

Section 12.1 in "The Definitive Antlr 4 Reference" shows how to get access to comments without having to sprinkle the comments rules throughout the grammar. In short you add this to the grammar file:
grammar Java;
#lexer::members {
public static final int WHITESPACE = 1;
public static final int COMMENTS = 2;
}
Then for your comments rules do this:
COMMENT
: '/*' .*? '*/' -> channel(COMMENTS)
;
LINE_COMMENT
: '//' ~[\r\n]* -> channel(COMMENTS)
;
Then in your code ask for the tokens through the getHiddenTokensToLeft/getHiddenTokensToRight and look at the 12.1 section in the book and you will see how to do this.

first: direct all comments to a certain channel (only comments)
COMMENT
: '/*' .*? '*/' -> channel(2)
;
LINE_COMMENT
: '//' ~[\r\n]* -> channel(2)
;
second: print out all comments
CommonTokenStream tokens = new CommonTokenStream(lexer);
tokens.fill();
for (int index = 0; index < tokens.size(); index++)
{
Token token = tokens.get(index);
// substitute whatever parser you have
if (token.getType() != Parser.WS)
{
String out = "";
// Comments will be printed as channel 2 (configured in .g4 grammar file)
out += "Channel: " + token.getChannel();
out += " Type: " + token.getType();
out += " Hidden: ";
List<Token> hiddenTokensToLeft = tokens.getHiddenTokensToLeft(index);
for (int i = 0; hiddenTokensToLeft != null && i < hiddenTokensToLeft.size(); i++)
{
if (hiddenTokensToLeft.get(i).getType() != IDLParser.WS)
{
out += "\n\t" + i + ":";
out += "\n\tChannel: " + hiddenTokensToLeft.get(i).getChannel() + " Type: " + hiddenTokensToLeft.get(i).getType();
out += hiddenTokensToLeft.get(i).getText().replaceAll("\\s", "");
}
}
out += token.getText().replaceAll("\\s", "");
System.out.println(out);
}
}

Is there a way to make ANTLR automatically add any comments it finds to the AST?
No, you'll have to sprinkle your entire grammar with extra comments rules to account for all the valid places comments can occur:
...
if_stat
: 'if' comments '(' comments expr comments ')' comments ...
;
...
comments
: (SingleLineComment | MultiLineComment)*
;
SingleLineComment
: '//' ~('\r' | '\n')*
;
MultiLineComment
: '/*' .* '*/'
;

The feature "island grammars" can also be used. See the the following section in the ANTLR4 book:
Island Grammars: Dealing with Different Formats in the Same File

I did that on my lexer part :
WS : ( [ \t\r\n] | COMMENT) -> skip
;
fragment
COMMENT
: '/*'.*'*/' /*single comment*/
| '//'~('\r' | '\n')* /* multiple comment*/
;
Like that it will remove them automatically !

For ANTLR v3:
The whitespace tokens are usually not processed by parser, but they are still captured on the HIDDEN channel.
If you use BufferedTokenStream, you can get to list of all tokens through it and do a postprocessing, adding them as needed.

Antlr whitespace token error

I have the following grammar and I want to match the String "{name1, name2}". I just want lists of names/intergers with at least one element. However I get the error:
line 1:6 no viable alternative at character ' '
line 1:11 no viable alternative at character '}'
line 1:7 mismatched input 'name' expecting SIMPLE_VAR_TYPE
I would expect whitespaces and such are ignored... Also interesting is the error does not occur with input "{name1,name2}" (no space after ',').
Heres my gramar
grammar NusmvInput;
options {
language = Java;
}
#header {
package secltlmc.grammar;
}
#lexer::header {
package secltlmc.grammar;
}
specification :
SIMPLE_VAR_TYPE EOF
;
INTEGER
: ('0'..'9')+
;
SIMPLE_VAR_TYPE
: ('{' (NAME | INTEGER) (',' (NAME | INTEGER))* '}' )
;
NAME
: ('A'..'Z' | 'a'..'z') ('a'..'z' | 'A'..'Z' | '0'..'9' | '_' | '$' | '#' | '-')*
;
WS
: (' ' | '\t' | '\n' | '\r')+ {$channel = HIDDEN;}
;
And this is my testing code
package secltlmc;
public class Main {
public static void main(String[] args) throws
IOException, RecognitionException {
CharStream stream = new ANTLRStringStream("{name1, name2}");
NusmvInputLexer lexer = new NusmvInputLexer(stream);
CommonTokenStream tokenStream = new CommonTokenStream(lexer);
NusmvInputParser parser = new NusmvInputParser(tokenStream);
parser.specification();
}
}
Thanks for your help.

The problem is that you are trying to parse SIMPLE_VAR_TYPE with the lexer, i.e. you are trying to make it a single token. In reality, it looks like you want a multi-token production, since you'd like whitespace to be re-directed to hidden channel through WS.
You should change SIMPLE_VAR_TYPE from a lexer rule to a parser rule by changing its initial letter (or better yet, the entire name) to lower case.
specification :
simple_var_type EOF
;
simple_var_type
: ('{' (NAME | INTEGER) (',' (NAME | INTEGER))* '}' )
;

The defintion of SIMPLE_VAR_TYPE specifies the following expression:
Open {
followed by one of NAME or INTEGER
follwoed by zero or more of:
comma (,) followed by one of NAME or INTEGER
followed by closing }
Nowhere does it allow white-space in the input (neither NAME nor INTEGER allows it either), so you get an error when you supply one
Try:
SIMPLE_VAR_TYPE
: ('{' (NAME | INTEGER) (WS* ',' WS* (NAME | INTEGER))* '}' )
;

ANTLR rule to consume fixed number of characters

I am trying to write an ANTLR grammar for the PHP serialize() format, and everything seems to work fine, except for strings. The problem is that the format of serialized strings is :
s:6:"length";
In terms of regexes, a rule like s:(\d+):".{\1}"; would describe this format if only backreferences were allowed in the "number of matches" count (but they are not).
But I cannot find a way to express this for either a lexer or parser grammar: the whole idea is to make the number of characters read depend on a backreference describing the number of characters to read, as in Fortran Hollerith constants (i.e. 6HLength), not on a string delimiter.
This example from the ANTLR grammar for Fortran seems to point the way, but I don't see how. Note that my target language is Python, while most of the doc and examples are for Java:
// numeral literal
ICON {int counter=0;} :
/* other alternatives */
// hollerith
'h' ({counter>0}? NOTNL {counter--;})* {counter==0}?
{
$setType(HOLLERITH);
String str = $getText;
str = str.replaceFirst("([0-9])+h", "");
$setText(str);
}
/* more alternatives */
;

Since input like s:3:"a"b"; is valid, you can't define a String token in your lexer, unless the first and last double quote are always the start and end of your string. But I guess this is not the case.
So, you'll need a lexer rule like this:
SString
: 's:' Int ':"' ( . )* '";'
;
In other words: match a s:, then an integer value followed by :" then one or more characters that can be anything, ending with ";. But you need to tell the lexer to stop consuming when the value Int is not reached. You can do that by mixing some plain code in your grammar to do so. You can embed plain code by wrapping it inside { and }. So first convert the value the token Int holds into an integer variable called chars:
SString
: 's:' Int {chars = int($Int.text)} ':"' ( . )* '";'
;
Now embed some code inside the ( . )* loop to stop it consuming as soon as chars is counted down to zero:
SString
: 's:' Int {chars = int($Int.text)} ':"' ( {if chars == 0: break} . {chars = chars-1} )* '";'
;
and that's it.
A little demo grammar:
grammar Test;
options {
language=Python;
}
parse
: (SString {print 'parsed: [\%s]' \% $SString.text})+ EOF
;
SString
: 's:' Int {chars = int($Int.text)} ':"' ( {if chars == 0: break} . {chars = chars-1} )* '";'
;
Int
: '0'..'9'+
;
(note that you need to escape the % inside your grammar!)
And a test script:
import antlr3
from TestLexer import TestLexer
from TestParser import TestParser
input = 's:6:"length";s:1:""";s:0:"";s:3:"end";'
char_stream = antlr3.ANTLRStringStream(input)
lexer = TestLexer(char_stream)
tokens = antlr3.CommonTokenStream(lexer)
parser = TestParser(tokens)
parser.parse()
which produces the following output:
parsed: [s:6:"length";]
parsed: [s:1:""";]
parsed: [s:0:"";]
parsed: [s:3:"end";]

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

ANTLR Grammar to Preprocess Source Files While Preserving WhiteSpace Formatting - parsing

I guess one solution is to keep the WS tokens in the same channel by removing the $channel = HIDDEN;. This will allow you to get access to the information of a WS token in your parser.

Related

ANTLR v4 grammar to recognize jQuery blocks in Java code

Matching a "text" in line by line file with XText

Catching (and keeping) all comments with ANTLR

Antlr whitespace token error

ANTLR rule to consume fixed number of characters

Categories

Resources