Writing a parser with multiple expected completions - parsing

Let's say I have the following grammar for boolean queries:
Query := <Atom> [ <And-Chain> | <Or-Chain> ]
Atom := "(" <Query> ")" | <Var> <Op> <Var>
And-Chain := "&&" <Atom> [ <And-Chain> ]
Or-Chain := "||" <Atom> [ <Or-Chain> ]
Var := "A" | "B" | "C" | ... | "Z"
Op := "<" | ">" | "="
A parser for this grammar would have the pseudo-code:
parse(query) :=
tokens <- tokenize(query)
parse-query(tokens)
assert-empty(tokens)
parse-query(tokens) :=
parse-atom(tokens)
if peek(tokens) equals "&&"
parse-chain(tokens, "&&")
else if peek(tokens) equals "||"
parse-chain(tokens, "||")
parse-atom(tokens) :=
if peek(tokens) equals "("
assert-equal( "(", shift(tokens) )
parse-query(tokens)
assert-equal( ")", shift(tokens) )
else
parse-var(tokens)
parse-op(tokens)
parse-var(tokens)
parse-chain(tokens, connector) :=
assert-equal( connector, shift(tokens) )
parse-atom(tokens)
if peek(tokens) equals connector
parse-chain(tokens, connector)
parse-var(tokens) :=
assert-matches( /[A-Z]/, shift(tokens) )
parse-op(tokens) :=
assert-matches( /[<>=]/, shift(tokens) )
What I want to make sure of, though, is that my parser will report helpful parse errors. For example, given a query that starts with "( A < B && B < C || ...", I'd like an error like :
found "||" but expected "&&" or ")"
The trick with this is that it gathers expectations from across different parts of the parser. I can work out ways to do this, but it all ends up feeling a little clunky.
Like, in Java, I might throw a GreedyError when attempting to peek for a "&&" or "||"
// in parseAtom
if ( tokens.peek() == "(" ) {
assertEqual( "(", tokens.shift() );
try {
parseQuery(tokens);
caught = null;
}
catch (GreedyError e) {
caught = e;
}
try {
assertEqual( ")", tokens.shift() );
}
catch (AssertionError e) {
throw e.or(caught);
}
}
// ...
// in parseChain
assertEqual( connector, tokens.shift() );
parseAtom(tokens);
if (tokens.peek() == connector) {
parseChain(tokens, connector);
}
else {
new GreedyError(connector);
}
Or, in Haskell, I might use WriterT to keep track of my last failing comparisons for the current token, using censor to clear out after every successful token match.
But both of these solutions feel a bit hacked and clunky. I feel like I'm missing something fundamental, some pattern, that could handle this elegantly. What is it?

Build an L(AL)R(1) state machine. More details about LALR can be found here.
What you want to use for error reporting is the FIRSTOF set for dot in each state. This answer will make sense when you understand how the parser states are generated.
If you have such a set of states, you can record the union of the FIRSTOF sets with each state; then when in that state, and no transition is possible, your error message is "Exepect (firstof currentstate)".
If you don't want to record the FIRST sets, you can easily write an algorithm that will climb through the state tables to reconstruct it. That would the algorithmic equivalent of your "peeking ahead".

Related

PEG grammar to suppress execution in if command

I want to create a grammar parsing some commands. Most is working flawless but the "if(condition,then-value,else-value)" is not working together with "out" command to show some value.
It works fine in case the output-command is outside the if-command:
out(if(1,42,43))
→ output and return 42 as expected OK
But at the moment the output-command is inside then- and else-part (which is required to be more intuitive) it fails:
if(1,out(42),out(43))
→ still return only 42 as expected OK, but the output function is called twice with 42 and 43
I'm working under C with the peg/leg parser generator here
The problem is also reproducible with PEG.js online parser generator here when using the following very much simplified grammar:
Expression
= Int
/ "if(" cond:Expression "," ok:Expression "," nok:Expression ")" { return cond?ok:nok; }
/ "out(" num:Expression ")" { window.alert(num); return num;}
Int = [0-9]+ { return parseInt(text(), 10); }
The "window.alert()" is only a placeholder for the needed output function, but for this problem it acts the same.
It looks like the scanner have to match the full if-command with then-
and else-value until the closing bracket ")". So it matches both out-commands and they both execute the defined function - which is not what I expect.
Is there a way in peg/leg to match some characters but suppress execution of the according function under some circumstances?
(I've already experimented with "&" predicate element without success)
(Maybe left-recursion vs. right-recursion could help here, but used peg/leg-generator seems to only supports right-recursion)
Is there a way in peg/leg to match some characters but suppress execution of the according function under some circumstances?
I'm not familiar with the tools in question, but it would surprise me if this were possible. And even if it were, you'd run into a similar problem when implementing loops: now you'd need to execute the action multiple times.
What you need is for your actions to not directly execute the code, but return something that can be used to execute it.
The usual way that interpreters work is that the parser produces some sort of representation of the source code (such as bytecode or an AST), which is then executed as a separate step.
The simplest (but perhaps not cleanest) way to make your parser work without changing too much would be to just wrap all your actions in 0-argument functions. You could then call the functions returned by the sub-expressions if and only if you want them to be executed. And to implement loops, you could then simply call the functions multiple times.
An solution could be using a predicate expression "& {expression}" (not to be confused by predicate element "& element")
Expression
  = Function
  
Function
  = Int
  / "if(" IfCond "," ok:Function "," nok:FunctionDisabled ")" { return ok; }
  / "if(" FunctionDisabled "," ok:FunctionDisabled "," nok:Function ")" { return nok; }
  / "out(" num:Function ")" { window.alert("Out:"+num); return num;}
 
FunctionDisabled
  = Int
/ "if(" IfCond "," ok:FunctionDisabled "," nok:FunctionDisabled ")" { return ok; }
  / "if(" FunctionDisabled "," ok:FunctionDisabled "," nok:FunctionDisabled ")" { return nok; }
/ "out(" num:FunctionDisabled ")" { return num;}
IfCond
  = cond:FunctionDisabled   &{ return cond; }
                   
Int = [0-9]+ { return parseInt(text(), 10); }
The idea is to define the out() twice, once really doing something and a second time disabled without output.
The condition of the if-command is evaluated using the code inside {}, so if the condition is false, the whole expression match failes.
Visible drawback is the redundant definition of the if-command for then and else and recursive disabled

Antlr4: How can I both hide and use Tokens in a grammar

I'm parsing a script language that defines two types of statements; control statements and non control statements. Non control statements are always ended with ';', while control statements may end with ';' or EOL ('\n'). A part of the grammar looks like this:
script
: statement* EOF
;
statement
: control_statement
| no_control_statement
;
control_statement
: if_then_control_statement
;
if_then_control_statement
: IF expression THEN end_control_statment
( statement ) *
( ELSEIF expression THEN end_control_statment ( statement )* )*
( ELSE end_control_statment ( statement )* )?
END IF end_control_statment
;
no_control_statement
: sleep_statement
;
sleep_statement
: SLEEP expression END_STATEMENT
;
end_control_statment
: END_STATEMENT
| EOL
;
END_STATEMENT
: ';'
;
ANY_SPACE
: ( LINE_SPACE | EOL ) -> channel(HIDDEN)
;
EOL
: [\n\r]+
;
LINE_SPACE
: [ \t]+
;
In all other aspects of the script language, I never care about EOL so I use the normal lexer rules to hide white space.
This works fine in all cases but the cases where I need to use a EOL to find a termination of a control statement, but with the grammar above, all EOL is hidden and not used in the control statement rules.
Is there a way to change my grammar so that I can skip all EOL but the ones needed to terminate parts of my control statements?
Found one way to handle this.
The idea is to divert EOL into one hidden channel and the other stuff I don´t want to see in another hidden channel (like spaces and comments). Then I use some code to backtrack the tokens when an EOL is supposed to show up and examine the previous tokens channels (since they already have been consumed). If I find something on EOL channel before I run into something from the ordinary channel, then it is ok.
It looks like this:
Changed the lexer rules:
#lexer::members {
public static int EOL_CHANNEL = 1;
public static int OTHER_CHANNEL = 2;
}
...
EOL
: '\r'? '\n' -> channel(EOL_CHANNEL)
;
LINE_SPACE
: [ \t]+ -> channel(OTHER_CHANNEL)
;
I also diverted all other HIDDEN channels (comments) to the OTHER_CHANNEL.
Then I changed the rule end_control_statment:
end_control_statment
: END_STATEMENT
| { isEOLPrevious() }?
;
and added
#parser::members {
public static int EOL_CHANNEL = 1;
public static int OTHER_CHANNEL = 2;
boolean isEOLPrevious()
{
int idx = getCurrentToken().getTokenIndex();
int ch;
do
{
ch = getTokenStream().get(--idx).getChannel();
}
while (ch == OTHER_CHANNEL);
// Channel 1 is only carrying EOL, no need to check token itself
return (ch == EOL_CHANNEL);
}
}
One could stick to the ordinary hidden channel but then there is a need to both track channel and tokens while backtracking so this is maybe a bit easier...
Hope this could help someone else dealing with these kind of issues...

How to implement a BNF grammar tree for parsing input in GO?

The grammar for the type language is as follows:
TYPE ::= TYPEVAR | PRIMITIVE_TYPE | FUNCTYPE | LISTTYPE;
PRIMITIVE_TYPE ::= ‘int’ | ‘float’ | ‘long’ | ‘string’;
TYPEVAR ::= ‘`’ VARNAME; // Note, the character is a backwards apostrophe!
VARNAME ::= [a-zA-Z][a-zA-Z0-9]*; // Initial letter, then can have numbers
FUNCTYPE ::= ‘(‘ ARGLIST ‘)’ -> TYPE | ‘(‘ ‘)’ -> TYPE;
ARGLIST ::= TYPE ‘,’ ARGLIST | TYPE;
LISTTYPE ::= ‘[‘ TYPE ‘]’;
My input like this: TYPE
for example, if I input (int,int)->float, this is valid. If I input ( [int] , int), it's a wrong type and invalid.
I need to parse input from keyboard and decide if it's valid under this grammar(for later type inference). However, I don't know how to build this grammar with go and how to parse input by each byte. Is there any hint or similar implementation? That's will be really helpful.
For your purposes, the grammar of types looks simple enough that you should be able to write a recursive descent parser that roughly matches the shape of your grammar.
As a concrete example, let's say that we're recognizing a similar language.
TYPE ::= PRIMITIVETYPE | TUPLETYPE
PRIMITIVETYPE ::= 'int'
TUPLETYPE ::= '(' ARGLIST ')'
ARGLIST ::= TYPE ARGLIST | TYPE
Not quite exactly the same as your original problem, but you should be able to see the similarities.
A recursive descent parser consists of functions for each production rule.
func ParseType(???) error {
???
}
func ParsePrimitiveType(???) error {
???
}
func ParseTupleType(???) error {
???
}
func ParseArgList(???) error {
???
}
where we'll denote things that we don't quite know what to put as ???* till we get there. We at least will say for now that we get an error if we can't parse.
The input into each of the functions is some stream of tokens. In our case, those tokens consist of sequences of:
"int"
"("
")"
and we can imagine a Stream might be something that satisfies:
type Stream interface {
Peek() string // peek at next token, stay where we are
Next() string // pick next token, move forward
}
to let us walk sequentially through the token stream.
A lexer is responsible for taking something like a string or io.Reader and producing this stream of string tokens. Lexers are fairly easy to write: you can imagine just using regexps or something similar to break a string into tokens.
Assuming we have a token stream, then a parser then just needs to deal with that stream and a very limited set of possibilities. As mentioned before, each production rule corresponds to a parsing function. Within a production rule, each alternative is a conditional branch. If the grammar is particularly simple (as yours is!), we can figure out which conditional branch to take.
For example, let's look at TYPE and its corresponding ParseType function:
TYPE ::= PRIMITIVETYPE | TUPLETYPE
PRIMITIVETYPE ::= 'int'
TUPLETYPE ::= '(' ARGLIST ')'
How might this corresponds to the definition of ParseType?
The production says that there are two possibilities: it can either be (1) primitive, or (2) tuple. We can peek at the token stream: if we see "int", then we know it's primitive. If we see a "(", then since the only possibility is that it's tuple type, we can call the tupletype parser function and let it do the dirty work.
It's important to note: if we don't see either a "(" nor an "int", then something horribly has gone wrong! We know this just from looking at the grammar. We can see that every type must parse from something FIRST starting with one of those two tokens.
Ok, let's write the code.
func ParseType(s Stream) error {
peeked := s.Peek()
if peeked == "int" {
return ParsePrimitiveType(s)
}
if peeked == "(" {
return ParseTupleType(s)
}
return fmt.Errorf("ParseType on %#v", peeked)
}
Parsing PRIMITIVETYPE and TUPLETYPE is equally direct.
func ParsePrimitiveType(s Stream) error {
next := s.Next()
if next == "int" {
return nil
}
return fmt.Errorf("ParsePrimitiveType on %#v", next)
}
func ParseTupleType(s Stream) error {
lparen := s.Next()
if lparen != "(" {
return fmt.Errorf("ParseTupleType on %#v", lparen)
}
err := ParseArgList(s)
if err != nil {
return err
}
rparen := s.Next()
if rparen != ")" {
return fmt.Errorf("ParseTupleType on %#v", rparen)
}
return nil
}
The only one that might cause some issues is the parser for argument lists. Let's look at the rule.
ARGLIST ::= TYPE ARGLIST | TYPE
If we try to write the function ParseArgList, we might get stuck because we don't yet know which choice to make. Do we go for the first, or the second choice?
Well, let's at least parse out the part that's common to both alternatives: the TYPE part.
func ParseArgList(s Stream) error {
err := ParseType(s)
if err != nil {
return err
}
/// ... FILL ME IN. Do we call ParseArgList() again, or stop?
}
So we've parsed the prefix. If it was the second case, we're done. But what if it were the first case? Then we'd still have to read additional lists of types.
Ah, but if we are continuing to read additional types, then the stream must FIRST start with another type. And we know that all types FIRST start either with "int" or "(". So we can peek at the stream. Our decision whether or not we picked the first or second choice hinges just on this!
func ParseArgList(s Stream) error {
err := ParseType(s)
if err != nil {
return err
}
peeked := s.Peek()
if peeked == "int" || peeked == "(" {
// alternative 1
return ParseArgList(s)
}
// alternative 2
return nil
}
Believe it or not, that's pretty much all we need. Here is working code.
package main
import "fmt"
type Stream interface {
Peek() string
Next() string
}
type TokenSlice []string
func (s *TokenSlice) Peek() string {
return (*s)[0]
}
func (s *TokenSlice) Next() string {
result := (*s)[0]
*s = (*s)[1:]
return result
}
func ParseType(s Stream) error {
peeked := s.Peek()
if peeked == "int" {
return ParsePrimitiveType(s)
}
if peeked == "(" {
return ParseTupleType(s)
}
return fmt.Errorf("ParseType on %#v", peeked)
}
func ParsePrimitiveType(s Stream) error {
next := s.Next()
if next == "int" {
return nil
}
return fmt.Errorf("ParsePrimitiveType on %#v", next)
}
func ParseTupleType(s Stream) error {
lparen := s.Next()
if lparen != "(" {
return fmt.Errorf("ParseTupleType on %#v", lparen)
}
err := ParseArgList(s)
if err != nil {
return err
}
rparen := s.Next()
if rparen != ")" {
return fmt.Errorf("ParseTupleType on %#v", rparen)
}
return nil
}
func ParseArgList(s Stream) error {
err := ParseType(s)
if err != nil {
return err
}
peeked := s.Peek()
if peeked == "int" || peeked == "(" {
// alternative 1
return ParseArgList(s)
}
// alternative 2
return nil
}
func main() {
fmt.Println(ParseType(&TokenSlice{"int"}))
fmt.Println(ParseType(&TokenSlice{"(", "int", ")"}))
fmt.Println(ParseType(&TokenSlice{"(", "int", "int", ")"}))
fmt.Println(ParseType(&TokenSlice{"(", "(", "int", ")", "(", "int", ")", ")"}))
// Should show error:
fmt.Println(ParseType(&TokenSlice{"(", ")"}))
}
This is a toy parser, of course, because it is not handling certain kinds of errors very well (like premature end of input), and tokens should include, not only their textual content, but also their source location for good error reporting. For your own purposes, you'll also want to expand the parsers so that they don't just return error, but also some kind of useful result from the parse.
This answer is just a sketch on how recursive descent parsers work. But you should really read a good compiler book to get the details, because you need them. The Dragon Book, for example, spends at least a good chapter on about how to write recursive descent parsers with plenty of the technical details. in particular, you want to know about the concept of FIRST sets (which I hinted at), because you'll need to understand them to choose which alternative is appropriate when writing each of your parser functions.

Recursive descent parser implementation

I am looking to write some pseudo-code of a recursive descent parser. Now, I have no experience with this type of coding. I have read some examples online but they only work on grammar that uses mathematical expressions. Here is the grammar I am basing the parser on.
S -> if E then S | if E then S else S | begin S L | print E
L -> end | ; S L
E -> i
I must write methods S(), L() and E() and return some error messages, but the tutorials I have found online have not helped a lot. Can anyone point me in the right direction and give me some examples?.
I would like to write it in C# or Java syntax, since it is easier for me to relate.
Update
public void S() {
if (currentToken == "if") {
getNextToken();
E();
if (currentToken == "then") {
getNextToken();
S();
if (currentToken == "else") {
getNextToken();
S();
Return;
}
} else {
throw new IllegalTokenException("Procedure S() expected a 'then' token " + "but received: " + currentToken);
} else if (currentToken == "begin") {
getNextToken();
S();
L();
return;
} else if (currentToken == "print") {
getNextToken();
E();
return;
} else {
throw new IllegalTokenException("Procedure S() expected an 'if' or 'then' or else or begin or print token " + "but received: " + currentToken);
}
}
}
public void L() {
if (currentToken == "end") {
getNextToken();
return;
} else if (currentToken == ";") {
getNextToken();
S();
L();
return;
} else {
throw new IllegalTokenException("Procedure L() expected an 'end' or ';' token " + "but received: " + currentToken);
}
}
public void E() {
if (currentToken == "i") {
getNextToken();
return;
} else {
throw new IllegalTokenException("Procedure E() expected an 'i' token " + "but received: " + currentToken);
}
}
Basically in recursive descent parsing each non-terminal in the grammar is translated into a procedure, then inside each procedure you check to see if the current token you are looking at matches what you would expect to see on the right hand side of the non-terminal symbol corresponding to the procedure, if it does then continue applying the production, if it doesn't then you have encountered an error and must take some action.
So in your case as you mentioned above you will have procedures: S(), L(), and E(), I will give an example of how L() could be implemented and then you can try and do S() and E() on your own.
It is also important to note that you will need some other program to tokenize the input for you, then you can just ask that program for the next token from your input.
/**
* This procedure corresponds to the L non-terminal
* L -> 'end'
* L -> ';' S L
*/
public void L()
{
if(currentToken == 'end')
{
//we found an 'end' token, get the next token in the input stream
//Notice, there are no other non-terminals or terminals after
//'end' in this production so all we do now is return
//Note: that here we return to the procedure that called L()
getNextToken();
return;
}
else if(currentToken == ';')
{
//we found an ';', get the next token in the input stream
getNextToken();
//Notice that there are two more non-terminal symbols after ';'
//in this production, S and L; all we have to do is call those
//procedures in the order they appear in the production, then return
S();
L();
return;
}
else
{
//The currentToken is not either an 'end' token or a ';' token
//This is an error, raise some exception
throw new IllegalTokenException(
"Procedure L() expected an 'end' or ';' token "+
"but received: " + currentToken);
}
}
Now you try S() and E(), and post back.
Also as Kristopher points out your grammar has something called a dangling else, meaing that you have a production that starts with the same thing up to a point:
S -> if E then S
S -> if E then S else S
So this begs the question if your parser sees an 'if' token, which production should it choose to process the input? The answer is it has no idea which one to choose because unlike humans the compiler can't look ahead into the input stream to search for an 'else' token. This is a simple problem to fix by applying a rule known as Left-Factoring, very similar to how you would factor an algebra problem.
All you have to do is create a new non-terminal symbol S' (S-prime) whose right hand side will hold the pieces of the productions that aren't common, so your S productions no becomes:
S -> if E then S S'
S' -> else S
S' -> e
(e is used here to denote the empty string, which basically means there is no
input seen)
This is not the easiest grammar to start with because you have an unlimited amount of lookahead on your first production rule:
S -> if E then S | if E then S else S | begin S L | print E
consider
if 5 then begin begin begin begin ...
When do we determine this stupid else?
also, consider
if 5 then if 4 then if 3 then if 2 then print 2 else ...
Now, was that else supposed to bind to the if 5 then fragment? If not, that's actually cool, but be explicit.
You can rewrite your grammar (possibly, depending on else rule) equivalently as:
S -> if E then S (else S)? | begin S L | print E
L -> end | ; S L
E -> i
Which may or may not be what you want. But the pseudocode sort of jumps out from this.
define S() {
if (peek()=="if") {
consume("if")
E()
consume("then")
S()
if (peek()=="else") {
consume("else")
S()
}
} else if (peek()=="begin") {
consume("begin")
S()
L()
} else if (peek()=="print") {
consume("print")
E()
} else {
throw error()
}
}
define L() {
if (peek()=="end") {
consume("end")
} else if (peek==";")
consume(";")
S()
L()
} else {
throw error()
}
}
define E() {
consume_token_i()
}
For each alternate, I created an if statement that looked at the unique prefix. The final else on any match attempt is always an error. I consume keywords and call the functions corresponding to production rules as I encounter them.
Translating from pseudocode to real code isn't too complicated, but it isn't trivial. Those peeks and consumes probably don't actually operate on strings. It's far easier to operate on tokens. And simply walking a sentence and consuming it isn't quite the same as parsing it. You'll want to do something as you consume the pieces, possibly building up a parse tree (which means these functions probably return something). And throwing an error might be correct at a high level, but you'd want to put some meaningful information into the error. Also, things get more complex if you do require lookahead.
I would recommend Language Implementation Patterns by Terence Parr (the guy who wrote antlr, a recursive descent parser generator) when looking at these kinds of problems. The Dragon Book (Aho, et al, recommended in a comment) is good, too (it is still probably the dominant textbook in compiler courses).
I taught (really just, helped) the parsing section of a PL class last semester. I really recommend looking at the parsing slides from our page: here. Basically, for recursive descent parsing, you ask yourself the following question:
I've parsed a little bit of a nonterminal, now I'm at a point where I can make a choice as to what I'm supposed to parse next. What I see next will make decisions as to what nonterminal I'm in.
Your grammar -- by the way -- exhibits a very common ambiguity called the "dangling else," which has been around since the Algol days. In shift reduce parsers (typically produced by parser generators) this generates a shift / reduce conflict, where you typically choose to arbitrarily shift instead of reduce, giving you the common "maximal much" principal. (So, if you see "if (b) then if (b2) S1 else S2" you read it as "if (b) then { if (b2) { s1; } else { s2; } }")
Let's throw this out of your grammar, and work with a slightly simpler grammar:
T -> A + T
| A - T
| A
A -> NUM * A
| NUM / A
| NUM
we're also going to assume that NUM is something you get from the lexer (i.e., it's just a token). This grammar is LL(1), that is, you can parse it with a recursive descent parser implemented using a naive algorithm. The algorithm works like this:
parse_T() {
a = parse_A();
if (next_token() == '+') {
next_token(); // advance to the next token in the stream
t = parse_T();
return T_node_plus_constructor(a, t);
}
else if (next_token() == '-') {
next_token(); // advance to the next token in the stream
t = parse_T();
return T_node_minus_constructor(a, t);
} else {
return T_node_a_constructor(a);
}
}
parse_A() {
num = token(); // gets the current token from the token stream
next_token(); // advance to the next token in the stream
assert(token_type(a) == NUM);
if (next_token() == '*') {
a = parse_A();
return A_node_times_constructor(num, a);
}
else if (next_token() == '/') {
a = parse_A();
return A_node_div_constructor(num, a);
} else {
return A_node_num_constructor(num);
}
}
Do you see the pattern here:
for each nonterminal in your grammar: parse the first thing, then see what you have to look at to decide what you should parse next. Continue this until you're done!
Also, please note that this type of parsing typically won't produce what you want for arithmetic. Recursive descent parsers (unless you use a little trick with tail recursion?) won't produce leftmost derivations. In particular, you can't write left recursive rules like "a -> a - a" where the leftmost associativity is really necessary! This is why people typically use fancier parser generator tools. But the recursive descent trick is still worth knowing about and playing around with.

ANTLR Grammar to Preprocess Source Files While Preserving WhiteSpace Formatting

I am trying to preprocess my C++ source files by ANTLR. I would like to output an input file preserving all the whitespace formatting of the original source file while inserting some new source codes of my own at the appropriate locations.
I know preserving WS requires this lexer rule:
WS: (' '|'\n'| '\r'|'\t'|'\f' )+ {$channel=HIDDEN;};
With this my parser rules would have a $text attribute containing all the hidden WS. But the problem is, for any parser rule, its $text attribute only include those input text starting from the position that matches the first token of the rule. For example, if this is my input (note the formatting WS before and in between the tokens):
line 1; line 2;
And, if I have 2 separate parser rules matching
"line 1;"
and
"line 2;"
above separately but not the whole line:
" line 1; line 2;"
, then the leading WS and those WS in between "line 1" and "line 2" are lost (not accessible by any of my rules).
What should I do to preserve ALL THE WHITESPACEs while allowing my parser rules to determine when to add new codes at the appropriate locations?
EDIT
Let's say whenever my code contains a call to function(1) using 1 as the parameter but not something else, it adds an extraFunction() before it:
void myFunction() {
function();
function(1);
}
Becomes:
void myFunction() {
function();
extraFunction();
function(1);
}
This preprocessed output should remain human readable as people would continue coding on it. For this simple example, text editor can handle it. But there are more complicated cases that justify the use of ANTLR.
Another solution, but maybe also not very practical (?): You can collect all Whitespaces backwards, something like this untested pseudocode:
grammar T;
#members {
public printWhitespaceBetweenRules(Token start) {
int index = start.getTokenIndex() - 1;
while(index >= 0) {
Token token = input.get(index);
if(token.getChannel() != Token.HIDDEN_CHANNEL) break;
System.out.print(token.getText());
index--;
}
}
}
line1: 'line' '1' {printWhitespaceBetweenRules($start); };
line2: 'line' '2' {printWhitespaceBetweenRules($start); };
WS: (' '|'\n'| '\r'|'\t'|'\f' )+ {$channel=HIDDEN;};
But you would still need to change every rule.
I guess one solution is to keep the WS tokens in the same channel by removing the $channel = HIDDEN;. This will allow you to get access to the information of a WS token in your parser.
Here's another way to solve it (at least the example you posted).
So you want to replace ...function(1) with ...extraFunction();\nfunction(1), where the dots are indents, and \n a line break.
What you could do is match:
Function1
: Spaces 'function' Spaces '(' Spaces '1' Spaces ')'
;
fragment Spaces
: (' ' | '\t')*
;
and replace that with the text it matches, but pre-pended with your extra method. However, the lexer will now complain when it stumbles upon input like:
'function()'
(without the 1 as a parameter)
or:
' x...'
(indents not followed by the f from function)
So, you'll need to "branch out" in your Function1 rule and make sure you only replace the proper occurrence.
You also must take care of occurrences of function(1) inside string literals and comments, assuming you don't want them to be pre-pended with extraFunction();\n.
A little demo:
grammar T;
parse
: (t=. {System.out.print($t.text);})* EOF
;
Function1
: indent=Spaces
( 'function' Spaces '(' Spaces ( '1' Spaces ')' {setText($indent.text + "extraFunction();\n" + $text);}
| ~'1' // do nothing if something other than `1` occurs
)
| '"' ~('"' | '\r' | '\n')* '"' // do nothing in case of a string literal
| '/*' .* '*/' // do nothing in case of a multi-line comment
| '//' ~('\r' | '\n')* // do nothing in case of a single-line comment
| ~'f' // do nothing in case of a char other than 'f' is seen
)
;
OtherChar
: . // a "fall-through" rule: it will match anything if none of the above matched
;
fragment Spaces
: (' ' | '\t')* // fragment rules are only used inside other lexer rules
;
You can test it with the following class:
import org.antlr.runtime.*;
public class Main {
public static void main(String[] args) throws Exception {
String source =
"/* \n" +
" function(1) \n" +
"*/ \n" +
"void myFunction() { \n" +
" s = \"function(1)\"; \n" +
" function(); \n" +
" function(1); \n" +
"} \n";
System.out.println(source);
System.out.println("---------------------------------");
TLexer lexer = new TLexer(new ANTLRStringStream(source));
TParser parser = new TParser(new CommonTokenStream(lexer));
parser.parse();
}
}
And if you run this Main class, you will see the following being printed to the console:
bart#hades:~/Programming/ANTLR/Demos/T$ java -cp antlr-3.3.jar org.antlr.Tool T.g
bart#hades:~/Programming/ANTLR/Demos/T$ javac -cp antlr-3.3.jar *.java
bart#hades:~/Programming/ANTLR/Demos/T$ java -cp .:antlr-3.3.jar Main
/*
function(1)
*/
void myFunction() {
s = "function(1)";
function();
function(1);
}
---------------------------------
/*
function(1)
*/
void myFunction() {
s = "function(1)";
function();
extraFunction();
function(1);
}
I'm sure it's not fool-proof (I did't account for char-literals, for one), but this could be a start to solve this, IMO.

Resources