In order to have the lexer of ANTLR4 recognize different kinds of tokens in one rule I use a semantic predicate. This predicate evaluates a static field of a helper class. Have a look at some grammar excerpts:
// very simplified
#header {
import static ParserAndLexerState.*;
}
#members {
private boolean fooAllowed() {
System.out.println(fooAllowed);
}
...
methodField
: t = type
{ fooAllowed = false; }
id = Identifier
{ fooAllowed = true; /* do something with t and id*/ }
...
fragment CHAR_NO_OUT_1 : [a-eg-zA-Z_] ;
fragment CHAR_NO_OUT_2 : [a-nq-zA-Z_0-9] ;
fragment CHAR_NO_OUT_3 : [a-nq-zA-Z_0-9] ;
fragment CHAR_1 : [a-zA-Z_] ;
fragment CHAR_N : CHAR_1 | [0-9] ;
Identifier
// returns every possible identifier
: { fooAllowed() }? (CHAR_1 CHAR_N*)
// returns everything but 'foo'
| { !fooAllowed() }? CHAR_NO_OUT_1 (CHAR_NO_OUT_2 (CHAR_NO_OUT_3 CHAR_N*)?)? ;
Identifier will now always behave as if fooAllowed had the initial value of the definition in ParserAndLexerState. So if this was true Identifier will only use the first alternative of the rule, otherwise always the second. This is some weird behavior, especially considering that fooAllowed prints the right values to the console.
Is there anything in ANTLR4 that could discourages me from using global state from within semantic predicates? How can I avoid this behavior?
ANTLR 4 uses unbounded lookahead with non-deterministic termination conditions for the prediction process. While the TokenStream implementations do call TokenSource.nextToken lazily, it is not safe to ever assume that the number of tokens consumed so far is bounded.
In other words, the actual semantics of using a parser action to change the behavior of the lexer are undefined. Different versions of ANTLR 4, or even subtle changes in the input you give it, could produce completely different results.
Related
Hm, well that was a hard question to name appropriately. Anyway, I'm wondering why, given this type declaration
type T = {
C : int
}
This does not compile:
let foo () =
{
C = printfn ""; 3
}
but this does:
let foo () =
{
C =
printfn ""; 3
}
Compiler bug or by design?
"Works as designed" probably more than a "bug", but it's just an overall weird thing to do.
Semicolon is an expression sequencing operator (as in your intended usage), but also a record field separator. In the first case, the parser assumes the latter, and gets confused by it. In the second case, by indenting it you make it clear that the semicolon means expression sequencing.
You could get away with this without splitting the field over two lines:
let foo () =
{
C = (printfn ""; 3)
}
If I have:
if {
yylval = $1;
}
Is this legal? If not, is there another way to say I want to reference what I put in?
(please dont say yylval = 'if', it's not dynamic, and I want to use it in some more complicated scenarios)
No. $1 and friends are non-terminal or terminal symbols in the grammar. I don't know what you're trying to do exactly, but normally you would have a set of rules like this:
"if" { return IF; }
"else" { return ELSE; }
[0-9]+ { yylval.intValue = atoi(yytext); return INTEGER; }
etc., where IF and ELSE are defined in y.tab.h as a result of being declared in your .y file via the %token directive.
please don't say yylval = 'if', it's not dynamic
Neither is a lex rule. Your purpose remains obscure.
I want to write a PEG parser with PackCC (but also peg/leg or other libraries are possible) which is able to calculate some fields with variables on random position.
The first simplified approach is the following grammar:
%source {
int vars[256];
}
statement <- e:term EOL { printf("answer=%d\n", e); }
term <- l:primary
( '+' r:primary { l += r; }
/ '-' r:primary { l -= r; }
)* { $$ = l; }
/ i:var '=' s:term { $$ = vars[i] = s; }
/ e:primary { $$ = e; }
primary <- < [0-9]+ > { $$ = atoi($1); }
/ i:var !'=' { $$ = vars[i]; }
var <- < [a-z] > { $$ = $1[0]; }
EOL <- '\n' / ';'
%%
When testing with sequential order, it works fine:
a=42;a+1
answer=42
answer=43
But when having the variable definition behind the usage, it fails:
a=42;a+b;b=1
answer=42
answer=42
answer=1
And even deeper chained late definitions shall work, like:
a=42;a+b;b=c;c=1
answer=42
answer=42
answer=0
answer=1
Lets think about the input not as a sequential programming language, but more as a Excel-like spreadsheet e.g.:
A1: 42
A2: =A1+A3
A3: 1
Is it possible to parse and handle such kind of text with a PEG grammar?
Is two-pass or multi-pass an option here?
Or do I need to switch over to old style lex/yacc flex/bison?
I'm not familiar with PEG per se, but it looks like what you have is an attributed grammar where you perform the execution logic directly within the semantic action.
That won't work if you have use before definition.
You can use the same parser generator but you'll probably have to define some sort of abstract syntax tree to capture the semantics and postpone evaluation until you've parsed all input.
Yes, it is possible to parse this with a PEG grammar. PEG is effectively greedy LL(*) with infinite lookahead. Expressions like this are easy.
But the grammar you have written is left recursive, which is not PEG. Although some PEG parsers can handle left recursion, until you're an expert it's best to avoid it, and use only right recursion if needed.
If I write grammar file in Yacc/Bison like this:
Module
:ModuleName "=" Functions
{ $$ = Builder::concat($1, $2, ","); }
Functions
:Functions Function
{ $$ = Builder::concat($1, $2, ","); }
| Function
{ $$ = $1; }
Function
: DEF ID ARGS BODY
{
/** Lacks module name to do name mangling for the function **/
/** How can I obtain the "parent" node's module name here ?? **/
module_name = ; //????
$$ = Builder::def_function(module_name, $ID, $ARGS, $BODY);
}
And this parser should parse codes like this:
main_module:
def funA (a,b,c) { ... }
In my AST, the name "funA" should be renamed as main_module.funA. But I can't get the module's information while the parser is processing Function node !
Is there any Yacc/Bison facilities can help me to handle this problem, or should I change my parsing style to avoid such embarrassing situations ?
There is a bison feature, but as the manual says, use it with care:
$N with N zero or negative is allowed for reference to tokens and groupings on the stack before those that match the current rule. This is a very risky practice, and to use it reliably you must be certain of the context in which the rule is applied. Here is a case in which you can use this reliably:
foo: expr bar '+' expr { ... }
| expr bar '-' expr { ... }
;
bar: /* empty */
{ previous_expr = $0; }
;
As long as bar is used only in the fashion shown here, $0 always refers to the expr which precedes bar in the definition of foo.
More cleanly, you could use a mid-rule action (in Module) to push the module name on a name stack (which would have to be part of the parsing context). You would then pop the stack at the end of the rule.
For more information and examples of mid-rules actions, see the manual.
I'm new to the area of grammars and parsing.
I'm trying to write a recursive descent parser that evaluates strings like this:
((3 == 5 AND 4 == 5) OR (6 == 6 ))
Everything works fine for me until I start to deal with nested parentheses. Essentially I find that I'm reaching the end of my target string too early.
I think the problem is due to the fact when I encounter a token like the "6" or the second-to-last parenthesis, I evaluate it and then move to the next token. I'd remove the code for advancing to the next token, but then I'm not sure how I move forward.
My grammar, such as it is, looks like this (the "=>" signs are my own notation for the "translation" of a rule):
Test: If CompoundSentence Then CompoundSentence | CompoundSentence
CompoundSentence : ( CompoundSentence ) PCSopt |CompoundSentence Conjunction Sentence |
Sentence =>
CompoundSentence = ( CompoundSentence ) PCSopt | Sentence CSOpt
PCSOpt = ParenConjunction CompoundSentence PCSOpt| Epsilon
CSOpt = Conjunction Sentence CSOpt| Epsilon
ParenConjunction: And|Or
Conjunction: And|Or
Sentence : Subject Verb Prefix
Subject: Subject Infix Value | Value =>
Subject = Value SubjectOpt
SubjectOpt = Infix Value SubjectOpt | Epsilon
Verb: ==|!=|>|<
Predicate: Predicate Infix Value | Value =>
Predicate= Value PredicateOpt
PredicateOpt = Infix Value PredicateOpt | Epsilon
Infix: +, -, *, /
My code for a compound sentence is as follows:
private string CompoundSentence(IEnumerator<Token> ts)
{
// CompoundSentence = ( CompoundSentence ) PCSopt | Sentence CSOpt
string sReturnValue = "";
switch(ts.Current.Category) {
case "OPENPAREN":
{
//Skip past the open parenthesis
ts.MoveNext();
string sCSValue = CompoundSentence(ts);
if(ts.Current.Category != "CLOSEPAREN") {
sReturnValue = "Missing parenthesis at " + ts.Current.OriginalString;
return sError;
}
else {
//Skip past the close parenthesis
ts.MoveNext();
}
sReturnValue = PCSOpt(sCSValue, ts);
break;
}
default:
{
string sSentenceVal = Sentence(ts);
//sSentenceVal is the truth value -- "TRUE" or "FALSE"
//of the initial Sentence component
//CSOpt will use that value, along with the particular conjunction
//and the value of the current token,
//to generate a new truth value.
sReturnValue = CSOpt(sSentenceVal, ts);
break;
}
}
return sReturnValue;
}
As I say, I'm new to this area, so I'm probably not understanding something quite fundamental.
If anyone could steer me in the right direction, I'd greatly appreciate it.
For expressions, a hand-coded recursive descent parser is a pretty easy thing to code.
See my SO answer for how to write recursive descent parsers.
Once you have the structure of the parser, it is pretty easy to evaluate an expression as-you-parse.
The basic convention to follow for parsing is:
At the start of a rule, the current token should be the first token that the rule covers.
A rule should consume all of the tokens it covers.
I thought it was incredibly subtle, but it turns out to have been quite simple: my scanner wasn't catching the second (and probably higher) close parentheses. Ouch.
Thanks everyone for your help.
Ira, I'll accept your answer for the detailed help it provides on RDP's.