PEG grammar to accept late definition - parsing

I want to write a PEG parser with PackCC (but also peg/leg or other libraries are possible) which is able to calculate some fields with variables on random position.
The first simplified approach is the following grammar:
%source {
int vars[256];
}
statement <- e:term EOL { printf("answer=%d\n", e); }
term <- l:primary
( '+' r:primary { l += r; }
/ '-' r:primary { l -= r; }
)* { $$ = l; }
/ i:var '=' s:term { $$ = vars[i] = s; }
/ e:primary { $$ = e; }
primary <- < [0-9]+ > { $$ = atoi($1); }
/ i:var !'=' { $$ = vars[i]; }
var <- < [a-z] > { $$ = $1[0]; }
EOL <- '\n' / ';'
%%
When testing with sequential order, it works fine:
a=42;a+1
answer=42
answer=43
But when having the variable definition behind the usage, it fails:
a=42;a+b;b=1
answer=42
answer=42
answer=1
And even deeper chained late definitions shall work, like:
a=42;a+b;b=c;c=1
answer=42
answer=42
answer=0
answer=1
Lets think about the input not as a sequential programming language, but more as a Excel-like spreadsheet e.g.:
A1: 42
A2: =A1+A3
A3: 1
Is it possible to parse and handle such kind of text with a PEG grammar?
Is two-pass or multi-pass an option here?
Or do I need to switch over to old style lex/yacc flex/bison?

I'm not familiar with PEG per se, but it looks like what you have is an attributed grammar where you perform the execution logic directly within the semantic action.
That won't work if you have use before definition.
You can use the same parser generator but you'll probably have to define some sort of abstract syntax tree to capture the semantics and postpone evaluation until you've parsed all input.

Yes, it is possible to parse this with a PEG grammar. PEG is effectively greedy LL(*) with infinite lookahead. Expressions like this are easy.
But the grammar you have written is left recursive, which is not PEG. Although some PEG parsers can handle left recursion, until you're an expert it's best to avoid it, and use only right recursion if needed.

Related

How do I implement a parser that respects order of operations in a stack-based AST?

I have a parser that parses the following arithmetic
1 + 2 * 2
Into the following stack AST: Const(1) Const(2) Add Const(2) Mul.
I need it to parse into this stack AST Const(2) Const(2) Mul Const(1) Add
I would also need to parse 2 * 2 + 1 / 3 correctly as Const(2) Const(2) Mul Const(1) Const(3) Div Add and any other combination.
My algorithm currently looks something like this (rust pseudocode):
let mut add_next = None;
while let Some(token) = tokens.next() { // Iterate over tokens
match_token(token, &mut add_next);
}
fn match_token(token: Token, &mut add_next: Ops) {
let original_add_next = add_next.clone();
match token {
Token::Const(x) => push_ops(Ops::Const(x)),
Token::Add => add_next = Some(Ops::Add),
Token::Mul => add_next = Some(Ops::Mul),
//... some other rules
}
if let Some(add_next) = add_next { //if add_next has a value
push_ops(add_next);
}
}
I need help coming up with an algorithm that can put the operations on the stack in the right order with the correct order of operations (Parenthesis, Exponents, Multiplication, Division, Addition, Subtraction).
I am able to implement comparison methods for operations, so the following is valid
assert!(Ops::Mul > Ops::Add);
assert!(Ops::Pow > Ops::Div);
I am also able to call tokens.next() to get to the next token within the loop, and I can call the match_token function recursively as needed.
I don't need a solution written in rust. I just need a pseudocode algorithm based on a loop with a match expression for a set of tokens that can convert mathematical expressions to a stack-based AST that respects Order of Operations.

PEG grammar to suppress execution in if command

I want to create a grammar parsing some commands. Most is working flawless but the "if(condition,then-value,else-value)" is not working together with "out" command to show some value.
It works fine in case the output-command is outside the if-command:
out(if(1,42,43))
→ output and return 42 as expected OK
But at the moment the output-command is inside then- and else-part (which is required to be more intuitive) it fails:
if(1,out(42),out(43))
→ still return only 42 as expected OK, but the output function is called twice with 42 and 43
I'm working under C with the peg/leg parser generator here
The problem is also reproducible with PEG.js online parser generator here when using the following very much simplified grammar:
Expression
= Int
/ "if(" cond:Expression "," ok:Expression "," nok:Expression ")" { return cond?ok:nok; }
/ "out(" num:Expression ")" { window.alert(num); return num;}
Int = [0-9]+ { return parseInt(text(), 10); }
The "window.alert()" is only a placeholder for the needed output function, but for this problem it acts the same.
It looks like the scanner have to match the full if-command with then-
and else-value until the closing bracket ")". So it matches both out-commands and they both execute the defined function - which is not what I expect.
Is there a way in peg/leg to match some characters but suppress execution of the according function under some circumstances?
(I've already experimented with "&" predicate element without success)
(Maybe left-recursion vs. right-recursion could help here, but used peg/leg-generator seems to only supports right-recursion)
Is there a way in peg/leg to match some characters but suppress execution of the according function under some circumstances?
I'm not familiar with the tools in question, but it would surprise me if this were possible. And even if it were, you'd run into a similar problem when implementing loops: now you'd need to execute the action multiple times.
What you need is for your actions to not directly execute the code, but return something that can be used to execute it.
The usual way that interpreters work is that the parser produces some sort of representation of the source code (such as bytecode or an AST), which is then executed as a separate step.
The simplest (but perhaps not cleanest) way to make your parser work without changing too much would be to just wrap all your actions in 0-argument functions. You could then call the functions returned by the sub-expressions if and only if you want them to be executed. And to implement loops, you could then simply call the functions multiple times.
An solution could be using a predicate expression "& {expression}" (not to be confused by predicate element "& element")
Expression
  = Function
  
Function
  = Int
  / "if(" IfCond "," ok:Function "," nok:FunctionDisabled ")" { return ok; }
  / "if(" FunctionDisabled "," ok:FunctionDisabled "," nok:Function ")" { return nok; }
  / "out(" num:Function ")" { window.alert("Out:"+num); return num;}
 
FunctionDisabled
  = Int
/ "if(" IfCond "," ok:FunctionDisabled "," nok:FunctionDisabled ")" { return ok; }
  / "if(" FunctionDisabled "," ok:FunctionDisabled "," nok:FunctionDisabled ")" { return nok; }
/ "out(" num:FunctionDisabled ")" { return num;}
IfCond
  = cond:FunctionDisabled   &{ return cond; }
                   
Int = [0-9]+ { return parseInt(text(), 10); }
The idea is to define the out() twice, once really doing something and a second time disabled without output.
The condition of the if-command is evaluated using the code inside {}, so if the condition is false, the whole expression match failes.
Visible drawback is the redundant definition of the if-command for then and else and recursive disabled

Parser expression grammar - how to match any string excluding a single character?

I'd like to write a PEG that matches filesystem paths. A path element is any character except / in posix linux.
There is an expression in PEG to match any character, but I cannot figure out how to match any character except one.
The peg parser I'm using is PEST for rust.
You could find the PEST syntax in https://docs.rs/pest/0.4.1/pest/macro.grammar.html#syntax, in particular there is a "negative lookahead"
!a — matches if a doesn't match without making progress
So you could write
!["/"] ~ any
Example:
// cargo-deps: pest
#[macro_use] extern crate pest;
use pest::*;
fn main() {
impl_rdp! {
grammar! {
path = #{ soi ~ (["/"] ~ component)+ ~ eoi }
component = #{ (!["/"] ~ any)+ }
}
}
println!("should be true: {}", Rdp::new(StringInput::new("/bcc/cc/v")).path());
println!("should be false: {}", Rdp::new(StringInput::new("/bcc/cc//v")).path());
}

ANTLR4: semantic predicate depending on state does not work

In order to have the lexer of ANTLR4 recognize different kinds of tokens in one rule I use a semantic predicate. This predicate evaluates a static field of a helper class. Have a look at some grammar excerpts:
// very simplified
#header {
import static ParserAndLexerState.*;
}
#members {
private boolean fooAllowed() {
System.out.println(fooAllowed);
}
...
methodField
: t = type
{ fooAllowed = false; }
id = Identifier
{ fooAllowed = true; /* do something with t and id*/ }
...
fragment CHAR_NO_OUT_1 : [a-eg-zA-Z_] ;
fragment CHAR_NO_OUT_2 : [a-nq-zA-Z_0-9] ;
fragment CHAR_NO_OUT_3 : [a-nq-zA-Z_0-9] ;
fragment CHAR_1 : [a-zA-Z_] ;
fragment CHAR_N : CHAR_1 | [0-9] ;
Identifier
// returns every possible identifier
: { fooAllowed() }? (CHAR_1 CHAR_N*)
// returns everything but 'foo'
| { !fooAllowed() }? CHAR_NO_OUT_1 (CHAR_NO_OUT_2 (CHAR_NO_OUT_3 CHAR_N*)?)? ;
Identifier will now always behave as if fooAllowed had the initial value of the definition in ParserAndLexerState. So if this was true Identifier will only use the first alternative of the rule, otherwise always the second. This is some weird behavior, especially considering that fooAllowed prints the right values to the console.
Is there anything in ANTLR4 that could discourages me from using global state from within semantic predicates? How can I avoid this behavior?
ANTLR 4 uses unbounded lookahead with non-deterministic termination conditions for the prediction process. While the TokenStream implementations do call TokenSource.nextToken lazily, it is not safe to ever assume that the number of tokens consumed so far is bounded.
In other words, the actual semantics of using a parser action to change the behavior of the lexer are undefined. Different versions of ANTLR 4, or even subtle changes in the input you give it, could produce completely different results.

Recursive Descent Parser and Nested Parentheses

I'm new to the area of grammars and parsing.
I'm trying to write a recursive descent parser that evaluates strings like this:
((3 == 5 AND 4 == 5) OR (6 == 6 ))
Everything works fine for me until I start to deal with nested parentheses. Essentially I find that I'm reaching the end of my target string too early.
I think the problem is due to the fact when I encounter a token like the "6" or the second-to-last parenthesis, I evaluate it and then move to the next token. I'd remove the code for advancing to the next token, but then I'm not sure how I move forward.
My grammar, such as it is, looks like this (the "=>" signs are my own notation for the "translation" of a rule):
Test: If CompoundSentence Then CompoundSentence | CompoundSentence
CompoundSentence : ( CompoundSentence ) PCSopt |CompoundSentence Conjunction Sentence |
Sentence =>
CompoundSentence = ( CompoundSentence ) PCSopt | Sentence CSOpt
PCSOpt = ParenConjunction CompoundSentence PCSOpt| Epsilon
CSOpt = Conjunction Sentence CSOpt| Epsilon
ParenConjunction: And|Or
Conjunction: And|Or
Sentence : Subject Verb Prefix
Subject: Subject Infix Value | Value =>
Subject = Value SubjectOpt
SubjectOpt = Infix Value SubjectOpt | Epsilon
Verb: ==|!=|>|<
Predicate: Predicate Infix Value | Value =>
Predicate= Value PredicateOpt
PredicateOpt = Infix Value PredicateOpt | Epsilon
Infix: +, -, *, /
My code for a compound sentence is as follows:
private string CompoundSentence(IEnumerator<Token> ts)
{
// CompoundSentence = ( CompoundSentence ) PCSopt | Sentence CSOpt
string sReturnValue = "";
switch(ts.Current.Category) {
case "OPENPAREN":
{
//Skip past the open parenthesis
ts.MoveNext();
string sCSValue = CompoundSentence(ts);
if(ts.Current.Category != "CLOSEPAREN") {
sReturnValue = "Missing parenthesis at " + ts.Current.OriginalString;
return sError;
}
else {
//Skip past the close parenthesis
ts.MoveNext();
}
sReturnValue = PCSOpt(sCSValue, ts);
break;
}
default:
{
string sSentenceVal = Sentence(ts);
//sSentenceVal is the truth value -- "TRUE" or "FALSE"
//of the initial Sentence component
//CSOpt will use that value, along with the particular conjunction
//and the value of the current token,
//to generate a new truth value.
sReturnValue = CSOpt(sSentenceVal, ts);
break;
}
}
return sReturnValue;
}
As I say, I'm new to this area, so I'm probably not understanding something quite fundamental.
If anyone could steer me in the right direction, I'd greatly appreciate it.
For expressions, a hand-coded recursive descent parser is a pretty easy thing to code.
See my SO answer for how to write recursive descent parsers.
Once you have the structure of the parser, it is pretty easy to evaluate an expression as-you-parse.
The basic convention to follow for parsing is:
At the start of a rule, the current token should be the first token that the rule covers.
A rule should consume all of the tokens it covers.
I thought it was incredibly subtle, but it turns out to have been quite simple: my scanner wasn't catching the second (and probably higher) close parentheses. Ouch.
Thanks everyone for your help.
Ira, I'll accept your answer for the detailed help it provides on RDP's.

Resources