Recursive Descent Parser and Nested Parentheses - parsing

I'm new to the area of grammars and parsing.
I'm trying to write a recursive descent parser that evaluates strings like this:
((3 == 5 AND 4 == 5) OR (6 == 6 ))
Everything works fine for me until I start to deal with nested parentheses. Essentially I find that I'm reaching the end of my target string too early.
I think the problem is due to the fact when I encounter a token like the "6" or the second-to-last parenthesis, I evaluate it and then move to the next token. I'd remove the code for advancing to the next token, but then I'm not sure how I move forward.
My grammar, such as it is, looks like this (the "=>" signs are my own notation for the "translation" of a rule):
Test: If CompoundSentence Then CompoundSentence | CompoundSentence
CompoundSentence : ( CompoundSentence ) PCSopt |CompoundSentence Conjunction Sentence |
Sentence =>
CompoundSentence = ( CompoundSentence ) PCSopt | Sentence CSOpt
PCSOpt = ParenConjunction CompoundSentence PCSOpt| Epsilon
CSOpt = Conjunction Sentence CSOpt| Epsilon
ParenConjunction: And|Or
Conjunction: And|Or
Sentence : Subject Verb Prefix
Subject: Subject Infix Value | Value =>
Subject = Value SubjectOpt
SubjectOpt = Infix Value SubjectOpt | Epsilon
Verb: ==|!=|>|<
Predicate: Predicate Infix Value | Value =>
Predicate= Value PredicateOpt
PredicateOpt = Infix Value PredicateOpt | Epsilon
Infix: +, -, *, /
My code for a compound sentence is as follows:
private string CompoundSentence(IEnumerator<Token> ts)
{
// CompoundSentence = ( CompoundSentence ) PCSopt | Sentence CSOpt
string sReturnValue = "";
switch(ts.Current.Category) {
case "OPENPAREN":
{
//Skip past the open parenthesis
ts.MoveNext();
string sCSValue = CompoundSentence(ts);
if(ts.Current.Category != "CLOSEPAREN") {
sReturnValue = "Missing parenthesis at " + ts.Current.OriginalString;
return sError;
}
else {
//Skip past the close parenthesis
ts.MoveNext();
}
sReturnValue = PCSOpt(sCSValue, ts);
break;
}
default:
{
string sSentenceVal = Sentence(ts);
//sSentenceVal is the truth value -- "TRUE" or "FALSE"
//of the initial Sentence component
//CSOpt will use that value, along with the particular conjunction
//and the value of the current token,
//to generate a new truth value.
sReturnValue = CSOpt(sSentenceVal, ts);
break;
}
}
return sReturnValue;
}
As I say, I'm new to this area, so I'm probably not understanding something quite fundamental.
If anyone could steer me in the right direction, I'd greatly appreciate it.

For expressions, a hand-coded recursive descent parser is a pretty easy thing to code.
See my SO answer for how to write recursive descent parsers.
Once you have the structure of the parser, it is pretty easy to evaluate an expression as-you-parse.

The basic convention to follow for parsing is:
At the start of a rule, the current token should be the first token that the rule covers.
A rule should consume all of the tokens it covers.

I thought it was incredibly subtle, but it turns out to have been quite simple: my scanner wasn't catching the second (and probably higher) close parentheses. Ouch.
Thanks everyone for your help.
Ira, I'll accept your answer for the detailed help it provides on RDP's.

Related

How do I implement a parser that respects order of operations in a stack-based AST?

I have a parser that parses the following arithmetic
1 + 2 * 2
Into the following stack AST: Const(1) Const(2) Add Const(2) Mul.
I need it to parse into this stack AST Const(2) Const(2) Mul Const(1) Add
I would also need to parse 2 * 2 + 1 / 3 correctly as Const(2) Const(2) Mul Const(1) Const(3) Div Add and any other combination.
My algorithm currently looks something like this (rust pseudocode):
let mut add_next = None;
while let Some(token) = tokens.next() { // Iterate over tokens
match_token(token, &mut add_next);
}
fn match_token(token: Token, &mut add_next: Ops) {
let original_add_next = add_next.clone();
match token {
Token::Const(x) => push_ops(Ops::Const(x)),
Token::Add => add_next = Some(Ops::Add),
Token::Mul => add_next = Some(Ops::Mul),
//... some other rules
}
if let Some(add_next) = add_next { //if add_next has a value
push_ops(add_next);
}
}
I need help coming up with an algorithm that can put the operations on the stack in the right order with the correct order of operations (Parenthesis, Exponents, Multiplication, Division, Addition, Subtraction).
I am able to implement comparison methods for operations, so the following is valid
assert!(Ops::Mul > Ops::Add);
assert!(Ops::Pow > Ops::Div);
I am also able to call tokens.next() to get to the next token within the loop, and I can call the match_token function recursively as needed.
I don't need a solution written in rust. I just need a pseudocode algorithm based on a loop with a match expression for a set of tokens that can convert mathematical expressions to a stack-based AST that respects Order of Operations.

Why can't a statement be added on the same line in a record expression?

Hm, well that was a hard question to name appropriately. Anyway, I'm wondering why, given this type declaration
type T = {
C : int
}
This does not compile:
let foo () =
{
C = printfn ""; 3
}
but this does:
let foo () =
{
C =
printfn ""; 3
}
Compiler bug or by design?
"Works as designed" probably more than a "bug", but it's just an overall weird thing to do.
Semicolon is an expression sequencing operator (as in your intended usage), but also a record field separator. In the first case, the parser assumes the latter, and gets confused by it. In the second case, by indenting it you make it clear that the semicolon means expression sequencing.
You could get away with this without splitting the field over two lines:
let foo () =
{
C = (printfn ""; 3)
}

PEG grammar to suppress execution in if command

I want to create a grammar parsing some commands. Most is working flawless but the "if(condition,then-value,else-value)" is not working together with "out" command to show some value.
It works fine in case the output-command is outside the if-command:
out(if(1,42,43))
→ output and return 42 as expected OK
But at the moment the output-command is inside then- and else-part (which is required to be more intuitive) it fails:
if(1,out(42),out(43))
→ still return only 42 as expected OK, but the output function is called twice with 42 and 43
I'm working under C with the peg/leg parser generator here
The problem is also reproducible with PEG.js online parser generator here when using the following very much simplified grammar:
Expression
= Int
/ "if(" cond:Expression "," ok:Expression "," nok:Expression ")" { return cond?ok:nok; }
/ "out(" num:Expression ")" { window.alert(num); return num;}
Int = [0-9]+ { return parseInt(text(), 10); }
The "window.alert()" is only a placeholder for the needed output function, but for this problem it acts the same.
It looks like the scanner have to match the full if-command with then-
and else-value until the closing bracket ")". So it matches both out-commands and they both execute the defined function - which is not what I expect.
Is there a way in peg/leg to match some characters but suppress execution of the according function under some circumstances?
(I've already experimented with "&" predicate element without success)
(Maybe left-recursion vs. right-recursion could help here, but used peg/leg-generator seems to only supports right-recursion)
Is there a way in peg/leg to match some characters but suppress execution of the according function under some circumstances?
I'm not familiar with the tools in question, but it would surprise me if this were possible. And even if it were, you'd run into a similar problem when implementing loops: now you'd need to execute the action multiple times.
What you need is for your actions to not directly execute the code, but return something that can be used to execute it.
The usual way that interpreters work is that the parser produces some sort of representation of the source code (such as bytecode or an AST), which is then executed as a separate step.
The simplest (but perhaps not cleanest) way to make your parser work without changing too much would be to just wrap all your actions in 0-argument functions. You could then call the functions returned by the sub-expressions if and only if you want them to be executed. And to implement loops, you could then simply call the functions multiple times.
An solution could be using a predicate expression "& {expression}" (not to be confused by predicate element "& element")
Expression
  = Function
  
Function
  = Int
  / "if(" IfCond "," ok:Function "," nok:FunctionDisabled ")" { return ok; }
  / "if(" FunctionDisabled "," ok:FunctionDisabled "," nok:Function ")" { return nok; }
  / "out(" num:Function ")" { window.alert("Out:"+num); return num;}
 
FunctionDisabled
  = Int
/ "if(" IfCond "," ok:FunctionDisabled "," nok:FunctionDisabled ")" { return ok; }
  / "if(" FunctionDisabled "," ok:FunctionDisabled "," nok:FunctionDisabled ")" { return nok; }
/ "out(" num:FunctionDisabled ")" { return num;}
IfCond
  = cond:FunctionDisabled   &{ return cond; }
                   
Int = [0-9]+ { return parseInt(text(), 10); }
The idea is to define the out() twice, once really doing something and a second time disabled without output.
The condition of the if-command is evaluated using the code inside {}, so if the condition is false, the whole expression match failes.
Visible drawback is the redundant definition of the if-command for then and else and recursive disabled

PEG grammar to accept late definition

I want to write a PEG parser with PackCC (but also peg/leg or other libraries are possible) which is able to calculate some fields with variables on random position.
The first simplified approach is the following grammar:
%source {
int vars[256];
}
statement <- e:term EOL { printf("answer=%d\n", e); }
term <- l:primary
( '+' r:primary { l += r; }
/ '-' r:primary { l -= r; }
)* { $$ = l; }
/ i:var '=' s:term { $$ = vars[i] = s; }
/ e:primary { $$ = e; }
primary <- < [0-9]+ > { $$ = atoi($1); }
/ i:var !'=' { $$ = vars[i]; }
var <- < [a-z] > { $$ = $1[0]; }
EOL <- '\n' / ';'
%%
When testing with sequential order, it works fine:
a=42;a+1
answer=42
answer=43
But when having the variable definition behind the usage, it fails:
a=42;a+b;b=1
answer=42
answer=42
answer=1
And even deeper chained late definitions shall work, like:
a=42;a+b;b=c;c=1
answer=42
answer=42
answer=0
answer=1
Lets think about the input not as a sequential programming language, but more as a Excel-like spreadsheet e.g.:
A1: 42
A2: =A1+A3
A3: 1
Is it possible to parse and handle such kind of text with a PEG grammar?
Is two-pass or multi-pass an option here?
Or do I need to switch over to old style lex/yacc flex/bison?
I'm not familiar with PEG per se, but it looks like what you have is an attributed grammar where you perform the execution logic directly within the semantic action.
That won't work if you have use before definition.
You can use the same parser generator but you'll probably have to define some sort of abstract syntax tree to capture the semantics and postpone evaluation until you've parsed all input.
Yes, it is possible to parse this with a PEG grammar. PEG is effectively greedy LL(*) with infinite lookahead. Expressions like this are easy.
But the grammar you have written is left recursive, which is not PEG. Although some PEG parsers can handle left recursion, until you're an expert it's best to avoid it, and use only right recursion if needed.

Scala Parser - Message Length

I'm toying with Scala's Parser library. I am trying to write a parser for a format where a length is specified followed by a message of that length. For example:
x.parseAll(x.message, "5helloworld") // result: "hello", remaining: "world"
I'm not sure how to do this using combinators. My mind first goes to:
def message = length ~ body
But obviously body depends on length, and I don't know how to do that :p
Instead you could just define a message Parser as a single Parser (not combination of Parsers) and I think that is doable (although I haven't looked if a single Parser can pull several elem?).
Anyways, I'm a scala noob, I just find this awesome :)
You should use into for that, or its abbreviation, >>:
scala> object T extends RegexParsers {
| def length: Parser[String] = """\d+""".r
| def message: Parser[String] = length >> { length => """\w{%d}""".format(length.toInt).r }
| }
defined module T
scala> T.parseAll(T.message, "5helloworld")
res0: T.ParseResult[String] =
[1.7] failure: string matching regex `\z' expected but `w' found
5helloworld
^
scala> T.parse(T.message, "5helloworld")
res1: T.ParseResult[String] = [1.7] parsed: hello
Be careful with precedence when using it. If you add an "~ remainder" after the function above, for instance, Scala will interpret it as length >> ({ length => ...} ~ remainder) instead of (length >> { length => ...}) ~ remainder.
This does not sound like a context free language, so you will need to use flatMap :
def message = length.flatMap(l => bodyOfLength(n))
where length is of type Parser[Int] and bodyOfLength(n) would be based on repN, such as
def bodyWithLength(n: Int) : Parser[String]
= repN(n, elem("any", _ => true)) ^^ {_.mkString}
I wouldn´t use pasrer combinators for this purpose. But if you have to or the problem becomes more complex you could try this:
def times(x :Long,what:String) : Parser[Any] = x match {
case 1 => what;
case x => what~times(x-1,what);
}
Don´t use parseAll if you want something remained, use parse.
You could parse length, store the result in a mutable field x(I know ugly, but useful here) and parse body x times, then you get the String parsed and the rest remains in the parser.

Resources