How to describe conditional statement (if-then-else) using PEG - parsing

i'm working on Qt's qmake project file parser (open source project).
And i have a trouble with describing qmake's variant of conditional statement, called "scope" in documentation.
EBNF (simplified):
ScopeStatement -> Condition ScopeBody
Condition -> Identifier | TestFunctionCall | NotExpr | OrExpr | AndExpr
NotExpr -> "!" Condition
OrExpr -> Condition "|" Condition
AndExpr -> Condition ":" Condition
ScopeBody -> COLON Statement | BR_OPEN Statement:* BR_CLOSE
Statement -> AssignmentStatement
AssignmentStatement -> Identifier EQ String
// There are many others built-in boolean functions
TestFunctionCall -> ("defined" | ...) ARG_LIST_OPEN (String COMMA:?):* ARG_LIST_CLOSE
Identifier -> Letter (Letter | Digit | UNDERSCP):+ String -> (Letter | Digit | UNDERSCP):+
EQ -> "="
COLON -> ":"
COMMA -> ","
ARG_LIST_OPEN -> "("
ARG_LIST_CLOSE -> ")"
BLOCK_OPEN -> "{"
BLOCK_CLOSE -> "}"
UNDERSCP -> "_"
First question: how to distinguish AND-operator colon from the condition terminal one? is it possible?
P.S. My grammar draft (without function call support) don't work even for simple case like
win32:xml: x = y
PEG.JS Code:
Start
= ScopeStatement
// qmake scope statement
ScopeStatement
= BooleanExpression ws* ((":" ws* SingleLineStatement) / ("{" ws* MultiLineStatement ))
SingleLineStatement
= Identifier ws* "=" ws* Identifier lb*
MultiLineStatement
= (SingleLineStatement lb*)+
// qmake condition statement
BooleanExpression
= BooleanOrExpression
BooleanOrExpression
= left:BooleanAndExpression ws* "|" ws* right:BooleanOrExpression { return {type: "OR", left:left, right:right} }
/ BooleanAndExpression
BooleanAndExpression
= left:BooleanNotExpression ws* ":" ws* right:BooleanAndExpression { return {type: "AND", left:left, right:right} }
/ BooleanNotExpression
BooleanNotExpression
= "!" ws* operand:BooleanNotExpression { return {type: "NOT", operand: operand } }
/ BooleanComplexExpression
BooleanComplexExpression
= Identifier
/ "(" logical_or:BooleanOrExpression ")" { return logical_or; }
Identifier
= token:[a-zA-Z0-9_]+ { return token.join(""); }
ws
= [ \t]
lb
= [\r\n]
Thanks!

You need to include a negative lookahead after the BooleanAndExpression for anything that is not a BooleanAndExpression, otherwise it will keep greedily consuming additional "and" expressions.
Start
= ScopeStatement
// qmake scope statement
ScopeStatement
= bool:BooleanExpression ws* state:Statement { return {bool:bool, state:state} }
Statement
= ":" ws* state:SingleLineStatement { return state }
SingleLineStatement
= left:Identifier ws* "=" ws* right:Identifier lb* { return {type: "ASSIGN", left:left, right:right} }
MultiLineStatement
= (SingleLineStatement lb*)+
// qmake condition statement
BooleanExpression
= BooleanOrExpression
BooleanOrExpression
= left:BooleanAndExpression ws* "|" ws* right:BooleanOrExpression { return {type: "OR", left:left, right:right} }
/ BooleanAndExpression
BooleanAndExpression
= left:BooleanNotExpression ws* !(":" ws* SingleLineStatement) ":" ws* right:BooleanAndExpression { return {type: "AND", left:left, right:right} }
/ BooleanNotExpression
BooleanNotExpression
= "!" ws* operand:BooleanNotExpression { return {type: "NOT", operand: operand } }
/ BooleanComplexExpression
BooleanComplexExpression
= Identifier
/ "(" logical_or:BooleanOrExpression ")" { return logical_or; }
Identifier
= token:[a-zA-Z0-9_]+ { return token.join(""); }
ws
= [ \t]
lb
= [\r\n]

Related

Rust pest parser fail

I'm learning pest parser written in Rust and I need to parse a file:
{ISomething as Something, IOther as Other}
I wrote a rule:
LBrace = { "{" }
RBrace = { "}" }
Comma = { "," }
As = { "as" }
symbolAliases = { LBrace ~ importAliases ~ ( Comma ~ importAliases )* ~ RBrace }
importAliases = { Identifier ~ (As ~ Identifier)? }
Identifier = { IdentifierStart ~ IdentifierPart* }
IdentifierStart = _{ 'a'..'z' | 'A'..'Z' | "$" | "_" }
IdentifierPart = _{ 'a'..'z' | 'A'..'Z' | '0'..'9' | "$" | "_" }
But the parser throws an error:
thread 'main' panicked at 'unsuccessful parse: Error { variant: ParsingError { positives: [As, RBrace, Comma], negatives: [] }, location: Pos(11), line_col: Pos((1, 12)), path: None, line: "{ISomething as Something, IOther as Other}", continued_line: None }', src\main.rs:18:10
stack backtrace:
Help me figure out what the problem is.
You're not parsing any whitespace, that's your problem. It's expecting something like {ISomethingasSomething,IOtherasOther}. You can add a whitespace rule and add it in sensible places:
LBrace = { "{" }
RBrace = { "}" }
Comma = { "," }
As = { ws+ ~ "as" ~ ws+ }
// ^^^^^------^^^^^
// an "as" token should probably be surrounded by whitespace to make sense
symbolAliases = { LBrace ~ importAliases ~ ( Comma ~ importAliases )* ~ RBrace }
importAliases = { ws* ~ Identifier ~ (As ~ Identifier)? }
// ^^^^^
Identifier = { IdentifierStart ~ IdentifierPart* }
IdentifierStart = _{ 'a'..'z' | 'A'..'Z' | "$" | "_" }
IdentifierPart = _{ 'a'..'z' | 'A'..'Z' | '0'..'9' | "$" | "_" }
// whitespace rule defined here (probably want to add tabs and so on, too):
ws = _{ " " }
That's the least amount to make it parse, but you'll probably want it around braces and add some more fine-grained rules.
OR: you can add them implicitly, depending on your needs.

Parsing a statement and Adding Paranthesis to them

I am trying to parse logical BNF statements , and trying to apply paranthesis to them.
For example:
I am trying to parse a statement a=>b<=>c&d as ((a)=>(b))<=>((c)&(d)), and similar statements as well.
Problem Facing: Some of the statements are working fine, and while some are not. The example provided above is not working and the solution is printing as ((c)&(d))<=>((c)&(d)) The second expr seems to be overriding the first one.
Conditions that are working: While other simple examples like a<=>b , a|(b&c) are working fine.
I think I have made some basic error in my code, which I cannot figure out.
Here is my code
lex file
letters [a-zA-Z]
identifier {letters}+
operator (?:<=>|=>|\||&|!)
separator [\(\)]
%%
{identifier} {
yylval.s = strdup(yytext);
return IDENTIFIER; }
{operator} { return *yytext; }
{separator} { return *yytext; }
[\n] { return *yytext; }
%%
yacc file
%start program
%union {char* s;}
%type <s> program expr IDENTIFIER
%token IDENTIFIER
%left '<=>'
%left '=>'
%left '|'
%left '&'
%right '!'
%left '(' ')'
%%
program : expr '\n'
{
cout<<$$;
exit(0);
}
;
expr : IDENTIFIER {
cout<<" Atom ";
cout<<$1<<endl;
string s1 = string($1);
cout<<$$<<endl;
}
| expr '<=>' expr {
cout<<"Inside <=>\n";
string s1 = string($1);
string s2 = string($3);
string s3 = "(" + s1 +")" +"<=>"+"(" + s2 +")";
$$ = (char * )s3.c_str();
cout<<s3<<endl;
}
| expr '=>' expr {
cout<<"Inside =>\n";
string s1 = string($1);
string s2 = string($3);
string s3 = "(" + s1 +")" +"=>"+"(" + s2 +")";
$$ = (char *)s3.c_str();
cout<<$$<<endl;
}
| expr '|' expr {
cout<<"Inside |\n";
string s1 = string($1);
string s2 = string($3);
string s3 = "(" + s1 +")" +"|"+"(" + s2 +")";
$$ = (char *)s3.c_str();
cout<<$$<<endl;
}
| expr '&' expr {
cout<<"Inside &\n";
string s1 = string($1);
string s2 = string($3);
string s3 = "(" + s1 +")" +"&"+"(" + s2 +")";
$$ = (char *)s3.c_str();
cout<<$$<<endl;
}
| '!' expr {
cout<<"Inside !\n";
string s1 = string($2);
cout<<s1<<endl;
string s2 = "!" + s1;
$$ = (char *)s2.c_str();
cout<<$$<<endl;
}
| '(' expr ')' { $$ = $2; cout<<"INSIDE BRACKETS"; }
;
%%
Please let me know the mistake I have made.
Thank you
The basic problem you have is that you save the pointer returned by string::c_str() on the yacc value stack, but after the action finishes and the string object is destroyed, that pointer is no longer valid.
To fix this you need to either not use std::string at all, or change your %union to be { std::string *s; } (instead of char *). In either case you have issues with memory leaks. If you are using Linux, the former is pretty easy. Your actions would become something like:
| expr '<=>' expr {
cout<<"Inside <=>\n";
asprintf(&$$, "(%s)<=>(%s)", $1, $3);
cout<<$$<<endl;
free($1);
free($3);
}
for the latter, the action would look like:
| expr '<=>' expr {
cout<<"Inside <=>\n";
$$ = new string("(" + *$1 +")" +"<=>"+"(" + *$2 +")");
cout<<$$<<endl;
delete $1;
delete $3;
}

unclear how to add extra productions to bison grammar to create error messages

This is not homework, but it is from a book.
I'm given a following bison spec file:
%{
#include <stdio.h>
#include <ctype.h>
int yylex();
int yyerror();
%}
%token NUMBER
%%
command : exp { printf("%d\n", $1); }
; /* allows printing of the result */
exp : exp '+' term { $$ = $1 + $3; }
| exp '-' term { $$ = $1 - $3; }
| term { $$ = $1; }
;
term : term '*' factor { $$ = $1 * $3; }
| factor { $$ = $1; }
;
factor : NUMBER { $$ = $1; }
| '(' exp ')' { $$ = $2; }
;
%%
int main() {
return yyparse();
}
int yylex() {
int c;
/* eliminate blanks*/
while((c = getchar()) == ' ');
if (isdigit(c)) {
ungetc(c, stdin);
scanf("%d", &yylval);
return (NUMBER);
}
/* makes the parse stop */
if (c == '\n') return 0;
return (c);
}
int yyerror(char * s) {
fprintf(stderr, "%s\n", s);
return 0;
} /* allows for printing of an error message */
The task is to do the following:
Rewrite the spec to add the following useful error messages:
"missing right parenthesis," generated by the string (2+3
"missing left parenthesis," generated by the string 2+3)
"missing operator," generated by the string 2 3
"missing operand," generated by the string (2+)
The simplest solution that I was able to come up with is to do the following:
half_exp : exp '+' { $$ = $1; }
| exp '-' { $$ = $1; }
| exp '*' { $$ = $1; }
;
factor : NUMBER { $$ = $1; }
| '(' exp '\n' { yyerror("missing right parenthesis"); }
| exp ')' { yyerror("missing left parenthesis"); }
| '(' exp '\n' { yyerror("missing left parenthesis"); }
| '(' exp ')' { $$ = $2; }
| '(' half_exp ')' { yyerror("missing operand"); exit(0); }
;
exp : exp '+' term { $$ = $1 + $3; }
| exp '-' term { $$ = $1 - $3; }
| term { $$ = $1; }
| exp exp { yyerror("missing operator"); }
;
These changes work, however they lead to a lot of conflicts.
Here is my question.
Is there a way to rewrite this grammar in such a way so that it wouldn't generate conflicts?
Any help is appreciated.
Yes it is possible:
command : exp { printf("%d\n", $1); }
; /* allows printing of the result */
exp: exp '+' exp {
// code
}
| exp '-' exp {
// code
}
| exp '*' exp {
// code
}
| exp '/' exp {
// code
}
|'(' exp ')' {
// code
}
Bison allows Ambiguous grammars.
I don't see how can you rewrite grammar to avoid conflicts. You just missed the point of terms, factors etc. You use these when you want left recursion context free grammar.
From this grammar:
E -> E+T
|T
T -> T*F
|F
F -> (E)
|num
Once you free it from left recursion you would go to:
E -> TE' { num , ( }
E' -> +TE' { + }
| eps { ) , EOI }
T -> FT' { ( , num }
T' -> *FT' { * }
|eps { + , ) , EOI }
F -> (E) { ( }
|num { num }
These sets alongside rules are showing what input character has to be in order to use that rule. Of course this is just example for simple arithmetic expressions for example 2*(3+4)*5+(3*3*3+4+5*6) etc.
If you want to learn more about this topic I suggest you to read about "left recursion context free grammar". There are some great books covering this topic and also covering how to get input sets.
But as I said above, all of this can be avoided because Bison allows Ambiguous grammars.

noob wants make a parser for a small language

I want make a parser in happy for the let-in-expression language. For example, i want parse the following string:
let x = 4 in x*x
At the university we study attribute grammars, and i want use this tricks to calculate directly the value of the parsed let-in-expression. So in the happy file, i set the data type of the parsing function to Int, and i created a new attribute called env. This attribute is a function from String to Int that associates variable name to value. Referring to my example:
env "x" = 4
Now i put here below the happy file, where there is my grammar:
{
module Parser where
import Token
import Lexer
}
%tokentype { Token }
%token
let { TLet }
in { TIn }
int { TInt $$ }
var { TVar $$ }
'=' { TEq }
'+' { TPlus }
'-' { TMinus }
'*' { TMul }
'/' { TDiv }
'(' { TOB }
')' { TCB }
%name parse
%attributetype { Int }
%attribute env { String -> Int }
%error { parseError }
%%
Exp : let var '=' Exp in Exp
{
$4.env = $$.env;
$2.env = (\_ -> 0);
$6.env = (\str -> if str == $2 then $4 else 0);
$$ = $6;
}
| Exp1
{
$1.env = $$.env;
$$ = $1;
}
Exp1 : Exp1 '+' Term
{
$1.env = $$.env;
$2.env = $$.env;
$$ = $1 + $3;
}
| Exp1 '-' Term
{
$1.env = $$.env;
$2.env = $$.env;
$$ = $1 - $3;
}
| Term
{
$1.env = $$.env;
$$ = $1;
}
Term : Term '*' Factor
{
$1.env = $$.env;
$2.env = $$.env;
$$ = $1 * $3;
}
| Term '/' Factor
{
$1.env = $$.env;
$2.env = $$.env;
$$ = div $1 $3;
}
| Factor
{
$1.env = $$.env;
$$ = $1;
}
Factor
: int
{
$$ = $1;
}
| var
{
$$ = $$.env $1;
}
| '(' Exp ')'
{
$1.env = $$.env;
$$ = $1;
}
{
parseError :: [Token] -> a
parseError _ = error "Parse error"
}
When i load the haskell file generated from the happy file above, i get the following error:
Ambiguous occurrence `Int'
It could refer to either `Parser.Int', defined at parser.hs:271:6
or `Prelude.Int',
imported from `Prelude' at parser.hs:2:8-13
(and originally defined in `GHC.Types')
I don't know why i get this, because i don't define the type Parser.Int in my happy file. I tried to replace Int with Prelude.Int, but i get other errors.
How can i resolve? Can i have also some general tips if I'm doing something not optimal?
See the happy explaination of attributetype: http://www.haskell.org/happy/doc/html/sec-AtrributeGrammarsInHappy.html
Your line:
%attributetype { Int }
Is declaring a type named Int. This is what causes the ambiguity.

Expressions in a CoCo to ANTLR translator

I'm parsing CoCo/R grammars in a utility to automate CoCo -> ANTLR translation. The core ANTLR grammar is:
rule '=' expression '.' ;
expression
: term ('|' term)*
-> ^( OR_EXPR term term* )
;
term
: (factor (factor)*)? ;
factor
: symbol
| '(' expression ')'
-> ^( GROUPED_EXPR expression )
| '[' expression']'
-> ^( OPTIONAL_EXPR expression)
| '{' expression '}'
-> ^( SEQUENCE_EXPR expression)
;
symbol
: IF_ACTION
| ID (ATTRIBUTES)?
| STRINGLITERAL
;
My problem is with constructions such as these:
CS = { ExternAliasDirective }
{ UsingDirective }
EOF .
CS results in an AST with a OR_EXPR node although no '|' character
actually appears. I'm sure this is due to the definition of
expression but I cannot see any other way to write the rules.
I did experiment with this to resolve the ambiguity.
// explicitly test for the presence of an '|' character
expression
#init { bool ored = false; }
: term {ored = (input.LT(1).Type == OR); } (OR term)*
-> {ored}? ^(OR_EXPR term term*)
-> ^(LIST term term*)
It works but the hack reinforces my conviction that something fundamental is wrong.
Any tips much appreciated.
Your rule:
expression
: term ('|' term)*
-> ^( OR_EXPR term term* )
;
always causes the rewrite rule to create a tree with a root of type OR_EXPR. You can create "sub rewrite rules" like this:
expression
: (term -> REWRITE_RULE_X) ('|' term -> ^(REWRITE_RULE_Y))*
;
And to resolve the ambiguity in your grammar, it's easiest to enable global backtracking which can be done in the options { ... } section of your grammar.
A quick demo:
grammar CocoR;
options {
output=AST;
backtrack=true;
}
tokens {
RULE;
GROUP;
SEQUENCE;
OPTIONAL;
OR;
ATOMS;
}
parse
: rule EOF -> rule
;
rule
: ID '=' expr* '.' -> ^(RULE ID expr*)
;
expr
: (a=atoms -> $a) ('|' b=atoms -> ^(OR $expr $b))*
;
atoms
: atom+ -> ^(ATOMS atom+)
;
atom
: ID
| '(' expr ')' -> ^(GROUP expr)
| '{' expr '}' -> ^(SEQUENCE expr)
| '[' expr ']' -> ^(OPTIONAL expr)
;
ID
: ('a'..'z' | 'A'..'Z') ('a'..'z' | 'A'..'Z' | '0'..'9')*
;
Space
: (' ' | '\t' | '\r' | '\n') {skip();}
;
with input:
CS = { ExternAliasDirective }
{ UsingDirective }
EOF .
produces the AST:
and the input:
foo = a | b ({c} | d [e f]) .
produces:
The class to test this:
import org.antlr.runtime.*;
import org.antlr.runtime.tree.*;
import org.antlr.stringtemplate.*;
public class Main {
public static void main(String[] args) throws Exception {
/*
String source =
"CS = { ExternAliasDirective } \n" +
"{ UsingDirective } \n" +
"EOF . ";
*/
String source = "foo = a | b ({c} | d [e f]) .";
ANTLRStringStream in = new ANTLRStringStream(source);
CocoRLexer lexer = new CocoRLexer(in);
CommonTokenStream tokens = new CommonTokenStream(lexer);
CocoRParser parser = new CocoRParser(tokens);
CocoRParser.parse_return returnValue = parser.parse();
CommonTree tree = (CommonTree)returnValue.getTree();
DOTTreeGenerator gen = new DOTTreeGenerator();
StringTemplate st = gen.toDOT(tree);
System.out.println(st);
}
}
and with the output this class produces, I used the following website to create the AST-images: http://graph.gafol.net/
HTH
EDIT
To account for epsilon (empty string) in your OR expressions, you might try something (quickly tested!) like this:
expr
: (a=atoms -> $a) ( ( '|' b=atoms -> ^(OR $expr $b)
| '|' -> ^(OR $expr NOTHING)
)
)*
;
which parses the source:
foo = a | b | .
into the following AST:
The production for expression explicitly says that it can only return an OR_EXPR node. You can try something like:
expression
:
term
|
term ('|' term)+
-> ^( OR_EXPR term term* )
;
Further down, you could use:
term
: factor*;

Resources