Grammar for arithmeticthat accounts for integer division - parsing

I'm trying to learn about abstract syntax trees. I'm stuck at developing a grammar that works for integer division.
I've found this one that works for float division:
EXP -> TERM EXP1
EXP1 -> + TERM EXP1 |
- TERM EXP1 |
epsilon
TERM -> FACTOR TERM1
TERM1 -> * FACTOR TERM1 |
/ FACTOR TERM1 |
epsilon
FACTOR -> ( EXP ) | - EXP | number
But it doesn't work for integer division.

Related

Grammar for expressions which disallows outer parentheses

I have the following grammar for expressions involving binary operators (| ^ & << >> + - * /):
expression : expression BITWISE_OR xor_expression
| xor_expression
xor_expression : xor_expression BITWISE_XOR and_expression
| and_expression
and_expression : and_expression BITWISE_AND shift_expression
| shift_expression
shift_expression : shift_expression LEFT_SHIFT arith_expression
| shift_expression RIGHT_SHIFT arith_expression
| arith_expression
arith_expression : arith_expression PLUS term
| arith_expression MINUS term
| term
term : term TIMES factor
| term DIVIDE factor
| factor
factor : NUMBER
| LPAREN expression RPAREN
This seems to work fine, but doesn't quite match my needs because it allows outer parentheses e.g. ((3 + 4) * 2).
How can I change the grammar to disallow outer parentheses, while still allowing them within expressions e.g. (3 + 4) * 2, even redundantly e.g. (3 * 4) + 2?
Add this rule to your grammar:
top_level : expression BITWISE_OR xor_expression
| xor_expression BITWISE_XOR and_expression
| and_expression BITWISE_AND shift_expression
| shift_expression LEFT_SHIFT arith_expression
| shift_expression RIGHT_SHIFT arith_expression
| arith_expression PLUS term
| arith_expression MINUS term
| term TIMES factor
| term DIVIDE factor
| NUMBER
and use top_level where you want expressions without outer parens.

ANTLR grammar for SMT formulae

I am trying to make a grammar for SMT formulae and this is what I have so far
grammar Z3input;
startRule : formulaList? EOF;
LEFT_PAREN : '(';
RIGHT_PAREN : ')';
COMMA : ',';
SEMICOLON : ';';
PLUS : '+';
MINUS : '-';
TIMES : '*';
DIVIDE : '/';
DIGIT : [0-9];
INTEGER : '0' | [1-9] DIGIT*;
FLOAT : DIGIT+ '.' DIGIT+;
NUMERICAL_LITERAL : FLOAT | INTEGER;
BOOLEAN_LITERAL : 'True' | 'False';
LITERAL : MINUS? NUMERICAL_LITERAL | BOOLEAN_LITERAL;
COMPARISON_OPERATOR : '>' | '<' | '>=' | '<=' | '!=' | '==';
WHITESPACE: [ \t\n\r]+ -> skip;
IDENTIFIER : [a-uw-zB-DF-Z]+ ([a-zA-Z0-9]? [a-uw-zB-DF-Z])*; // omits 'v', 'A', 'E' and cannot end in those characters
IMPLIES : '->' | '-->' | 'implies';
AND : '&' | 'and' | '^';
OR : 'or' | 'v' | '|';
NOT : '~' | '!' | 'not';
QUANTIFIER : 'A' | 'E' | 'forall' | 'exists';
formulaList : formula ( SEMICOLON formula )*;
argumentList : expression ( COMMA expression )*;
formula : formulaConjunction
| LEFT_PAREN formula RIGHT_PAREN OR LEFT_PAREN formulaConjunction RIGHT_PAREN
| formula IMPLIES LEFT_PAREN formulaConjunction RIGHT_PAREN;
formulaConjunction : formulaNegation | formulaConjunction AND formulaNegation;
formulaNegation : formulaAtom | NOT formulaNegation;
formulaAtom : BOOLEAN_LITERAL
| IDENTIFIER ( LEFT_PAREN argumentList? RIGHT_PAREN )?
| QUANTIFIER '.' LEFT_PAREN formulaAtom RIGHT_PAREN
| compareExpn;
expression : boolConjunction | expression OR boolConjunction;
boolConjunction : boolNegation | boolConjunction AND boolNegation;
boolNegation : compareExpn | NOT boolNegation;
compareExpn : arithExpn COMPARISON_OPERATOR arithExpn;
arithExpn : term | arithExpn PLUS term | arithExpn MINUS term;
term : factor | term TIMES factor | term DIVIDE factor;
factor : primary | MINUS factor;
primary : LITERAL
| IDENTIFIER ( LEFT_PAREN argumentList? RIGHT_PAREN )?
| LEFT_PAREN expression RIGHT_PAREN;
SMT formulae are formulae of first-order logic with function symbols (identifiers which can be called with however many arguments), variables, comparison of either boolean literals (I.e. 'True' or 'False') or numeric literals or function calls or variables, arithmetic with operators '+', '*', '-', and '/'. Essentially these formulae are first-order logic over some signature and for my purposes I've chosen for this signature to be the theory of rationals.
I can get a proper interpretation of something like 'True ^ True' but anything more complicated, including even 'True | True', seems to always result in something along the lines of
... mismatched input '|' expecting {<EOF>, ';', IMPLIES, AND}
so I would like some help with correcting the grammar. And for the record I would prefer to keep the grammar run-time independent.
Your formula rule seems to be causing the issue here: LEFT_PAREN formula RIGHT_PAREN OR LEFT_PAREN formulaConjunction RIGHT_PAREN.
That's saying that only formulas of the form (FORMULA)|(CONJUNCTIVE) will be accepted by the language.
Instead, specify precedence rules for each operator, and use a nonterminal for each level of precedence. For example, your grammar might look something like the following:
formula : (QUANTIFIER IDENTIFIER '.')? formulaImplication;
formulaImplication : formulaConjunction (IMPLIES formula)?;
formulaConjunction : formulaDisjunction (AND formulaConjunction)?;
formulaDisjunction : formulaNegation (OR formulaDisjunction)?;
formulaNegation : formulaAtom | NOT formulaNegation;
formulaAtom : BOOLEAN_LITERAL | IDENTIFIER ( LEFT_PAREN argumentList? RIGHT_PAREN )? | LEFT_PAREN formula RIGHT_PAREN | compareExpn;
expression : boolConjunction | expression OR boolConjunction;
boolConjunction : boolNegation | boolConjunction AND boolNegation;
boolNegation : compareExpn | NOT boolNegation;
compareExpn : arithExpn COMPARISON_OPERATOR arithExpn;
arithExpn : term | arithExpn PLUS term | arithExpn MINUS term;
term : factor ((TIMES | DIVIDE) term)?;
factor : primary | MINUS factor;
primary : LITERAL | IDENTIFIER ( LEFT_PAREN argumentList? RIGHT_PAREN )? | LEFT_PAREN expression RIGHT_PAREN;

Creating Bison File for Simple Grammar

I have the following simple grammar:
E -> T | ^ v . E
T -> F T1
T1 -> F T1 | epsilon
F -> ( E ) | v
I'm pretty new to Bison, so I was hoping someone could help show me how to write it out in that format. All I have so far is the following, but I'm not sure if it's correct:
%left '.'
%left 'v'
%% /* The grammar follows. */
exp:
term {printf("1");}
| '^' 'v' '.' exp {printf("2");}
;
term:
factor term1 {printf("3");}
;
term1:
factor term1 {printf("4");}
| {printf("5");}
;
factor:
'(' exp ')' {printf("6");}
| 'v' {printf("7");}
;
%%
You are missing the closing semicolon from several of the productions. There's nothing in the source grammar to suggest you need the productions about lines.

ANTLR AST rules fail with RewriteEmptyStreamException

I have a simple grammar:
grammar sample;
options { output = AST; }
assignment
: IDENT ':=' expr ';'
;
expr
: factor ('*' factor)*
;
factor
: primary ('+' primary)*
;
primary
: NUM
| '(' expr ')'
;
IDENT : ('a'..'z')+ ;
NUM : ('0'..'9')+ ;
WS : (' '|'\n'|'\t'|'\r')+ {$channel=HIDDEN;} ;
Now I want to add some rewrite rules to generate an AST. From what I've read online and in the Language Patterns book, I should be able to modify the grammar like this:
assignment
: IDENT ':=' expr ';' -> ^(':=' IDENT expr)
;
expr
: factor ('*' factor)* -> ^('*' factor+)
;
factor
: primary ('+' primary)* -> ^('+' primary+)
;
primary
: NUM
| '(' expr ')' -> ^(expr)
;
But it does not work. Although it compiles fine, when I run the parser I get a RewriteEmptyStreamException error. Here's where things get weird.
If I define the pseudo tokens ADD and MULT and use them instead of the tree node literals, it works without error.
tokens { ADD; MULT; }
expr
: factor ('*' factor)* -> ^(MULT factor+)
;
factor
: primary ('+' primary)* -> ^(ADD primary+)
;
Alternatively, if I use the node suffix notation, it also appears to work fine:
expr
: factor ('*'^ factor)*
;
factor
: primary ('+'^ primary)*
;
Is this discrepancy in behavior a bug?
No, not a bug, AFAIK. Take your expr rule for example:
expr
: factor ('*' factor)* -> ^('*' factor+)
;
since the * might not be present, it should also not be in your AST rewrite rule. So, the above is incorrect and ANTLR complaining about it is correct.
Now if you insert an imaginary token like MULT instead:
expr
: factor ('*' factor)* -> ^(MULT factor+)
;
all is okay since your rule will always produce one or more factor's.
What you probably meant to do is something like this:
expr
: (factor -> factor) ('*' f=factor -> ^('*' $expr $f))*
;
Also see chapter 7: Tree Construction from The Definitive ANTLR Reference. Especially the paragraphs Rewrite Rules in Subrules (page 173) and Referencing Previous Rule ASTs in Rewrite Rules (page 174/175).
If you want to generate an N-ary tree for the '*' operator with all children at the same level you can do this:
expr
: (s=factor -> factor) (('*' factor)+ -> ^('*' $s factor+))?
;
Here are some examples of what this will return:
Tokens: AST
factor: factor
factor '*' factor: ^('*' factor factor)
factor '*' factor '*' factor: ^('*' factor factor factor)
Bart's third example above will produce a nested tree, since the result of $expr for each successive iteration is a node with two children, like this:
factor * factor * factor: ^('*' factor ^('*' factor factor))
which you probably don't need since multiplication is commutative.

How to write a recursive descent parser from scratch?

As a purely academic exercise, I'm writing a recursive descent parser from scratch -- without using ANTLR or lex/yacc.
I'm writing a simple function which converts math expressions into their equivalent AST. I have the following:
// grammar
type expr =
| Lit of float
| Add of expr * expr
| Mul of expr * expr
| Div of expr * expr
| Sub of expr * expr
// tokens
type tokens =
| Num of float
| LParen | RParen
| XPlus | XStar | XMinus | XSlash
let tokenize (input : string) =
Regex.Matches(input.Replace(" ", ""), "\d+|[+/*\-()]")
|> Seq.cast<Match>
|> Seq.map (fun x -> x.Value)
|> Seq.map (function
| "+" -> XPlus
| "-" -> XMinus
| "/" -> XSlash
| "*" -> XStar
| "(" -> LParen
| ")" -> RParen
| num -> Num(float num))
|> Seq.to_list
So, tokenize "10 * (4 + 5) - 1" returns the following token stream:
[Num 10.0; XStar; LParen; Num 4.0; XPlus; Num 5.0; RParen; XMinus; Num 1.0]
At this point, I'd like to map the token stream to its AST with respect to operator precedence:
Sub(
Mul(
Lit 10.0
,Add(Lit 4.0, Lit 5.0)
)
,Lit 1.0
)
However, I'm drawing a blank. I've never written a parser from scratch, and I don't know even in principle how to begin.
How do I convert a token stream its representative AST?
Do you know about language grammars?
Assuming yes, you have a grammar with rules along the lines
...
addTerm := mulTerm addOp addTerm
| mulTerm
addOp := XPlus | XMinus
mulTerm := litOrParen mulOp mulTerm
| litOrParen
...
which ends up turning into code like (writing code in browser, never compiled)
let rec AddTerm() =
let mulTerm = MulTerm() // will parse next mul term (error if fails to parse)
match TryAddOp with // peek ahead in token stream to try parse
| None -> mulTerm // next token was not prefix for addOp rule, stop here
| Some(ao) -> // did parse an addOp
let rhsMulTerm = MulTerm()
match ao with
| XPlus -> Add(mulTerm, rhsMulTerm)
| XMinus -> Sub(mulTerm, rhsMulTerm)
and TryAddOp() =
let next = tokens.Peek()
match next with
| XPlus | XMinus ->
tokens.ConsumeNext()
Some(next)
| _ -> None
...
Hopefully you see the basic idea. This assumes a global mutable token stream that allows both 'peek at next token' and 'consume next token'.
If I remember from college classes the idea was to build expression trees like:
<program> --> <expression> <op> <expression> | <expression>
<expression> --> (<expression>) | <constant>
<op> --> * | - | + | /
<constant> --> <constant><constant> | [0-9]
then once you have construction your tree completely so you get something like:
exp
exp op exp
5 + and so on
then you run your completed tree through another program that recursively descents into the tree calculating expressions until you have an answer. If your parser doesn't understand the tree, you have a syntax error. Hope that helps.

Resources