I would like to do a parser in prolog. This one should be able to parse something like this:
a = 3 + (6 * 11);
For now I only have this grammar done. It's working but I would like to improve it in order to have id such as (a..z)+ and digit such as (0..9)+.
parse(-ParseTree, +Program, []):-
parsor(+Program, []).
parsor --> [].
parsor --> assign.
assign --> id, [=], expr, [;].
id --> [a] | [b].
expr --> term, (add_sub, expr ; []).
term --> factor, (mul_div, term ; []).
factor --> digit | (['('], expr, [')'] ; []).
add_sub --> [+] | [-].
mul_div --> [*] | [/].
digit --> [0] | [1] | [2] | [3] | [4] | [5] | [6] | [7] | [8] | [9].
Secondly, I would like to store something in the ParseTree variable in order to print the ParseTree like this:
PARSE TREE:
assignment
ident(a)
assign_op
expression
term
factor
int(1)
mult_op
term
factor
int(2)
add_op
expression
term
factor
left_paren
expression
term
factor
int(3)
sub_op
...
And this is the function I'm going use to print the ParseTree:
output_result(OutputFile,ParseTree):-
open(OutputFile,write,OutputStream),
write(OutputStream,'PARSE TREE:'),
nl(OutputStream),
writeln_term(OutputStream,0,ParseTree),
close(OutputStream).
writeln_term(Stream,Tabs,int(X)):-
write_tabs(Stream,Tabs),
writeln(Stream,int(X)).
writeln_term(Stream,Tabs,ident(X)):-
write_tabs(Stream,Tabs),
writeln(Stream,ident(X)).
writeln_term(Stream,Tabs,Term):-
functor(Term,_Functor,0), !,
write_tabs(Stream,Tabs),
writeln(Stream,Term).
writeln_term(Stream,Tabs1,Term):-
functor(Term,Functor,Arity),
write_tabs(Stream,Tabs1),
writeln(Stream,Functor),
Tabs2 is Tabs1 + 1,
writeln_args(Stream,Tabs2,Term,1,Arity).
writeln_args(Stream,Tabs,Term,N,N):-
arg(N,Term,Arg),
writeln_term(Stream,Tabs,Arg).
writeln_args(Stream,Tabs,Term,N1,M):-
arg(N1,Term,Arg),
writeln_term(Stream,Tabs,Arg),
N2 is N1 + 1,
writeln_args(Stream,Tabs,Term,N2,M).
write_tabs(_,0).
write_tabs(Stream,Num1):-
write(Stream,'\t'),
Num2 is Num1 - 1,
write_tabs(Stream,Num2).
writeln(Stream,Term):-
write(Stream,Term),
nl(Stream).
write_list(_Stream,[]).
write_list(Stream,[Ident = Value|Vars]):-
write(Stream,Ident),
write(Stream,' = '),
format(Stream,'~1f',Value),
nl(Stream),
write_list(Stream,Vars).
I hope someone will be able to help me. Thank you !
Here's an enhancement of your parser as written which can get you started. It's an elaboration of the notions that #CapelliC indicated.
parser([]) --> [].
parser(Tree) --> assign(Tree).
assign([assignment, ident(X), '=', Exp]) --> id(X), [=], expr(Exp), [;].
id(X) --> [X], { atom(X) }.
expr([expression, Term]) --> term(Term).
expr([expression, Term, Op, Exp]) --> term(Term), add_sub(Op), expr(Exp).
term([term, F]) --> factor(F).
term([term, F, Op, Term]) --> factor(F), mul_div(Op), term(Term).
factor([factor, int(N)]) --> num(N).
factor([factor, Exp]) --> ['('], expr(Exp), [')'].
add_sub(Op) --> [Op], { memberchk(Op, ['+', '-']) }.
mul_div(Op) --> [Op], { memberchk(Op, ['*', '/']) }.
num(N) --> [N], { number(N) }.
I might have a couple of niggles in here, but the key elements I've added to your code are:
Replaced digit with num which accepts any Prolog term N for which number(N) is true
Used atom(X) to identify a valid identifier
Added an argument to hold the result of parsing the given expression item
As an example:
| ?- phrase(parser(Tree), [a, =, 3, +, '(', 6, *, 11, ')', ;]).
Tree = [assignment,ident(a),=,[expression,[term,[factor,int(3)]],+,[expression,[term,[factor,[expression,[term,[factor,int(6)],*,[term,[factor,int(11)]]]]]]]]] ? ;
This may not be an ideal representation of the parse tree. It may need some adjustment per your needs, which you can do by modifying what I've shown a little. And then you can write a predicate which formats the parse tree as you like.
You could also consider, instead of a list structure, an embedded Prolog term structure as follows:
parser([]) --> [].
parser(Tree) --> assign(Tree).
assign(assignment(ident(X), '=', Exp)) --> id(X), [=], expr(Exp), [;].
id(X) --> [X], { atom(X) }.
expr(expression(Term)) --> term(Term).
expr(expression(Term, Op, Exp)) --> term(Term), add_sub(Op), expr(Exp).
term(term(F)) --> factor(F).
term(term(F, Op, Term)) --> factor(F), mul_div(Op), term(Term).
factor(factor(int(N))) --> num(N).
factor(factor(Exp)) --> ['('], expr(Exp), [')'].
add_sub(Op) --> [Op], { memberchk(Op, ['+', '-']) }.
mul_div(Op) --> [Op], { memberchk(Op, ['*', '/']) }.
num(N) --> [N], { number(N) }.
Which results in something like this:
| ?- phrase(parser(T), [a, =, 3, +, '(', 6, *, 11, ')', ;]).
T = assignment(ident(a),=,expression(term(factor(int(3))),+,expression(term(factor(expression(term(factor(int(6)),*,term(factor(int(11)))))))))) ? ;
A recursive rule for id//0, made a bit more generic:
id --> [First], {char_type(First,lower)}, id ; [].
Building the tree could be done 'by hand', augmenting each non terminal with the proper term, like
...
assign(assign(Id, Expr)) --> id(Id), [=], expr(Expr), [;].
...
id//0 could become id//1
id(id([First|Rest])) --> [First], {memberchk(First, [a,b])}, id(Rest) ; [], {Rest=[]}.
If you're going to code such parsers frequently, a rewrite rule can be easily implemented...
Related
I've been trying for the past couple of days write a menhir (yacc) grammar for this seemingly simple language with some syntax sugar, but can't figure out to do so without ambiguities.
The relevant constructs from the language are:
Application is juxtaposition: x y z is a valid expression.
Allow type annotations: given some expression e and type t, you can write e : t. the : should be a non-associative operator.
You can (and sometimes need to) wrap certain expressions in parentheses
Arrow types: expressions of the form (x y z : a) -> b, which is sugar for (x : a) -> (y : a) -> (z : a) -> b (-> is right-associative). here x y z are identifiers, a and b are other expressions
Arrows can also be written as just a -> b, which is sugar for (_ : a) -> b.
My initial strategy was to split the expressions to two types based on the intuitive question "does it need to be wrapped in parentheses when used in application?", if the answer is yes I call it an "atom", otherwise it's an expression.
For example variables are atoms, because I can write x y. Parenthesized expressions are also atoms because they can be juxtaposed. Lambda abstraction is not an atom because it's too ambiguous (even for a human reader) to parse λx . y z - you have to write (λx . y) z. The (minimal error reproducing) grammar can be outlined as follows
%token EOF
%token <string> IDENT
%token LPAREN RPAREN COLON ARROW BOOL
%nonassoc COLON
%right ARROW
%start program
%type <expr> program
%type <expr> expr
%type <expr> atom
%%
program:
| e=expr; EOF { e }
expr:
| es=nonempty_list(atom) { unfoldApp es } // application
| e=expr; COLON; t=expr { Annot (e, t) } // annotation
| LPAREN; xs=nonempty_list(IDENT); COLON;
a=expr; RPAREN; ARROW; b=expr { unfoldPi xs a b } // arrow types
| a=expr; ARROW; b=expr { Pi ("_", a, b) } // sugar (not necessary to reproduce)
// more cases for lambda, let abstractions, etc...
atom:
| x=IDENT { Var x } // variable
| LPAREN; e=expr; RPAREN { e } // parentheses
| BOOL { Bool }
// more constants and atoms
There's clearly an undeniable ambiguity here. After the parser sees ( x with lookahead : it might be a variable x, annotated and parenthesized. Or, it could be the start of an arrow with just one variable. Even with more variables, it could be an annotated application, so the ambiguity remains. I want to prioritize the latter, meaning if it can be an arrow type, choose that.
I could probably check in the AST later if it's something like Annot (App (App (Var "x", Var "y"), Var "z"), a) and then convert manually to the correct result, but that feels sketchy and unreliable.
Is there some clean and easy way to do it with a LALR generator such as menhir?
I am creating parser and lexer rules for Decaf programming language written in ANTLR4. There is a parser test file I am trying to run to get the parser tree for it by printing the visited nodes on the terminal window and paste them into D3_parser_tree.html class. The current parser tree is missing the right square brackets with the number 10 according to this testing file : class program { int i [10]; }
The error I am getting : mismatched input '10' expecting INT_LITERAL
I am not sure why I am getting this error although I have declared a lexer rule for INT_LITERAL and then called it in a parser rule within field_decl according to the given Decaf spec :
** Parser rules **
<program> → class Program ‘{‘ <field_decl>* <method_decl>* ‘}’
<field_decl> → <type> { <id> | <id> ‘[‘ <int_literal> ‘]’ }+, ;
<method_decl> → { <type> | void } <id> ( [ { <type> <id> }+, ] ) <block>
<digit> → 0 | 1 | 2 | … | 9
<block> → ‘{‘ <var_decl>* <statement>* ‘}’
<literal> → <int_literal> | <char_literal> | <bool_literal>
<hex_digit> → <digit> | a | b | c | … | f | A | B | C | … | F
<int_literal> → <decimal_literal> | <hex_literal>
<decimal_literal> → <digit> <digit>*
<hex_literal> → 0x <hex_digit> <hex_digit>*
Related Lexer rules :
NUMBER : [0-9]+;
fragment ALPHA : [_a-zA-Z0-9];
fragment DIGIT : [0-9];
fragment DECIMAL_LITERAL : DIGIT+;
CHAR_LITERAL : '\'' CHAR '\'';
STRING_LITERAL : '"' CHAR+ '"' ;
COMMENT : '//' ~('\n')* '\n' -> skip;
WS : (' ' | '\n' | '\t' | '\r') + -> skip;
Related Parser rules :
program : CLASS VAR LCURLYBRACE field_decl*method_decl* RCURLYBRACE EOF;
field_decl : data_type field ( COMMA field )* SEMICOLON;
Please let me know if you need further details & I appreciate your help a lot.
The following rules conflict:
VAR : ALPHA+;
...
NUMBER : [0-9]+;
...
INT_LITERAL : DECIMAL_LITERAL | HEX_LITERAL;
They all match 10, but the lexer will always choose VAR since that is the rule defined first.
This is just how ANTLR's lexer works: it tries to match the most characters as possible, and when two (or more) rules all match the same amount of characters, the one defined first "wins".
You will see that it parses correctly if you change field into:
field : VAR | VAR LSQUAREBRACE VAR RSQUAREBRACE;
I have the following in a JavaCC file:
void condition() : {}
{
expression() comp_op() expression()
| condition() (<AND> | <OR>) condition()
}
where <AND> is "&&" and <OR> is "||". This is causing problems due to the fact that it is direct left-recursion. How can I fix this?
A condition is essentially 1 or more of expression comp_op expression separated by an AND or an OR. You could do the following
condition --> simpleCondition ( (<AND> | <OR>) simpleCondition )*
simpleCondition --> expression comp_op expression
I'm new to the language prolog and have been given an assignment regarding parsing in prolog. I need some help in solving the problem.
In the assingment we have the grammar:
Expr ::= + Expr Expr | * Expr Expr | Num | Xer
Xer ::= x | ^ x Num
Num ::= 2 | 3 | .... a Integer (bigger than 1) ...
The token ^ is the same as in math. 5^5 equals 25.
Parse needs to work both ways: a call with an instantiated list to generate an Ast, while
a call with an instantiated Ast should generate similar prefix list.
My assingment says that I need to make a prefix parse that does this:
Example(with the value of Ast removed):
?- parse([+, *, 2, x, ^, x, 5 ], Ast), parse(L, Ast).
X = ...,
L = [+, *, 2, x, ^, x, 5]
I would also like to know how the parse tree will look like.
Prolog has a particular formalism to handle context-free grammars directly: DCGs (Definite Clause Grammars). Your example translates almost immediately into a DCG:
expr --> [+], expr, expr | [*], expr, expr | num | xer.
xer --> [x] | [^], [x], num.
num --> [2] | [3] | [4] | [5].
Now, you already can test sentences:
?- phrase(expr, [+,*,2,x,^,x,5]).
true
; false.
?- phrase(expr, [+,*,*,2,x,^,x,5]).
false.
You can even generate all possible sentences like so:
?- length(L, N), phrase(expr, L).
L = [2], N = 1
; L = [3], N = 1
; ... .
And, finally, you can add the abstract syntax tree to your definition.
expr(plus(A,B)) --> [+], expr(A), expr(B).
expr(mul(A,B)) --> [*], expr(A), expr(B).
expr(Num) --> num(Num).
expr(Xer) --> xer(Xer).
xer(var(x)) --> [x].
xer(pow(var(x),N)) --> [^], [x], num(N).
num(num(2)) --> [2].
num(num(3)) --> [3].
num(num(4)) --> [4].
num(num(5)) --> [5].
So now you can use it as desired:
?- phrase(expr(AST), [+,*,2,x,^,x,5]), phrase(expr(AST),L).
AST = plus(mul(num(2),var(x)),pow(var(x),num(5))),
L = [+,*,2,x,^,x,5]
; false.
Just a nitpick: The interface predicate to DCGs is phrase/2 not parse/2.
I'm making a simple propositional logic parser on happy based on this BNF definition of the propositional logic grammar, this is my code
{
module FNC where
import Data.Char
import System.IO
}
-- Parser name, token types and error function name:
--
%name parse Prop
%tokentype { Token }
%error { parseError }
-- Token list:
%token
var { TokenVar $$ } -- alphabetic identifier
or { TokenOr }
and { TokenAnd }
'¬' { TokenNot }
"=>" { TokenImp } -- Implication
"<=>" { TokenDImp } --double implication
'(' { TokenOB } --open bracket
')' { TokenCB } --closing bracket
'.' {TokenEnd}
%left "<=>"
%left "=>"
%left or
%left and
%left '¬'
%left '(' ')'
%%
--Grammar
Prop :: {Sentence}
Prop : Sentence '.' {$1}
Sentence :: {Sentence}
Sentence : AtomSent {Atom $1}
| CompSent {Comp $1}
AtomSent :: {AtomSent}
AtomSent : var { Variable $1 }
CompSent :: {CompSent}
CompSent : '(' Sentence ')' { Bracket $2 }
| Sentence Connective Sentence {Bin $2 $1 $3}
| '¬' Sentence {Not $2}
Connective :: {Connective}
Connective : and {And}
| or {Or}
| "=>" {Imp}
| "<=>" {DImp}
{
--Error function
parseError :: [Token] -> a
parseError _ = error ("parseError: Syntax analysis error.\n")
--Data types to represent the grammar
data Sentence
= Atom AtomSent
| Comp CompSent
deriving Show
data AtomSent = Variable String deriving Show
data CompSent
= Bin Connective Sentence Sentence
| Not Sentence
| Bracket Sentence
deriving Show
data Connective
= And
| Or
| Imp
| DImp
deriving Show
--Data types for the tokens
data Token
= TokenVar String
| TokenOr
| TokenAnd
| TokenNot
| TokenImp
| TokenDImp
| TokenOB
| TokenCB
| TokenEnd
deriving Show
--Lexer
lexer :: String -> [Token]
lexer [] = [] -- cadena vacia
lexer (c:cs) -- cadena es un caracter, c, seguido de caracteres, cs.
| isSpace c = lexer cs
| isAlpha c = lexVar (c:cs)
| isSymbol c = lexSym (c:cs)
| c== '(' = TokenOB : lexer cs
| c== ')' = TokenCB : lexer cs
| c== '¬' = TokenNot : lexer cs --solved
| c== '.' = [TokenEnd]
| otherwise = error "lexer: Token invalido"
lexVar cs =
case span isAlpha cs of
("or",rest) -> TokenOr : lexer rest
("and",rest) -> TokenAnd : lexer rest
(var,rest) -> TokenVar var : lexer rest
lexSym cs =
case span isSymbol cs of
("=>",rest) -> TokenImp : lexer rest
("<=>",rest) -> TokenDImp : lexer rest
}
Now, I have two problems here
For some reason I get 4 shift/reduce conflicts, I don't really know where they might be since I thought the precedence would solve them (and I think I followed the BNF grammar correctly)...
(this is rather a Haskell problem) On my lexer function, for some reason I get parsing errors on the line where I say what to do with '¬', if I remove that line it's works, why could that be? (this issue is solved)
Any help would be great.
If you use happy with -i it will generate an info file. The file lists all the states that your parser has. It will also list all the possible transitions for each state. You can use this information to determine if the shift/reduce conflict is one you care about.
Information about invoking happy and conflicts:
http://www.haskell.org/happy/doc/html/sec-invoking.html
http://www.haskell.org/happy/doc/html/sec-conflict-tips.html
Below is some of the output of -i. I've removed all but State 17. You'll want to get a copy of this file so that you can properly debug the problem. What you see here is just to help talk about it:
-----------------------------------------------------------------------------
Info file generated by Happy Version 1.18.10 from FNC.y
-----------------------------------------------------------------------------
state 17 contains 4 shift/reduce conflicts.
-----------------------------------------------------------------------------
Grammar
-----------------------------------------------------------------------------
%start_parse -> Prop (0)
Prop -> Sentence '.' (1)
Sentence -> AtomSent (2)
Sentence -> CompSent (3)
AtomSent -> var (4)
CompSent -> '(' Sentence ')' (5)
CompSent -> Sentence Connective Sentence (6)
CompSent -> '¬' Sentence (7)
Connective -> and (8)
Connective -> or (9)
Connective -> "=>" (10)
Connective -> "<=>" (11)
-----------------------------------------------------------------------------
Terminals
-----------------------------------------------------------------------------
var { TokenVar $$ }
or { TokenOr }
and { TokenAnd }
'¬' { TokenNot }
"=>" { TokenImp }
"<=>" { TokenDImp }
'(' { TokenOB }
')' { TokenCB }
'.' { TokenEnd }
-----------------------------------------------------------------------------
Non-terminals
-----------------------------------------------------------------------------
%start_parse rule 0
Prop rule 1
Sentence rules 2, 3
AtomSent rule 4
CompSent rules 5, 6, 7
Connective rules 8, 9, 10, 11
-----------------------------------------------------------------------------
States
-----------------------------------------------------------------------------
State 17
CompSent -> Sentence . Connective Sentence (rule 6)
CompSent -> Sentence Connective Sentence . (rule 6)
or shift, and enter state 12
(reduce using rule 6)
and shift, and enter state 13
(reduce using rule 6)
"=>" shift, and enter state 14
(reduce using rule 6)
"<=>" shift, and enter state 15
(reduce using rule 6)
')' reduce using rule 6
'.' reduce using rule 6
Connective goto state 11
-----------------------------------------------------------------------------
Grammar Totals
-----------------------------------------------------------------------------
Number of rules: 12
Number of terminals: 9
Number of non-terminals: 6
Number of states: 19
That output basically says that it runs into a bit of ambiguity when it's looking at connectives. It turns out, the slides you linked mention this (Slide 11), "ambiguities are resolved through precedence ¬∧∨⇒⇔ or parentheses".
At this point, I would recommend looking at the shift/reduce conflicts and your desired precedences to see if the parser you have will do the right thing. If so, then you can safely ignore the warnings. If not, you have more work for yourself.
I can answer No. 2:
| c== '¬' == TokenNot : lexer cs --problem here
-- ^^
You have a == there where you should have a =.