Parser Implementation - parsing

Hi I am trying to implement a parser for a simple language with grammar like this.
program ::= "program" declarations "begin" statements "end"
declaration ::= "var" ident "as" type
type ::= "string" | "int"
I have the first two done, how would I write the type grammar?
program( prog( DECLS, STATS ) ) -->
[ 'program' ], declarations( DECLS ),
[ 'begin' ], statements( STATS ), [ 'end' ].
declaration( decl( IDENT, TYPE ) ) -->
[ 'var' ], ident( IDENT ), [ 'as' ], type( TYPE ).

Your grammar might be underspecified. In fact you do not define how keywords are separated from other tokens like identifiers. There are programming languages, where you do not need to separate keywords from identifiers. And there are other programming languages where some whitespace or layout character is needed.
In your case, is "varaasint" a valid declaration? Your grammar suggests it. Or do you have to write "var a as int".
You might want to look into this answer for more.

type(string) --> ['string'].
type(int) --> ['int'].
(actually ' are not required)
You could use | or ; but that would complicate the way you return the type you found.

You miss statements rule!
Anyway, DCG rules are just plain syntax sugar on Prolog, then you can use any Prolog feature you like. If you need to keep the grammar compact:
type(type(T)) --> [T], {memberchk(T, [int, string])}.
Braces allow to mix generic Prolog with grammar rules.
As #false noted, your grammar is useful only if you have a tokenizer, that splits your input and discards whitespaces. Or you could handle that more directly, using this schema (beware, untested code):
program( prog( DECLS, STATS ) ) -->
s, "program", s, declarations( DECLS ),
s, "begin", s, statements( STATS ), s, "end", s.
declaration( decl( IDENT, TYPE ) ) -->
"var", s, ident( IDENT ), s, "as", s, type( TYPE ).
declarations([D|Ds]) --> declaration(D), declarations(Ds).
declarations([]) --> [].
type(type(int)) --> "int".
type(type(string)) --> "string".
% skip 1 or more whitespace
s --> (" " ; "\n"), (s ; []).

Related

Validating a "break" statement with a recursive descent parser

In Crafting Interpreters, we implement a little programming language using a recursive descent parser. Among many other things, it has these statements:
statement → exprStmt
| ifStmt
| printStmt
| whileStmt
| block ;
block → "{" declaration* "}" ;
whileStmt → "while" "(" expression ")" statement ;
ifStmt → "if" "(" expression ")" statement ( "else" statement )? ;
One of the exercises is to add a break statement to the language. Also, it should be a syntax error to have this statement outside a loop. Naturally, it can appear inside other blocks, if statements etc. if those are inside a loop.
My first approach was to create a new rule, whileBody, to accept break:
## FIRST TRY
statement → exprStmt
| ifStmt
| printStmt
| whileStmt
| block ;
block → "{" declaration* "}" ;
whileStmt → "while" "(" expression ")" whileBody ;
whileBody → statement
| break ;
break → "break" ";" ;
ifStmt → "if" "(" expression ")" statement ( "else" statement )? ;
But we have to accept break inside nested loops, if conditionals etc. What I could imagine is, I'd need a new rule for blocks and conditionals which accept break:
## SECOND TRY
statement → exprStmt
| ifStmt
| printStmt
| whileStmt
| block ;
block → "{" declaration* "}" ;
whileStmt → "while" "(" expression ")" whileBody ;
whileBody → statement
| break
| whileBlock
| whileIfStmt
whileBlock→ "{" (declaration | break)* "}" ;
whileIfStmt → "if" "(" expression ")" whileBody ( "else" whileBody )? ;
break → "break" ";"
ifStmt → "if" "(" expression ")" statement ( "else" statement )? ;
It is not infeasible for now, but it can be cumbersome to handle it once the language grows. It is boring and error-prone to write even today!
I looked for inspiration in C and Java BNF specifications. Apparently, none of those specifications prohibit the break outside loop. I guess their parsers have ad hoc code to prevent that. So, I followed suit and added code into the parser to prevent break outside loops.
TL;DR
My questions are:
Would the approach of my second try even work? In other words, could a recursive descent parser handle a break statement that only appears inside loops?
Is there a more practical way to bake the break command inside a syntax specification?
Or the standard way is indeed to change a parser to prevent breaks outside loops while parsing?
Attribute grammars are good at this sort of thing. Define an inherited attribute (I'll call it LC for loop count). The 'program' non-terminal passes LC = 0 to its children; loops pass LC = $LC + 1 to their children; all other constructs pass LC = $LC to their children. Make the rule for 'break' syntactically valid only if $LC > 0.
There is no standard syntax for attribute grammars, or for using attribute values in guards (as I am suggesting for 'break'), but using Prolog definite-clause grammar notation your grammar might look something like the following. I have added a few notes on DCG notation, in case it's been too long since you have used them.
/* nt(X) means, roughly, pass the value X as an inherited attribute.
** In a recursive-descent system, it can be passed as a parameter.
** N.B. in definite-clause grammars, semicolon separates alternatives,
** and full stop ends a rule.
*/
/* DCD doesn't have regular-right-part rules, so we have to
** handle repetition via recursion.
*/
program -->
statement(0);
statement(0), program.
statement(LC) -->
exprStmt(LC);
ifStmt(LC);
printStmt(LC);
whileStmt(LC);
block(LC);
break(LC).
block(LC) -->
"{", star-declaration(LC), "}".
/* The notation [] denotes the empty list, and matches zero
** tokens in the input.
*/
star-declaration(LC) -->
[];
declaration(LC), star-declaration(LC).
/* On the RHS of a rule, braces { ... } contain Prolog code. Here,
** the code "LC2 is LC + 1" adds 1 to LC and binds LC2 to that value.
*/
whileStmt(LC) -->
{ LC2 is LC + 1 }, "while", "(", expression(LC2), ")", statement(LC2).
ifStmt(LC) --> "if", "(", expression(LC), ")", statement(LC), opt-else(LC).
opt-else(LC) -->
"else", statement(LC);
[].
/* The definition of break checks the value of the loop count:
** "LC > 0" succeeds if LC is greater than zero, and allows the
** parse to succeed. If LC is not greater than zero, the expression
** fails. And since there is no other rule for 'break', any attempt
** to parse a 'break' rule when LC = 0 will fail.
*/
break(LC) --> { LC > 0 }, "break", ";".
Nice introductions to attribute grammars can be found in Grune and Jacobs, Parsing Techniques and in the Springer volumes Lecture Notes in Computer Science 461 (Attribute Grammars and Their Applications*, ed. P. Deransart and M. Jourdan) and 545 (Attribute Grammars, Applications, and Systems, ed. H. Alblas and B. Melichar.
The technique of duplicating some productions in order to distinguish two situations (am I in a loop? or not?), as illustrated in the answer by #rici, can be regarded as a way of pushing a Boolean attribute into the non-terminal names.
Would the approach of my second try even work? In other words, could a recursive descent parser handle a break statement that only appears inside loops?
Sure. But you need a lot of duplication. Since while is not the only loop construct, I've used a different way of describing the alternatives, which consists of adding _B to the name of non-terminals which might include break statements.
declaration → varDecl
| statement
declaration_B → varDecl
| statement_B
statement → exprStmt
| ifStmt
| printStmt
| whileStmt
| block
statement_B → exprStmt
| printStmt
| whileStmt
| breakStmt
| ifStmt_B
| block_B
breakStmt → "break" ";"
ifStmt → "if" "(" expression ")" statement ( "else" statement )?
ifStmt_B → "if" "(" expression ")" statement_B ( "else" statement_B )?
whileStmt → "while" "(" expression ")" statement_B ;
block → "{" declaration* "}"
block_B → "{" declaration_B* "}"
Not all statement types need to be duplicated. Non-compound statement like exprStmt don't, because they cannot possibly include a break statement (or any other statement type). And the statement which is the target of a loop statement like whileStmt can always include break, regardless of whether the while was inside a loop or not.
Is there a more practical way to bake the break command inside a syntax specification?
Not unless your syntax specification has marker macros, like the specification used to describe ECMAScript.
Is there a different way to do this?
Since this is a top-down (recursive descent) parser, it's pretty straight-forward to handle this condition in the parser's execution. You just need to add an argument to every (or many) parsing functions which specifies whether a break is possible or not. Any parsing function called by whileStmt would set that argument to True (or an enumeration indicating that break is possible), while other statement types would just pass the parameter through, and the top-level parsing function would set the argument to False. The breakStmt implementation would just return failure if it is called with False.

Difficulties implementing DSL in Prolog from EBNF using DCG

I'm working on implementation of the Google Protobuf compiler for proto files in Prolog for generating Prolog programs. Prolog is SWI-Prolog.
I'm translating EBNF definitions into DCG and ran across a few problems:
I have to deal with [ ... ] and { ... } EBNF construct - meaning optional ( executable zero or one times ) and repeatative( executable any number of times );
I have to insert the callbacks into DCG code to implement the part of compiler functionality (syntax switching/importing/ etc.) using DCG's construct { ... }, which allows goals in Prolog syntax inside DCG rules.
I'm applying for optional and repeatative the meta-predicates: $$rep/1, $$opt/1:
EBNF
decimals = decimalDigit { decimalDigit }
exponent = ( "e" | "E" ) [ "+" | "-" ] decimals
DCG
decimals --> decimalDigit, '$$rep'( decimalDigit ).
exponent --> ( "e"; "E" ), '$$opt'( "+"; "-" ), decimals.
'$$rep'( Goal ) :- repeat, call(Goal); !, fail.
'$$opt'( Goal ) :- once(Goal) ; \+ Goal.
"Callback:"
import --> "import", opt(( "weak" ; "public", { record(public)} )), strLit,
{
import(public, strlit )
}, ";".
Looking awkward (if not to say ugly) for me...
Questions:
What's wrong with my solutions?
Should I manually translate EBNG into DCG without using meta-predicates?
What is the alternative for the awkward penetration into a DCG rule?
From a quick first glance, the main issue is that you are uncleanly intermingling DCGs with regular Prolog predicates.
Stay within DCGs to define all nonterminals. For example:
optional(NT) --> [] | NT.
once_or_more(NT) --> NT, or_more(NT).
or_more(NT) --> [] | NT, or_more(NT).
With the following example definition:
a --> [a].
We can post:
?- phrase(optional(a), Ls).
Ls = [] ;
Ls = [a].
?- phrase(once_or_more(a), Ls).
Ls = [a] ;
Ls = [a, a] ;
Ls = [a, a, a] ;
Ls = [a, a, a, a] ;
Ls = [a, a, a, a, a] .
This seems to work as you need it.
For the callback, you can simply pass around the predicate that you need to call, with the general outline:
parse_with_callback(Goal) -->
...,
{ Goal },
...
This seems quite OK.
If such patterns arise frequently, you can always consider generating such DCGs from a different representation that lets you represent the task more cleanly.

ANTLR4 can't parse Integer if a parser rules has an own numeric literal

I am struggling a bit with trying to define integers in my grammar.
Let's say I have this small grammar:
grammar Hello;
r : 'hello' INTEGER;
INTEGER : [0-9]+ ;
WS : [ \t\r\n]+ -> skip ;
If I then type in
hello 5
it parses correctly.
However, if I have an additional parser rule (even if it's unused) which defines a token '5',
then I can't parse the previous example anymore.
So this grammar:
grammar Hello;
r : 'hello' INTEGER;
unusedRule: 'hi' '5';
INTEGER : [0-9]+ ;
WS : [ \t\r\n]+ -> skip ;
with
hello 5
won't parse anymore. It gives me the following error:
Hello::r:1:6: mismatched input '5' expecting INTEGER
How is that possible and how can I work around this?
When you define a parser rule like
unusedRule: 'hi' '5';
Antlr creates implicit lexer tokens for the subterms. Since they are automatically created in the lexer, you have no control over where the sit in the precedence evaluation of Lexer rules.
Consequently, the best policy is to never use literals in parser rules; always explicitly define your tokens.

Layout in Rascal

When I import the Lisra recipe,
import demo::lang::Lisra::Syntax;
This creates the syntax:
layout Whitespace = [\t-\n\r\ ]*;
lexical IntegerLiteral = [0-9]+ !>> [0-9];
lexical AtomExp = (![0-9()\t-\n\r\ ])+ !>> ![0-9()\t-\n\r\ ];
start syntax LispExp
= IntegerLiteral
| AtomExp
| "(" LispExp* ")"
;
Through the start syntax-definition, layout should be ignored around the input when it is parsed, as is stated in the documentation: http://tutor.rascal-mpl.org/Rascal/Declarations/SyntaxDefinition/SyntaxDefinition.html
However, when I type:
rascal>(LispExp)` (something)`
This gives me a concrete syntax fragment error (or a ParseError when using the parse-function), in contrast to:
rascal>(LispExp)`(something)`
Which succesfully parses. I tried this both with one of the latest versions of Rascal as well as the Eclipse plugin version. Am I doing something wrong here?
Thank you.
Ps. Lisra's parse-function:
public Lval parse(str txt) = build(parse(#LispExp, txt));
Also fails on the example:
rascal>parse(" (something)")
|project://rascal/src/org/rascalmpl/library/ParseTree.rsc|(10329,833,<253,0>,<279,60>): ParseError(|unknown:///|(0,1,<1,0>,<1,1>))
at *** somewhere ***(|project://rascal/src/org/rascalmpl/library/ParseTree.rsc|(10329,833,<253,0>,<279,60>))
at parse(|project://rascal/src/org/rascalmpl/library/demo/lang/Lisra/Parse.rsc|(163,3,<7,44>,<7,47>))
at $shell$(|stdin:///|(0,13,<1,0>,<1,13>))
When you define a start non-terminal Rascal defines two non-terminals in one go:
rascal>start syntax A = "a";
ok
One non-terminal is A, the other is start[A]. Given a layout non-terminal in scope, say L, the latter is automatically defined by (something like) this rule:
syntax start[A] = L before A top L after;
If you call a parser or wish to parse a concrete fragment, you can use either non-terminal:
parse(#start[A], " a ") // parse using the start non-terminal and extra layout
parse(A, "a") // parse only an A
(start[A]) ` a ` // concrete fragment for the start-non-terminal
(A) `a` // concrete fragment for only an A
[start[A]] " a "
[A] "a"

What do you call this and how to read this? (Parsing for Scheme)

At the moment, I am learning how to parse Scheme in Java. Here is the basic list (I do not know what its formal name is) Edit: Grammar!
exp -> ( rest
| #f
| #t
| ' exp
| integer_constant
| string_constant
| identifier
rest -> )
| exp+ [ . exp ] )
My question is: What is that list called, like what is the formal name for it? "Parse list"? Edit: According to a comment, it's called grammar.
And how to read it? My guess is that the expression goes in between the the left and right parenthesis, example: ( exp ).
Additionally I guess any of the objects between the lines exp -> ( rest and rest ->), #f, #t, ' exp, integer_constant, string_constant, identifier go in place of the expression in the previous example example. Like for example: ( #t )
And the last item on the list is | exp+ [ . exp] ), which I suppose is another expression to the right of the first right parenthesis such as for example with the previous example: ((#t) exp)?
Lastly, this bit [ . exp], the bracket just says it is optional?
If I am wrong, please correct me.
This is called a grammar. There are many different syntaxes for writing down grammars, but they are all quite similar to each other.
Here -> can be read as "is", | as "or", + as one or more, and [], as you suspected, as "optionally". The other symbols used here just stand for themselves. So this grammar can be read like this:
An expression is:
an opening parentheses followed by a "rest" (see 2)
OR a hash mark followed by the letter f
OR a hash mark followed by the letter t
OR a single quote followed by an expression
OR an integer constant (like 123)
OR a string constant (like "foo")
OR an identifier (like foo)
A "rest" is:
a closing parentheses
OR one or more expressions, optionally followed by a dot and one other expression, followed by a closing parenthesis
So foo is an expression (because identifiers are expressions), () is an expression (because ) is a "rest" and ( rest is an expression, (foo) is an expression (because foo is an expression, exp ) is a "rest" and ( rest is an expression) and so on.

Resources