PEST Parser does not recognise comments - parsing

I'm trying to write a parser with PEST the Rust parser generator. I'm having trouble with a fairly simple grammar. file is the top level rule in the grammar. It contains the SOI and EOI rules.
// example.pest
WHITESPACE = _ { "\n" | " " }
COMMENT = _{ "(*" ~ ANY* ~ "*)" }
KEYWORD = { ^"keyword" }
file = _{ SOI ~ KEYWORD ~ EOI }
Here is the contents of the file I'm trying to parse:
(*
*)
keyword
The generated parser cannot parse this file. The error looks like this:
1 | (*␊
| ^---
|
= expected KEYWORD
The built in COMMENT rule should handle this situation. Is whitespace handled differently inside comments?
How to properly write a grammar with comments?

There is actually an error in the logic of the grammar as given here. This rule in the grammar will match everything to the end of the file.
COMMENT = _{ "(*" ~ ANY* ~ "*)" }
The rule should be
COMMENT = _{ "(*" ~ (!"*)" ~ ANY)* ~ "*)" }
This means that any number of characters will be matched, but not anything that looks like *). Once *) is encountered, the next part of the sequence is reached and *) is matched and the whole rule is fulfilled.

Related

Flex and Bison - Grammar that sometimes care about spaces

Currently I'm trying to implement a grammar which is very similar to ruby. To keep it simple, the lexer currently ignores space characters.
However, in some cases the space letter makes big difference:
def some_callback(arg=0)
arg * 100
end
some_callback (1 + 1) + 1 # 300
some_callback(1 + 1) + 1 # 201
some_callback +1 # 100
some_callback+1 # 1
some_callback + 1 # 1
So currently all whitespaces are being ignored by the lexer:
{WHITESPACE} { ; }
And the language says for example something like:
UnaryExpression:
PostfixExpression
| T_PLUS UnaryExpression
| T_MINUS UnaryExpression
;
One way I can think of to solve this problem would be to explicitly add whitespaces to the whole grammar, but doing so the whole grammar would increase a lot in complexity:
// OLD:
AdditiveExpression:
MultiplicativeExpression
| AdditiveExpression T_ADD MultiplicativeExpression
| AdditiveExpression T_SUB MultiplicativeExpression
;
// NEW:
_:
/* empty */
| WHITESPACE _;
AdditiveExpression:
MultiplicativeExpression
| AdditiveExpression _ T_ADD _ MultiplicativeExpression
| AdditiveExpression _ T_SUB _ MultiplicativeExpression
;
//...
UnaryExpression:
PostfixExpression
| T_PLUS UnaryExpression
| T_MINUS UnaryExpression
;
So I liked to ask whether there is any best practice on how to solve this grammar.
Thank you in advance!
Without having a full specification of the syntax you are trying to parse, it's not easy to give a precise answer. In the following, I'm assuming that those are the only two places where the presence (or absence) of whitespace between two tokens affects the parse.
Differentiating between f(...) and f (...) occurs in a surprising number of languages. One common strategy is for the lexer to recognize an identifier which is immediately followed by an open parenthesis as a "FUNCTION_CALL" token.
You'll find that in most awk implementations, for example; in awk, the ambiguity between a function call and concatenation is resolved by requiring that the open parenthesis in a function call immediately follow the identifier. Similarly, the C pre-processor macro definition directive distinguishes between #define foo(A) A (the definition of a macro with arguments) and #define foo (A) (an ordinary macro whose expansion starts with a ( token.
If you're doing this with (f)lex, you can use the / trailing-context operator:
[[:alpha:]_][[:alnum:]_]*/'(' { yylval = strdup(yytext); return FUNC_CALL; }
[[:alpha:]_][[:alnum:]_]* { yylval = strdup(yytext); return IDENT; }
The grammar is now pretty straight-forward:
call: FUNC_CALL '(' expression_list ')' /* foo(1, 2) */
| IDENT expression_list /* foo (1, 2) */
| IDENT /* foo * 3 */
This distinction will not be useful in all syntactic contexts, so it will often prove useful to add a non-terminal which will match either identifier form:
name: IDENT | FUNC_CALL
But you will need to be careful with this non-terminal. In particular, using it as part of the expression grammar could lead to parser conflicts. But in other contexts, it will be fine:
func_defn: "def" name '(' parameters ')' block "end"
(I'm aware that this is not the precise syntax for Ruby function definitions. It's just for illustrative purposes.)
More troubling is the other ambiguity, in which it appears that the unary operators + and - should be treated as part of an integer literal in certain circumstances. The behaviour of the Ruby parser suggests that the lexer is combining the sign character with an immediately following number in the case where it might be the first argument to a function. (That is, in the context <identifier><whitespace><sign><digits> where <identifier> is not an already declared local variable.)
That sort of contextual rule could certainly be added to the lexical scanner using start conditions, although it's more than a little ugly. A not-fully-fleshed out implementation, building on the previous:
%x SIGNED_NUMBERS
%%
[[:alpha:]_][[:alnum:]_]*/'(' { yylval.id = strdup(yytext);
return FUNC_CALL; }
[[:alpha:]_][[:alnum:]_]*/[[:blank:]] { yylval.id = strdup(yytext);
if (!is_local(yylval.id))
BEGIN(SIGNED_NUMBERS);
return IDENT; }
[[:alpha:]_][[:alnum:]_]*/ { yylval.id = strdup(yytext);
return IDENT; }
<SIGNED_NUMBERS>[[:blank:]]+ ;
/* Numeric patterns, one version for each context */
<SIGNED_NUMBERS>[+-]?[[:digit:]]+ { yylval.integer = strtol(yytext, NULL, 0);
BEGIN(INITIAL);
return INTEGER; }
[[:digit:]]+ { yylval.integer = strtol(yytext, NULL, 0);
return INTEGER; }
/* ... */
/* If the next character is not a digit or a sign, rescan in INITIAL state */
<SIGNED_NUMBERS>.|\n { yyless(0); BEGIN(INITIAL); }
Another possible solution would be for the lexer to distinguish sign characters which follow a space and are directly followed by a digit, and then let the parser try to figure out whether or not the sign should be combined with the following number. However, this will still depend on being able to distinguish between local variables and other identifiers, which will still require the lexical feedback through the symbol table.
It's worth noting that the end result of all this complication is a language whose semantics are not very obvious in some corner cases. The fact that f+3 and f +3 produce different results could easily lead to subtle bugs which might be very hard to detect. In many projects using languages with these kinds of ambiguities, the project style guide will prohibit legal constructs with unclear semantics. You might want to take this into account in your language design, if you have not already done so.

What does ~ mean inside a Grammar (in Perl 6)?

I found a tilde ~ in this Config::INI Perl 6 Grammar:
token header { ^^ \h* '[' ~ ']' $<text>=<-[ \] \n ]>+ \h* <.eol>+ }
There are no tildes ~ in the text I am processing. I know that '[' ~ ']' is important because omitting any or all of '[', ~, and ']' makes the grammar no longer match my text.
Since I knew what the pattern was that I was matching, I changed it so that the square brackets were around the text expression, thus:
token header { ^^ \h* '[' $<text>=<-[ \] \n ]>+ ']' \h* <.eol>+ }
So it seems to me that '[' ~ ']' is really saying match a square bracket here and expect the closing bracket afterwards.
Anyway, I know that in normal Perl 6 syntax, the tilde ~ is used for concatenating strings. But this obviously means something different inside this Grammar. (In Perl 6, you can use Grammars to extract complicated data structures from text. They are like regular expressions taken to the next level.).
Anyway, I searched the documentation for Grammars and for Regular Expressions for a single ~, but I didn't find any inside a grammar nor inside a regular expression.
cross posted on StackOverflow en español
You can find an explanation in the design documents here: https://github.com/perl6/roast/blob/master/S05-metachars/tilde.t#L6-L81
It mostly does what you discovered: replaces the tilde with the expression that follows the right bracket, and searches for it between the bracket characters. However, it adds some extra magic to help the expression recognize the terminating bracket and to provide a more useful error message if the final bracket isn't found. So you'll usually get the same results doing it either way, but not always.

Layout in Rascal

When I import the Lisra recipe,
import demo::lang::Lisra::Syntax;
This creates the syntax:
layout Whitespace = [\t-\n\r\ ]*;
lexical IntegerLiteral = [0-9]+ !>> [0-9];
lexical AtomExp = (![0-9()\t-\n\r\ ])+ !>> ![0-9()\t-\n\r\ ];
start syntax LispExp
= IntegerLiteral
| AtomExp
| "(" LispExp* ")"
;
Through the start syntax-definition, layout should be ignored around the input when it is parsed, as is stated in the documentation: http://tutor.rascal-mpl.org/Rascal/Declarations/SyntaxDefinition/SyntaxDefinition.html
However, when I type:
rascal>(LispExp)` (something)`
This gives me a concrete syntax fragment error (or a ParseError when using the parse-function), in contrast to:
rascal>(LispExp)`(something)`
Which succesfully parses. I tried this both with one of the latest versions of Rascal as well as the Eclipse plugin version. Am I doing something wrong here?
Thank you.
Ps. Lisra's parse-function:
public Lval parse(str txt) = build(parse(#LispExp, txt));
Also fails on the example:
rascal>parse(" (something)")
|project://rascal/src/org/rascalmpl/library/ParseTree.rsc|(10329,833,<253,0>,<279,60>): ParseError(|unknown:///|(0,1,<1,0>,<1,1>))
at *** somewhere ***(|project://rascal/src/org/rascalmpl/library/ParseTree.rsc|(10329,833,<253,0>,<279,60>))
at parse(|project://rascal/src/org/rascalmpl/library/demo/lang/Lisra/Parse.rsc|(163,3,<7,44>,<7,47>))
at $shell$(|stdin:///|(0,13,<1,0>,<1,13>))
When you define a start non-terminal Rascal defines two non-terminals in one go:
rascal>start syntax A = "a";
ok
One non-terminal is A, the other is start[A]. Given a layout non-terminal in scope, say L, the latter is automatically defined by (something like) this rule:
syntax start[A] = L before A top L after;
If you call a parser or wish to parse a concrete fragment, you can use either non-terminal:
parse(#start[A], " a ") // parse using the start non-terminal and extra layout
parse(A, "a") // parse only an A
(start[A]) ` a ` // concrete fragment for the start-non-terminal
(A) `a` // concrete fragment for only an A
[start[A]] " a "
[A] "a"

Parse further an expression in a special case

At the moment my frontend can parse such normal expressions as 123, "abcd", "=123", "=TRUE+123"... The following are related code:
(* in `syntax.ml`: *)
and expression =
| E_integer of int
| E_string of string
(* in `parser.mly`: *)
expression:
| INTEGER { E_integer $1 }
| STRING { E_string $1 }
Now I would like to refine the parser, so that, when we meet a string starting with =, we try to evaluate it as a formula, not a literal string. So syntax.ml turns to be:
(* in `syntax.ml`: *)
and expression =
| E_integer of int
| E_string of string
| E_formula of formula
and formula =
| F_integer of int
| F_boolean of bool
| F_Add of formula * formula
The question is I am not sure how to change parser.mly, I tried this which did not work (This expression has type string but an expression was expected of type Syntax.formula):
(* in `parser.mly`: *)
expression:
| INTEGER { E_integer $1 }
| STRING {
if String.sub $1 1 1 <> "="
then E_string $1
else E_formula (String.sub $1 2 ((String.length $1) - 1)) }
I don't know how to let the parser know, for a string beginning with =, I need to parse it further based on the rules for formula... Could anyone help?
Following the comment of gasche:
I agree that I need to have a parser for formula. Now the question is if I need a separate lexer.mll for formula. I hope not, because it is logic to lex the whole program only one time, no? Also, can I add directly the formula grammar to the existing parser.mly?
In the current lexer.mll, I have:
let STRING = double_quote ([^ '\x0D' '\x0A' '\x22'])* double_quote
rule token = parse
| STRING as s { STRING s }
I think i can directly do something here:
let STRING = double_quote ([^ '\x0D' '\x0A' '\x22'])* double_quote
let FORMULA_STRING = double_quote = ([^ '\x0D' '\x0A' '\x22'])* double_quote
rule token = parse
| FORMULA_STRING as fs { XXXXX }
| STRING as s { STRING s }
I am not sure what I should write at the place of XXXXX, should it be Parser_formula.formula token fs, in the case that I have separately parser_formula.mly? What if I have only parser.mly which contains all the grammars including the one of formula?
The problem is with your line
else E_formula (String.sub $1 2 ((String.length $1) - 1))
Instead of (String.sub ...), which has type string, you should return a value of type Syntax.formula. If you had a parse_formula : string -> Syntax.formula function you could here write
else E_formula (parse_formula (String.sub $1 2 ((String.length $1) - 1)))
I think you could define such a function by defining the formula grammar as a separate parser first.
Edit: following you own edit:
if you go the route of calling a different parser for formulas, you don't need to define a different lexer
if you choose to handle the distinction between strings and formulas at the lexer level (are you sure that's correct? what about real string that would begin with '='?), then you don't need to have a separate parser for formulas, you can have them as rules in your current grammar. But to do that you need your lexer to behave in a more fine-grained ways on formulas: instead of just recognizing "=.*" as a single token, you should recognize "= as a beginning-of-formula, and lex the rest of the formula until you encounter the closing ". To avoid conflicts you may want to handle simple strings with a lexing rule rather than a simple regexp as well.
If you can get the second approach to work, I think it is indeed a simpler idea.
PS: please use menhir variable naming facilities instead of $1 as soon as the variables are not consecutive (because of intermediary terminals) or you need to repeat it more than once.
Continuing on #gasche 's answer.
You want to include new syntactic rules in your parser, which means that you need to change the grammar rules in parser.mly to accomodate these new rules.
The String.sub approach is somewhat in the right direction, but you are actually doing by hand what the mly file could let you automate.
Consider your formula type: the F_Add datatype there let you encode a binary sum formula, thus containing 2 formulas.
In the mly file, you could describe it as:
formula:
INTEGER { F_Integer $1 }
| BOOL { F_Bool $1 }
| formula PLUS formula { F_Add ($1, $3) }
;
Note how the grammar rule definition mirrors the formula type definition.
As you can see, the recursive property of formulas is nicely handled by the grammar rule for you.
Concerning lexer.mll, the regular expressions STRING and FORMULA_STRING are exactly the same. If you use them both in the same lexer rule (as in your code snippet), it will not work as you expect it to. The lexer has no knowledge of what is going on in the parser, it cannot choose to provide a STRING or a FORMULA_STRING when it's convenient for the parser to fill a specific rule in. With ocamlyacc (and with the tools it drew inspiration from), it works the other way round: the parser receives tokens which the lexer has recognized from the text stream, and tries to find the rule which correspond to them, according to what he has already figured out before.
Note that the BOOL terminal must be regonized by _lexer.mll(just likeINTEGER`), so you will need to amend it with the proper regular expression.
Also, you should ask yourself the following questions:
in the =5 formula, isn't there somewhere an expression waiting to be discovered?
If so, could you reformulate the definition of a formula in terms of expressions and new tokens?

Bison matching wrong tokens

While having a problem with PLY, I tried rephrasing the same grammar fragment in bison and encountered a similar problem. This suggests I might be doing something wrong.
A symbolic representation of the grammar fragment looks like this:
document -> fragment?
fragment -> { \n line* \n fragment? }
line -> [^\n]+ \n
The relevant lex lines:
[{}] return *yytext;
[^\n]+ return ANYTHING;
\n return EOL;
The relevant bison lines:
multiline: '{' EOL lines EOL multiline '}'
|
;
lines: lines ANYTHING EOL
|
;
The grammar is deterministic, for all I know should even be LALR(1) (haven't really tried to build the tableau, though). A document like "{\n\n}" parses OK, but a document where the multiline elements are nested (e.g. "{\n\n{\n\n}}") does not, the lexer sees the last "}}" as a token "ANYTHING" rather than two '}'s.
What am I doing wrong?
[{}] return *yytext;
[^{}\n]+ return ANYTHING;
\n return EOL;
Lex is greedy: if two patterns match the current input, the longest match wins. In the original lex fragment, the [^\n]+ pattern catches lines with { or } in them.

Resources