I have this lexeme:
a().length()
And this PEGjs grammar:
start = func_call
func_call = header "(" ")"
header = "a" suffix*
suffix = "(" ")" ("." "length")
In the grammar, I'm parsing a function call. This currently parses, as you can try online in the PEGjs playground.
Input parsed successfully.
However, if I add an asterisk to the end of the suffix production, like so:
suffix = "(" ")" ("." "length")*
Then the input fails to parse:
Line 1, column 13: Expected "(" or "." but end of input found.
I don't understand why. From the documentation:
expression * Match zero or more repetitions of the expression and return their match results in an array. The matching is greedy
This should be a greedy match of "." "length" which should match one time. But instead it fails to match at all. Is this related to the nested use of * in header?
* matches zero or more repetitions of its operand, as you say.
So when you write
suffix = "(" ")" ("." "length")*
You are saying that a suffix is () followed by zero or more repetitions of .length. Thus it could be (), which has zero repetitions.
Thus, suffix* (in header) can match ().length() as two repetitions of suffix, first ().length, and then (). That would be the greedy match, which is PEG's modus operandi.
But after that, there is no () left to be matched by func_call. Hence the parse error.
Related
Here is the grammar of the language id' like to parse:
expr ::= val | const | (expr) | unop expr | expr binop expr
var ::= letter
const ::= {digit}+
unop ::= -
binop ::= /*+-
I'm using an example from the haskell wiki.
The semantics and token parser are not shown here.
exprparser = buildExpressionParser table term <?> "expression"
table = [ [Prefix (m_reservedOp "-" >> return (Uno Oppo))]
,[Infix (m_reservedOp "/" >> return (Bino Quot)) AssocLeft
,Infix (m_reservedOp "*" >> return (Bino Prod)) AssocLeft]
,[Infix (m_reservedOp "-" >> return (Bino Diff)) AssocLeft
,Infix (m_reservedOp "+" >> return (Bino Somm)) AssocLeft]
]
term = m_parens exprparser
<|> fmap Var m_identifier
<|> fmap Con m_natural
The minus char appears two times, once as unary, once as binary operator.
On input "1--2", the parser gives only
Con 1
instead of the expected
"Bino Diff (Con 1) (Uno Oppo (Con 2))"
Any help welcome.Full code here
The purpose of reservedOp is to create a parser (which you've named m_reservedOp) that parses the given string of operator symbols while ensuring that it is not the prefix of a longer string of operator symbols. You can see this from the definition of reservedOp in the source:
reservedOp name =
lexeme $ try $
do{ _ <- string name
; notFollowedBy (opLetter languageDef) <?> ("end of " ++ show name)
}
Note that the supplied name is parsed only if it is not followed by any opLetter symbols.
In your case, the string "--2" can't be parsed by m_reservedOp "-" because, even though it starts with the valid operator "-", this string occurs as the prefix of a longer valid operator "--".
In a language with single-character operators, you probably don't want to use reservedOp at all, unless you want to disallow adjacent operators without intervening whitespace. Just use symbol "-", which will always parse "-", no matter what follows (and consume following whitespace, if any). Also, in a language with a fixed set of operators (i.e., no user-defined operators), you probably won't use the operator parser, so you won't need opStart, or reservedOpNames. Without reservedOp or operator, the opLetter parser isn't used, so you can drop it too.
This is probably pretty confusing, and the Parsec documentation does a terrible job of explaining how the "reserved" mechanism is supposed to work. Here's a primer:
Let's start with identifiers, instead of operators. In a typical language that allows user-defined identifiers (i.e., pretty much any language, since "variables" and "functions" have user-defined names) and may also have some reserved words that aren't allowed as identifiers, the relevant settings in the GenLanguageDef are:
identStart -- parser for first character of valid identifier
identLetter -- second and following characters of valid identifier
reservedNames -- list of reserved names not allowed as identifiers
The lexeme (whitespace-absorbing) parsers created using the GenTokenParser object are:
identifier - Parses an unknown, user-defined identifier. It parses a character from identStart followed by zero or more identLetters up to the first non-identLetter. (It never parses a partial identifier, so it'll never leave more identLetters on the table.) Additionally, it checks that the identifier is not in the list reservedNames.
symbol - Parses the given string. If the string is a reserved word, no check is made that it isn't part of a larger valid identifier. So, symbol "for" would match the beginning of foreground = "black", which is rarely what you want. Note that symbol makes no use of identStart, identLetter, or reservedNames.
reserved - Parses the given string, and then ensures that it's not followed by an identLetter. So, m_reserved "for" will parse for (i=1; ... but not parse foreground = "black". Usually, the supplied string will be a valid identifier, but no check is made for this, so you can write m_reserved "15" if you want -- in a language with the usual sorts of alphanumeric identifiers, this would parse "15" provided it wasn't following by a letter or another digit. Also, maybe somewhat surprisingly, no check is made that the supplied string is in reservedNames.
If that makes sense to you, then the operator settings follow the exact same pattern. The relevant settings are:
opStart -- parser for first character of valid operator
opLetter -- valid second and following operator chars, for multichar operators
reservedOpNames -- list of reserved operator names not allowed as user-defined operators
and the relevant parsers are:
operator - Parses an unknown, user-defined operator starting with an opStart and followed by zero or more opLetters up to the first non-opLetter. So, operator applied to the string "--2" will always take the whole operator "--", never just the prefix "-". An additional check is made that the resulting operator is not in the reservedOpNames list.
symbol - Exactly as for identifiers. It parses a string with no checks or reference to opStart, opLetter, or reservedOpNames, so symbol "-" will parse the first character of the string "--" just fine, leaving the second "-" character for a later parser.
reservedOp - Parses the given string, ensuring it's not followed by opLetter. So, m_reservedOp "-" will parse the start of "-x" but not "--2", assuming - matches opLetter. As before, no check is made that the string is in reservedOpNames.
my lex code is
/* description: Parses end executes mathematical expressions. */
/* lexical grammar */
%lex
%%
\s+ /* skip whitespace */
[0-9]+("."[0-9]+)?\b return 'NUMBER'
[a-zA-Z] return 'FUNCTION'
<<EOF>> return 'EOF'
. return 'INVALID'
/lex
/* operator associations and precedence */
%start expressions
%% /* language grammar */
expressions
: e EOF
{return $1;}
;
e
| FUNCTION '('e')'
{$$=$3}
| NUMBER
{$$ = Number(yytext);}
;
i got error
Parse error on line 1:
balaji()
-^
Expecting '(', got 'FUNCTION'
what i want to pass myfun(a,b,...) and also myfun(a) in this parser.thank you for your valuable time going to spent for me.
[a-zA-Z] matches a single alphabetic character (in this case, the letter b), returning FUNCTION. When the next token is needed, it again matches a single alphabetic character (a), returning another FUNCTION token. But of course the grammar doesn't allow two consecutive FUNCTIONs; it's expecting a (, as it says.
You probably intended [a-zA-Z]+, although a better identifier pattern is [A-Za-z_][A-Za-z0-9_]*, which matches things like my_function_2.
I'm trying to parse a comma separated list. To simplify, I'm just using digits. These expressions would be valid:
(1, 4, 3)
()
(4)
I can think of two ways to do this and I'm wondering why exactly the failed example does not work. I believe it is a correct BNF, but I can't get it to work as PEG. Can anyone explain why exactly? I'm trying to get a better understanding of the PEG parsing logic.
I'm testing using the online browser parser generator here:
https://pegjs.org/online
This does not work:
list = '(' some_digits? ')'
some_digits = digit / ', ' some_digits
digit = [0-9]
(actually, it parses okay, and likes () or (1) but doesn't recognize (1, 2)
But this does work:
list = '(' some_digits? ')'
some_digits = digit another_digit*
another_digit = ', ' digit
digit = [0-9]
Why is that? (Grammar novice here)
Cool question and after digging around in their docs for a second I found that the / character means:
Try to match the first expression, if it does not succeed, try the
second one, etc. Return the match result of the first successfully
matched expression. If no expression matches, consider the match
failed.
So this lead me to the solution:
list = '(' some_digits? ')'
some_digits = digit ', ' some_digits / digit
digit = [0-9]
The reason this works:
input: (1, 4)
eats '('
check are there some digits?
check some_digits - first condition:
eats '1'
eats ', '
check some_digits - first condition:
eats '4'
fails to eat ', '
check some_digits - second condition:
eats '4'
succeeds
succeeds
eats ')'
succeeds
if you reverse the order of the some_digits conditions the first number is comes across gets eaten by digit and no recursion occurs. Then it throws an error because ')' is not present.
In one line:
some_digits = '(' digit (', ' digit)* ')'
It depends on what you want with the values and on the PEG implementation, but extracting them might be easier this way.
As an exercise I try to parse a EBNF/ABNF grammar with Megaparsec. I got trivial stuff like terminals and optionals working, but I'm struggling with alternatives. With this grammar:
S ::= 'hello' ['world'] IDENTIFIER LITERAL | 'test';
And this code:
production :: Parser Production
production = sepBy1 alternativeTerm (char '|') >>= return . Production
alternativeTerm :: Parser AlternativeTerm
alternativeTerm = sepBy1 term space >>= return . AlternativeTerm
term :: Parser Term
term = terminal
<|> optional
<|> identifier
<|> literal
I get this error:
unexpected '|'
expecting "IDENTIFIER", "LITERAL", ''', '[', or white space
I guess the alternativeTerm parser is not returning to the production parser when it encounters a sequence that it cannot parse and throws an error instead.
What can I do about this? Change my ADT of an EBNF or should I somehow flatten the parsing. But then again, how can I do so?
It's probably best to expand my previous comment into a full answer.
Your grammar is basically a list of list of terms seperated (and ended) by whitespace, which in turn is seperated by |. Your solution with sepBy1 does not work because there is a trailing whitespace after LITERAL - sepBy1 assumes there is another term following that whitespace and tries to apply term to the |, which fails.
If your alternativeTerm is guaranteed to end with a whitespace character (or multiple), rewrite your alternativeTerm as follows:
alternativeTerm = (term `sepEndBy1` space) >>= return . AlternativeTerm
When I import the Lisra recipe,
import demo::lang::Lisra::Syntax;
This creates the syntax:
layout Whitespace = [\t-\n\r\ ]*;
lexical IntegerLiteral = [0-9]+ !>> [0-9];
lexical AtomExp = (![0-9()\t-\n\r\ ])+ !>> ![0-9()\t-\n\r\ ];
start syntax LispExp
= IntegerLiteral
| AtomExp
| "(" LispExp* ")"
;
Through the start syntax-definition, layout should be ignored around the input when it is parsed, as is stated in the documentation: http://tutor.rascal-mpl.org/Rascal/Declarations/SyntaxDefinition/SyntaxDefinition.html
However, when I type:
rascal>(LispExp)` (something)`
This gives me a concrete syntax fragment error (or a ParseError when using the parse-function), in contrast to:
rascal>(LispExp)`(something)`
Which succesfully parses. I tried this both with one of the latest versions of Rascal as well as the Eclipse plugin version. Am I doing something wrong here?
Thank you.
Ps. Lisra's parse-function:
public Lval parse(str txt) = build(parse(#LispExp, txt));
Also fails on the example:
rascal>parse(" (something)")
|project://rascal/src/org/rascalmpl/library/ParseTree.rsc|(10329,833,<253,0>,<279,60>): ParseError(|unknown:///|(0,1,<1,0>,<1,1>))
at *** somewhere ***(|project://rascal/src/org/rascalmpl/library/ParseTree.rsc|(10329,833,<253,0>,<279,60>))
at parse(|project://rascal/src/org/rascalmpl/library/demo/lang/Lisra/Parse.rsc|(163,3,<7,44>,<7,47>))
at $shell$(|stdin:///|(0,13,<1,0>,<1,13>))
When you define a start non-terminal Rascal defines two non-terminals in one go:
rascal>start syntax A = "a";
ok
One non-terminal is A, the other is start[A]. Given a layout non-terminal in scope, say L, the latter is automatically defined by (something like) this rule:
syntax start[A] = L before A top L after;
If you call a parser or wish to parse a concrete fragment, you can use either non-terminal:
parse(#start[A], " a ") // parse using the start non-terminal and extra layout
parse(A, "a") // parse only an A
(start[A]) ` a ` // concrete fragment for the start-non-terminal
(A) `a` // concrete fragment for only an A
[start[A]] " a "
[A] "a"