I'm using the rust-peg parsing expression grammar library, but the principles should be generally understandable. I'm using the library to create a parser for go based on the spec. I'm having trouble getting if statements to parse, and I've distilled the issue down to a simple example.
sep
= ( " " / "\n" )*
expression
= "x" sep block
/ "x"
if_stmt
= "if" sep expression sep block
block
= "{" ( sep stmt )* "}"
stmt
= if_stmt
/ expression
pub file
= ( sep stmt )*
This grammar should (in my mind) parse a very simple language that contains two types of statements: If statements and expression statements. An expression can be x, or x followed by a block. An if statement is if followed by an expression, followed by a block. Here is an example input that my grammar fails to parse:
x {}
if x {
}
This fails to parse because the curly braces after the x in the if statement line are interpreted as a block as a part of the "x" sep block rule, and not a block as a part of the if_stmt rule. Unfortunately, this parsing library does not backtrack and try re-parsing this part of the line as the if statement's block when this fails. I have realized that if I switch the order of the expression rule so that it tries to parse "x" first, then the if statement parses just fine. This does create a problem for the line x {}, since the x at the beginning of the line parses as just a normal "x", and it backs out of the expression rule before it tries to parse the {}.
Do these limitations make the library incapable of parsing a grammar like this? Should I find another one, or just write my own parser? (I don't really want to do that)
Edit
I experimented with the go grammar, and I discovered that it is not legal to put a struct literal (the "x" sep block example) in an if statement condition. I was thus able to disambiguate the grammar as attdona suggested.
Try to disambiguate your grammar:
simple_expr
= "x"
block_expr
= "x" sep block
expression
= simple_expr
/ block_expr
if_stmt
= "if" sep simple_expr sep block
I don't know rust-peg, I hope this may help to resolve the parsing of your grammar.
Related
I've been using the Antlr Matlab grammar from Antlr grammars
I found out I need to implement the ' Matlab operator. It is the complex conjugate transpose operator, used as such
result = input'
I tried a straightforward solution of adding it to unary_expression as an option postfix_expression '\''
However, this failed to parse when multiple of these operators were used on a single line.
Here's a significantly simplified version of the grammar, still exhibiting the exact problem:
grammar Grammar;
unary_expression
: IDENTIFIER
| unary_expression '\''
;
translation_unit : unary_expression CR ;
STRING_LITERAL : '\'' [a-z]* '\'' ;
IDENTIFIER : [a-zA-Z] ;
CR : [\r\n] + ;
Test cases, being parsed as translation_unit:
"x''\n" //fails getNumberOfSyntaxErrors returns 1
"x'\n" //passes
The failure also prints the message line 1:1 extraneous input '''' expecting CR to stderr.
The failure goes away if I either remove STRING_LITERAL, or change the * to +. Neither is a proper solution of course, as removing it is entirely off the table, and mandating non-empty strings is not quite correct, though I might be able to live with it. Also, forcing non-empty string does nothing to help the real use case, when the input is something like x' + y' instead of using the operator twice.
For some reason removing CR from the grammar and \n from the tests also makes the parsing run without problems, but yet again is not a useable solution.
What can I do to the grammar to make it work correctly? I'm assuming it's a problem with lexing specifically because removing STRING_LITERAL or making it unable to match '' makes it go away.
The lexer can never be made that context aware I think, but I don't know Matlab well enough to be sure. How could you check during tokenisation that these single quotes are operators:
x' + y';
while these are strings:
x = 'x' + ' + y';
?
Maybe you can do something similar as how in ECMAScript a / can be a division operator or a regex delimiter. In this grammar that is handled by a predicate in the lexer that uses some target code to check this.
If something like the above is not possible, I see no other way than to "promote" the creation of strings to the parser. That would mean removing STRING_LITERAL and introducing a parser rule that matches something like this:
string_literal
: QUOTE ~(QUOTE | CR)* QUOTE
;
// Needed to match characters inside strings
OTHER
: .
;
However, that will fail when a string like 'hi there' is encountered: the space in between hi and there will now be skipped by the WS rule. So WS should also be removed (spaces will then get matched by the OTHER rule). But now (of course) all spaces will litter the token stream and you'll have to account for them in all parser rules (not really a viable solution).
All in all: I don't see ANTLR as a suitable tool in this case. You might look into parser generators where there is no separation between tokenisation and parsing. Google for "PEG" and/or "scannerless parsing".
I am having some difficulties understanding the specific difference between Lexical Grammar and Syntactic Grammar in the ECMAScript 2017 specification.
Excerpts from ECMAScript 2017
5.1.2 The Lexical and RegExp Grammars
A lexical grammar for ECMAScript is given in clause 11. This grammar
has as its terminal symbols Unicode code points that conform to the
rules for SourceCharacter defined in 10.1. It defines a set of
productions, starting from the goal symbol InputElementDiv,
InputElementTemplateTail, or InputElementRegExp, or
InputElementRegExpOrTemplateTail, that describe how sequences of such
code points are translated into a sequence of input elements.
Input elements other than white space and comments form the terminal
symbols for the syntactic grammar for ECMAScript and are called
ECMAScript tokens. These tokens are the reserved words, identifiers,
literals, and punctuators of the ECMAScript language.
5.1.4 The Syntactic Grammar
When a stream of code points is to be parsed as an ECMAScript Script
or Module, it is first converted to a stream of input elements by
repeated application of the lexical grammar; this stream of input
elements is then parsed by a single application of the syntactic
grammar.
Questions
Lexical grammar
Here it says the terminal symbols are Unicode code points (individual characters)
It also says it produces input elements (aka. tokens)
How are these reconcilable? Either the terminal symbols are tokens, and thus it produces tokens. Or, the terminal symbols are individual code points, and that's what it produces.
Syntactic grammar
I have the same questions on this grammar as on the lexical grammar
It seems to say that the terminal symbols here are tokens
So by applying the syntactic grammar rules, valid tokens are produced, which in turn can be sent to parser? Or, does this grammar accept tokens as input and then test the overall stream of tokens for validity?
My Best Guess
Lexing phase
Input: Code points (source code)
Output: Applies lexical grammar productions to produce valid tokens (lexeme type + value) as output
Parsing phase
Input: Tokens
Output: Applies syntactic grammar productions (CFG) to decide if all the tokens together represent a valid stream (i.e. that the source code as a whole is a valid Script / Module)
I think you are confused about what terminal symbol means. In fact they are the inputs of the parser, not the outputs (which is a parse tree - including the degenerate case of a list).
On the other hand, a production rule does have indeed terminal symbols as the output and a goal symbol as the input - it's backwards, that's where the term "terminal" comes from. A non-terminal can be expanded (in different ways, that's what the rules describe) to a sequence of terminal symbols.
Example:
Language:
S -> T | S '_' T
T -> D | T D
D -> '0' | '1' | '2' | … | '9'
String:
12_45
Production:
S // start: the goal
= S '_' T
= T '_' T
= T D ' ' T
= T '2 ' T
= D '2 ' T
= '12 ' T
= '12 ' T D
= '12 ' T '5'
= '12 ' D '5'
= '12_45' // end: the terminals
Parse tree:
S
S
T
T
D
'1'
D
'2'
' '
T
T
D
'4'
D
'5'
Parser output (generating a sequence of items from top-level Ts):
'12'
'45'
So
The lexing phase has code points as inputs and tokens as outputs. The code points are the terminal symbols of the lexical grammar.
The syntactic phase has tokens as inputs and programs as outputs. The tokens are the terminal symbols of the syntactic grammar.
Your "best guess" is correct to a first approximation. The main correction is to change "tokens" to "input elements". That is, the lexical level produces input elements (only some of which are designated 'tokens'), and the syntactic level takes input elements as input.
The syntactic level can almost ignore input elements that aren't tokens, except that Automatic Semicolon Insertion rules require it to pay attention to line-terminators in whitespace and comments.
Your "How are these reconcilable?" questions seems to stem from a misunderstanding of either "terminal symbol" or "produces", but it's not clear to me which.
I'm totally new to Haskell and trying to implement a "Lambda calculus" parser, that will be used to read the input to a lambda reducer .. It's required to parse bindings first "identifier = expression;" from a text file, then at the end there's an expression alone ..
till now it can parse bindings only, and displays errors when encountering an expression alone .. when I try to use the try or option functions, it gives a type mismatch error:
Couldn't match type `[Expr]'
with `Text.Parsec.Prim.ParsecT s0 u0 m0 [[Expr]]'
Expected type: Text.Parsec.Prim.ParsecT
s0 u0 m0 (Text.Parsec.Prim.ParsecT s0 u0 m0 [[Expr]])
Actual type: Text.Parsec.Prim.ParsecT s0 u0 m0 [Expr]
In the second argument of `option', namely `bindings'
bindings weren't supposed to return anything, but I tried to add a return statement and it also returned a type mismatch error:
Couldn't match type `[Expr]' with `Expr'
Expected type: Text.Parsec.Prim.ParsecT
[Char] u0 Data.Functor.Identity.Identity [Expr]
Actual type: Text.Parsec.Prim.ParsecT
[Char] u0 Data.Functor.Identity.Identity [[Expr]]
In the second argument of `(<|>)', namely `expressions'
Don't use <|> if you want to allow both
Your program parser does its main work with
program = do
spaces
try bindings <|> expressions
spaces >> eof
This <|> is choice - it does bindings if it can, and if that fails, expressions, which isn't what you want. You want zero or more bindings, followed by expressions, so let's make it do that.
Sadly, even when this works, the last line of your parser is eof and
First, let's allow zero bindings, since they're optional, then let's get both the bindings and the expressions:
bindings = many binding
program = do
spaces
bs <- bindings
es <- expressions
spaces >> eof
return (bs,es)
This error would be easier to find with plenty more <?> "binding" type hints so you can see more clearly what was expected.
endBy doesn't need many
The error message you have stems from the line
expressions = many (endBy expression eol)
which should be
expressions :: Parser [Expr]
expressions = endBy expression eol
endBy works like sepBy - you don't need to use many on it because it already parses many.
This error would have been easier to find with a stronger data type tree, so:
Use try to deal with common prefixes
One of the hard-to-debug problems you've had is when you get the error expecting space or "=" whilst parsing an expression. If we think about that, the only place we expect = is in a binding, so it must be part way through parsing a binding when we've given it an expression. This only happens if our expression starts with an identifier, just like a binding does.
binding sees the first identifier and says "It's OK guys, I've got this" but then finds no = and gives you an error, where we wanted it to backtrack and let expression have a go. The key point is we've already used the identifier input, and we want to unuse it. try is right for that.
Encase your binding parser with try so if it fails, we'll go back to the start of the line and hand over to expression.
binding = try (do
(Var id) <- identifier
_ <- char '='
spaces
exp <- expression
spaces
eol <?> "end of line"
return $ Eq id exp
<?> "binding")
It's important that as far as possible each parser starts with matching something unique to avoid this problem. (try is backtracking, hence inefficient, so should be avoided if possible.)
In particular, avoid starting parsers with spaces, but instead make sure you finish them all with spaces. Your main program can start with spaces if you like, since it's the only alternative.
Use types for most productions - better structure & readability
My first piece of general advice is that you could do with a more fine-grained data type, and should annotate your parsers with their type. At the moment, everything's wrapped up in Expr, which means you can only get error messages about whether you have an Expr or a [Expr]. The fact that you had to add Eq to Expr is a sign you're pushing the type too far.
Usually it's worth making a data type for quite a lot of the productions, and if you import Control.Applicative hiding ((<|>),(<$>),many) Control.Applicative you can use <$> and <*> so that the production, the datatype and the parser are all the same structure:
--<program> ::= <spaces> [<bindings>] <expressions>
data Program = Prog [Binding] [Expr]
program = spaces >> Prog <$> bindings <*> expressions
-- <expression> ::= <abstraction> | factors
data Expression = Ab Abstraction | Fa [Factor]
expression = Ab <$> abstraction <|> Fa <$> factors <?> "expression"
Don't do this with letters for example, but for important things. What counts as important things is a matter of judgement, but I'd start with Identifiers. (You can use <* or *> to not include syntax like = in the results.)
Amended code:
Before refactoring types and using Applicative here
And afterwards here
I'm trying to implement js parser in haskell. But I'm stuck with automatic semicolon insertion. I have created test project to play around with problem, but I can not figure out how to solve the problem.
In my test project program is a list of expressions (unary or binary):
data Program = Program [Expression]
data Expression
= UnaryExpression Number
| PlusExpression Number Number
Input stream is a list of tokens:
data Token
= SemicolonToken
| NumberToken Number
| PlusToken
I want to parse inputs like these:
1; - Unary expression
1 + 2; - Binary expression
1; 2 + 3; - Two expressions (unary and binary)
1 2 + 3; - Same as previous input, but first semicolon is missing. So parser consume token 1, but token 2 is not allowed by any production of grammar (next expected token is semicolon or plus). Rule of automatic semicolon insertion says that in this case a semicolon is automatically inserted before token 2.
So, what is the most elegant way to implement such parser behavior.
You have
expression = try unaryExpression <|> plusExpression
but that doesn't work, since a UnaryExpression is a prefix of a PlusExpression. So for
input2 = [NumberToken Number1, PlusToken, NumberToken Number1, SemicolonToken]
the parser happily parses the first NumberToken and automatically adds a semicolon, since the next token is a PlusToken and not a SemicolonToken. Then it tries to parse the next Expression, but the next is a PlusToken, no Expression can start with that.
Change the order in which the parsers are tried,
expression = try plusExpression <|> unaryExpression
and it will first try to parse a PlusExpression, and only when that fails resort to the shorter parse of a UnaryExpression.
I'm building a parser for a language I've designed, in which type names start with an upper case letter and variable names start with a lower case letter, such that the lexer can tell the difference and provide different tokens. Also, the string 'this' is recognised by the lexer (it's an OOP language) and passed as a separate token. Finally, data members can only be accessed on the 'this' object, so I built the grammar as so:
%token TYPENAME
%token VARNAME
%token THIS
%%
start:
Expression
;
Expression:
THIS
| THIS '.' VARNAME
| Expression '.' TYPENAME
;
%%
The first rule of Expression allows the user to pass 'this' around as a value (for example, returning it from a method or passing to a method call). The second is for accessing data on 'this'. The third rule is for calling methods, however I've removed the brackets and parameters since they are irrelevant to the problem. The originally grammar was clearly much larger than this, however this is the smallest part that generates the same error (1 Shift/Reduce conflict) - I isolated it into its own parser file and verified this, so the error has nothing to do with any other symbols.
As far as I can see, the grammar given here is unambiguous and so should not produce any errors. If you remove any of the three rules or change the second rule to
Expression '.' VARNAME
there is no conflict. In any case, I probably need someone to state the obvious of why this conflict occurs and how to resolve it.
The problem is that the grammar can only look one ahead. Therefore when you see a THIS then a ., are you in line 2(Expression: THIS '.' VARNAME) or line 3 (Expression: Expression '.' TYPENAME, via a reduction according to line 1).
The grammar could reduce THIS. to Expression. and then look for a TYPENAME or shift it to THIS. and look for a VARNAME, but it has to decide when it gets to the ..
I try to avoid y.output but sometimes it does help. I looked at the file it produced and saw.
state 1
2 Expression: THIS. [$end, '.']
3 | THIS . '.' VARNAME
'.' shift, and go to state 4
'.' [reduce using rule 2 (Expression)]
$default reduce using rule 2 (Expression)
Basically it is saying it sees '.' and can reduce or it can shift. Reduce makes me anrgu sometimes because they are hard to fine. The shift is rule 3 and is obvious (but the output doesnt mention the rule #). The reduce where it see's '.' in this case is the line
| Expression '.' TYPENAME
When it goes to Expression it looks at the next letter (the '.') and goes in. Now it sees THIS | so when it gets to the end of that statement it expects '.' when it leaves or an error. However it sees THIS '.' while its between this and '.' (hence the dot in the out file) and it CAN reduce a rule so there is a path conflict. I believe you can use %glr-parser to allow it to try both but the more conflicts you have the more likely you'll either get unexpected output or an ambiguity error. I had ambiguity errors in the past. They are annoying to deal with especially if you dont remember what rule caused or affected them. it is recommended to avoid conflicts.
I highly recommend this book before attempting to use bison.
I cant think of a 'great' solution but this gives no conflicts
start:
ExpressionLoop
;
ExpressionLoop:
Expression
| ExpressionLoop ';' Expression
;
Expression:
rval
| rval '.' TYPENAME
| THIS //trick is moving this AWAY so it doesnt reduce
rval:
THIS '.' VARNAME
Alternative you can make it reduce later by adding more to the rule so it doesnt reduce as soon or by adding a token after or before to make it clear which path to take or fails (remember, it must know BEFORE reducing ANY path)
start:
ExpressionLoop
;
ExpressionLoop:
Expression
| ExpressionLoop ';' Expression
;
Expression:
rval
| rval '.' TYPENAME
rval:
THIS '#'
| THIS '.' VARNAME
%%
-edit- note if i want to do func param and type varname i cant because type according to the lexer func is a Var (which is A-Za-z09_) as well as type. param and varname are both var's as well so this will cause me a reduce/reduce conflict. You cant write this as what they are, only what they look like. So keep that in mind when writing. You'll have to write a token to differentiate the two or write it as one of the two but write additional logic in code (the part that is in { } on the right side of the rules) to check if it is a funcname or a type and handle both those case.