Is the following grammar suitable for predictive parsing, or is their an algorithm to modify the grammar to make it suitable for predictive parsing?
number = digit digit_or_sep* digit | digit
digit_or_sep = '0'..'9' | '_'
digit = '0'..'9'
where * means zero or more, and | divides the different choices.
I have written a backtracking parser that works fine on the above grammar, however I've read that modern parsers are mostly predictive parsers these days and backtracking parsers are rarely used. Backtracking parsers need to rewind the state of the parser as they backtrack which makes them less performant than predictive parsers.
Transforming the grammar above into:
number = digit (sep* digit+)*
sep = '_'
digit = '0'..'9'
where + means one or more.
Will make predictive parsing work because it avoids digit_or_sep* from consuming too many tokens before the final digit. But I am not sure if there is an algorithmic way to auto-transform the grammar to make it work.
Edit: I had a read about left factoring on google, it could be the missing piece of the puzzle I need to understand.
After a bit of left refactoring:
number = digit digit_or_sep* digit | digit
digit_or_sep = '0'..'9' | '_'
digit = '0'..'9'
let g = digit_or_sep g | empty
number = digit g digit | digit
let h = g digit | empty
number = digit h
h = g digit | empty
g = digit_or_sep g | empty
g = digit g | sep g | empty
h = (digit g | sep g | empty) digit | empty
h = digit g digit | sep g digit | digit | empty
h = digit h | sep g digit | empty
The following grammar is produced:
number = digit h
h = digit h | sep g digit | empty
g = digit g | sep g | empty
Which should be suitable for a predictive parser. But I have still not come up with an algorithm to do it yet automatically.
Edit: Throwing in more details of what I am trying to do. The grammar above is just an small example of grammar I'd like to transform.
I am actually parsing Kotlin:
https://github.com/clinuxrulz/parse-bolt/blob/main/src/kotlin/parser.rs
And it is working fine with the backtracking parser, but it is having performance issues. Parsing a simple function type signature "(a: String, b: String) -> String" took 20ms to parse which is way too slow for such a small input. There are some optimisations I can do in the code that I know off, but it was way slower than I expected.
On the other hand for predictive parsing I can not simply use their grammar as is. It seems it will need some manual changes to the grammar first. ANTLR must be doing some transformations on their grammar 1st before generating the parser.
Down the bottom of this grammar file for Kotlin is the same the number example above:
https://github.com/Kotlin/kotlin-spec/blob/release/grammar/src/main/antlr/KotlinLexer.g4
I've read ANTLR can parse the full Java standard library in under 1 second. I might have a long way to go.
Edit: (29/04/2022) After more research, I found that the initial grammar (for number) has no problems for a predictive parser, I just needed more look-ahead.
I'm going with rici's comment that there is no algorithm to guarantee full left factoring of any grammar. However there are heuristic search methods.
I'm gonna stick to PEG and modify my own grammars when I need the performance. And optimise my backtracking parsing combinator as much as possible as it can handle more types of grammars and might still be useful.
Related
I am trying to write EBNF production rule for parsing math. expression with +, -, *, / operations (op) and matching parantheses. I would like to point out that writing a = op b is not possible. Example: a = - 1 is not allowed. What is allowed is if + or - are touching the literal (no whitespace between). Then they are treated as a lit_num by the lexer. Identifiers are not allowed to have special characters and the lexer throws an error if they do.
This is what I have done so far:
ex =
(literal | identifier)[('+'|'-'|'*'|'/') ex]
| '(' ex ')' [('+'|'-'|'*'|'/') ex]
I think it could also be written like this:
ex =
( '(' ex ')' | (literal | identifier) )[('+'|'-'|'*'|'/') ex]
It should work, I tried parsing some long expressions by hand and it also works in the algorithm. I would just like to hear your opinion on it.
EDIT: I just realised that with allowing -1 with no whitespace I could write something like a = 1 + -1. I will just put the rule into lexer to not allow this. So I will not be able to prefix a number but this is not a big deal. Assignin a negative number can be written like 0 - x.
The Goal
The goal is interpret plain text content and recognise patterns e.g. Arithmetic, Comments, Units of Measurements.
Example Input
This would be entered by a user.
# This is an example comment
10 + 10
// Another comment
This is one line of text
Tested
Expected Parse Tree
The goal of my grammar is to generate a tree that would look like this (if anyone has a better method I'd be interested to hear).
Note: The 10 + 10 is being recognised as an arithmetic rule.
Current Parse Tree aka The Problem
Below is the current output from the lexer and parser.
Note: The 10 + 10 is being recognised as an text and not the arithmetic rule.
Grammar Definition
The logic of the grammar at a high levels is as follows:
Parse line by line
Determine the line content if not fall back to text
grammar ContentParser;
/*
* Tokens
*/
NUMBER: '-'? [0-9]+;
LPARAN: '(';
RPARAN: ')';
POW: '^';
MUL: '*';
DIV: '/';
ADD: '+';
SUB: '-';
LINE_COMMENT: '#' TEXT | '//' TEXT;
TEXT: ~[\n\r]+;
EOL: '\r'? '\n';
/*
* Rules
*/
start: file;
file: line+ EOF;
line: content EOL;
content
: comment
| arithmetic
| text
;
// Custom Content Types
comment: LINE_COMMENT;
/// Example taken from ANTLR Docs
arithmetic:
NUMBER # Number
| LPARAN inner = arithmetic RPARAN # Parentheses
| left = arithmetic operator = POW right = arithmetic # Power
| left = arithmetic operator = (MUL | DIV) right = arithmetic # MultiplicationOrDivision
| left = arithmetic operator = (ADD | SUB) right = arithmetic # AdditionOrSubtraction;
text: TEXT;
My Understanding
The content rule should check for a match of the comment rule then followed by the arithmetic rule and finally falling back to the text rule which matches any character apart from newlines.
However, I believe that the lexer is being greedy on the TEXT tokens which is causing issues but I'm not sure.
(I'm still learning ANTLR)
When you are writing a parser, it's always a good idea to print out the tokens for the input.
In the current grammar, 10 + 10 is recognized by the lexer as TEXT, which is not what is needed. The reason it is text is because that is the longest string matched by a rule. It does not matter in this case that the TEXT rule occurs after the NUMBER rule in the grammar. The rule is that Antlr lexers will always match the longest string possible of the given lexer rules. But, if it can match two or more lexer rules where the strings are of equal length, then the first rule "wins". The lexer works pretty much independently of the parser.
There is no way to reliably have spaces in a text string, and not have them in arithmetic. The fix is to push spaces and tabs into an "off-channel" stream, then reconstruct the text by looking at the start and end character indices of the first and last tokens for the text tree node. The tree is a little messier, but it does what you need.
Also, you should just name the grammar as "Context" not "ContextParser" because you end up with "ContextParserParser.java" and "ContextParserLexer.java" when you generate the parser--rather confusing. I also took liberty to remove labeling an variables (I don't used them because I work with XPath expressions on the tree). And, I reordered and reformatted the grammar to be single line, sort alphabetically in order to find rules quicker in a text editor rather than require an IDE to navigate around.
A grammar that does all this is:
grammar Content;
arithmetic: NUMBER | LPARAN arithmetic RPARAN | arithmetic POW arithmetic | arithmetic (MUL | DIV) arithmetic | arithmetic (ADD | SUB) arithmetic ;
comment: LINE_COMMENT;
content : comment | arithmetic | text ;
file: line+ EOF;
line: content? EOL;
start: file;
text: TEXT+;
ADD: '+';
DIV: '/';
LINE_COMMENT: '#' STUFF | '//' STUFF;
LPARAN: '(';
MUL: '*';
NUMBER: '-'? [0-9]+;
POW: '^';
RPARAN: ')';
SUB: '-';
fragment STUFF : ~[\n\r]* ;
EOL: '\r'? '\n';
WS : [ \t]+ -> channel(HIDDEN);
TEXT: .; // Must be last lexer rule, and only one char in length.
I am creating the simplest grammar possible that basically recognizes arithmetic expressions. The grammar needs to correctly follow arithmetic operators precedence rules (PEMDAS), and for that I placed expr ('*'|'/') term before expr ('+'|'-') term to ensure this precedence.
This is the arithmetic.g4 file that I have:
/*Productions */
expr: expr ('*'|'/') term
| expr ('+'|'-') term
| term
;
term: '('expr')'
| ID
| NUM
;
/*Tokens */
ID: [a-z]+;
NUM: [0-9]+;
WS: [\t\r\n]+->skip;
The output of the grammar is however not what it should be. For example for the arithmetic expression 4 * (3 + 10) I get the below parse tree (which is absolutely not correct):
Any suggestions on how I can change the grammar to get what I am looking for. I am new to antlr and am not sure what mistake I am making. (jbtw my OS is windows)
(I'm assuming that you've made a mistake in your example (which looks fine) and you really meant that you're getting the wrong tree for the input 4 + 3 * 10, so that's what I'm going to answer. If that's not what you meant, please clarify.)
You're right that ANTLR resolves ambiguities based on the order of rules, but that does not apply to your grammar because your grammar is not ambiguous. For an input like 4 + 3 * 10, there's only one way to parse it according to your grammar: with * being the outer operator, with 4 + 3 as its left and 10 as its right operand. The correct way (+ as the outer operator with 3 * 10 as the right operand) doesn't work with your grammar because 3 * 10 is not a valid term and the right operand needs to be a term according to your grammar.
In order to get an ambiguity that's resolved in the way you want, you'll need to make both operands of your operators exprs.
I have been trying create my parser for expression with variables and simplify them to quadratic expression form.
This is my parser grammar:
Exercise : Expr '=' Expr
Expr : Term [+-] Expr | Term
Term : Factor [*/] Term | Factor
Factor: Atom [^] Factor | Atom
Atom: Number | Identified | [- sqrt] Atom | '(' Expr ')'
For parsing I'm using recursive descent parser. Let's say I'd like to parse this:
" 2 - 1 + 1 = 0"
and the result is 0, parser creates wrong tree:
-
/ \
2 +
/ \
1 1
How can I make this grammar left-associative? I'm newbie at this, please can you advice me source where I can find more information? Can I achieve this with recursive descent parser?
Take a look at Parsing Expressions by Recursive Descent by Theodore Norvell
There he gives three ways to solve the problem.
1. The shunting yard algorithm.
2. Classic solution by factoring the grammar.
3. Precedence climbing
Your problem stems from the fact that your grammar needs several changes, one example being
Exrp: Term { [+-] Expr} | Term
notice the addition of the { } around [+-] Expr indicating that they should be parsed together and that there can 0 or more of them.
Also by default you are building the tree as
-
/ \
2 +
/ \
1 1
i.e. -(2,+(1,1))
when you should be building the tree for left associative operators of the same precedence as
+
/ \
- 1
/ \
2 1
i.e. +(-(2,1),1)
Since there are three ways to do this in the paper I won't expand on them here. Also you mention that you are new to this so you should get a good compiler book to understand the details of you will encounter reading the paper. Most of these methods are implemented in common programming languages and available free on the internet, but be aware that many people do what you do and post wrong results.
The best way to check if you have it right is with a test like this using a multiple sequence of subtraction operations:
7-3-2 = 2
if you get
7-3-2 = 6 or something else
then it is wrong.
I'm trying to learn about shift-reduce parsing. Suppose we have the following grammar, using recursive rules that enforce order of operations, inspired by the ANSI C Yacc grammar:
S: A;
P
: NUMBER
| '(' S ')'
;
M
: P
| M '*' P
| M '/' P
;
A
: M
| A '+' M
| A '-' M
;
And we want to parse 1+2 using shift-reduce parsing. First, the 1 is shifted as a NUMBER. My question is, is it then reduced to P, then M, then A, then finally S? How does it know where to stop?
Suppose it does reduce all the way to S, then shifts '+'. We'd now have a stack containing:
S '+'
If we shift '2', the reductions might be:
S '+' NUMBER
S '+' P
S '+' M
S '+' A
S '+' S
Now, on either side of the last line, S could be P, M, A, or NUMBER, and it would still be valid in the sense that any combination would be a correct representation of the text. How does the parser "know" to make it
A '+' M
So that it can reduce the whole expression to A, then S? In other words, how does it know to stop reducing before shifting the next token? Is this a key difficulty in LR parser generation?
Edit: An addition to the question follows.
Now suppose we parse 1+2*3. Some shift/reduce operations are as follows:
Stack | Input | Operation
---------+-------+----------------------------------------------
| 1+2*3 |
NUMBER | +2*3 | Shift
A | +2*3 | Reduce (looking ahead, we know to stop at A)
A+ | 2*3 | Shift
A+NUMBER | *3 | Shift (looking ahead, we know to stop at M)
A+M | *3 | Reduce (looking ahead, we know to stop at M)
Is this correct (granted, it's not fully parsed yet)? Moreover, does lookahead by 1 symbol also tell us not to reduce A+M to A, as doing so would result in an inevitable syntax error after reading *3 ?
The problem you're describing is an issue with creating LR(0) parsers - that is, bottom-up parsers that don't do any lookahead to symbols beyond the current one they are parsing. The grammar you've described doesn't appear to be an LR(0) grammar, which is why you run into trouble when trying to parse it w/o lookahead. It does appear to be LR(1), however, so by looking 1 symbol ahead in the input you could easily determine whether to shift or reduce. In this case, an LR(1) parser would look ahead when it had the 1 on the stack, see that the next symbol is a +, and realize that it shouldn't reduce past A (since that is the only thing it could reduce to that would still match a rule with + in the second position).
An interesting property of LR grammars is that for any grammar which is LR(k) for k>1, it is possible to construct an LR(1) grammar which is equivalent. However, the same does not extend all the way down to LR(0) - there are many grammars which cannot be converted to LR(0).
See here for more details on LR(k)-ness:
http://en.wikipedia.org/wiki/LR_parser
I'm not exactly sure of the Yacc / Bison parsing algorithm and when it prefers shifting over reducing, however I know that Bison supports LR(1) parsing which means it has a lookahead token. This means that tokens aren't passed to the stack immediately. Rather they wait until no more reductions can happen. Then, if shifting the next token makes sense it applies that operation.
First of all, in your case, if you're evaluating 1 + 2, it will shift 1. It will reduce that token to an A because the '+' lookahead token indicates that its the only valid course. Since there are no more reductions, it will shift the '+' token onto the stack and hold 2 as the lookahead. It will shift the 2 and reduce to an M since A + M produces an A and the expression is complete.