I'm parsing a language that doesn't have statement terminators like ;. Expressions are defined as the longest sequence of tokens, so 5-5 has to be parsed as a subtraction, not as two statements (literal 5 followed by a unary negated -5).
I'm using LALRPOP as the parser generator (despite the name, it is LR(1) instead of LALR, afaik). LALRPOP doesn't have precedence attributes and doesn't prefer shift over reduce by default like yacc would do. I think I understand how regular operator precedence is encoded in an LR grammar by building a "chain" of rules, but I don't know how to apply that to this issue.
The expected parses would be (individual statements in brackets):
"5 - 5" → 5-5 instead of 5, -5
"5 (- 5)" → 5, -5
"- 5" → -5
"5 5" → 5, 5
How do I change the grammar such that it always prefers the longer parse?
Going through the first few pages of google results as well as stack overflow didn't yield any results for this specific problem. Most related questions need more lookahead or the result is to not allow consecutive statements without terminators.
I created a minimal sample grammar that reproduces the shift/reduce conflict (a statement in this grammar is just an expression, in the full grammar there would also be "if", "while", etc. and more levels of operator precedence, but I've omitted them for brevity). Besides unary minus, there are also other conflicts in the original grammar like print(5), which could be parsed as the identifier print and a parenthesized number (5) or a function call. There might be more conflicts like this, but all of them have the same underlying issue, that the longer sequence should be preferred, but both are currently valid, though only the first should be.
For convenience, I created a repo (checkout and cargo run). The grammar is:
use std::str::FromStr;
grammar;
match {
"+",
"-",
"(",
")",
r"[0-9]+",
// Skip whitespace
r"\s*" => { },
}
Expr: i32 = {
<l:Expr> "+" <r:Unary> => l + r,
<l:Expr> "-" <r:Unary> => l - r,
Unary,
};
Unary: i32 = {
"-" <r:Unary> => -r,
Term,
}
Term: i32 = {
Num,
"(" <Expr> ")",
};
Num: i32 = {
r"[0-9]+" => i32::from_str(<>).unwrap(),
};
Stmt: i32 = {
Expr
};
pub Stmts: Vec<i32> = {
Stmt*
};
Part of the error (full error message):
/lalrpop-shift-repro/src/test.lalrpop:37:5: 37:8: Local ambiguity detected
The problem arises after having observed the following symbols in the input:
Stmt+ Expr
At that point, if the next token is a `"-"`, then the parser can proceed in two different ways.
First, the parser could execute the production at
/lalrpop-shift-repro/src/test.lalrpop:37:5: 37:8, which would consume
the top 1 token(s) from the stack and produce a `Stmt`. This might then yield a parse tree like
Expr ╷ Stmt
├─Stmt──┤ │
├─Stmt+─┘ │
└─Stmt+──────┘
Alternatively, the parser could shift the `"-"` token and later use it to construct a `Expr`. This might
then yield a parse tree like
Stmt+ Expr "-" Unary
│ ├─Expr───────┤
│ └─Stmt───────┤
└─Stmt+────────────┘
See the LALRPOP manual for advice on making your grammar LR(1).
The issue you're going to have to confront is how to deal with function calls. I can't really give you any concrete advice based on your question, because the grammar you provide lacks any indication of the intended syntax of functions calls, but the hint that print(5) is a valid statement makes it clear that there are two distinct situations, which need to be handled separately.
Consider:
5 - 5 One statement 5 ( - 5 ) Two statements
print(-5) One statement print - 5 Two statements (presumably)
a - 5 ???
The ambiguity of the third expression could be resolved if the compiler knew whether a is a function or a variable (if we assume that functions are not first-class values, making print an invalid statement). But there aren't many ways that the parser could know that, and none of them seem very likely:
There might not be any user-defined functions. Then the lexer could be built to recognise identifier-like tokens which happen to be built-in functions (like print) and then a(-5) would be illegal since a is not a built-in function.
The names of functions and identifiers might differ in some way that the lexer can detect. For example, the language might require functions to start with a capital letter. I presume this is not the case since you wrote print rather than Print but there might be some other simple distinction, such as requiring identifiers to be a single character.
Functions must be declared as such before the first use of the function, and the parser shares the symbol table with the lexer. (I didn't search the rather inadequate documentation for the generator you're using to see if lexical feedback is practical.)
If there were an optional statement delimiter (as with Lua, for example), then you could simply require that statements which start with parentheses (usually a pretty rare case) be explicitly delimited unless they are the first statement in a block. Or there might be an optional keyword such as compute which can be used as an unambiguous statement starter and whose use is required for statements which start with a parenthesis. I presume that neither of these is the case here, since you could have used that to force 5 - 5 to be recognised as two statements (5; -5 or 5 compute - 5.)
Another unlikely possibility, again based on the print(5) example, is that function calls use a different bracket than expression grouping. In that case, a[5] (for example) would be a function call and a(5) would unambiguously be two statements.
Since I don't know the precise requirements here, I'll show a grammar (in yacc/bison syntax, although it should be easy enough to translate it) which attempts to illustrate a representative sample. It implements one statement (return) in addition to expression statements, and expressions include multiplication, subtraction, negation and single argument function calls. To force "greedy" expressions, it prohibits certain statement sequences:
statements starting with a unary operator
statements starting with an open parenthesis if the previous statement ends with an identifier. (This effectively requires that the function to be applied in a call expression be a simple identifier. Without that restriction, it becomes close to impossible to distinguish two consecutive parenthesized expressions from a single function call expression, and you then need some other way to disambiguate.)
Those rules are easy to state, but the actual implementation is annoyingly repetitive because it requires various different kinds of expressions, depending on what the first and last token in the expression is, and possibly different kinds of statements, if you have statements which might end with an expression. (return x, for example.) The formalism used by ECMAScript would be useful here, but I suspect that your parser-generator doesn't implement it -- although it's possible that its macro facility could be used to that effect, if it came with something resembling documentation. Without that, there is a lot of duplication.
In a vague attempt to generate the grammar, I used the following suffixes:
_un / _pr / _oth: starts with unary / parenthesis / other token
_id / _nid: ends / does not end with an id
The absence of a suffix is used for the union of different possibilities. There are probably more unit productions than necessary. It has not been thoroughly debugged, but it worked on a few test cases (see below):
program : block
block_id : stmt_id
| block_id stmt_oth_id
| block_nid stmt_pr_id
| block_nid stmt_oth_id
block_nid : stmt_nid
| block_id stmt_oth_nid
| block_nid stmt_pr_nid
| block_nid stmt_oth_nid
block : %empty
| block_id | block_nid
stmt_un_id : expr_un_id
stmt_un_nid : expr_un_nid
stmt_pr_id : expr_pr_id
stmt_pr_nid : expr_pr_nid
stmt_oth_id : expr_oth_id
| return_id
stmt_oth_nid : expr_oth_nid
| return_nid
stmt_id : stmt_un_id | stmt_pr_id | stmt_oth_id
stmt_nid : stmt_un_nid | stmt_pr_nid | stmt_oth_nid
return_id : "return" expr_id
return_nid : "return" expr_nid
expr_un_id : sum_un_id
expr_un_nid : sum_un_nid
expr_pr_id : sum_pr_id
expr_pr_nid : sum_pr_nid
expr_oth_id : sum_oth_id
expr_oth_nid : sum_oth_nid
expr_id : expr_un_id | expr_pr_id | expr_oth_id
expr_nid : expr_un_nid | expr_pr_nid | expr_oth_nid
expr : expr_id | expr_nid
sum_un_id : mul_un_id
| sum_un '-' mul_id
sum_un_nid : mul_un_nid
| sum_un '-' mul_nid
sum_un : sum_un_id | sum_un_nid
sum_pr_id : mul_pr_id
| sum_pr '-' mul_id
sum_pr_nid : mul_pr_nid
| sum_pr '-' mul_nid
sum_pr : sum_pr_id | sum_pr_nid
sum_oth_id : mul_oth_id
| sum_oth '-' mul_id
sum_oth_nid : mul_oth_nid
| sum_oth '-' mul_nid
sum_oth : sum_oth_id | sum_oth_nid
mul_un_id : unary_un_id
| mul_un '*' unary_id
mul_un_nid : unary_un_nid
| mul_un '*' unary_nid
mul_un : mul_un_id | mul_un_nid
mul_pr_id : mul_pr '*' unary_id
mul_pr_nid : unary_pr_nid
| mul_pr '*' unary_nid
mul_pr : mul_pr_id | mul_pr_nid
mul_oth_id : unary_oth_id
| mul_oth '*' unary_id
mul_oth_nid : unary_oth_nid
| mul_oth '*' unary_nid
mul_oth : mul_oth_id | mul_oth_nid
mul_id : mul_un_id | mul_pr_id | mul_oth_id
mul_nid : mul_un_nid | mul_pr_nid | mul_oth_nid
unary_un_id : '-' unary_id
unary_un_nid : '-' unary_nid
unary_pr_nid : term_pr_nid
unary_oth_id : term_oth_id
unary_oth_nid: term_oth_nid
unary_id : unary_un_id | unary_oth_id
unary_nid : unary_un_nid | unary_pr_nid | unary_oth_nid
term_oth_id : IDENT
term_oth_nid : NUMBER
| IDENT '(' expr ')'
term_pr_nid : '(' expr ')'
Here's a little test:
> 5-5
{ [- 5 5] }
> 5(-5)
{ 5; [~ -- 5] }
> a-5
{ [- a 5] }
> a(5)
{ [CALL a 5] }
> -7*a
{ [* [~ -- 7] a] }
> a*-7
{ [* a [~ -- 7]] }
> a-b*c
{ [- a [* b c]] }
> a*b-c
{ [- [* a b] c] }
> a*b(3)-c
{ [- [* a [CALL b 3]] c] }
> a*b-c(3)
{ [- [* a b] [CALL c 3]] }
> a*b-7(3)
{ [- [* a b] 7]; 3 }
I'm reading this tutorial on context free grammars in Prolog, and they mention at the bottom of the page implementing a context free grammar in Prolog using difference lists, with the following code block included:
s(X,Z):- np(X,Y), vp(Y,Z).
np(X,Z):- det(X,Y), n(Y,Z).
vp(X,Z):- v(X,Y), np(Y,Z).
vp(X,Z):- v(X,Z).
det([the|W],W).
det([a|W],W).
n([woman|W],W).
n([man|W],W).
v([shoots|W],W).
It mentions:
Consider these rules carefully. For example, the s rule says: I know that the pair of lists X and Z represents a sentence if (1) I can consume X and leave behind a Y , and the pair X and Y represents a noun phrase, and (2) I can then go on to consume Y leaving Z behind , and the pair Y Z represents a verb phrase . The np rule and the second of the vp rules work similarly.
And
Think of the first list as what needs to be consumed (or if you prefer: the input list ), and the second list as what we should leave behind (or: the output list ). Viewed from this (rather procedural) perspective the difference list
[a,woman,shoots,a,man] [].
represents the sentence a woman shoots a man because it says: If I consume all the symbols on the left, and leave behind the symbols on the right, then I have the sentence I am interested in. That is, the sentence we are interested in is the difference between the contents of these two lists.
That’s all we need to know about difference lists to rewrite our recogniser. If we simply bear in mind the idea of consuming something, and leaving something behind in mind, we obtain the following recogniser:
As an explanation, but that just isn't doing anything to clarify this code to me. I understand it's more efficient than using append/3, but as for the notion of consuming things and leaving others behind, it seems so much more confusing than the previous append/3 implementation, and I just can't make heads or tails of it. Furthermore when I copy and paste that code into a Prolog visualizer to try to understand it, the visualizer says there are errors. Could anyone shed some light on this?
Listing as facts
Lets try to explain this with a counter example. Let's specify the nouns, verbs, etc. with simple facts:
det(the).
det(a).
n(woman).
n(man).
v(shoots).
Now we can implement a noun phrase np as:
np([X,Y]) :-
det(X),
n(Y).
In other words we say "a noun phrase is a sentence with two words the first being a det, the second a n". And this will work: if we query np([a,woman]), it will succeed etc.
But now we need to do something more advance, define the verb phrase. There are two possible verb phrases: the one with a verb and a noun phrase that was originally defined as:
vp(X,Z):- v(X,Y),np(Y,Z).
We could define it as:
vp([X|Y]) :-
v(X),
np(Y).
And the one with only a single verb:
vp(X,Z):- v(X,Z).
That can be converted into:
vp([X]) :-
v(X).
The guessing problem
The problem however is that both variants have a different number of words: there are verb phrases with one word and with three words. That's not really a problem, but now say - I know this is not correct English - there exists a sentence defined as vp followed by np, so this would be:
s(X,Z):- vp(X,Y), np(Y,Z).
in the original grammar.
The problem is that if we want to transform this into our new way of representing it, we need to know how much vp will consume (how much words will be eaten by vp). We cannot know this in advance: since at this point we don't know much about the sentence, we cannot make a guess whether vp will eat one or three words.
We might of course guess the number of words with:
s([X|Y]) :-
vp([X]),
np(Y).
s([X,Y,Z|T]) :-
vp([X,Y,Z]),
np(Z).
But I hope you can imagine that if you would define verb phrases with 1, 3, 5 and 7 words, things will get problematic. Another way to solve this, is leave this to Prolog:
s(S) :-
append(VP,NP,S),
vp(VP),
np(NP).
Now Prolog will guess how to subdivide the sentence into two parts first and then try to match every part. But the problem is that for a sentence with n words, there are n breakpoints.
So Prolog will for instance first split it as:
VP=[],NP=[shoots,the,man,the,woman]
(remember we swapped the order of the verb phrase and noun phrase). Evidently vp will not be very happy if it gets an empty string. So that will be rejected easily. But next it says:
VP=[shoots],NP=[the,man,the,woman]
Now vp is happy with only shoots, but it will require some computational effort to realize that. np is however not amused with this long part. So Prolog backtracks again:
VP=[shoots,the],NP=[man,the,woman]
now vp will complain again that it has been given too much words. Finally Prolog will split it correctly with:
VP=[shoots,the,woman],NP=[the,woman]
The point is that it requires a large number of guesses. And for each of these guesses vp and np will require work as well. For a real complicated grammar, vp and np might split the sentence further resulting in a huge amount of try-and-error.
The true reason is the append/3 has no "semantical" clue how to split the sentence, so it tries all possibilities. One is however more interested in an approach where vp could provide information about what share of the sentence it really wants.
Furthermore if you have to split the sentence into 3 parts, the number of ways to do this even increases to O(n^2) and so on. So guessing will not do the trick.
You might also try to generate a random verb phrase, and then hope the verb phrase matches:
s(S) :-
vp(VP),
append(VP,NP,S),
np(NP).
But in that case the number of guessed verb phrases will blow up exponentially. Of course you can give "hints" etc. to speed up the process, but it will still take some time.
The solution
What you want to do is to provide the part of the sentence to every predicate such that a predicate looks like:
predicate(Subsentence,Remaining)
Subsentence is a list of words that start for that predicate. For instance for a noun phrase, it might look like [the,woman,shoots,the,man]. Every predicate consumes the words it is interested in: the words up to a certain point. In this case, the noun-phrase is only interested in ['the','woman'], because that is a noun-phrase. To do the remaining parsing, it returns the remaining part [shoots,the,woman], in the hope that some other predicate can consume the remainder of the sentence.
For our table of facts that is easy:
det([the|W],W).
det([a|W],W).
n([woman|W],W).
n([man|W],W).
v([shoots|W],W).
It thus means that if you query a setence: [the,woman,shoots,...], and you ask the det/2 whether that's a determiner, it will say: "yes, the is a determiner, but the remaining part [woman,shoots,...]" is not part of the determiner, please match it with something else.
This matching is done because a list is represented as a linked list. [the,woman,shoots,...], is actually represented as [the|[woman|[shoots|...]]] (so it points to the next "sublist"). If you match:
[the|[woman|[shoots|...]]]
det([the|W] ,W)
It will unify [woman|[shoots|...]] with W and thus result in:
det([the|[woman|[shoots|...]],[woman|[shoots|...]]).
Thus returning the remaining list, it has thus consumed the the part.
Higher level predicates
Now in case we define the noun phrase:
np(X,Z):- det(X,Y), n(Y,Z).
And we again call with [the,woman,shoots,...], it will query unifying X with that list. It will first call det that will consume the, without any need for backtracking. Next Y is equal to [woman,shoots,...], now the n/2 will consume woman and return [shoots,...]. This is also the result the np will return, and another predicate will have to consume this.
Usefulness
Say you introduce your name as an additional noun:
n([doug,smith|W],W).
(sorry for using small cases, but otherwise Prolog sees these as variables).
It will simply try to match the first two words with doug and smith, and if that succeeds, try to match the remaining of the sentence. So one can make a setence like: [the,doug,smith,shoots,the,woman] (sorry about that, furthermore in English, some noun phrases map to a noun directly np(X,Y) :- n(X,Y) so the the can be removed for a more complex English grammar).
Guessing completely eliminated?
Is the guessing completely eliminated? No. It is still possible that there is overlap in consuming. For instance you might add:
n([doug,smith|W],W).
n([doug|W],W).
In that case if you query for [the,doug,smith,shoots,the,woman]. It will first consume/eat the in det, next it will look for a noun to consume from [doug,smith,...]. There are two candidates. Prolog will first try to eat only doug, and match [smith,shoots,...] as an entire verb phrase, but since smith is not a verb, it will backtrack, reconsider eating only one word and thus decinde to eat both doug and smith as a noun instead.
The point is that when using append, Prolog would have to backtrack as well.
Conclusion
By using difference lists, you can eat an arbitrary amount of words. The remainder is returned such that other sentence parts like the verb phrase aim to consume the remainder. The list is furthermore always fully grounded, so one definitely does not use brute force to generate all kinds of verb phrases first.
This answer is a follow-up to the answer by #mat. Let's proceed step-by-step:
We start with the code shown in #mat's answer:
s --> np, vp.
np --> det, n.
vp --> v, np.
vp --> v.
det --> [the].
det --> [a].
n --> [woman].
n --> [man].
v --> [shoots].
So far, we have used (',')/2: (A,B) means sequence A followed by sequence B.
Next, we will also use ('|')/2: (A|B) means sequence A or sequence B.
For information on control-constructs in grammar bodies read this manual section.
s --> np, vp.
np --> det, n.
vp --> v, np | v.
det --> [the] | [a].
n --> [woman] | [man].
v --> [shoots].
We can make the code more concise by "inlining" a little:
s --> np, vp.
np --> ([the] | [a]), ([woman] | [man]).
vp --> v,np | v.
v --> [shoots].
How about some more inlining---as suggested by #mat in his comment...
s --> np, [shoots], (np | []).
np --> ([the] | [a]), ([woman] | [man]).
Take it to the max! (The following could be written as a one-liner.)
?- phrase((( [the]
| [a]), ( [woman]
| [man]), [shoots], ( ( [the]
| [a]), ( [woman]
| [man])
| [])),
Ws).
Different formulations all have their up-/down-sides, e.g., very compact code is harder to extend, but may be required when space is scarce—like when putting code on presentation slides.
Sample query:
?- phrase(s,Ws).
Ws = [the, woman, shoots, the, woman]
; Ws = [the, woman, shoots, the, man]
; Ws = [the, woman, shoots, a, woman]
; Ws = [the, woman, shoots, a, man]
; Ws = [the, woman, shoots]
; Ws = [the, man, shoots, the, woman]
; Ws = [the, man, shoots, the, man]
; Ws = [the, man, shoots, a, woman]
; Ws = [the, man, shoots, a, man]
; Ws = [the, man, shoots]
; Ws = [a, woman, shoots, the, woman]
; Ws = [a, woman, shoots, the, man]
; Ws = [a, woman, shoots, a, woman]
; Ws = [a, woman, shoots, a, man]
; Ws = [a, woman, shoots]
; Ws = [a, man, shoots, the, woman]
; Ws = [a, man, shoots, the, man]
; Ws = [a, man, shoots, a, woman]
; Ws = [a, man, shoots, a, man]
; Ws = [a, man, shoots]. % 20 solutions
I know exactly what you mean. It is quite hard at first to think in terms of list differences. The good news is that you usually do not have to do this.
There is a built-in formalism called Definite Clause Grammars (DCGs) which make it completely unnecessary to manually encode list differences in cases like yours.
In your example, I recommend you simply write this as:
s --> np, vp.
np --> det, n.
vp --> v, np.
vp --> v.
det --> [the].
det --> [a].
n --> [woman].
n --> [man].
v --> [shoots].
and accept the fact that within DCGs, [T] represents the terminal T and, is read as "and then". That's how to describe lists with DCGs.
You can already use this to ask for all solutions, using the phrase/2 interface to DCGs:
?- phrase(s, Ls).
Ls = [the, woman, shoots, the, woman] ;
Ls = [the, woman, shoots, the, man] ;
Ls = [the, woman, shoots, a, woman] ;
etc.
It is hard to understand this in procedural terms at first, so I suggest that you do not try this. Instead, focus an a declarative reading of the grammar and see that it describes lists.
Later, you can go into more implementation details. Start with the way how simple terminals are translated, and then move on to more advanced grammar constructs: Rules containing a single terminal, then rules containing a terminal and a nonterminal etc.
Difference lists work like this ( a layman's explanation).
Consider append is used to join two trains X and Y
X = {1}[2][3][4] Y = {a}[b][c]
{ } - represents the compartment having the engine or head.
[ ] - represents compartments or elements in the tail. Assume that we can remove the engine from one compartment and put it into another.
Append proceeds like this:
The new train Z is now Y i.e., {a}[b][c]
next the engine is removed from Z's head and put into tail end compartment removed from X and the new train Z is:
Z = {4}[a][b][c]
The same process is repeated
Z = {3}[4][a][b][c]
Z = {2}[3][4][a][b][c]
Z = {1}[2][[3][4][a][b][c]
our new long train.
Introducing Difference lists is like having a toa hook at the end of X that can readily fasten to Y.
The final hook is discarded.
n([man|W],W).
W is the hook here, unification of W with the head of the successor list is the fastening process.
Hi I am trying to implement a parser for a simple language with grammar like this.
program ::= "program" declarations "begin" statements "end"
declaration ::= "var" ident "as" type
type ::= "string" | "int"
I have the first two done, how would I write the type grammar?
program( prog( DECLS, STATS ) ) -->
[ 'program' ], declarations( DECLS ),
[ 'begin' ], statements( STATS ), [ 'end' ].
declaration( decl( IDENT, TYPE ) ) -->
[ 'var' ], ident( IDENT ), [ 'as' ], type( TYPE ).
Your grammar might be underspecified. In fact you do not define how keywords are separated from other tokens like identifiers. There are programming languages, where you do not need to separate keywords from identifiers. And there are other programming languages where some whitespace or layout character is needed.
In your case, is "varaasint" a valid declaration? Your grammar suggests it. Or do you have to write "var a as int".
You might want to look into this answer for more.
type(string) --> ['string'].
type(int) --> ['int'].
(actually ' are not required)
You could use | or ; but that would complicate the way you return the type you found.
You miss statements rule!
Anyway, DCG rules are just plain syntax sugar on Prolog, then you can use any Prolog feature you like. If you need to keep the grammar compact:
type(type(T)) --> [T], {memberchk(T, [int, string])}.
Braces allow to mix generic Prolog with grammar rules.
As #false noted, your grammar is useful only if you have a tokenizer, that splits your input and discards whitespaces. Or you could handle that more directly, using this schema (beware, untested code):
program( prog( DECLS, STATS ) ) -->
s, "program", s, declarations( DECLS ),
s, "begin", s, statements( STATS ), s, "end", s.
declaration( decl( IDENT, TYPE ) ) -->
"var", s, ident( IDENT ), s, "as", s, type( TYPE ).
declarations([D|Ds]) --> declaration(D), declarations(Ds).
declarations([]) --> [].
type(type(int)) --> "int".
type(type(string)) --> "string".
% skip 1 or more whitespace
s --> (" " ; "\n"), (s ; []).