I am trying to write a small compiler for a language that handles lambda calculus. Here is the ambiguous definition of the language that I've found:
E → ^ v . E | E E | ( E ) | v
The symbols ^, ., (, ) and v are tokens. ^ represents lambda and v represents a variable.
An expression of the form ^v.E is a function definition where v is the formal parameter of the function and E is its body. If f and g are lambda expressions, then the lambda expression fg represents the application of the function f to the argument g.
I'm trying to write an unambiguous grammar for this language, under the assumption that function application is left associative, e.g., fgh = (fg)h, and that function application binds tighter than ., e.g., (^x. ^y. xy) ^z.z = (^x. (^y. xy)) ^z.z
Here is what I have so far, but I'm not sure if it's correct:
E -> ^v.E | T
T -> vF | (E) E
F -> v | epsilon
Could someone help out?
Between reading your question and comments, you seem to be looking more for help with learning and implementing lambda calculus than just the specific question you asked here. If so then I am on the same path so I will share some useful info.
The best book I have, which is not to say the best book possible, is Types and Programming Languages (WorldCat) by Benjamin C. Pierce. I know the title doesn't sound anything like lambda calculus but take a look at λ-Calculus extensions: meaning of extension symbols which list many of the lambda calculi that come from the book. There is code for the book in OCaml and F#.
Try searching in CiteSeerX for research papers on lambda calculus to learn more.
The best λ-Calculus evaluator I have found so far is:
Lambda calculus reduction workbench with info here.
Also, I find that you get much better answers for lambda calculus questions related to programming at CS:StackExchange and math related questions at Math:StackExcahnge.
As for programming languages to implement lambda calculus you will probably need to learn a functional language if you haven't; Yes it's a different beast, but the enlightenment on the other side of the mountain is spectacular. Most of the source code I find uses a functional language such as ML or OCaml, and once you learn one, the rest get easier to learn.
To be more specific, here is the source code for the untyped lambda calculus project, here is the input file to an F# variation of YACC which from reading your previous questions seems to be in your world of knowledge, and here is sample input.
Since the grammar is for implementing a REPL, it starts with toplevel, think command prompt, and accepts multiple commands, which in this case are lambda calculus expressions. Since this grammar is used for many calculi it has parts that are place holders in the earlier examples, thus binding here is more of a place holder.
Finally we get to the part you are after
Note LCID is Lower Case Identifier
Term : AppTerm
| LAMBDA LCID DOT Term
| LAMBDA USCORE DOT Term
AppTerm : ATerm
| AppTerm ATerm
/* Atomic terms are ones that never require extra parentheses */
ATerm : LPAREN Term RPAREN
| LCID
You may find the proof for a particular grammar's ambiguity in sublinear time, but proving that grammar is unambiguous is an NP complete problem. You'd have to generate every possible sentence in the language, and check that there is only one derivation for each.
Related
Despite my limited knowledge about compiling/parsing I dared to build a small recursive-descent parser for OData $filter expressions. The parser only needs to check the expression for correctness and output a corresponding condition in SQL. As input and output have almost the same tokens and structure, this was fairly straightforward, and my implementation does 90% of what I want.
But now I got stuck with parentheses, which appear in separate rules for logical and arithmetic expressions. The full OData grammar in ABNF is here, a condensed version of the rules involved is this:
boolCommonExpr = ( boolMethodCallExpr
/ notExpr
/ commonExpr [ eqExpr / neExpr / ltExpr / ... ]
/ boolParenExpr
) [ andExpr / orExpr ]
commonExpr = ( primitiveLiteral
/ firstMemberExpr ; = identifier
/ methodCallExpr
/ parenExpr
) [ addExpr / subExpr / mulExpr / divExpr / modExpr ]
boolParenExpr = "(" boolCommonExpr ")"
parenExpr = "(" commonExpr ")"
How does this grammar match a simple expression like (1 eq 2)? From what I can see all ( are consumed by the rule parenExpr inside commonExpr, i.e. they must also close after commonExpr to not cause an error and boolParenExpr never gets hit. I suppose my experience / intuition on reading such a grammar is just insufficient to get it. A comment in the ABNF says: "Note that boolCommonExpr is also a commonExpr". Maybe that's part of the mystery?
Obviously an opening ( alone won't tell me where it's going to close: After the current commonExpr expression or further away after boolCommonExpr. My lexer has a list of all tokens ahead (URL is very short input). I was thinking to use that to find out what type of ( I have. Good idea?
I'd rather have restrictions in input or a little hack than switching to a generally more powerful parser model. For a simple expression translation like this I also want to avoid compiler tools.
Edit 1: Extension after answer by rici - Is grammar rewrite correct?
Actually I started out with the example for recursive-descent parsers given on Wikipedia. Then I though to better adapt to the official grammar given by the OData standard to be more "conformant". But with the advice from rici (and the comment from "Internal Server Error") to rewrite the grammar I would tend to go back to the more comprehensible structure provided on Wikipedia.
Adapted to the boolean expression for the OData $filter this could maybe look like this:
boolSequence= boolExpr {("and"|"or") boolExpr} .
boolExpr = ["not"] expression ("eq"|"ne"|"lt"|"gt"|"lt"|"le") expression .
expression = term {("add"|"sum") term} .
term = factor {("mul"|"div"|"mod") factor} .
factor = IDENT | methodCall | LITERAL | "(" boolSequence")" .
methodCall = METHODNAME "(" [ expression {"," expression} ] ")" .
Does the above make sense in general for boolean expressions, is it mostly equivalent to the original structure above and digestible for a recursive descent parser?
#rici: Thanks for your detailed remarks on type checking. The new grammar should resolve your concerns about precedence in arithmetic expressions.
For all three terminals (UPPERCASE in the grammar above) my lexer supplies a type (string, number, datetime or boolean). Non-terminals return the type they produce. With this I managed quite nicely do type checking on the fly in my current implementation, including decent error messages. Hopefully this will also work for the new grammar.
Edit 2: Return to original OData grammar
The differentiation between a "logical" and "arithmetic" ( is not a trivial one. To solve the problem even N.Wirth uses a dodgy workaround to keep the grammar of Pascal simple. As a consequence, in Pascal an extra pair of () is mandatory around and and or expressions. Neither intuitive nor OData conformant :-(. The best read about the "() difficulty" I found is in Let's Build a Compiler (Part VI). Other languages seem to go to great length in the grammar to solve the problem. As I don't have experience with grammar construction I stopped doing my own.
I ended up implementing the original OData grammar. Before I run the parser I go over all tokens backwards to figure out which ( belong to a logical/arithmetic expression. Not a problem for the potential length of a URL.
Personally, I'd just modify the grammar so that it has only one type of expression and therefore one type of parenthesis. I'm not convinced that the OData grammar is actually correct; it is certainly not usable in an LL(1) (or recursive descent) parser for exactly the reason you mention.
Specifically, if the goal is boolCommonExpr, there are two productions which can match the ( lookahead token:
boolCommonExpr = ( …
/ commonExpr [ eqExpr / neExpr / … ]
/ boolParenExpr
/ …
) …
commonExpr = ( …
/ parenExpr
/ …
) …
For the most part, this is a misguided attempt to make the grammar detect a type violation. (If in fact it is a type violation.) It's misguided because it is doomed to failure if there are boolean variables, which there apparently are in this environment. Since there is not syntactic clue as to the type of a variable, the parser is not capable of deciding whether particular expressions are well-formed or not, so there is a good argument for not trying at all, particularly if it creates parsing headaches. A better solution is to first parse the expression into an AST of some form, and then do another pass over the AST to check that each operators has operands of the correct type (and possibly inserting explicit cast operators if that is necessary).
Aside from any other advantage, doing the type check in a separate pass lets you produce much better error messages. If you make (some) type violations syntax errors, then you may leave the user puzzled about why their expression was rejected; in contrast, if you notice that a comparison operation is being used as an operand to multiply (and if your language's semantics don't allow an automatic conversion from True/False to 1/0), then you can produce a well-targetted error message ("comparisons cannot be used as the operand of an arithmetic operator", for example).
One possible reason to put different operators (but not parentheses) into different grammatical variables is to express grammatical precedence. That consideration might encourage you to rewrite the grammar with explicit precedence. (As written, the grammar assumes that all arithmetic operators have the same precedence, which would presumably lead to 2 + 3 * a being parsed as (2 + 3) * a, which might be a huge surprise.) Alternatively, you might use some simple precedence aware subparser for expressions.
If you want to test your ABNF grammar for determinism (i.e. LL(1)), you can use Tunnel Grammar Studio (TGS). I have tested the full grammar, and there are plenty of conflicts, not only this scopes. If you are able to extract the relevant rules, you can use the desktop version of TGS to visualize the conflicts (the online version checker is with a textual result only). If the rules are not too many, the demo may help you to create an LL(1) grammar from your rules.
If you extract all rules you need, and add them to your question, I can run it for you and will tell you is it LL(1). Note that the grammar is not exactly in ABNF meta syntax, because the case sensitivity is typed with ' for case sensitive strings. The ABNF (RFC 5234) by definition is case insensitive, as RFC 7405 defines the sensitivity with %s and %i (sensitive and insensitive) prefixes before the the actual string. The default case (without a prefix) still means insensitive. This means that you have to replace this invalid '...' strings with %s"..." before testing in TGS.
TGS is a project I work on.
to determine if my parser is working correctly i need to find a lr(2+) grammar. After a quick research i have found this grammar and i believe that it is lr(2). However, i am not sure how to determine this.
Terminals: b, e, o, r, s
NonTerminals: A, B, E, Q, SL
Start: P
Productions:
P -> A
A -> E B SL E | b e
B -> b | o r
E -> e | Ɛ
SL -> s SL | s
I would be glad, if someone is able to confirm or deny that this grammar is lr(2) and at best give me a brief explanation on how to determine it by myself.
Thank you very much!
I'm pretty sure it's LR(2), but I don't have an LR(2) parser generator handy to test it, which would be the definitive way to do the test. Of course, you could generate the parser tables by hand. It's not that complicated a grammar, so it shouldn't take you too long.
It's certainly not LR(1), as can be seen from the pair of inputs:
b e
b s e
The left-most derivations are:
P->A->b e
P->E B SL E->B SL E->b SL E->b s E->b s e
So at the beginning of the parse, the parser can either shift a b in order to follow the first derivation chain or reduce an empty sequence to E in order to proceed with the second derivation chain. The second token is needed to choose between these two options, hence a lookahead of at least 2 is required.
As a side note, it should be pretty simple to mine StackOverflow for LR(2) grammars; they come up from time to time in questions. Here's a few I found by searching for LALR(2): (I used a Google search with site:stackoverflow.com because SO's own search engine doesn't do well with search patterns which aren't words. Not that Google does it well, but it does do it better.)
Solving bison conflict over 2nd lookahead
Solving small shift reduce conflict
Persistent Shift - Reduce Conflict in Goldparser
How to reduce parser stack or 'unshift' the current token depending on what follows?
I didn't verify the claims in those questions and answers, and there are other questions which didn't seem to have as clear a result.
The most classic LALR(2) grammar is the grammar for Yacc itself, which is pretty ironic. Here's a simplified version:
grammar: %empty | grammar production
production: ID ':' symbols
symbols: %empty | symbols symbol
symbol: ID | QUOTED_LITERAL
That simple grammar leaves out actions and the optional semicolon. But it captures the essence of the LALR(2)-ness of the grammar, which is precisely the result of the semicolon being optional. That's not a complaint; the grammar is unambiguous so the semicolon really is redundant and no-one should be forced to type a redundant token :-)
Goal: find a way to formally define a grammar that recognizes elements from a set 0 or 1 times in any order. Subsequently, I want to parse it and generate an AST as well.
For example: Say the set of valid strings in my language is {A, B, C}. I want to define a grammar that recognizes all valid permutations of any number of those elements.
Syntactically valid strings would include:
(the empty string)
A,
B A, and
C A B
Syntactically invalid strings would include:
A A, and
B A C B
To be clear, defining all possible permutations explicitly in a CFG is unacceptable for my purposes, since larger sets would be impossible to maintain.
From what I understand, such a language fails the pumping lemma for context free languages, so the solution will not be context free or regular.
Update
What I'm after is called a "permutation language", which Benedek Nagy has done some theoretical work on as an extension to context free languages.
Regarding a parser generator, I've only found talk of implementing parsers with a permutation phase (link). Parsers evidently have an exponential lower bound on the size of resulting CFG, and I haven't found any parser generators that support it anyhow.
A sort-of solution to this problem was written in ANTLR. It uses semantic predicates to 'code around' the issue.
Assuming that the set of alternative strings is fixed and known in advance, say of size n, one can come up with a (non context-free) grammar of size O(n!). This is not asymptotically smaller than enumerating all permutations, so I suppose it cannot be considered a good solution. I believe that this grammar can be reformulated as a context-sensitive grammar (although in the form I'm suggesting below it is not).
For the example {a, b, c} mentioned in the question, one such grammar is the following. I'm using lower case letters for terminal symbols and upper case letters for non-terminals, as is customary. S is the initial non-terminal symbol.
S ::= XabcY
XabcY ::= aXbcY | bXacY | cXabY
XabY ::= ab | ba
XacY ::= ac | ca
XbcY ::= bc | cb
Non-terminals X and Y enclose the substring in the production which has not been finalized yet; this substring will eventually be replaced by a permutation of the terminals that are given between X and Y (in some arbitrary order).
I am studying grammars in Prolog and I have a litle doubt about conversions from the classic BNF grammars to the Prolog DCG grammars form.
For example I have the following BNF grammar:
<s> ::= a b
<s> ::= a <s> b
that, by rewriting, generates all strings of type:
ab
aabb
aaabbb
aaaabbbb
.....
.....
a^n b^n
Looking on the Ivan Bratko book Programming for Artificial Intelligence he convert this BNF grammar into DCG grammar in this way:
s --> [a],[b].
s --> [a],s,[b].
At a first look this seems to me very similar to the classic BNF grammar form but I have only a doubt related to the , symbol used in the DCG
This is not the symbol of the logical OR in Prolog but it is only a separator from the character in the generated sequence.
Is it right?
You can read the , in DCGs as and then or concatenated with:
s -->
[a],
[b].
and
t -->
[a,b].
is the same:
?- phrase(s,X).
X = [a, b].
?- phrase(t,X).
X = [a, b].
It is different to , in a non-DCG/regular Prolog rule which means logical conjunction (AND):
a.
b.
u :-
a,
b.
?- u.
true.
i.e. u is true if a and b are true (which is the case here).
Another difference is also that the predicate s/0 does not exist:
?- s.
ERROR: Undefined procedure: s/0
ERROR: However, there are definitions for:
ERROR: s/2
false.
The reason for this is that the grammar rule s is translated to a Prolog predicate, but this needs additional arguments. The intended way to evaluate a grammar rule is to use phrase/2 as above (phrase(startrule,List)). If you like, I can add explanations about a translation from DCG to plain rules, but I don’t know if this is too confusing if you are a beginner in Prolog.
Addendum:
An even better example would have been to define t as:
t -->
[b],
[a].
Where the evaluation with phrase results in the list [b,a] (which is definitely different from [a,b]):
?- phrase(t,X).
X = [b, a].
But if we reorder the goals in a rule, the cases in which the predicate is true never changes (*), so in our case, defining
v :-
b,
a.
is equivalent to u.
(*) Because prolog uses depth-first search to find a solution, it might be the case that it might need to try infinitely many candidates before it would find the solution after reordering. (In more technical terms, the solutions don't change but your search might not terminate if you reorder goals).
For a while now I am intrigued by the fact that ANTLR isn't capable of parsing the following context free grammar rule: S → 'x' S 'x' | 'x'.
It didn't seem that complex to me.
For all I know, ANTLR is the most powerful LL parser available.
Are there any other kind of parser generators (LR or other) that are capable of generating a parser for this?
gr,
Coen
I don't think your grammar is LL(n) or LALR(n) or LR(n) for any n. Proof: Fix any n. Your input stream starts with n x characters, followed by another one. At this point, without any further look-ahead, do you need to go up or down?
Since the standard parser generators only work on languages in one of those classes (and many only for small values of n), it is not surprising that you don't find one that handles your input. You might want to reconsider if your grammar really needs to look the way it does; for the reduced example you gave, you could just as well have S → 'x' 'x' S | 'x', for example.
In Antlr, you need to add a syntactic predicate to resolve the ambiguity, like this:
grammar fred;
sentence : ( 'x' 'x' 'x' ) => 'x' sentence 'x'
| 'x'
;
This shouldn't, I think, require more than 1 additional token of lookahead. The semantic predicate says "if you see an 'x' followed by another 'x', try the first alternative.
Antlr 3.3/Antlrworks 1.4.2 is happy with it:
Another option is to refactor your grammar to eliminate the alternative that introduces the ambiguity:
grammar fred;
start : sentence
;
sentence : 'x' 'x'('x' 'x')* 'x'
| 'x'
;
This, I believe, will give you the same parse tree (or close) as your original grammar.