Some doubts about BNF grammars and Prolog's DCG grammars - parsing

I am studying grammars in Prolog and I have a litle doubt about conversions from the classic BNF grammars to the Prolog DCG grammars form.
For example I have the following BNF grammar:
<s> ::= a b
<s> ::= a <s> b
that, by rewriting, generates all strings of type:
ab
aabb
aaabbb
aaaabbbb
.....
.....
a^n b^n
Looking on the Ivan Bratko book Programming for Artificial Intelligence he convert this BNF grammar into DCG grammar in this way:
s --> [a],[b].
s --> [a],s,[b].
At a first look this seems to me very similar to the classic BNF grammar form but I have only a doubt related to the , symbol used in the DCG
This is not the symbol of the logical OR in Prolog but it is only a separator from the character in the generated sequence.
Is it right?

You can read the , in DCGs as and then or concatenated with:
s -->
[a],
[b].
and
t -->
[a,b].
is the same:
?- phrase(s,X).
X = [a, b].
?- phrase(t,X).
X = [a, b].
It is different to , in a non-DCG/regular Prolog rule which means logical conjunction (AND):
a.
b.
u :-
a,
b.
?- u.
true.
i.e. u is true if a and b are true (which is the case here).
Another difference is also that the predicate s/0 does not exist:
?- s.
ERROR: Undefined procedure: s/0
ERROR: However, there are definitions for:
ERROR: s/2
false.
The reason for this is that the grammar rule s is translated to a Prolog predicate, but this needs additional arguments. The intended way to evaluate a grammar rule is to use phrase/2 as above (phrase(startrule,List)). If you like, I can add explanations about a translation from DCG to plain rules, but I don’t know if this is too confusing if you are a beginner in Prolog.
Addendum:
An even better example would have been to define t as:
t -->
[b],
[a].
Where the evaluation with phrase results in the list [b,a] (which is definitely different from [a,b]):
?- phrase(t,X).
X = [b, a].
But if we reorder the goals in a rule, the cases in which the predicate is true never changes (*), so in our case, defining
v :-
b,
a.
is equivalent to u.
(*) Because prolog uses depth-first search to find a solution, it might be the case that it might need to try infinitely many candidates before it would find the solution after reordering. (In more technical terms, the solutions don't change but your search might not terminate if you reorder goals).

Related

Recognizing permutations of a finite set of strings in a formal grammar

Goal: find a way to formally define a grammar that recognizes elements from a set 0 or 1 times in any order. Subsequently, I want to parse it and generate an AST as well.
For example: Say the set of valid strings in my language is {A, B, C}. I want to define a grammar that recognizes all valid permutations of any number of those elements.
Syntactically valid strings would include:
(the empty string)
A,
B A, and
C A B
Syntactically invalid strings would include:
A A, and
B A C B
To be clear, defining all possible permutations explicitly in a CFG is unacceptable for my purposes, since larger sets would be impossible to maintain.
From what I understand, such a language fails the pumping lemma for context free languages, so the solution will not be context free or regular.
Update
What I'm after is called a "permutation language", which Benedek Nagy has done some theoretical work on as an extension to context free languages.
Regarding a parser generator, I've only found talk of implementing parsers with a permutation phase (link). Parsers evidently have an exponential lower bound on the size of resulting CFG, and I haven't found any parser generators that support it anyhow.
A sort-of solution to this problem was written in ANTLR. It uses semantic predicates to 'code around' the issue.
Assuming that the set of alternative strings is fixed and known in advance, say of size n, one can come up with a (non context-free) grammar of size O(n!). This is not asymptotically smaller than enumerating all permutations, so I suppose it cannot be considered a good solution. I believe that this grammar can be reformulated as a context-sensitive grammar (although in the form I'm suggesting below it is not).
For the example {a, b, c} mentioned in the question, one such grammar is the following. I'm using lower case letters for terminal symbols and upper case letters for non-terminals, as is customary. S is the initial non-terminal symbol.
S ::= XabcY
XabcY ::= aXbcY | bXacY | cXabY
XabY ::= ab | ba
XacY ::= ac | ca
XbcY ::= bc | cb
Non-terminals X and Y enclose the substring in the production which has not been finalized yet; this substring will eventually be replaced by a permutation of the terminals that are given between X and Y (in some arbitrary order).

Cannot compute minimal length of a parser - uu-parsinglib in Haskell

Lets see the code snippet:
pSegmentBegin p i = pIndentExact i *> ((:) <$> p i <*> ((pEOL *> pSegment p i) <|> pure []))
if I change this code in my parser to:
pSegmentBegin p i = do
pIndentExact i
((:) <$> p i <*> ((pEOL *> pSegment p i) <|> pure []))
I've got an error:
canot compute minmal length of a parser due to occurrence of a moadic bind, use addLength to override
I thought the above parser should behave the same way. Why this error can occur?
EDIT
The above example is very simple (to simplify the question) and as noted below it is not necessary to use do notation here, but the real case I wanted it to use is as follows:
pSegmentBegin p i = do
j <- pIndentAtLast i
(:) <$> p j <*> ((pEOL *> pSegments p j) <|> pure [])
I have noticed that adding "addLength 1" before the do statement solves the problem, but I'm unsure if its a correct solution:
pSegmentBegin p i = addLength 2 $ do
j <- pIndentAtLast i
(:) <$> p j <*> ((pEOL *> pSegments p j) <|> pure [])
As I have mentioned many times the monadic interface should be avoided whenever possible. let me try to explain why the applicative interface is to be preferred.
One of the distinctive features of my library is that it performs error correction by inserting or deleting problems. Of course we can take an umlimited look-ahead here but that would make the process VERY expensive. So we take only a limited lookahead of three steps.
Now suppose we have to insert an expression and one of the expression alternatives is:
expr := "if" expr "then" expr "else" expr
then we want to exclude this alternative since choosing this alternative would necessitate the insertion of another expression etc. So we perform an abstract interpretation of the alternatives and make sure that in case of a draw (i.e. equal costs for the limited lookahead) we take one of the non-recursive alternatives.
Unfortunately this scheme breaks down when one writes monadic parsers, since the length of the right hand side of the bind may depend on the result of the left-hand side. So we issue the error message, and ask some help from the programmer to indicate the number of tokens this alternative might consume. The actual value does not matter so much, as long as you do not provide a finite length for something which is recursive and may lead to infinite insertions. It is only used to select the shortest alternative in case of an insertion.
This abstract interpretation has some costs and if you write all your parsers in monadic style it is unavoidable that this analysis is repeated over an over again. so: DO NOT WRITE MONADIC STYLE PARSERS WHEN USING THIS LIBRARY IF THERE IS AN APPLICATIVE ALTERNATIVE.
It's trying to statically analyze how much input needs to be read in order to optimize performance, but that kind of optimization requires a statically known parser structure—the kind that can be built by Applicatives since the parser effect cannot depend upon the parser value such what (>>=) does.
So that's what goes wrong—when you use do notation it translates to a Monadic bind which breaks the Applicative predictor. It'd be nice if the library only exposed one of the two interfaces so that this kind of error cannot happen, but instead there's some inconsistency if you use both interfaces together in the same parser.
Since this use of do is strictly unnecessary—you're not using the extra power the monadic interface gives you—it's probably better to just avoid it.
I have a workaround I use with monadic parsers in uuparsinglib. Its a self-answer here:
Monadic parse with uu-parsinglib
You may find it useful

Is My Lambda Calculus Grammar Unambiguous?

I am trying to write a small compiler for a language that handles lambda calculus. Here is the ambiguous definition of the language that I've found:
E → ^ v . E | E E | ( E ) | v
The symbols ^, ., (, ) and v are tokens. ^ represents lambda and v represents a variable.
An expression of the form ^v.E is a function definition where v is the formal parameter of the function and E is its body. If f and g are lambda expressions, then the lambda expression fg represents the application of the function f to the argument g.
I'm trying to write an unambiguous grammar for this language, under the assumption that function application is left associative, e.g., fgh = (fg)h, and that function application binds tighter than ., e.g., (^x. ^y. xy) ^z.z = (^x. (^y. xy)) ^z.z
Here is what I have so far, but I'm not sure if it's correct:
E -> ^v.E | T
T -> vF | (E) E
F -> v | epsilon
Could someone help out?
Between reading your question and comments, you seem to be looking more for help with learning and implementing lambda calculus than just the specific question you asked here. If so then I am on the same path so I will share some useful info.
The best book I have, which is not to say the best book possible, is Types and Programming Languages (WorldCat) by Benjamin C. Pierce. I know the title doesn't sound anything like lambda calculus but take a look at λ-Calculus extensions: meaning of extension symbols which list many of the lambda calculi that come from the book. There is code for the book in OCaml and F#.
Try searching in CiteSeerX for research papers on lambda calculus to learn more.
The best λ-Calculus evaluator I have found so far is:
Lambda calculus reduction workbench with info here.
Also, I find that you get much better answers for lambda calculus questions related to programming at CS:StackExchange and math related questions at Math:StackExcahnge.
As for programming languages to implement lambda calculus you will probably need to learn a functional language if you haven't; Yes it's a different beast, but the enlightenment on the other side of the mountain is spectacular. Most of the source code I find uses a functional language such as ML or OCaml, and once you learn one, the rest get easier to learn.
To be more specific, here is the source code for the untyped lambda calculus project, here is the input file to an F# variation of YACC which from reading your previous questions seems to be in your world of knowledge, and here is sample input.
Since the grammar is for implementing a REPL, it starts with toplevel, think command prompt, and accepts multiple commands, which in this case are lambda calculus expressions. Since this grammar is used for many calculi it has parts that are place holders in the earlier examples, thus binding here is more of a place holder.
Finally we get to the part you are after
Note LCID is Lower Case Identifier
Term : AppTerm
| LAMBDA LCID DOT Term
| LAMBDA USCORE DOT Term
AppTerm : ATerm
| AppTerm ATerm
/* Atomic terms are ones that never require extra parentheses */
ATerm : LPAREN Term RPAREN
| LCID
You may find the proof for a particular grammar's ambiguity in sublinear time, but proving that grammar is unambiguous is an NP complete problem. You'd have to generate every possible sentence in the language, and check that there is only one derivation for each.

Example of an LR grammar that cannot be represented by LL?

All LL grammars are LR grammars, but not the other way around, but I still struggle to deal with the distinction. I'm curious about small examples, if any exist, of LR grammars which do not have an equivalent LL representation.
Well, as far as grammars are concerned, its easy -- any simple left-recursive grammar is LR (probably LR(1)) and not LL. So a list grammar like:
list ::= list ',' element | element
is LR(1) (assuming the production for element is) but not LL. Such grammars can be fairly easily converted into LL grammars by left-factoring and such, so this is not too interesting however.
Of more interest is LANGUAGES that are LR but not LL -- that is a language for which there exists an LR(1) grammar but no LL(k) grammar for any k. An example is things that need optional trailing matches. For example, the language of any number of a symbols followed by the same number or fewer b symbols, but not more bs -- { a^i b^j | i >= j }. There's a trivial LR(1) grammar:
S ::= a S | P
P ::= a P b | \epsilon
but no LL(k) grammar. The reason is that an LL grammar needs to decide whether to match an a+b pair or an odd a when looking at an a, while the LR grammar can defer that decision until after it sees the b or the end of the input.
This post on cs.stackechange.com has lots of references about this

Parsing unordered sequence with parsing expression grammar

Is there a (simple) way, within a parsing expression grammar (PEG), to express an "unordered sequence"? A rule such as
Rule <- A B C
requires A, B and C to match in order. A rule such as
Rule <- (A B C) / (B C A) / (C A B) / (A C B) / (C B A) / (B A C)
allows them to match in any order (which is what we want) but it is cumbersome and inapplicable in practice with more terms in the sequence.
Is the only solution to use a syntactically looser rule such as
Rule <- (A / B / C){3}
and semantically check that each rule matches only once?
The fact that, e.g., Relax NG Compact Syntax has an "unordered list" operator to parse XML make me hint that there is no obvious solution.
Last question: do you think the addition of such an operator would bring ambiguity to PEG?
Grammar rules express precisely the sequence of forms that you want, regardless of parsing engine (e.g., PEG, LALR, LL(k), ...) that you choose.
The only way to express that you want all possible sequences of just of something using BNF rules is the big ugly rule you proposed.
The standard solution is to simply define:
rule <- (A | B | C)*
(or whatever syntax your parser generator accepts for lists) and semantically count that only 3 forms are provided and they are unique.
Often people building parser generators add special "extended BNF" notations to let them describe special circumstances; you gave an example use {3} as special syntax implying that you only wanted "3 of" under the assumption the parser generator accepts this notation and does the appropriate enforcement. One can imagine an extension notation {unique} to let you describe your situation. I've never seen a parser generator that implemented that idea.

Resources