Parsing unordered sequence with parsing expression grammar - parsing

Is there a (simple) way, within a parsing expression grammar (PEG), to express an "unordered sequence"? A rule such as
Rule <- A B C
requires A, B and C to match in order. A rule such as
Rule <- (A B C) / (B C A) / (C A B) / (A C B) / (C B A) / (B A C)
allows them to match in any order (which is what we want) but it is cumbersome and inapplicable in practice with more terms in the sequence.
Is the only solution to use a syntactically looser rule such as
Rule <- (A / B / C){3}
and semantically check that each rule matches only once?
The fact that, e.g., Relax NG Compact Syntax has an "unordered list" operator to parse XML make me hint that there is no obvious solution.
Last question: do you think the addition of such an operator would bring ambiguity to PEG?

Grammar rules express precisely the sequence of forms that you want, regardless of parsing engine (e.g., PEG, LALR, LL(k), ...) that you choose.
The only way to express that you want all possible sequences of just of something using BNF rules is the big ugly rule you proposed.
The standard solution is to simply define:
rule <- (A | B | C)*
(or whatever syntax your parser generator accepts for lists) and semantically count that only 3 forms are provided and they are unique.
Often people building parser generators add special "extended BNF" notations to let them describe special circumstances; you gave an example use {3} as special syntax implying that you only wanted "3 of" under the assumption the parser generator accepts this notation and does the appropriate enforcement. One can imagine an extension notation {unique} to let you describe your situation. I've never seen a parser generator that implemented that idea.

Related

Automata and Formal Languages

Showing that the reverse of a word for a regular language L is also regular
I am confused as to how I am to approach this question, i've been stuck for hours: For a word x, we use x^r to denote its reverse. For a language L, we use L^r to denote {x^r where x is in the set of L}. Show that if L is regular then so is L^r
If L is regular, then there exists some regular grammar which generates it. It can be always represented as either a left-regular grammar, or a right-regular grammar. Let's assume that it's left-regular grammar G_l(the proof for right-regular grammar is analogous).
This grammar has productions of two types; the terminating-type:
A -> a, where A is non-terminal and a is either a terminal or empty string (epsilon)
or the chaining type:
B -> Ca, where B, C are non-terminals and a is a terminal
When we apply reverse to a regular language, we basically also apply it to the tails of productions (since heads are just single non-terminals). It's going to be proved later on. So we get a new grammar G_r, with productions:
A -> a, where A is non-terminal and a is either a terminal or empty string (epsilon)
B -> aC, where B, C are non-terminals and a is a terminal
But hey, it's a right-regular grammar! So the language it accepts is also regular.
There is one thing to do - to show that reversing tails actually does the thing it's supposed to. We're going to prove it very simply:
If L contains \epsilon, then there is production 'S -> \epsilon' in G_l. Since we don't touch productions like that, it's also present in G_r.
If L contains a, a word composed of a single terminal, then it's similar to the above
If L contains aZ, where a is a terminal and Z is a word from the language constructed from chopping off the first terminals out of words in L, then L^r contains (because of changes to the chaining productions) (Z^r)a. Z is also a regular language, since it can be constructed by dropping the first "level" of left-productions from G_l, which leaves us with a regular grammar.
I hope it helped. There's also an arguably easier way of doing that by reversing edges of the relevant finite automata and changing accepting and entry states a bit.

Recognizing permutations of a finite set of strings in a formal grammar

Goal: find a way to formally define a grammar that recognizes elements from a set 0 or 1 times in any order. Subsequently, I want to parse it and generate an AST as well.
For example: Say the set of valid strings in my language is {A, B, C}. I want to define a grammar that recognizes all valid permutations of any number of those elements.
Syntactically valid strings would include:
(the empty string)
A,
B A, and
C A B
Syntactically invalid strings would include:
A A, and
B A C B
To be clear, defining all possible permutations explicitly in a CFG is unacceptable for my purposes, since larger sets would be impossible to maintain.
From what I understand, such a language fails the pumping lemma for context free languages, so the solution will not be context free or regular.
Update
What I'm after is called a "permutation language", which Benedek Nagy has done some theoretical work on as an extension to context free languages.
Regarding a parser generator, I've only found talk of implementing parsers with a permutation phase (link). Parsers evidently have an exponential lower bound on the size of resulting CFG, and I haven't found any parser generators that support it anyhow.
A sort-of solution to this problem was written in ANTLR. It uses semantic predicates to 'code around' the issue.
Assuming that the set of alternative strings is fixed and known in advance, say of size n, one can come up with a (non context-free) grammar of size O(n!). This is not asymptotically smaller than enumerating all permutations, so I suppose it cannot be considered a good solution. I believe that this grammar can be reformulated as a context-sensitive grammar (although in the form I'm suggesting below it is not).
For the example {a, b, c} mentioned in the question, one such grammar is the following. I'm using lower case letters for terminal symbols and upper case letters for non-terminals, as is customary. S is the initial non-terminal symbol.
S ::= XabcY
XabcY ::= aXbcY | bXacY | cXabY
XabY ::= ab | ba
XacY ::= ac | ca
XbcY ::= bc | cb
Non-terminals X and Y enclose the substring in the production which has not been finalized yet; this substring will eventually be replaced by a permutation of the terminals that are given between X and Y (in some arbitrary order).

Cannot compute minimal length of a parser - uu-parsinglib in Haskell

Lets see the code snippet:
pSegmentBegin p i = pIndentExact i *> ((:) <$> p i <*> ((pEOL *> pSegment p i) <|> pure []))
if I change this code in my parser to:
pSegmentBegin p i = do
pIndentExact i
((:) <$> p i <*> ((pEOL *> pSegment p i) <|> pure []))
I've got an error:
canot compute minmal length of a parser due to occurrence of a moadic bind, use addLength to override
I thought the above parser should behave the same way. Why this error can occur?
EDIT
The above example is very simple (to simplify the question) and as noted below it is not necessary to use do notation here, but the real case I wanted it to use is as follows:
pSegmentBegin p i = do
j <- pIndentAtLast i
(:) <$> p j <*> ((pEOL *> pSegments p j) <|> pure [])
I have noticed that adding "addLength 1" before the do statement solves the problem, but I'm unsure if its a correct solution:
pSegmentBegin p i = addLength 2 $ do
j <- pIndentAtLast i
(:) <$> p j <*> ((pEOL *> pSegments p j) <|> pure [])
As I have mentioned many times the monadic interface should be avoided whenever possible. let me try to explain why the applicative interface is to be preferred.
One of the distinctive features of my library is that it performs error correction by inserting or deleting problems. Of course we can take an umlimited look-ahead here but that would make the process VERY expensive. So we take only a limited lookahead of three steps.
Now suppose we have to insert an expression and one of the expression alternatives is:
expr := "if" expr "then" expr "else" expr
then we want to exclude this alternative since choosing this alternative would necessitate the insertion of another expression etc. So we perform an abstract interpretation of the alternatives and make sure that in case of a draw (i.e. equal costs for the limited lookahead) we take one of the non-recursive alternatives.
Unfortunately this scheme breaks down when one writes monadic parsers, since the length of the right hand side of the bind may depend on the result of the left-hand side. So we issue the error message, and ask some help from the programmer to indicate the number of tokens this alternative might consume. The actual value does not matter so much, as long as you do not provide a finite length for something which is recursive and may lead to infinite insertions. It is only used to select the shortest alternative in case of an insertion.
This abstract interpretation has some costs and if you write all your parsers in monadic style it is unavoidable that this analysis is repeated over an over again. so: DO NOT WRITE MONADIC STYLE PARSERS WHEN USING THIS LIBRARY IF THERE IS AN APPLICATIVE ALTERNATIVE.
It's trying to statically analyze how much input needs to be read in order to optimize performance, but that kind of optimization requires a statically known parser structure—the kind that can be built by Applicatives since the parser effect cannot depend upon the parser value such what (>>=) does.
So that's what goes wrong—when you use do notation it translates to a Monadic bind which breaks the Applicative predictor. It'd be nice if the library only exposed one of the two interfaces so that this kind of error cannot happen, but instead there's some inconsistency if you use both interfaces together in the same parser.
Since this use of do is strictly unnecessary—you're not using the extra power the monadic interface gives you—it's probably better to just avoid it.
I have a workaround I use with monadic parsers in uuparsinglib. Its a self-answer here:
Monadic parse with uu-parsinglib
You may find it useful

Some doubts about BNF grammars and Prolog's DCG grammars

I am studying grammars in Prolog and I have a litle doubt about conversions from the classic BNF grammars to the Prolog DCG grammars form.
For example I have the following BNF grammar:
<s> ::= a b
<s> ::= a <s> b
that, by rewriting, generates all strings of type:
ab
aabb
aaabbb
aaaabbbb
.....
.....
a^n b^n
Looking on the Ivan Bratko book Programming for Artificial Intelligence he convert this BNF grammar into DCG grammar in this way:
s --> [a],[b].
s --> [a],s,[b].
At a first look this seems to me very similar to the classic BNF grammar form but I have only a doubt related to the , symbol used in the DCG
This is not the symbol of the logical OR in Prolog but it is only a separator from the character in the generated sequence.
Is it right?
You can read the , in DCGs as and then or concatenated with:
s -->
[a],
[b].
and
t -->
[a,b].
is the same:
?- phrase(s,X).
X = [a, b].
?- phrase(t,X).
X = [a, b].
It is different to , in a non-DCG/regular Prolog rule which means logical conjunction (AND):
a.
b.
u :-
a,
b.
?- u.
true.
i.e. u is true if a and b are true (which is the case here).
Another difference is also that the predicate s/0 does not exist:
?- s.
ERROR: Undefined procedure: s/0
ERROR: However, there are definitions for:
ERROR: s/2
false.
The reason for this is that the grammar rule s is translated to a Prolog predicate, but this needs additional arguments. The intended way to evaluate a grammar rule is to use phrase/2 as above (phrase(startrule,List)). If you like, I can add explanations about a translation from DCG to plain rules, but I don’t know if this is too confusing if you are a beginner in Prolog.
Addendum:
An even better example would have been to define t as:
t -->
[b],
[a].
Where the evaluation with phrase results in the list [b,a] (which is definitely different from [a,b]):
?- phrase(t,X).
X = [b, a].
But if we reorder the goals in a rule, the cases in which the predicate is true never changes (*), so in our case, defining
v :-
b,
a.
is equivalent to u.
(*) Because prolog uses depth-first search to find a solution, it might be the case that it might need to try infinitely many candidates before it would find the solution after reordering. (In more technical terms, the solutions don't change but your search might not terminate if you reorder goals).

How to enforce associativity rules in a GLR parser?

I'm writing a GLR for fun (again, because I understood a few things since my last try). The parser is working now and I am implementing disambiguation rules. I am handling precedence in a way that seems to work. Now I'm a bit at loss regarding associativity.
Say I have a grammar like this :
E <- E '+' E (rule 1)
E <- E '-' E (rule 2)
E <- '0' (rule 3)
E <- '1' (rule 4)
Where rules 1) and 2) have the same precedence and left associativity.
Without associativity handling, the string '1-1+0' will generate two parse trees:
1 2
/ \ / \
/ \ / \
2 3 4 1
| \ | \
4 4 4 3
Where numbers indicate the rule used for reduction. The correct parse tree is the first one and thus I would like to keep only this one.
I'm wondering how to efficiently detect associativity infringements algorithmically.
One approach I tried was to see that in the first tree, at the top node, rule 2 is LEFT of rule 3 in the list of children of rule 1, whereas in the second tree rule 1 is RIGHT of rule 4 and thus since rules 2 and 1 are LEFT associative I keep only the first tree.
This, however, did not get me very far in more complicated examples. A limitation of this solution is that I can only discard trees based on a comparison with another tree.
Do you think I can find a solution using a refined version of this approach? What is the standard way of doing?
In my opinion this is best expressed by integrating into the grammar rules, completely resolving the ambiguity:
E <- F
E <- E '+' F
E <- E '-' F
F <- '0'
F <- '1'
As you are set for (G)LR, it should be possible to equally well express left- and right-associativity. The increase in depth of parse trees, due to unit derivations, can be addressed by postprocessing them appropriately.
This will completely avoid inventing a new mechanism, and exploit the expressiveness of the BNF that is used anyway. I think it requires strong arguments to instead favor an ambiguous notation, plus a separate specification of how to resolve.
The XQuery language specification, during its definition process, evolved from using ambiguous EBNF with extra disambiguation rules (see April 30, 2002 draft) to dropping the latter in favor of unambiguous rules incorporating precedence and associativity (see August 16, 2002 draft). As an implementer, I very much appreciated that - it made my life easier.
To do this algorithmically I would make two groups: SIMPLE which includes rule 3 and 4 and COMPLEX which includes rule 1 and 2. If the rightmost child of a (COMPLEX) (sub)root is COMPLEX then remove this tree because it is (partly) right-associative.

Resources