Parsing in Prolog without cut? - parsing

I found this nice snippet for parsing lisp in Prolog (from here):
ws --> [W], { code_type(W, space) }, ws.
ws --> [].
parse(String, Expr) :- phrase(expressions(Expr), String).
expressions([E|Es]) -->
ws, expression(E), ws,
!, % single solution: longest input match
expressions(Es).
expressions([]) --> [].
% A number N is represented as n(N), a symbol S as s(S).
expression(s(A)) --> symbol(Cs), { atom_codes(A, Cs) }.
expression(n(N)) --> number(Cs), { number_codes(N, Cs) }.
expression(List) --> "(", expressions(List), ")".
expression([s(quote),Q]) --> "'", expression(Q).
number([D|Ds]) --> digit(D), number(Ds).
number([D]) --> digit(D).
digit(D) --> [D], { code_type(D, digit) }.
symbol([A|As]) -->
[A],
{ memberchk(A, "+/-*><=") ; code_type(A, alpha) },
symbolr(As).
symbolr([A|As]) -->
[A],
{ memberchk(A, "+/-*><=") ; code_type(A, alnum) },
symbolr(As).
symbolr([]) --> [].
However expressions uses a cut. I'm assuming this is for efficiency. Is it possible to write this code so that it works efficiently without cut?
Would also be in interested answers that involve Mercury's soft-cut / committed choice.

The cut is not used for efficiency, but to commit to the first solution (see the comment next to the !/0: "single solution: longest input match"). If you comment out the !/0, you get for example:
?- parse("abc", E).
E = [s(abc)] ;
E = [s(ab), s(c)] ;
E = [s(a), s(bc)] ;
E = [s(a), s(b), s(c)] ;
false.
It is clear that only the first solution, consisting of the longest sequence of characters that form a token, is desired in such cases. Given the example above, I therefore disagree with "false": expression//1 is ambiguous, because number//1 and symbolr//1 are. In Mercury, you could use the determinism declaration cc_nondet to commit to a solution, if any.

You are touching a quite deep problem here. At the place of the cut you have
added the comment "longest input match". But what you actually did was to commit
to the first solution which will produce the "longest input match" for the non-terminal ws//0 but not necessarily for expression//1.
Many programming languages define their tokens based on the longest input match. This often leads to very strange effects. For example, a number may be immediately
followed by a letter in many programming languages. That's the case for Pascal, Haskell,
Prolog and many other languages. E.g. if a>2then 1 else 2 is valid Haskell.
Valid Prolog: X is 2mod 3.
Given that, it might be a good idea to define a programming language such that it does not depend on such features at all.
Of course, you would then like to optimize the grammar. But I can only recommend to start with a definition that is unambiguous in the first place.
As for efficiency (and purity):
eos([],[]).
nows --> call(eos).
nows, [W] --> [W], { code_type(W, nospace) }.
ws --> nows.
ws --> [W], {code_type(W, space)}, ws.

You could use a construct that has already found its place in Parsing Expression Grammars (PEGs) but which is also available in DCGs. Namely the negation of a DCG goal. In PEGs the exclamation mark (!) with an argument is used for negation, i.e. ! e. In DCG the negation of a DCG goal is expressed by the (\+) operator, which is already used for ordinary negation as failure in ordinary Prolog clauses and queries.
So lets first explain how (\+) works in DCGs. If you have a production rule of
the form:
A --> B, \+C, D.
Then this is translated to:
A(I,O) :- B(I,X), \+ C(X,_), D(X,O).
Which means an attempt is made to parse the C DCG goal, but without actually consuming the input list. Now this can be used to replace the cut, if desired, and it gives a little bit more declarative feeling. To explain the idea lets assume that with have a grammar without ws//0. So the original clause set of expressions//1 would be:
expressions([E|Es]) --> expression(E), !, expressions(Es).
expressions([]) --> [].
With negation we can turn this into the following cut-less form:
expressions([E|Es]) --> expression(E), expressions(Es).
expressions([]) --> \+ expression(_).
Unfortunately the above variant is quite un-efficient, since an attempt to parse an expression is made twice. Once in the first rule, and then again in the second rule for the negation. But you could do the following and only check for the negation of the beginning of an expression:
expressions([E|Es]) --> expression(E), expressions(Es).
expressions([]) --> \+ symbol(_), \+ number(_), \+ "(", \+ "'".
If you try negation, you will see that you get a relatively strict parser. This is important if you try to parse maximum prefix of input and if you want to detect some errors. Try that:
?- phrase(expressions(X),"'",Y).
You should get failure in the negation version which checks the first symbol of the expression. In the cut and in the cut free version you will get success with the empty list as a result.
But you could also deal in another way with errors, I have only made the error example to highlight a little bit how the negation version works.
In other settings, for example CYK parser, one can make the negation quite efficient, it can use the information which is already placed in the chart.
Best Regards

Related

Prolog - parsing functions with DCG

I need to parse a string representing function like this:
<fsignature>"(" <term1>, <term2> ... <termn>")"
The function' signature and terms also have to be controlled further for the function to be accepted.
I've written this DCG in Prolog:
fsign --> {is_leg(A)}, [A].
terms --> (funct, funct; terms).
terms --> (terms, terms; term).
term --> {is_term(T)}, [T].
But it doesn't seem to work, when I try to use phrase(funct, [foo, "(", a, a, ")"]). it goes into overflow. is_leg just checks if the string is legal (a string starting with a character), while is_term should check if the term is a literal (either a constant, a variable or a function).
What is that it's not working? I figure, probably the variables - should I put them as arguments of the nonterminals?
Thanks for any help.
If your expressions all look like this:
<fsignature> "(" <term1> <term2> ... <termn> ")"
Then writing this out in terms of DCG, it should look something like this (minus any string validation predicates you're suggesting):
expression --> fsignature, ['('], terms, [')'].
fsignature --> ... % whatever fsignature looks like
terms --> term. % "terms" is a single term, or...
terms --> terms, term. % ... more terms followed by a single term.
term --> ... % whatever a term looks like
You can also write the definition of terms as:
terms --> term | terms, term.
Note that the non-recursive definition of terms occurs before the recursive one. Also, the definition for terms above assumes that you must have at least one term (this requirement wasn't stated in your question).

Eliminating Epsilon Production for Left Recursion Elimination

Im following the algorithm for left recursion elimination from a grammar.It says remove the epsilon production if there is any
I have the following grammer
S-->Aa/b
A-->Ac/Sd/∈
I can see after removing the epsilon productions the grammer becomes
1) S-->Aa/a/b
2)A-->Ac/Sd/c/d
Im confused where the a/b comes in 1) and c/d comes in 2)
Can someone explain this?
lets look at the rule S->Aa, if A->∈ then S->∈a giving just S->a, so together with the previous rules we get S->Aa|a|b
now lets check the rule A->Ac and A->∈c which gives us A->c.
what about A->Sd? I dont see how you got A->d as a rule. if that is a rule, then the string "da" is accepted by this grammar (S->Aa & A->d --> "da"), but try to construct this string with the original grammar - if you start with S and the string finishes with a, it means you must use S->Aa, but then in order to have a "d" you must use A->Sd, which forces us to have another "a" or "b", meaning we cannot construct this string, and the rule A->d is not correct.

Grammar: start: (a b)? a c; Input: a d. Which error correct at position 2? 1. expected "b", "c". OR expected "c"

Grammar:
rule: (a b)? a c ;
Input:
a d
Question: Which error message correct at position 2 for given input?
1. expected "b", "c".
2. expected "c".
P.S.
I write parser and I have choice (dilemma) take into account that "b" expected at position or not take.
The #1 error (expected "b", "c") want to say that input "a b" expected but because it optional it may not expected but possible.
I don't know possible is the same as expected or not?
Which error message better and correct #1 or #2?
Thanks for answers.
P.S.
In first case I define marker of testing as limit of position.
if(_inputPos > testing) {
_failure(_inputPos, _code[cp + {{OFFSET_RESULT}}]);
}
Limit moved in optional expressions:
OPTIONAL_EXPRESSION:
testing = _inputPos;
The "b" expression move _inputPos above the testing pos and add failure at _inputPos.
In second case I can define marker of testing as boolean flag.
if(!testing) {
_failure(_inputPos, _code[cp + {{OFFSET_RESULT}}]);
}
The "b" expression in this case not add failure because it tested (inner for optional expression).
What you think what is better and correct?
Testing defined as specific position and if expression above this position (_inputPos > testing) it add failure (even it inside optional expression).
Testing defined as flag and if this flag set that the failures not takes into account. After executing optional expression it restore (not reset!) previous value of testing (true or false).
Also failures not takes into account if rule not fails. They only reported if parsing fails.
P.S.
Changes at 06 Jan 2014
This question raised because it related to two different problems.
First problem:
Parsing expression grammar (PEG) describe only three atomic items of input:
terminal symbol
nonterminal symbol
empty string
This grammar does not provide such operation as lexical preprocessing an thus it does not provide such element as the token.
Second problem:
What is a grammar? Are two grammars can be considred equal if they accept the same input but produce different result?
Assume we have two grammar:
Grammar 1
rule <- type? identifier
Grammar 2
rule <- type identifier / identifier
They both accept the same input but produce (in PEG) different result.
Grammar 1 results:
{type : type, identifier : identifier}
{type : null, identifier : identifier}
Grammar 2 results:
{type : type, identifier : identifier}
{identifier : identifier}
Quetions:
Both grammar equal?
It is painless to do optimization of grammars?
My answer on both questions is negative. No equal, Not painless.
But you may ask. "But why this happens?".
I can answer to you. "Because this is not a problem. This is a feature".
In PEG parser expression ALWAYS consists from these parts.
ORDERED_CHOICE => SEQUENCE => EXPRESSION
And this explanation is the my answer on question "But why this happens?".
Another problem.
PEG parser does not recognize WHITESPACES because it does not have tokens and tokens separators.
Now look at this grammar (in short):
program <- WHITESPACE expr EOF
expr <- ruleX
ruleX <- 'X' WHITESPACE
WHITESPACE < ' '?
EOF <- ! .
All PEG grammar desribed in this manner.
First WHITESPACE at begin and other WHITESPACE (often) at the end of rule.
In this case in PEG optional WHITESPACE must be assumed as expected.
But WHITESPACE not means only space. It may be more complex [\t\n\r] and even comments.
But the main rule of error messages is the following.
If not possible to display all expected elements (or not possible to display at least one from all set of expected elements) in this case is more correct do not display anything.
More precisely required to display "unexpected" error mesage.
How you in PEG will display expected WHITESPACE?
Parser error: expected WHITESPACE
Parser error: expected ' ', '\t', '\n' , 'r'
What about start charcters of comments? They also may be part of WHITESPACE in some grammars.
In this case optional WHITESPACE will be reject all other potential expected elements because not possible correctly to display WHITESPACE in error message because WHITESPACE is too complex to display.
Is this good or bad?
I think this is not bad and required some tricks to hide this nature of PEG parsers.
And in my PEG parser I not assume that the inner expression at first position of optional (optional & zero_or_more) expression must be treated as expected.
But all other inner (except at the first position) must treated as expected.
Example 1:
List<int list; // type? ident
Here "List<int" is a "type". But missing ">" is not at the first position in optional "type?".
This failure take into account and report as "expected '>'"
This is because we not skip "type" but enter into "type" and after really optional "List" we move position from first to next real "expected" (that already outside of testing position) element.
"List" was in "testing" position.
If inner expression (inside optional expression) "fits in the limitation" not continue at next position then it not assumed as the expected input.
From this assumption has been asked main question.
You must just take into account that we are talking about PEG parsers and their error messages.
Here is your grammar:
What is clear here is that after the first a there are two possible inputs: b or c. Your error message should not prioritize one over the other.
The basic idea to produce an error message for an invalid input is to find the most far place you failed (if your grammar where d | (a b)? a c, d wouldn't be part of the error) and determine what are all possible inputs that could make you advance and say "expected '...' but got '...'". There are other approaches to try to recover the parser and force it to continue. If there is only one possible expected token, let's temporarily insert it into the token stream and continue as if it where there since ever. This would lead to better error detection as you can find errors beyond the point where the parser first stopped.

Parsing an expression in Prolog and returning an abstract syntax

I have to write parse(Tkns, T) that takes in a mathematical expression in the form of a list of tokens and finds T, and return a statement representing the abstract syntax, respecting order of operations and associativity.
For example,
?- parse( [ num(3), plus, num(2), star, num(1) ], T ).
T = add(integer(3), multiply(integer(2), integer(1))) ;
No
I've attempted to implement + and * as follows
parse([num(X)], integer(X)).
parse(Tkns, T) :-
( append(E1, [plus|E2], Tkns),
parse(E1, T1),
parse(E2, T2),
T = add(T1,T2)
; append(E1, [star|E2], Tkns),
parse(E1, T1),
parse(E2, T2),
T = multiply(T1,T2)
).
Which finds the correct answer, but also returns answers that do not follow associativity or order of operations.
ex)
parse( [ num(3), plus, num(2), star, num(1) ], T ).
also returns
mult(add(integer(3), integer(2)), integer(1))
and
parse([num(1), plus, num(2), plus, num(3)], T)
returns the equivalent of 1+2+3 and 1+(2+3) when it should only return the former.
Is there a way I can get this to work?
Edit: more info: I only need to implement +,-,*,/,negate (-1, -2, etc.) and all numbers are integers. A hint was given that the code will be structured similarly to the grammer
<expression> ::= <expression> + <term>
| <expression> - <term>
| <term>
<term> ::= <term> * <factor>
| <term> / <factor>
| <factor>
<factor> ::= num
| ( <expression> )
Only with negate implemented as well.
Edit2: I found a grammar parser written in Prolog (http://www.cs.sunysb.edu/~warren/xsbbook/node10.html). Is there a way I could modify it to print a left hand derivation of a grammar ("print" in the sense that the Prolog interpreter will output "T=[the correct answer]")
Removing left recursion will drive you towards DCG based grammars.
But there is an interesting alternative way: implement bottom up parsing.
How hard is this in Prolog ? Well, as Pereira and Shieber show in their wonderful book
'Prolog and Natural-Language Analysis', can be really easy: from chapter 6.5
Prolog supplies by default a top-down, left-to-right, backtrack parsing algorithm for
DCGs.
It is well known that top-down parsing algorithms of this kind will loop on
left-recursive rules (cf. the example of Program 2.3).
Although techniques are avail-
able to remove left recursion from context-free grammars, these techniques are not
readily generalizable to DCGs, and furthermore they can increase grammar size by
large factors.
As an alternative, we may consider implementing a bottom-up parsing method
directly in Prolog. Of the various possibilities, we will consider here the left-corner
method in one of its adaptations to DCGs.
For programming convenience, the input grammar for the left-corner DCG interpreter is represented in a slight variation of the DCG notation. The right-hand sides of
rules are given as lists rather than conjunctions of literals. Thus rules are unit clauses
of the form, e.g.,
s ---> [np, vp].
or
optrel ---> [].
Terminals are introduced by dictionary unit clauses of the form word(w,PT).
Consider to complete the lecture before proceeding (lookup the free book entry by title in info page).
Now let's try writing a bottom up processor:
:- op(150, xfx, ---> ).
parse(Phrase) -->
leaf(SubPhrase),
lc(SubPhrase, Phrase).
leaf(Cat) --> [Word], {word(Word,Cat)}.
leaf(Phrase) --> {Phrase ---> []}.
lc(Phrase, Phrase) --> [].
lc(SubPhrase, SuperPhrase) -->
{Phrase ---> [SubPhrase|Rest]},
parse_rest(Rest),
lc(Phrase, SuperPhrase).
parse_rest([]) --> [].
parse_rest([Phrase|Phrases]) -->
parse(Phrase),
parse_rest(Phrases).
% that's all! fairly easy, isn't it ?
% here start the grammar: replace with your one, don't worry about Left Recursion
e(sum(L,R)) ---> [e(L),sum,e(R)].
e(num(N)) ---> [num(N)].
word(N, num(N)) :- integer(N).
word(+, sum).
that for instance yields
phrase(parse(P), [1,+,3,+,1]).
P = e(sum(sum(num(1), num(3)), num(1)))
note the left recursive grammar used is e ::= e + e | num
Before fixing your program, look at how you identified the problem! You assumed that a particular sentence will have exactly one syntax tree, but you got two of them. So essentially, Prolog helped you to find the bug!
This is a very useful debugging strategy in Prolog: Look at all the answers.
Next is the specific way how you encoded the grammar. In fact, you did something quite smart: You essentially encoded a left-recursive grammar - nevertheless your program terminates for a list of fixed length! That's because you indicate within each recursion that there has to be at least one element in the middle serving as operator. So for each recursion there has to be at least one element. That is fine. However, this strategy is inherently very inefficient. For, for each application of the rule, it will have to consider all possible partitions.
Another disadvantage is that you can no longer generate a sentence out of a syntax tree. That is, if you use your definition with:
?- parse(S, add(add(integer(1),integer(2)),integer(3))).
There are two reasons: The first is that the goals T = add(...,...) are too late. Simply put them at the beginning in front of the append/3 goals. But much more interesting is that now append/3 does not terminate. Here is the relevant failure-slice (see the link for more on this).
parse([num(X)], integer(X)) :- false.
parse(Tkns, T) :-
( T = add(T1,T2),
append(E1, [plus|E2], Tkns), false,
parse(E1, T1),
parse(E2, T2),
; false, T = multiply(T1,T2),
append(E1, [star|E2], Tkns),
parse(E1, T1),
parse(E2, T2),
).
#DanielLyons already gave you the "traditional" solution which requires all kinds of justification from formal languages. But I will stick to your grammar you encoded in your program which - translated into DCGs - reads:
expr(integer(X)) --> [num(X)].
expr(add(L,R)) --> expr(L), [plus], expr(R).
expr(multiply(L,R)) --> expr(L), [star], expr(R).
When using this grammar with ?- phrase(expr(T),[num(1),plus,num(2),plus,num(3)]). it will not terminate. Here is the relevant slice:
expr(integer(X)) --> {false}, [num(X)].
expr(add(L,R)) --> expr(L), {false}, [plus], expr(R).
expr(multiply(L,R)) --> {false}expr(L), [star], expr(R).
So it is this tiny part that has to be changed. Note that the rule "knows" that it wants one terminal symbol, alas, the terminal appears too late. If only it would occur in front of the recursion! But it does not.
There is a general way how to fix this: Add another pair of arguments to encode the length.
parse(T, L) :-
phrase(expr(T, L,[]), L).
expr(integer(X), [_|S],S) --> [num(X)].
expr(add(L,R), [_|S0],S) --> expr(L, S0,S1), [plus], expr(R, S1,S).
expr(multiply(L,R), [_|S0],S) --> expr(L, S0,S1), [star], expr(R, S1,S).
This is a very general method that is of particular interest if you have ambiguous grammars, or if you do not know whether or not your grammar is ambiguous. Simply let Prolog do the thinking for you!
The correct approach is to use DCGs, but your example grammar is left-recursive, which won't work. Here's what would:
expression(T+E) --> term(T), [plus], expression(E).
expression(T-E) --> term(T), [minus], expression(E).
expression(T) --> term(T).
term(F*T) --> factor(F), [star], term(T).
term(F/T) --> factor(F), [div], term(T).
term(F) --> factor(F).
factor(N) --> num(N).
factor(E) --> ['('], expression(E), [')'].
num(N) --> [num(N)], { number(N) }.
The relationship between this and your sample grammar should be obvious, as should the transformation from left-recursive to right-recursive. I can't recall the details from my automata class about left-most derivations, but I think it only comes into play if the grammar is ambiguous, and I don't think this one is. Hopefully a genuine computer scientist will come along and clarify that point.
I see no point in producing an AST other than what Prolog would use. The code within parenthesis on the left-hand side of the production is the AST-building code (e.g. the T+E in the first expression//1 rule). Adjust the code accordingly if this is undesirable.
From here, presenting your parse/2 API is quite trivial:
parse(L, T) :- phrase(expression(T), L).
Because we're using Prolog's own structures, the result will look a lot less impressive than it is:
?- parse([num(4), star, num(8), div, '(', num(3), plus, num(1), ')'], T).
T = 4* (8/ (3+1)) ;
false.
You can show a more AST-y output if you like using write_canonical/2:
?- parse([num(4), star, num(8), div, '(', num(3), plus, num(1), ')'], T),
write_canonical(T).
*(4,/(8,+(3,1)))
T = 4* (8/ (3+1)) a
The part *(4,/(8,+(3,1))) is the result of write_canonical/1. And you can evaluate that directly with is/2:
?- parse([num(4), star, num(8), div, '(', num(3), plus, num(1), ')'], T),
Result is T.
T = 4* (8/ (3+1)),
Result = 8 ;
false.

How to convert prolog parse tree back to a logical sentence

I managed to build the parse tree for given sentence and here it is, for the sentence: "The man went home."
T = s(np(det(the), n(man)), vp(v(went), np(n(home))))
1) How to use phrase/2 on this?
How to translate a sentence in a logical language using prolog? - is similar to what I need, but it's solution doesn't work on me.
2)I want to map this with grammar pattern and get the words tag.
Det=the, N(Subject)=man, V=went, N(Object)=home
Is there a way to map this tree with given set tree structures and identify the grammar.
how can I use parse tree to identify Subject, verb, object, the grammar pattern and the generate the target language sentence.
Edited later..
I tried this code and it gives considerable answer. Any suggestions on this code.
sent("(s(np(n(man))) (vp(v(went)) (np(n(home)))))").
whitespace --> [X], { char_type(X, white) ; char_type(X, space) }, whitespace.
whitespace --> [].
char(C) --> [C], { char_type(C, graph), \+ memberchk(C, "()") }.
chars([C|Rest]) --> char(C), chars(Rest).
chars([C]) --> char(C).
term(T) --> chars(C), { atom_chars(T, C) }.
term(L) --> list(L).
list(T) --> "(", terms(T), ")".
terms([]) --> [].
terms([T|Terms]) --> term(T), whitespace, !, terms(Terms).
simplify([s,[np, [n,[Subject]]], [vp,[v,[Verb]],[np,[n,[Object]]]]],Result) :- Result = [Subject,Verb,Object].
Thanks Mathee
the simpler way to do is by means a visit of the tree, 'hardcoded' on the symbols you are interested.
Here is a more generic utility, that uses (=..)/2 to capture a named part of the tree:
part_of(T, S, R) :- T =.. [F|As],
( F = S,
R = T
; member(N, As),
part_of(N, S, R)
).
?- part_of(s(np(det(the), n(man)), vp(v(went), np(n(home)))),np,P).
P = np(det(the), n(man)) ;
P = np(n(home)) ;
false.
It's a kind of member/2, just for trees. BTW I don't understand the first part of your question: why do you want to use phrase/2 on a syntax tree ? Usually a grammar (the first argument to phrase/2) is meant to build a syntax tree from 'raw' characters stream...

Resources