I have to write parse(Tkns, T) that takes in a mathematical expression in the form of a list of tokens and finds T, and return a statement representing the abstract syntax, respecting order of operations and associativity.
For example,
?- parse( [ num(3), plus, num(2), star, num(1) ], T ).
T = add(integer(3), multiply(integer(2), integer(1))) ;
No
I've attempted to implement + and * as follows
parse([num(X)], integer(X)).
parse(Tkns, T) :-
( append(E1, [plus|E2], Tkns),
parse(E1, T1),
parse(E2, T2),
T = add(T1,T2)
; append(E1, [star|E2], Tkns),
parse(E1, T1),
parse(E2, T2),
T = multiply(T1,T2)
).
Which finds the correct answer, but also returns answers that do not follow associativity or order of operations.
ex)
parse( [ num(3), plus, num(2), star, num(1) ], T ).
also returns
mult(add(integer(3), integer(2)), integer(1))
and
parse([num(1), plus, num(2), plus, num(3)], T)
returns the equivalent of 1+2+3 and 1+(2+3) when it should only return the former.
Is there a way I can get this to work?
Edit: more info: I only need to implement +,-,*,/,negate (-1, -2, etc.) and all numbers are integers. A hint was given that the code will be structured similarly to the grammer
<expression> ::= <expression> + <term>
| <expression> - <term>
| <term>
<term> ::= <term> * <factor>
| <term> / <factor>
| <factor>
<factor> ::= num
| ( <expression> )
Only with negate implemented as well.
Edit2: I found a grammar parser written in Prolog (http://www.cs.sunysb.edu/~warren/xsbbook/node10.html). Is there a way I could modify it to print a left hand derivation of a grammar ("print" in the sense that the Prolog interpreter will output "T=[the correct answer]")
Removing left recursion will drive you towards DCG based grammars.
But there is an interesting alternative way: implement bottom up parsing.
How hard is this in Prolog ? Well, as Pereira and Shieber show in their wonderful book
'Prolog and Natural-Language Analysis', can be really easy: from chapter 6.5
Prolog supplies by default a top-down, left-to-right, backtrack parsing algorithm for
DCGs.
It is well known that top-down parsing algorithms of this kind will loop on
left-recursive rules (cf. the example of Program 2.3).
Although techniques are avail-
able to remove left recursion from context-free grammars, these techniques are not
readily generalizable to DCGs, and furthermore they can increase grammar size by
large factors.
As an alternative, we may consider implementing a bottom-up parsing method
directly in Prolog. Of the various possibilities, we will consider here the left-corner
method in one of its adaptations to DCGs.
For programming convenience, the input grammar for the left-corner DCG interpreter is represented in a slight variation of the DCG notation. The right-hand sides of
rules are given as lists rather than conjunctions of literals. Thus rules are unit clauses
of the form, e.g.,
s ---> [np, vp].
or
optrel ---> [].
Terminals are introduced by dictionary unit clauses of the form word(w,PT).
Consider to complete the lecture before proceeding (lookup the free book entry by title in info page).
Now let's try writing a bottom up processor:
:- op(150, xfx, ---> ).
parse(Phrase) -->
leaf(SubPhrase),
lc(SubPhrase, Phrase).
leaf(Cat) --> [Word], {word(Word,Cat)}.
leaf(Phrase) --> {Phrase ---> []}.
lc(Phrase, Phrase) --> [].
lc(SubPhrase, SuperPhrase) -->
{Phrase ---> [SubPhrase|Rest]},
parse_rest(Rest),
lc(Phrase, SuperPhrase).
parse_rest([]) --> [].
parse_rest([Phrase|Phrases]) -->
parse(Phrase),
parse_rest(Phrases).
% that's all! fairly easy, isn't it ?
% here start the grammar: replace with your one, don't worry about Left Recursion
e(sum(L,R)) ---> [e(L),sum,e(R)].
e(num(N)) ---> [num(N)].
word(N, num(N)) :- integer(N).
word(+, sum).
that for instance yields
phrase(parse(P), [1,+,3,+,1]).
P = e(sum(sum(num(1), num(3)), num(1)))
note the left recursive grammar used is e ::= e + e | num
Before fixing your program, look at how you identified the problem! You assumed that a particular sentence will have exactly one syntax tree, but you got two of them. So essentially, Prolog helped you to find the bug!
This is a very useful debugging strategy in Prolog: Look at all the answers.
Next is the specific way how you encoded the grammar. In fact, you did something quite smart: You essentially encoded a left-recursive grammar - nevertheless your program terminates for a list of fixed length! That's because you indicate within each recursion that there has to be at least one element in the middle serving as operator. So for each recursion there has to be at least one element. That is fine. However, this strategy is inherently very inefficient. For, for each application of the rule, it will have to consider all possible partitions.
Another disadvantage is that you can no longer generate a sentence out of a syntax tree. That is, if you use your definition with:
?- parse(S, add(add(integer(1),integer(2)),integer(3))).
There are two reasons: The first is that the goals T = add(...,...) are too late. Simply put them at the beginning in front of the append/3 goals. But much more interesting is that now append/3 does not terminate. Here is the relevant failure-slice (see the link for more on this).
parse([num(X)], integer(X)) :- false.
parse(Tkns, T) :-
( T = add(T1,T2),
append(E1, [plus|E2], Tkns), false,
parse(E1, T1),
parse(E2, T2),
; false, T = multiply(T1,T2),
append(E1, [star|E2], Tkns),
parse(E1, T1),
parse(E2, T2),
).
#DanielLyons already gave you the "traditional" solution which requires all kinds of justification from formal languages. But I will stick to your grammar you encoded in your program which - translated into DCGs - reads:
expr(integer(X)) --> [num(X)].
expr(add(L,R)) --> expr(L), [plus], expr(R).
expr(multiply(L,R)) --> expr(L), [star], expr(R).
When using this grammar with ?- phrase(expr(T),[num(1),plus,num(2),plus,num(3)]). it will not terminate. Here is the relevant slice:
expr(integer(X)) --> {false}, [num(X)].
expr(add(L,R)) --> expr(L), {false}, [plus], expr(R).
expr(multiply(L,R)) --> {false}expr(L), [star], expr(R).
So it is this tiny part that has to be changed. Note that the rule "knows" that it wants one terminal symbol, alas, the terminal appears too late. If only it would occur in front of the recursion! But it does not.
There is a general way how to fix this: Add another pair of arguments to encode the length.
parse(T, L) :-
phrase(expr(T, L,[]), L).
expr(integer(X), [_|S],S) --> [num(X)].
expr(add(L,R), [_|S0],S) --> expr(L, S0,S1), [plus], expr(R, S1,S).
expr(multiply(L,R), [_|S0],S) --> expr(L, S0,S1), [star], expr(R, S1,S).
This is a very general method that is of particular interest if you have ambiguous grammars, or if you do not know whether or not your grammar is ambiguous. Simply let Prolog do the thinking for you!
The correct approach is to use DCGs, but your example grammar is left-recursive, which won't work. Here's what would:
expression(T+E) --> term(T), [plus], expression(E).
expression(T-E) --> term(T), [minus], expression(E).
expression(T) --> term(T).
term(F*T) --> factor(F), [star], term(T).
term(F/T) --> factor(F), [div], term(T).
term(F) --> factor(F).
factor(N) --> num(N).
factor(E) --> ['('], expression(E), [')'].
num(N) --> [num(N)], { number(N) }.
The relationship between this and your sample grammar should be obvious, as should the transformation from left-recursive to right-recursive. I can't recall the details from my automata class about left-most derivations, but I think it only comes into play if the grammar is ambiguous, and I don't think this one is. Hopefully a genuine computer scientist will come along and clarify that point.
I see no point in producing an AST other than what Prolog would use. The code within parenthesis on the left-hand side of the production is the AST-building code (e.g. the T+E in the first expression//1 rule). Adjust the code accordingly if this is undesirable.
From here, presenting your parse/2 API is quite trivial:
parse(L, T) :- phrase(expression(T), L).
Because we're using Prolog's own structures, the result will look a lot less impressive than it is:
?- parse([num(4), star, num(8), div, '(', num(3), plus, num(1), ')'], T).
T = 4* (8/ (3+1)) ;
false.
You can show a more AST-y output if you like using write_canonical/2:
?- parse([num(4), star, num(8), div, '(', num(3), plus, num(1), ')'], T),
write_canonical(T).
*(4,/(8,+(3,1)))
T = 4* (8/ (3+1)) a
The part *(4,/(8,+(3,1))) is the result of write_canonical/1. And you can evaluate that directly with is/2:
?- parse([num(4), star, num(8), div, '(', num(3), plus, num(1), ')'], T),
Result is T.
T = 4* (8/ (3+1)),
Result = 8 ;
false.
I've started looking into REBOL, just for fun, and as a fan of programming languages, I really like seeing new ideas and even just alternative syntaxes. REBOL is definitely full of these. One thing I noticed is the use of '/' as the path operator which can be used similarly to the '.' operator in most object-oriented programming languages. I have not programmed in REBOL extensively, just looked at some examples and read some documentation, but it isn't clear to me why there's no ambiguity with the '/' operator.
x: 4
y: 2
result: x/y
In my example, this should be division, but it seems like it could just as easily be the path operator if x were an object or function refinement. How does REBOL handle the ambiguity? Is it just a matter of an overloaded operator and the type system so it doesn't know until runtime? Or is it something I'm missing in the grammar and there really is a difference?
UPDATE Found a good piece of example code:
sp: to-integer (100 * 2 * length? buf) / d/3 / 1024 / 1024
It appears that arithmetic division requires whitespace, while the path operator requires no whitespace. Is that it?
This question deserves an answer from the syntactic point of view. In Rebol, there is no "path operator", in fact. The x/y is a syntactic element called path. As opposed to that the standalone / (delimited by spaces) is not a path, it is a word (which is usually interpreted as the division operator). In Rebol you can examine syntactic elements like this:
length? code: [x/y x / y] ; == 4
type? first code ; == path!
type? second code
, etc.
The code guide says:
White-space is used in general for delimiting (for separating symbols).
This is especially important because words may contain characters such as + and -.
http://www.rebol.com/r3/docs/guide/code-syntax.html
One acquired skill of being a REBOler is to get the hang of inserting whitespace in expressions where other languages usually do not require it :)
Spaces are generally needed in Rebol, but there are exceptions here and there for "special" characters, such as those delimiting series. For instance:
[a b c] is the same as [ a b c ]
(a b c) is the same as ( a b c )
[a b c]def is the same as [a b c] def
Some fairly powerful tools for doing introspection of syntactic elements are type?, quote, and probe. The quote operator prevents the interpreter from giving behavior to things. So if you tried something like:
>> data: [x [y 10]]
>> type? data/x/y
>> probe data/x/y
The "live" nature of the code would dig through the path and give you an integer! of value 10. But if you use quote:
>> data: [x [y 10]]
>> type? quote data/x/y
>> probe quote data/x/y
Then you wind up with a path! whose value is simply data/x/y, it never gets evaluated.
In the internal representation, a PATH! is quite similar to a BLOCK! or a PAREN!. It just has this special distinctive lexical type, which allows it to be treated differently. Although you've noticed that it can behave like a "dot" by picking members out of an object or series, that is only how it is used by the DO dialect. You could invent your own ideas, let's say you make the "russell" command:
russell [
x: 10
y: 20
z: 30
x/y/z
(
print x
print y
print z
)
]
Imagine that in my fanciful example, this outputs 30, 10, 20...because what the russell function does is evaluate its block in such a way that a path is treated as an instruction to shift values. So x/y/z means x=>y, y=>z, and z=>x. Then any code in parentheses is run in the DO dialect. Assignments are treated normally.
When you want to make up a fun new riff on how to express yourself, Rebol takes care of a lot of the grunt work. So for example the parentheses are guaranteed to have matched up to get a paren!. You don't have to go looking for all that yourself, you just build your dialect up from the building blocks of all those different types...and hook into existing behaviors (such as the DO dialect for basics like math and general computation, and the mind-bending PARSE dialect for some rather amazing pattern matching muscle).
But speaking of "all those different types", there's yet another weirdo situation for slash that can create another type:
>> type? quote /foo
This is called a refinement!, and happens when you start a lexical element with a slash. You'll see it used in the DO dialect to call out optional parameter sets to a function. But once again, it's just another symbolic LEGO in the parts box. You can ascribe meaning to it in your own dialects that is completely different...
While I didn't find any written definitive clarification, I did also find that +,-,* and others are valid characters in a word, so clearly it requires a space.
x*y
Is a valid identifier
x * y
Performs multiplication. It looks like the path operator is just another case of this.