Build parse tree with prolog - parsing

II am trying to write parser using prolog.
I have my tokenizer which returns list of tokens. For instance:
Tokens = [key(read),id('N'),sep(:=),int(10),....] All I need is to make prolog to return set of instruction to run a program.
program = [].
program = [Instructions | Program].
The question is, what is easiest way to BUILD parse tree for given tokens and grammar(like bison). I would be grateful for any help.

Following up on #mat's suggestion, and using your syntax for a token stream, here's a simple example.
Suppose you have the following BNF defining your language:
<program> ::= program begin <statement_list> end .
<statement_list> ::= <statement> | <statement> ; <statement_list>
<statement> ::= ...
You need to decide how you want to represent your parse tree. I've chosen what seems to be a clunky way to represent the n-ary parse tree as embedded lists (I didn't put a lot of thought into it), but hopefully this will give you the idea of how it works. Your DCG would directly reflect the BNF, and the argument would be the parse tree:
program([key(program), key(begin), AST_statement_list, key(end), sep(.)]) -->
[key(program), key(begin)],
statement_list(AST_statement),
[key(end), sep(.)].
statement_list([AST]) --> statement(AST).
statement_list([AST_statement | AST_statement_list]) -->
statement(AST_statement),
statement_list(AST_statement_list).
statement(AST) --> ...
You would parse by calling your DCG with the token stream:
phrase(program(ParseTree), TokenStream).
A program parse tree would look like:
[key(program), key(begin), [Statement1_List, Statement2_List, ...], key(end), sep(.)]
Each StatementN_List would itself be an n-ary tree for the statement subtree.

Related

Parsing Dart | ANTLR | Handle a comma at the end of parameter list

My apologies for the bad title, but couldn't express it in better words.
I'm writing a parser using ANTLR to calculate complexities in dart code.
Things seem to work fine until I tried to parse a file with the following Method Signature
Stream<SomeState> mapEventToState(SomeEvent event,) async* {
//someCode to map the State to Event
}
Here the mapEventToState(SomeEvent event,) creates an issue because of the COMMA , at the end.
It presents 2 params to me because of the trailing COMMA (whereas in reality it's just one) and includes some part of the code in the params list thus making the rest of the code unreadable for ANTLR.
This is normal in flutter to end a list of parameters with a COMMA.
The grammar corresponding to it is:
initializedVariableDeclaration
: declaredIdentifier ('=' expression)? (','initializedIdentifier)*
;
initializedIdentifier
: identifier ('=' expression)?
;
initializedIdentifierList
: initializedIdentifier (',' initializedIdentifier)*
;
The full grammar can be checked at https://github.com/antlr/grammars-v4/blob/master/dart2/Dart2.g4
What should I change on the grammar so that I don't face this issue and the parser can understand that functionName(Param param1, Param param2,) is same as functionName(Param param1, Param param2)
The Dart project maintains a reference ANTLR grammar for the Dart language (mostly as a tool for ourselves, to ensure new language features can be parsed).
It might be useful as a reference.
The "dart2" grammar you are linking to in the ANTLR repository is probably severely outdated. It was not created by a Dart team member, and if it doesn't handle trailing commas in argument lists, it was probably never complete for Dart 2.0. Use with caution.
I do not believe that the rule you mentioned (initializedVariableDeclaration) is the grammar corresponding to the problem. That's for an ordinary variable declaration (with an initializer).
I believe you actually want to change formalParameterList. The Dart grammar is provided by the language specification, and we can compare the grammar listed there to the grammar from the ANTLR repository.
The ANTLR file has:
formalParameterList
: '(' ')'
| '(' normalFormalParameters ')'
...
whereas the Dart 2.10 specification has, from section 9.2 (Formal Parameters):
<formalParameterList> ::= ‘(’ ‘)’
| ‘(’ <normalFormalParameters> ‘,’? ‘)’
...
You should file an issue against ANTLR or create a pull request to fix it.
That file also does not appear to have been substantially updated since May 2019 and seems to be missing some notable changes to the Dart language since that time (e.g. spread collections (spreadElement), collection-if (ifElement), and collection-for (forElement) from Dart 2.3, and the changes for null safety).

Is there a way to insert phases between the lexer and parser in ANTLR

I am writing a lexer/parser for a language that allows abbreviations (and globs) for its keywords. And, I am trying to determine the best way to do it.
And one thought that occurs to me, is to insert a phase between the lexer and the parser, where the lexer recognizes the general class, e.g. is this a "command name" or is it an "option" and then passes those general tokens to a second phase which does further analysis and recognizes which command name it is and passes that on as the token type to the parser.
It will make the parser simple. I will only have to deal with well formed command names. Every token will be clear what it means.
It will keep the lexer simple. It will only have to divide things into classes. This is a simple name. This is a glob. This is an option name (starts with a dash).
The phase is the middle will also be relatively simple. The simple name (and option forms) will only have to deal with strings. The glob form can use standard glob techniques to match the glob against the legal candidates, which are in the tables for the simple names and options.
The question is how to insert that phase into ANTLR, so that I call the lexer and it creates tokens and the intermediate phase massages them and then the parser gets the tokens the intermediate phase has categorized.
Is there a known solution for this?
Something like:
lexer grammar simple
letter: [A-Z][a-z];
digit: [0-9];
glob-char: [*?];
name: letter (letter | digit)*;
option: '-'name;
glob: (glob-char|letter)(glob-char|letter|digit)*;
glob-option: '-'glob;
filter grammar name;
end: 'e' | 'end';
generate: 'ge' | 'generate';
goto: 'go' | 'goto';
help: 'h' | 'help';
if: 'i' | 'if';
then: 't' | 'then';
parser grammar simple;
The user (programmer writing the language I am parsing) need to be to write
g*te and have if match generate.
The phase between the lexer and the parser when it sees a glob needs to look at the glob (and the list of keywords) and see if only one of them matches the glob and if so, return that keyword. The stuff I listed in the "filter grammar" is the stuff that builds the list of keywords globs can match. I have found code on the web that matches globs to a list of names. That part isn't hard.
And, I've since found in the ANTLR doc how to run arbitrary code on matching a token and how to change the resulting tokens type. (See my answer.)
It looks like you can use lexerCustomActions to achieve the desired effect. Something like the following.
in your lexer:
GLOB: [-A-Za-z0-9_.]* '*' [-A-Za-z0-9_.*]* { setType(lexGlob(getText())); }
in your Java (or whatever language you are using code):
void int lexGlob(String origText()) {
return xyzzy; // some code that computes the right kind of token type
}

How to use context free grammars?

Could someone help me with using context free grammars. Up until now I've used regular expressions to remove comments, block comments and empty lines from a string so that it can be used to count the PLOC. This seems to be extremely slow so I was looking for a different more efficient method.
I saw the following post: What is the best way to ignore comments in a java file with Rascal?
I have no idea how to use this, the help doesn't get me far as well. When I try to define the line used in the post I immediately get an error.
lexical SingleLineComment = "//" ~[\n] "\n";
Could someone help me out with this and also explain a bit about how to setup such a context free grammar and then to actually extract the wanted data?
Kind regards,
Bob
First this will help: the ~ in Rascal CFG notation is not in the language, the negation of a character class is written like so: ![\n].
To use a context-free grammar in Rascal goes in three steps:
write it, like for example the syntax definition of the Func language here: http://docs.rascal-mpl.org/unstable/Recipes/#Languages-Func
Use it to parse input, like so:
// This is the basic parse command, but be careful it will not accept spaces and newlines before and after the TopNonTerminal text:
Prog myParseTree = parse(#Prog, "example string");
// you can do the same directly to an input file:
Prog myParseTree = parse(#TopNonTerminal, |home:///myProgram.func|);
// if you need to accept layout before and after the program, use a "start nonterminal":
start[Prog] myParseTree = parse(#start[TopNonTerminal], |home:///myProgram.func|);
Prog myProgram = myParseTree.top;
// shorthand for parsing stuff:
myProgram = [Prog] "example";
myProgram = [Prog] |home:///myLocation.txt|;
Once you have the tree you can start using visit and / deepmatch to extract information from the tree, or write recursive functions if you like. Examples can be found here: http://docs.rascal-mpl.org/unstable/Recipes/#Languages-Func , but here are some common idioms as well to extract information from a parse tree:
// produces the source location of each node in the tree:
myParseTree#\loc
// produces a set of all nodes of type Stat
{ s | /Stat s := myParseTree }
// pattern match an if-then-else and bind the three expressions and collect them in a set:
{ e1, e2, e3 | (Stat) `if <Exp e1> then <Exp e2> else <Exp e3> end` <- myExpressionList }
// collect all locations of all sub-trees (every parse tree is of a non-terminal type, which is a sub-type of Tree. It uses |unknown:///| for small sub-trees which have not been annotated for efficiency's sake, like literals and character classes:
[ t#\loc?|unknown:///| | /Tree t := myParseTree ]
That should give you a start. I'd go try out some stuff and look at more examples. Writing a grammar is a nice thing to do, but it does require some trial and error methods like writing a regex, but even more so.
For the grammar you might be writing, which finds source code comments but leaves the rest as "any character" you will need to use the longest match disambiguation a lot:
lexical Identifier = [a-z]+ !>> [a-z]; // means do not accept an Identifier if there is still [a-z] to add to it; so only the longest possible Identifier will match.
This kind of context-free grammar is called an "Island Grammar" metaphorically, because you will write precise rules for the parts you want to recognize (the comments are "Islands") while leaving the rest as everything else (the rest is "Water"). See https://dl.acm.org/citation.cfm?id=837160

Append text file to lexicon in Rascal

Is it possible to append terminals retrieved from a text file to a lexicon in Rascal? This would happen at run time, and I see no obvious way to achieve this. I would rather keep the data separate from the Rascal project. For example, if I had read in a list of countries from a text file, how would I add these to a lexicon (using the lexical keyword)?
In the data-dependent version of the Rascal parser this is even easier and faster but we haven't released this yet. For now I'd write a generic rule with a post-parse filter, like so:
rascal>set[str] lexicon = {"aap", "noot", "mies"};
set[str]: {"noot","mies","aap"}
rascal>lexical Word = [a-z]+;
ok
rascal>syntax LexiconWord = word: Word w;
ok
rascal>LexiconWord word(Word w) { // called when the LexiconWord.word rule is use to build a tree
>>>>>>> if ("<w>" notin lexicon)
>>>>>>> filter; // remove this parse tree
>>>>>>> else fail; // just build the tree
>>>>>>>}
rascal>[Sentence] "hello"
|prompt:///|(0,18,<1,0>,<1,18>): ParseError(|prompt:///|(0,18,<1,0>,<1,18>))
at $root$(|prompt:///|(0,64,<1,0>,<1,64>))
rascal>[Sentence] "aap"
Sentence: (Sentence) `aap`
rascal>
Because the filter function removed all possible derivations for hello, the parser eventually returns a parse error on hello. It does not do so for aap which is in the lexicon, so hurray. Of course you can make interestingly complex derivations with this kind of filtering. People sometimes write ambiguous grammars and use filters like so to make it unambiguous.
Parsing and filtering in this way is in cubic worst-case time in terms of the length of the input, if the filtering function is in amortized constant time. If the grammar is linear, then of course the entire process is also linear.
A completely different answer would be to dynamically update the grammar and generate a parser from this. This involves working against the internal grammar representation of Rascal like so:
set[str] lexicon = {"aap", "noot", "mies"};
syntax Word = ; // empty definition
typ = #Word;
grammar = typ.definitions;
grammar[sort("Word")] = { prod(sort("Word"), lit(x), {}) | x <- lexicon };
newTyp = type(sort("Word"), grammar);
This newType is a reified grammar + type for the definition of the lexicon, and which can now be used like so:
import ParseTree;
if (type[Word] staticGrammar := newType) {
parse(staticGrammar, "aap");
}
Now having written al this, two things:
I think this may trigger unknown bugs since we did not test dynamic parser generation, and
For a lexicon with a reasonable size, this will generate an utterly slow parser since the parser is optimized for keywords in programming languages and not large lexicons.

Parsing an expression in Prolog and returning an abstract syntax

I have to write parse(Tkns, T) that takes in a mathematical expression in the form of a list of tokens and finds T, and return a statement representing the abstract syntax, respecting order of operations and associativity.
For example,
?- parse( [ num(3), plus, num(2), star, num(1) ], T ).
T = add(integer(3), multiply(integer(2), integer(1))) ;
No
I've attempted to implement + and * as follows
parse([num(X)], integer(X)).
parse(Tkns, T) :-
( append(E1, [plus|E2], Tkns),
parse(E1, T1),
parse(E2, T2),
T = add(T1,T2)
; append(E1, [star|E2], Tkns),
parse(E1, T1),
parse(E2, T2),
T = multiply(T1,T2)
).
Which finds the correct answer, but also returns answers that do not follow associativity or order of operations.
ex)
parse( [ num(3), plus, num(2), star, num(1) ], T ).
also returns
mult(add(integer(3), integer(2)), integer(1))
and
parse([num(1), plus, num(2), plus, num(3)], T)
returns the equivalent of 1+2+3 and 1+(2+3) when it should only return the former.
Is there a way I can get this to work?
Edit: more info: I only need to implement +,-,*,/,negate (-1, -2, etc.) and all numbers are integers. A hint was given that the code will be structured similarly to the grammer
<expression> ::= <expression> + <term>
| <expression> - <term>
| <term>
<term> ::= <term> * <factor>
| <term> / <factor>
| <factor>
<factor> ::= num
| ( <expression> )
Only with negate implemented as well.
Edit2: I found a grammar parser written in Prolog (http://www.cs.sunysb.edu/~warren/xsbbook/node10.html). Is there a way I could modify it to print a left hand derivation of a grammar ("print" in the sense that the Prolog interpreter will output "T=[the correct answer]")
Removing left recursion will drive you towards DCG based grammars.
But there is an interesting alternative way: implement bottom up parsing.
How hard is this in Prolog ? Well, as Pereira and Shieber show in their wonderful book
'Prolog and Natural-Language Analysis', can be really easy: from chapter 6.5
Prolog supplies by default a top-down, left-to-right, backtrack parsing algorithm for
DCGs.
It is well known that top-down parsing algorithms of this kind will loop on
left-recursive rules (cf. the example of Program 2.3).
Although techniques are avail-
able to remove left recursion from context-free grammars, these techniques are not
readily generalizable to DCGs, and furthermore they can increase grammar size by
large factors.
As an alternative, we may consider implementing a bottom-up parsing method
directly in Prolog. Of the various possibilities, we will consider here the left-corner
method in one of its adaptations to DCGs.
For programming convenience, the input grammar for the left-corner DCG interpreter is represented in a slight variation of the DCG notation. The right-hand sides of
rules are given as lists rather than conjunctions of literals. Thus rules are unit clauses
of the form, e.g.,
s ---> [np, vp].
or
optrel ---> [].
Terminals are introduced by dictionary unit clauses of the form word(w,PT).
Consider to complete the lecture before proceeding (lookup the free book entry by title in info page).
Now let's try writing a bottom up processor:
:- op(150, xfx, ---> ).
parse(Phrase) -->
leaf(SubPhrase),
lc(SubPhrase, Phrase).
leaf(Cat) --> [Word], {word(Word,Cat)}.
leaf(Phrase) --> {Phrase ---> []}.
lc(Phrase, Phrase) --> [].
lc(SubPhrase, SuperPhrase) -->
{Phrase ---> [SubPhrase|Rest]},
parse_rest(Rest),
lc(Phrase, SuperPhrase).
parse_rest([]) --> [].
parse_rest([Phrase|Phrases]) -->
parse(Phrase),
parse_rest(Phrases).
% that's all! fairly easy, isn't it ?
% here start the grammar: replace with your one, don't worry about Left Recursion
e(sum(L,R)) ---> [e(L),sum,e(R)].
e(num(N)) ---> [num(N)].
word(N, num(N)) :- integer(N).
word(+, sum).
that for instance yields
phrase(parse(P), [1,+,3,+,1]).
P = e(sum(sum(num(1), num(3)), num(1)))
note the left recursive grammar used is e ::= e + e | num
Before fixing your program, look at how you identified the problem! You assumed that a particular sentence will have exactly one syntax tree, but you got two of them. So essentially, Prolog helped you to find the bug!
This is a very useful debugging strategy in Prolog: Look at all the answers.
Next is the specific way how you encoded the grammar. In fact, you did something quite smart: You essentially encoded a left-recursive grammar - nevertheless your program terminates for a list of fixed length! That's because you indicate within each recursion that there has to be at least one element in the middle serving as operator. So for each recursion there has to be at least one element. That is fine. However, this strategy is inherently very inefficient. For, for each application of the rule, it will have to consider all possible partitions.
Another disadvantage is that you can no longer generate a sentence out of a syntax tree. That is, if you use your definition with:
?- parse(S, add(add(integer(1),integer(2)),integer(3))).
There are two reasons: The first is that the goals T = add(...,...) are too late. Simply put them at the beginning in front of the append/3 goals. But much more interesting is that now append/3 does not terminate. Here is the relevant failure-slice (see the link for more on this).
parse([num(X)], integer(X)) :- false.
parse(Tkns, T) :-
( T = add(T1,T2),
append(E1, [plus|E2], Tkns), false,
parse(E1, T1),
parse(E2, T2),
; false, T = multiply(T1,T2),
append(E1, [star|E2], Tkns),
parse(E1, T1),
parse(E2, T2),
).
#DanielLyons already gave you the "traditional" solution which requires all kinds of justification from formal languages. But I will stick to your grammar you encoded in your program which - translated into DCGs - reads:
expr(integer(X)) --> [num(X)].
expr(add(L,R)) --> expr(L), [plus], expr(R).
expr(multiply(L,R)) --> expr(L), [star], expr(R).
When using this grammar with ?- phrase(expr(T),[num(1),plus,num(2),plus,num(3)]). it will not terminate. Here is the relevant slice:
expr(integer(X)) --> {false}, [num(X)].
expr(add(L,R)) --> expr(L), {false}, [plus], expr(R).
expr(multiply(L,R)) --> {false}expr(L), [star], expr(R).
So it is this tiny part that has to be changed. Note that the rule "knows" that it wants one terminal symbol, alas, the terminal appears too late. If only it would occur in front of the recursion! But it does not.
There is a general way how to fix this: Add another pair of arguments to encode the length.
parse(T, L) :-
phrase(expr(T, L,[]), L).
expr(integer(X), [_|S],S) --> [num(X)].
expr(add(L,R), [_|S0],S) --> expr(L, S0,S1), [plus], expr(R, S1,S).
expr(multiply(L,R), [_|S0],S) --> expr(L, S0,S1), [star], expr(R, S1,S).
This is a very general method that is of particular interest if you have ambiguous grammars, or if you do not know whether or not your grammar is ambiguous. Simply let Prolog do the thinking for you!
The correct approach is to use DCGs, but your example grammar is left-recursive, which won't work. Here's what would:
expression(T+E) --> term(T), [plus], expression(E).
expression(T-E) --> term(T), [minus], expression(E).
expression(T) --> term(T).
term(F*T) --> factor(F), [star], term(T).
term(F/T) --> factor(F), [div], term(T).
term(F) --> factor(F).
factor(N) --> num(N).
factor(E) --> ['('], expression(E), [')'].
num(N) --> [num(N)], { number(N) }.
The relationship between this and your sample grammar should be obvious, as should the transformation from left-recursive to right-recursive. I can't recall the details from my automata class about left-most derivations, but I think it only comes into play if the grammar is ambiguous, and I don't think this one is. Hopefully a genuine computer scientist will come along and clarify that point.
I see no point in producing an AST other than what Prolog would use. The code within parenthesis on the left-hand side of the production is the AST-building code (e.g. the T+E in the first expression//1 rule). Adjust the code accordingly if this is undesirable.
From here, presenting your parse/2 API is quite trivial:
parse(L, T) :- phrase(expression(T), L).
Because we're using Prolog's own structures, the result will look a lot less impressive than it is:
?- parse([num(4), star, num(8), div, '(', num(3), plus, num(1), ')'], T).
T = 4* (8/ (3+1)) ;
false.
You can show a more AST-y output if you like using write_canonical/2:
?- parse([num(4), star, num(8), div, '(', num(3), plus, num(1), ')'], T),
write_canonical(T).
*(4,/(8,+(3,1)))
T = 4* (8/ (3+1)) a
The part *(4,/(8,+(3,1))) is the result of write_canonical/1. And you can evaluate that directly with is/2:
?- parse([num(4), star, num(8), div, '(', num(3), plus, num(1), ')'], T),
Result is T.
T = 4* (8/ (3+1)),
Result = 8 ;
false.

Resources