I'm curious about Prolog as a parser, so I'm making a little Lisp front-end. I have already made a tokenizer, which you can see here:
base_tokenize([], Buffer, [Buffer]).
base_tokenize([Char | Chars], Buffer, Tokens) :-
(Char = '(' ; Char = ')') ->
base_tokenize(Chars, '', Tail_Tokens),
Tokens = [Buffer, Char | Tail_Tokens];
Char = ' ' ->
base_tokenize(Chars, '', Tail_Tokens),
Tokens = [Buffer | Tail_Tokens];
atom_concat(Buffer, Char, New_Buffer),
base_tokenize(Chars, New_Buffer, Tokens).
filter_empty_blank([], []).
filter_empty_blank([Head | Tail], Result) :-
filter_empty_blank(Tail, Tail_Result),
((Head = [] ; Head = '') ->
Result = Tail_Result;
Result = [Head | Tail_Result]).
tokenize(Expr, Tokens) :-
atom_chars(Expr, Chars),
base_tokenize(Chars, '', Dirty_Tokens),
filter_empty_blank(Dirty_Tokens, Tokens).
I now have a new challenge: construct a parse tree from this. First, I tried making one without a grammar, but that turned out really messy. So I'm using DCGs. Wikipedia's page on it is not very clear - especially the portion Parsing with DCGs. Maybe someone can give me a clearer idea of how I would construct a tree? I was very happy to know that Prolog's lists are untyped, so it's a bit easier now that no sum types are needed. I'm just really confused about inputs to grammar clauses like sentence(s(NP,VP)) or verb(v(eats)) (on the Wiki), why the arguments have such abstruse names, and how I can get started with my parser without too much hassle.
expr --> [foo].
expr --> list.
seq --> expr, seq.
seq --> expr.
list --> ['('], seq, [')'].
Here is a beginning: Parsing a LISP list-of-atom, which at first is unstructured list-of-token:
List = [ '(', '(', foo, bar, ')', baz ')' ].
First, just accept it.
Write down the grammar directly:
so_list --> ['('], so_list_content, [')'].
so_list_content --> [].
so_list_content --> so_atom, so_list_content.
so_list_content --> so_list, so_list_content.
so_atom --> [X], { \+ member(X,['(',')']),atom(X) }.
Add some test cases (is there plunit in GNU Prolog?)
:- begin_tests(accept_list).
test(1,[fail]) :- phrase(so_list,[]).
test(2,[true,nondet]) :- phrase(so_list,['(',')']).
test(3,[true,nondet]) :- phrase(so_list,['(',foo,')']).
test(4,[true,nondet]) :- phrase(so_list,['(',foo,'(',bar,')',')']).
test(5,[true,nondet]) :- phrase(so_list,['(','(',bar,')',foo,')']).
test(6,[fail]) :- phrase(so_list,['(',foo,'(',bar,')']).
:- end_tests(accept_list).
And so:
?- run_tests.
% PL-Unit: accept_list ...... done
% All 6 tests passed
true.
Cool. Looks like we can accept lists-of-tokens.
Now build a parse tree. This is done by growing a Prolog term through parameters of the "DCG predicates". The term (or multiple terms) in the head collect the terms (or multiple terms) appearing in the body into a larger structure, quite naturally. Once the terminal tokens are reached, the structure starts to fill up with actual content:
so_list(list(Stuff)) --> ['('], so_list_content(Stuff), [')'].
so_list_content([]) --> [].
so_list_content([A|Stuff]) --> so_atom(A), so_list_content(Stuff).
so_list_content([L|Stuff]) --> so_list(L), so_list_content(Stuff).
so_atom(X) --> [X], { \+ member(X,['(',')']),atom(X) }.
Yup, tests (move the expected Result out of the test head because the visual noise is too much)
:- begin_tests(parse_list).
test(1,[fail]) :-
phrase(so_list(_),[]).
test(2,[true(L==Result),nondet]) :-
phrase(so_list(L),['(',')']),
Result = list([]).
test(3,[true(L==Result),nondet]) :-
phrase(so_list(L),['(',foo,')']),
Result = list([foo]).
test(4,[true(L==Result),nondet]) :-
phrase(so_list(L),['(',foo,'(',bar,')',')']),
Result = list([foo,list([bar])]).
test(5,[true(L==Result),nondet]) :-
phrase(so_list(L),['(','(',bar,')',foo,')']),
Result = list([list([bar]),foo]).
test(6,[fail]) :-
phrase(so_list(_),['(',foo,'(',bar,')']).
:- end_tests(parse_list).
And so:
?- run_tests.
% PL-Unit: parse_list ...... done
% All 6 tests passed
true.
I'm making a parser for a DSL in Haskell using Alex + Happy.
My DSL uses dice rolls as part of the possible expressions.
Sometimes I have an expression that I want to parse that looks like:
[some code...] 3D6 [... rest of the code]
Which should translate roughly to:
TokenInt {... value = 3}, TokenD, TokenInt {... value = 6}
My DSL also uses variables (basically, Strings), so I have a special token that handle variable names.
So, with this tokens:
"D" { \pos str -> TokenD pos }
$alpha [$alpha $digit \_ \']* { \pos str -> TokenName pos str}
$digit+ { \pos str -> TokenInt pos (read str) }
The result I'm getting when using my parse now is:
TokenInt {... value = 3}, TokenName { ... , name = "D6"}
Which means that my lexer "reads" an Integer and a Variable named "D6".
I have tried many things, for example, i changed the token D to:
$digit "D" $digit { \pos str -> TokenD pos }
But that just consumes the digits :(
Can I parse the dice roll with the numbers?
Or at least parse TokenInt-TokenD-TokenInt?
PS: I'm using PosN as a wrapper, not sure if relevant.
The way I'd go about it would be to extend the TokenD type to TokenD Int Int so using the basic wrapper for convenience I would do
$digit+ D $digit+ { dice }
...
dice :: String -> Token
dice s = TokenD (read $ head ls) (read $ last ls)
where ls = split 'D' s
split can be found here.
This is an extra step that'd usually be done in during syntactic analysis but doesn't hurt much here.
Also I can't make Alex parse $alpha for TokenD instead of TokenName. If we had Di instead of D that'd be no problem. From Alex's docs:
When the input stream matches more than one rule, the rule which matches the longest prefix of the input stream wins. If there are still several rules which match an equal number of characters, then the rule which appears earliest in the file wins.
But then your code should work. I don't know if this is an issue with Alex.
I decided that I could survive with variables starting with lowercase letters (like Haskell variables), so I changed my lexer to parse variables only if they start with a lowercase letter.
That also solved some possible problems with some other reserved words.
I'm still curious to know if there were other solutions, but the problem in itself was solved.
Thank you all!
I have a grammar describing an assembler dialect. In code section programmer can refer to registers from a certain list and to defined variables. Also I have a rule matching both [reg0++413] and [myVariable++413]:
BinaryBiasInsideFetchOperation:
'['
v = (Register|[IntegerVariableDeclaration]) ( gbo = GetBiasOperation val = (Register|IntValue|HexValue) )?
']'
;
But when I try to compile it, Xtext throws a warning:
Decision can match input such as "'[' '++' 'reg0' ']'" using multiple alternatives: 2, 3. As a result, alternative(s) 3 were disabled for that input
Spliting the rules I've noticed, that
BinaryBiasInsideFetchOperation:
'['
v = Register ( gbo = GetBiasOperation val = (Register|IntValue|HexValue) )?
']'
;
BinaryBiasInsideFetchOperation:
'['
v = [IntegerVariableDeclaration] ( gbo = GetBiasOperation val = (Register|IntValue|HexValue) )?
']'
;
work well separately, but not at the same time. When I try to compile both of them, XText writes a number of errors saying that registers from list could be processed ambiguously. So:
1) Am I right, that part of rule v = (Register|[IntegerVariableDeclaration]) matches any IntegerVariable name including empty, but rule v = [IntegerVariableDeclaration] matches only nonempty names?
2) Is it correct that when I try to compile separate rules together Xtext thinks that [IntegerVariableDeclaration] can concur with Register?
3) How to resolve this ambiguity?
edit: definitors
Register:
areg = ('reg0' | 'reg1' | 'reg2' | 'reg3' | 'reg4' | 'reg5' | 'reg6' | 'reg7' )
;
IntegerVariableDeclaration:
section = SectionServiceWord? name=ID ':' type = IntegerType ('[' size = IntValue ']')? ( value = IntegerVariableDefinition )? ';'
;
ID is a standart terminal which parses a single word, a.k.a identifier
No, (Register|[IntegerVariableDeclaration]) can't match Empty. Actually, [IntegerVariableDeclaration] is the same than [IntegerVariableDeclaration|ID], it is matching ID rule.
Yes, i think you can't split your rules.
I can't reproduce your problem (i need full grammar), but, in order to solve your problem you should look at this article about xtext grammar debugging:
Compile grammar in debug mode by adding the following line into your workflow.mwe2
fragment = org.eclipse.xtext.generator.parser.antlr.DebugAntlrGeneratorFragment {}
Open generated antrl debug grammar with AntlrWorks and check the diagram.
In addition to Fabien's answer, I'd like to add that an omnimatching rule like
AnyId:
name = ID
;
instead of
(Register|[IntegerVariableDeclaration])
solves the problem. One need to dynamically check if AnyId.name is a Regiser, Variable or something else like Constant.
I am trying to create a parser for simple chemical formulae. Meaning, they have no states of matter, charge, or anything like that. The formulae only have strings representing compounds, quantities, and parentheses.
Following this answer to a similar question, and some rudimentary knowledge of discrete math, I hoped that I could write a simple Recursive Descent Parser to generate the number of each atom inside of the formula. I already have a really simple answer for this that involves single parentheses, but not nested parentheses.
Here are the productions of the grammar without parentheses:
Compound: Component { Component };
Component: Atom [Quantity]
Atom: 'H' | 'He' | 'Li' | 'Be' ...
Quantity: Digit { Digit }
Digit: '0' | '1' | ... '9'
[...] is read as optional, and will be an if test in the program (either it is there or missing)
| is alternatives, and so is an if .. else if .. else or switch 'test', it is saying the input must match one of these
{ ... } is read as repetition of 0 or more, and will be a while loop in the program
Characters between quotes are literal characters which will be in the string. All the other words are names of rules, and for a recursive descent parser, end up being the names of the functions which get called to chop up, and handle the input.
With nested parentheses, I have no idea what to do. By nested parentheses I mean something like (Fe2(OH)2(H2O)8)2, or something fictitious and complicated like (Ab(CD2(Ef(G2H)3)(IJ2)4)3)2
Because now there is a production that I don't really understand how to articulate, but here is my best attempt:
Parenthetical: Compound { Parenthetical } [Quantity]
So the basic rules parse any simple sequence of chemical symbols and quantities without parenthesis.
I assume the Quantity is defining the quantity of the whole chunk of stuff between '(' ... ')'
So, '(' ... ') [Quantity] needs to be parsed as exactly the same thing as the Component, i.e. as an alternative to: Atom [Quantity]
So the only thing to change is the Component rule; it becomes:
Component: Atom [Quantity] | '(' Compound ')' [Quantity]
In the code function (or procedure) which is parsing Component, it will have a look at the next character (token), and if it is an '(', it will consume it, then call the function (or procedure) responsible for parsing Compound, and after that, check the next character (token) is a ')' (if not, it's a syntax error), then handle the optional Quantity, and then it is finished.
I am assuming you are using a programming language which supports recursive function (or procedure) calls. That housekeeping, done by code behind the scenes for your program, will make this 'just work' (TM).
Alternatively, you could solve the problem in a different way. Add a new rule, which says:
Stuff: Atom | '(' Compound ')'
Then modify the rule:
Compound: Stuff [Quantity]
Then write a new function (or procedure) for Stuff, and change the Compound code to simply call Stuff, then handle the optional Quantity.
There are good technical reasons for doing this to support some parsing technology. However you're using recursive descent where it won't really matter.
Edit:
The type of grammar which works very well for a recursive decent parser is called LL(1), which means parse from left-to-right, and create the left-most derivation. That is a 'natural' way to parse when the code and function calls is the control flow. To find the theory of how to check grammars are LL(1) search the web for "parsing LL(1)" or "grammar follow sets".
It is pretty uncommon to see nested brackets in chemical formula. But maybe, for instance ammonium carbonate and barium nitrate in a 2:3 ratio could be written as "( (NH4)2 CO3)2 ( Ba(NO3)2 )3"
I found a right-to-left parser that pushes the multiplier onto a multiplier stack worked really well for me:
double multiplier[8];
double num = 1.0;
int multdepth = 0;
multiplier[0] = 1;
char molecule[1024]; // contains molecular formula
//parse the molecular formula right-to-left whilst keeping track of multiplier
for (int i = strlen(molecule) - 1; i >= 0; i--)
{
if (isdigit(molecule[i]) || molecule[i] == '.')
i = readnum(i, &num);
if (isalpha(molecule[i]))
{
i = parseatom(i, num * multiplier[multdepth]);
num = 1.0; // need to reset the multiplier here
}
if (molecule[i] == ')')
{
multdepth++;
multiplier[multdepth] = num * multiplier[multdepth - 1];
num = 1.0;
}
if (molecule[i] == '(')
{
multdepth--;
if (multdepth < 0)
error("Opening bracket not terminated");
}
}
Sorry if it's a novice question - I want to parse something defined by
Exp ::= Mandatory_Part Optional_Part0 Optional_Part1
I thought I could do this:
proc::Parser String
proc = do {
;str<-parserMandatoryPart
;str0<-optional(parserOptionalPart0) --(1)
;str1<-optional(parserOptionalPart1) --(2)
;return str++str0++str1
}
I want to get str0/str1 if optional parts are present, otherwise, str0/str1 would be "".
But (1) and (2) won't work since optional() doesn't allow extracting result from its parameters, in this case, parserOptionalPart0/parserOptionalPart1.
Now What would be the proper way to do it?
Many thanks!
Billy R
The function you're looking for is optionMaybe. It returns Nothing if the parser failed, and returns the content in Just if it consumed input.
From the docs:
option x p tries to apply parser p. If p fails without consuming input, it returns the value x, otherwise the value returned by p.
So you could do:
proc :: Parser String
proc = do
str <- parserMandatoryPart
str0 <- option "" parserOptionalPart0
str1 <- option "" parserOptionalPart1
return (str++str0++str1)
Watch out for the "without consuming input" part. You may need to wrap either or both optional parsers with try.
I've also adjusted your code style to be more standard, and fixed an error on the last line. return isn't a keyword; it's an ordinary function. So return a ++ b is (return a) ++ b, i.e. almost never what you want.