Using Alex in Haskell to make a lexer that parses Dice Rolls - parsing

I'm making a parser for a DSL in Haskell using Alex + Happy.
My DSL uses dice rolls as part of the possible expressions.
Sometimes I have an expression that I want to parse that looks like:
[some code...] 3D6 [... rest of the code]
Which should translate roughly to:
TokenInt {... value = 3}, TokenD, TokenInt {... value = 6}
My DSL also uses variables (basically, Strings), so I have a special token that handle variable names.
So, with this tokens:
"D" { \pos str -> TokenD pos }
$alpha [$alpha $digit \_ \']* { \pos str -> TokenName pos str}
$digit+ { \pos str -> TokenInt pos (read str) }
The result I'm getting when using my parse now is:
TokenInt {... value = 3}, TokenName { ... , name = "D6"}
Which means that my lexer "reads" an Integer and a Variable named "D6".
I have tried many things, for example, i changed the token D to:
$digit "D" $digit { \pos str -> TokenD pos }
But that just consumes the digits :(
Can I parse the dice roll with the numbers?
Or at least parse TokenInt-TokenD-TokenInt?
PS: I'm using PosN as a wrapper, not sure if relevant.

The way I'd go about it would be to extend the TokenD type to TokenD Int Int so using the basic wrapper for convenience I would do
$digit+ D $digit+ { dice }
...
dice :: String -> Token
dice s = TokenD (read $ head ls) (read $ last ls)
where ls = split 'D' s
split can be found here.
This is an extra step that'd usually be done in during syntactic analysis but doesn't hurt much here.
Also I can't make Alex parse $alpha for TokenD instead of TokenName. If we had Di instead of D that'd be no problem. From Alex's docs:
When the input stream matches more than one rule, the rule which matches the longest prefix of the input stream wins. If there are still several rules which match an equal number of characters, then the rule which appears earliest in the file wins.
But then your code should work. I don't know if this is an issue with Alex.

I decided that I could survive with variables starting with lowercase letters (like Haskell variables), so I changed my lexer to parse variables only if they start with a lowercase letter.
That also solved some possible problems with some other reserved words.
I'm still curious to know if there were other solutions, but the problem in itself was solved.
Thank you all!

Related

Parsing simple markup language with Haskell

I'm trying to implement a very simple markup language. I have an intermediate representation that looks like:
data Token = Str Text
| Explode Text
type Rep = [Token]
So, the idea is to turn an arbitrary text of the form:
The quick brown %%fox%% %%jumps%% over the %%lazy%% dog.
into:
[Str "The quick brown", Explode "fox", Explode "jumps", Str "over the", Explode "lazy", Str "dog"]
for further processing. Also, it is important that we treat:
%%fox%% %%jumps%%
differently than
%%fox jumps%%
The latter should (Explode "fox jumps")
I tried to implement this using attoparsec, but I don't think I have the tools I need. But I'm not so good with parsing theory (I studied math, not CS). What kind of grammar is this? What kind of parser combinator library should I use? I considered using Parsec with a stateful monad transformer stack to keep track of the context. Does that sound sensible?
You can take the cheap and easy way, without a proper parser. The important thing to recognise is that this grammar is actually fairly simple – it has no recursion or such. It is just a flat listing of Strs and Explodes.
The easy way
So we can start by breaking the string down into a list containing the text and the separators as separate values. We need a data type to separate the separators (%%) from actual text (everything else.)
data ParserTokens = Sep | T Text
Breaking it down
Then we need to break the list into its constituents.
tokenise = intersperse Sep . map T . Text.splitOn "%%"
This will first split the string on %%, so in your example it'll become
["The quick brown ","fox"," ","jumps"," over the ","lazy"," dog."]
then we map T over it, to turn it from a [Text] to a [ParserTokens]. Finally, we intersperse Sep over it, to reintroduce the %% separators but in a shape that's easier to deal with. The result is, in your example,
[T "The quick brown ",Sep,T "fox",Sep,T " ",Sep,T "jumps",Sep,T " over the ",Sep,T "lazy",Sep,T " dog."]
Building it up
With this done, all that remains is parsing this thing into the shape you want it. Parsing this amounts to finding the 1-2-3 punch of Sep–T "something"–Sep and replacing it with Explode "something". We write a recursive function to do this.
construct [] = []
construct (T s : rest) = Str s : construct rest
construct (Sep : T s : Sep : rest) = Explode s : construct rest
construct _ = error "Mismatched '%%'!"
This converts T s to Str s and the combination of separators and a T s into an Explode s. If the pattern matching fails, it's because there were a stray separator somewhere, so I've just set it to crash the program. You might want better error handling there – such as wrapping the result in Either String or something similar.
With this done, we can create the function
parseTemplate = construct . tokenise
and in the end, if we run your example through parseTemplate, we get the expected result
[Str "The quick brown ",Explode "fox",Str " ",Explode "jumps",Str " over the ",Explode "lazy",Str " dog."]
For such simple parser even Attoparsec seems to be overkill:
parse = map (\w -> case w of
'%':'%':expl -> Explode $ init $ init expl
str -> Str str) . words
Of course, this code needs some sanity checks for Explode case.
This doesn't handle whitespace the way you specified, but it should get you on the right track.
parseMU = zipWith ($) (cycle [Str,Explode]) . splitps where
splitps :: String -> [String]
splitps [] = [[]]
splitps ('%':'%':r) = [] : splitps r
splitps (c:r) = let
(a:r') = splitps r
in ((c:a):r')

Can nested parentheticals be parsed in chemical formulae?

I am trying to create a parser for simple chemical formulae. Meaning, they have no states of matter, charge, or anything like that. The formulae only have strings representing compounds, quantities, and parentheses.
Following this answer to a similar question, and some rudimentary knowledge of discrete math, I hoped that I could write a simple Recursive Descent Parser to generate the number of each atom inside of the formula. I already have a really simple answer for this that involves single parentheses, but not nested parentheses.
Here are the productions of the grammar without parentheses:
Compound: Component { Component };
Component: Atom [Quantity]
Atom: 'H' | 'He' | 'Li' | 'Be' ...
Quantity: Digit { Digit }
Digit: '0' | '1' | ... '9'
[...] is read as optional, and will be an if test in the program (either it is there or missing)
| is alternatives, and so is an if .. else if .. else or switch 'test', it is saying the input must match one of these
{ ... } is read as repetition of 0 or more, and will be a while loop in the program
Characters between quotes are literal characters which will be in the string. All the other words are names of rules, and for a recursive descent parser, end up being the names of the functions which get called to chop up, and handle the input.
With nested parentheses, I have no idea what to do. By nested parentheses I mean something like (Fe2(OH)2(H2O)8)2, or something fictitious and complicated like (Ab(CD2(Ef(G2H)3)(IJ2)4)3)2
Because now there is a production that I don't really understand how to articulate, but here is my best attempt:
Parenthetical: Compound { Parenthetical } [Quantity]
So the basic rules parse any simple sequence of chemical symbols and quantities without parenthesis.
I assume the Quantity is defining the quantity of the whole chunk of stuff between '(' ... ')'
So, '(' ... ') [Quantity] needs to be parsed as exactly the same thing as the Component, i.e. as an alternative to: Atom [Quantity]
So the only thing to change is the Component rule; it becomes:
Component: Atom [Quantity] | '(' Compound ')' [Quantity]
In the code function (or procedure) which is parsing Component, it will have a look at the next character (token), and if it is an '(', it will consume it, then call the function (or procedure) responsible for parsing Compound, and after that, check the next character (token) is a ')' (if not, it's a syntax error), then handle the optional Quantity, and then it is finished.
I am assuming you are using a programming language which supports recursive function (or procedure) calls. That housekeeping, done by code behind the scenes for your program, will make this 'just work' (TM).
Alternatively, you could solve the problem in a different way. Add a new rule, which says:
Stuff: Atom | '(' Compound ')'
Then modify the rule:
Compound: Stuff [Quantity]
Then write a new function (or procedure) for Stuff, and change the Compound code to simply call Stuff, then handle the optional Quantity.
There are good technical reasons for doing this to support some parsing technology. However you're using recursive descent where it won't really matter.
Edit:
The type of grammar which works very well for a recursive decent parser is called LL(1), which means parse from left-to-right, and create the left-most derivation. That is a 'natural' way to parse when the code and function calls is the control flow. To find the theory of how to check grammars are LL(1) search the web for "parsing LL(1)" or "grammar follow sets".
It is pretty uncommon to see nested brackets in chemical formula. But maybe, for instance ammonium carbonate and barium nitrate in a 2:3 ratio could be written as "( (NH4)2 CO3)2 ( Ba(NO3)2 )3"
I found a right-to-left parser that pushes the multiplier onto a multiplier stack worked really well for me:
double multiplier[8];
double num = 1.0;
int multdepth = 0;
multiplier[0] = 1;
char molecule[1024]; // contains molecular formula
//parse the molecular formula right-to-left whilst keeping track of multiplier
for (int i = strlen(molecule) - 1; i >= 0; i--)
{
if (isdigit(molecule[i]) || molecule[i] == '.')
i = readnum(i, &num);
if (isalpha(molecule[i]))
{
i = parseatom(i, num * multiplier[multdepth]);
num = 1.0; // need to reset the multiplier here
}
if (molecule[i] == ')')
{
multdepth++;
multiplier[multdepth] = num * multiplier[multdepth - 1];
num = 1.0;
}
if (molecule[i] == '(')
{
multdepth--;
if (multdepth < 0)
error("Opening bracket not terminated");
}
}

How to retrieve value from optional parser in Parsec?

Sorry if it's a novice question - I want to parse something defined by
Exp ::= Mandatory_Part Optional_Part0 Optional_Part1
I thought I could do this:
proc::Parser String
proc = do {
;str<-parserMandatoryPart
;str0<-optional(parserOptionalPart0) --(1)
;str1<-optional(parserOptionalPart1) --(2)
;return str++str0++str1
}
I want to get str0/str1 if optional parts are present, otherwise, str0/str1 would be "".
But (1) and (2) won't work since optional() doesn't allow extracting result from its parameters, in this case, parserOptionalPart0/parserOptionalPart1.
Now What would be the proper way to do it?
Many thanks!
Billy R
The function you're looking for is optionMaybe. It returns Nothing if the parser failed, and returns the content in Just if it consumed input.
From the docs:
option x p tries to apply parser p. If p fails without consuming input, it returns the value x, otherwise the value returned by p.
So you could do:
proc :: Parser String
proc = do
str <- parserMandatoryPart
str0 <- option "" parserOptionalPart0
str1 <- option "" parserOptionalPart1
return (str++str0++str1)
Watch out for the "without consuming input" part. You may need to wrap either or both optional parsers with try.
I've also adjusted your code style to be more standard, and fixed an error on the last line. return isn't a keyword; it's an ordinary function. So return a ++ b is (return a) ++ b, i.e. almost never what you want.

String splitting problems in Erlang

I've been playing around with the splitting of atoms and have a problem with strings. The input data will always be an atom that consists of some letters and then some numbers, for instance ms444, r64 or min1. Since the function lists:splitwith/2 takes a list the atom is first converted into a list:
24> lists:splitwith(fun (C) -> is_atom(C) end, [m,s,4,4,4]).
{[m,s],[4,4,4]}
25> lists:splitwith(fun (C) -> is_atom(C) end, atom_to_list(ms444)).
{[],"ms444"}
26> atom_to_list(ms444).
"ms444"
I want to separate the letters from the numbers and I've succeeded in doing that when using a list, but since I start out with an atom I get a "string" as result to put into my splitwith function...
Is it interpreting each item in the list as a string or what is going on?
You might want to have a look at the string module documentation:
http://www.erlang.org/doc/man/string.html
The following function might interest you:
tokens(String, SeparatorList) -> Tokens
Since strings in Erlang are just a list() of integer() the test in the fun will be made if the item is an atom() when it is in fact an integer(). If the test is changed to look for letters it works:
29> lists:splitwith(fun (C) -> (C >= $a) and (C =< $Z) end, atom_to_list(ms444)).
{"ms","444"}
An atom in erlang is a named constant and not a variable (or not like a variable is in an imperative language).
You should really not create atoms in dynamic fashion (that is, don't convert things to atoms at runtime)
They are used more in pattern matching and send recive code.
Pid ! {matchthis, X}
recive
{foobar,Y} -> doY(Y);
{matchthis,X} -> doX(X);
Other -> doother(Other)
end
A variable, like X could be set to an atom. For example X=if 1==1 -> ok; true -> fail end. I could suffer from poor imagination but I can't think of a way why you would like to parse atom. You should be in charge of what atoms you write and not use list_to_atom(CharIntegerList).
Can you perhaps give a more overview of what you like to accomplish?
A "string" in Erlang is not a primitive type: it is just a list() of integers(). So if you want to "separate" the letters from the digits, you'll have to do comparison with the integer representation of the characters.

Gold Parsing System - What can it be used for in programming?

I have read the GOLD Homepage ( http://www.devincook.com/goldparser/ ) docs, FAQ and Wikipedia to find out what practical application there could possibly be for GOLD. I was thinking along the lines of having a programming language (easily) available to my systems such as ABAP on SAP or X++ on Axapta - but it doesn't look feasible to me, at least not easily - even if you use GOLD.
The final use of the parsed result produced by GOLD escapes me - what do you do with the result of the parse?
EDIT: A practical example (description) would be great.
Parsing really consists of two phases. The first is "lexing", which convert the raw strings of character in to something that the program can more readily understand (commonly called tokens).
Simple example, lex would convert:
if (a + b > 2) then
In to:
IF_TOKEN LEFT_PAREN IDENTIFIER(a) PLUS_SIGN IDENTIFIER(b) GREATER_THAN NUMBER(2) RIGHT_PAREN THEN_TOKEN
The parse takes that stream of tokens, and attempts to make yet more sense out of them. In this case, it would try and match up those tokens to an IF_STATEMENT. To the parse, the IF _STATEMENT may well look like this:
IF ( BOOLEAN_EXPRESSION ) THEN
Where the result of the lexing phase is a token stream, the result of the parsing phase is a Parse Tree.
So, a parser could convert the above in to:
if_statement
|
v
boolean_expression.operator = GREATER_THAN
| |
| v
V numeric_constant.string="2"
expression.operator = PLUS_SIGN
| |
| v
v identifier.string = "b"
identifier.string = "a"
Here you see we have an IF_STATEMENT. An IF_STATEMENT has a single argument, which is a BOOLEAN_EXPRESSION. This was explained in some manner to the parser. When the parser is converting the token stream, it "knows" what a IF looks like, and know what a BOOLEAN_EXPRESSION looks like, so it can make the proper assignments when it sees the code.
For example, if you have just:
if (a + b) then
The parser could know that it's not a boolean expression (because the + is arithmetic, not a boolean operator) and the parse could throw an error at this point.
Next, we see that a BOOLEAN_EXPRESSION has 3 components, the operator (GREATER_THAN), and two sides, the left side and the right side.
On the left side, it points to yet another expression, the "a + b", while on the right is points to a NUMERIC_CONSTANT, in this case the string "2". Again, the parser "knows" this is a NUMERIC constant because we told it about strings of numbers. If it wasn't numbers, it would be an IDENTIFIER (like "a" and "b" are).
Note, that if we had something like:
if (a + b > "XYZ") then
That "parses" just fine (expression on the left, string constant on the right). We don't know from looking at this whether this is a valid expression or not. We don't know if "a" or "b" reference Strings or Numbers at this point. So, this is something the parser can't decided for us, can't flag as an error, as it simply doesn't know. That will happen when we evaluate (either execute or try to compile in to code) the IF statement.
If we did:
if [a > b ) then
The parser can readily see that syntax error as a problem, and will throw an error. That string of tokens doesn't look like anything it knows about.
So, the point being that when you get a complete parse tree, you have some assurance that at first cut the "code looks good". Now during execution, other errors may well come up.
To evaluate the parse tree, you just walk the tree. You'll have some code associated with the major nodes of the parse tree during the compile or evaluation part. Let's assuming that we have an interpreter.
public void execute_if_statment(ParseTreeNode node) {
// We already know we have a IF_STATEMENT node
Value value = evaluate_expression(node.getBooleanExpression());
if (value.getBooleanResult() == true) {
// we do the "then" part of the code
}
}
public Value evaluate_expression(ParseTreeNode node) {
Value result = null;
if (node.isConstant()) {
result = evaluate_constant(node);
return result;
}
if (node.isIdentifier()) {
result = lookupIdentifier(node);
return result;
}
Value leftSide = evaluate_expression(node.getLeftSide());
Value rightSide = evaluate_expression(node.getRightSide());
if (node.getOperator() == '+') {
if (!leftSide.isNumber() || !rightSide.isNumber()) {
throw new RuntimeError("Must have numbers for adding");
}
int l = leftSide.getIntValue();
int r = rightSide.getIntValue();
int sum = l + r;
return new Value(sum);
}
if (node.getOperator() == '>') {
if (leftSide.getType() != rightSide.getType()) {
throw new RuntimeError("You can only compare values of the same type");
}
if (leftSide.isNumber()) {
int l = leftSide.getIntValue();
int r = rightSide.getIntValue();
boolean greater = l > r;
return new Value(greater);
} else {
// do string compare instead
}
}
}
So, you can see that we have a recursive evaluator here. You see how we're checking the run time types, and performing the basic evaluations.
What will happen is the execute_if_statement will evaluate it's main expression. Even tho we wanted only BOOLEAN_EXPRESION in the parse, all expressions are mostly the same for our purposes. So, execute_if_statement calls evaluate_expression.
In our system, all expressions have an operator and a left and right side. Each side of an expression is ALSO an expression, so you can see how we immediately try and evaluate those as well to get their real value. The one note is that if the expression consists of a CONSTANT, then we simply return the constants value, if it's an identifier, we look it up as a variable (and that would be a good place to throw a "I can't find the variable 'a'" message), otherwise we're back to the left side/right side thing.
I hope you can see how a simple evaluator can work once you have a token stream from a parser. Note how during evaluation, the major elements of the language are in place, otherwise we'd have got a syntax error and never got to this phase. We can simply expect to "know" that when we have a, for example, PLUS operator, we're going to have 2 expressions, the left and right side. Or when we execute an IF statement, that we already have a boolean expression to evaluate. The parse is what does that heavy lifting for us.
Getting started with a new language can be a challenge, but you'll find once you get rolling, the rest become pretty straightforward and it's almost "magic" that it all works in the end.
Note, pardon the formatting, but underscores are messing things up -- I hope it's still clear.
I would recommend antlr.org for information and the 'free' tool I would use for any parser use.
GOLD can be used for any kind of application where you have to apply context-free grammars to input.
elaboration:
Essentially, CFGs apply to all programming languages. So if you wanted to develop a scripting language for your company, you'd need to write a parser- or get a parsing program. Alternatively, if you wanted to have a semi-natural language for input for non-programmers in the company, you could use a parser to read that input and spit out more "machine-readable" data. Essentially, a context-free grammar allows you to describe far more inputs than a regular expression. The GOLD system apparently makes the parsing problem somewhat easier than lex/yacc(the UNIX standard programs for parsing).

Resources