I have read the GOLD Homepage ( http://www.devincook.com/goldparser/ ) docs, FAQ and Wikipedia to find out what practical application there could possibly be for GOLD. I was thinking along the lines of having a programming language (easily) available to my systems such as ABAP on SAP or X++ on Axapta - but it doesn't look feasible to me, at least not easily - even if you use GOLD.
The final use of the parsed result produced by GOLD escapes me - what do you do with the result of the parse?
EDIT: A practical example (description) would be great.
Parsing really consists of two phases. The first is "lexing", which convert the raw strings of character in to something that the program can more readily understand (commonly called tokens).
Simple example, lex would convert:
if (a + b > 2) then
In to:
IF_TOKEN LEFT_PAREN IDENTIFIER(a) PLUS_SIGN IDENTIFIER(b) GREATER_THAN NUMBER(2) RIGHT_PAREN THEN_TOKEN
The parse takes that stream of tokens, and attempts to make yet more sense out of them. In this case, it would try and match up those tokens to an IF_STATEMENT. To the parse, the IF _STATEMENT may well look like this:
IF ( BOOLEAN_EXPRESSION ) THEN
Where the result of the lexing phase is a token stream, the result of the parsing phase is a Parse Tree.
So, a parser could convert the above in to:
if_statement
|
v
boolean_expression.operator = GREATER_THAN
| |
| v
V numeric_constant.string="2"
expression.operator = PLUS_SIGN
| |
| v
v identifier.string = "b"
identifier.string = "a"
Here you see we have an IF_STATEMENT. An IF_STATEMENT has a single argument, which is a BOOLEAN_EXPRESSION. This was explained in some manner to the parser. When the parser is converting the token stream, it "knows" what a IF looks like, and know what a BOOLEAN_EXPRESSION looks like, so it can make the proper assignments when it sees the code.
For example, if you have just:
if (a + b) then
The parser could know that it's not a boolean expression (because the + is arithmetic, not a boolean operator) and the parse could throw an error at this point.
Next, we see that a BOOLEAN_EXPRESSION has 3 components, the operator (GREATER_THAN), and two sides, the left side and the right side.
On the left side, it points to yet another expression, the "a + b", while on the right is points to a NUMERIC_CONSTANT, in this case the string "2". Again, the parser "knows" this is a NUMERIC constant because we told it about strings of numbers. If it wasn't numbers, it would be an IDENTIFIER (like "a" and "b" are).
Note, that if we had something like:
if (a + b > "XYZ") then
That "parses" just fine (expression on the left, string constant on the right). We don't know from looking at this whether this is a valid expression or not. We don't know if "a" or "b" reference Strings or Numbers at this point. So, this is something the parser can't decided for us, can't flag as an error, as it simply doesn't know. That will happen when we evaluate (either execute or try to compile in to code) the IF statement.
If we did:
if [a > b ) then
The parser can readily see that syntax error as a problem, and will throw an error. That string of tokens doesn't look like anything it knows about.
So, the point being that when you get a complete parse tree, you have some assurance that at first cut the "code looks good". Now during execution, other errors may well come up.
To evaluate the parse tree, you just walk the tree. You'll have some code associated with the major nodes of the parse tree during the compile or evaluation part. Let's assuming that we have an interpreter.
public void execute_if_statment(ParseTreeNode node) {
// We already know we have a IF_STATEMENT node
Value value = evaluate_expression(node.getBooleanExpression());
if (value.getBooleanResult() == true) {
// we do the "then" part of the code
}
}
public Value evaluate_expression(ParseTreeNode node) {
Value result = null;
if (node.isConstant()) {
result = evaluate_constant(node);
return result;
}
if (node.isIdentifier()) {
result = lookupIdentifier(node);
return result;
}
Value leftSide = evaluate_expression(node.getLeftSide());
Value rightSide = evaluate_expression(node.getRightSide());
if (node.getOperator() == '+') {
if (!leftSide.isNumber() || !rightSide.isNumber()) {
throw new RuntimeError("Must have numbers for adding");
}
int l = leftSide.getIntValue();
int r = rightSide.getIntValue();
int sum = l + r;
return new Value(sum);
}
if (node.getOperator() == '>') {
if (leftSide.getType() != rightSide.getType()) {
throw new RuntimeError("You can only compare values of the same type");
}
if (leftSide.isNumber()) {
int l = leftSide.getIntValue();
int r = rightSide.getIntValue();
boolean greater = l > r;
return new Value(greater);
} else {
// do string compare instead
}
}
}
So, you can see that we have a recursive evaluator here. You see how we're checking the run time types, and performing the basic evaluations.
What will happen is the execute_if_statement will evaluate it's main expression. Even tho we wanted only BOOLEAN_EXPRESION in the parse, all expressions are mostly the same for our purposes. So, execute_if_statement calls evaluate_expression.
In our system, all expressions have an operator and a left and right side. Each side of an expression is ALSO an expression, so you can see how we immediately try and evaluate those as well to get their real value. The one note is that if the expression consists of a CONSTANT, then we simply return the constants value, if it's an identifier, we look it up as a variable (and that would be a good place to throw a "I can't find the variable 'a'" message), otherwise we're back to the left side/right side thing.
I hope you can see how a simple evaluator can work once you have a token stream from a parser. Note how during evaluation, the major elements of the language are in place, otherwise we'd have got a syntax error and never got to this phase. We can simply expect to "know" that when we have a, for example, PLUS operator, we're going to have 2 expressions, the left and right side. Or when we execute an IF statement, that we already have a boolean expression to evaluate. The parse is what does that heavy lifting for us.
Getting started with a new language can be a challenge, but you'll find once you get rolling, the rest become pretty straightforward and it's almost "magic" that it all works in the end.
Note, pardon the formatting, but underscores are messing things up -- I hope it's still clear.
I would recommend antlr.org for information and the 'free' tool I would use for any parser use.
GOLD can be used for any kind of application where you have to apply context-free grammars to input.
elaboration:
Essentially, CFGs apply to all programming languages. So if you wanted to develop a scripting language for your company, you'd need to write a parser- or get a parsing program. Alternatively, if you wanted to have a semi-natural language for input for non-programmers in the company, you could use a parser to read that input and spit out more "machine-readable" data. Essentially, a context-free grammar allows you to describe far more inputs than a regular expression. The GOLD system apparently makes the parsing problem somewhat easier than lex/yacc(the UNIX standard programs for parsing).
Related
Here's my grammar to the for statements:
FOR x>0 {
//somthing
}
// or
FOR x = 0; x > 0; x++ {
//somthing
}
it has the same prefix FOR, and I'd want to print the for_begin label after InitExpression,
however the codes right after FOR will become useless because of confliction.
ForStmt
: FOR {
printf("for_begin_%d:\n", n);
} Expression {
printf("ifeq for_exit_%d\n", n);
} ForBlock
| FOR ForClause ForBlock
;
ForClause
: InitExpression ';' {
printf("for_begin_%d:\n", n);
} Expression ';' Expression { printf("ifeq for_exit_%d\n", n); }
;
I had tried to change it to something like:
ForStart
: FOR
| FOR InitExpression
;
or use a flag to mention where to print the for_begin label,
but also fail to resolve the conflict.
How to make it not conflict?
How can the parser know which alternative of the FOR statement it sees?
While it's possible that an InitExpression has identifiable form, such as an assignment statement, which could not be used in a conditional expression. That strikes me as too restrictive for practical purposes -- there are many things you might do to initialise a loop other than a direct assignment -- but leaving that aside, it means that the earliest the InitExpression can be definitively identified is when the assignment operator is seen. If lvalues in your language can only be simple identifiers, that would make it the second lookahead token after the FOR, but in most useful language lvalues can be much more complicated than just simple identifiers, and so it's likely that the InitExpression cannot be definitively identified with finite lookahead.
But it's more likely that the only significant difference between the two forms is that the expression in the first form is followed by a block (which I suppose cannot start with a semicolon) and the first expression in the second form is followed by a semicolon. So the parser knows what it is parsing at the end of the first expression and no earlier.
Normally, that would not cause a problem. Were it not for the MidRule Action which inserts a label, the parser does not have to make a reduction decision until it reaches the end of the first expression, at which point it needs to decide whether to reduce the first expression as an InitExpression or an Expression. But at that point, the lookahead token as either a semicolon or the first token of a block, so the lookahead token can guide the decision.
But the Mid-Rule Action makes that impossible. The Mid-Rule Action must either be reduced or not before shifting the token which immediately follows the FOR token, and -- as your examples show -- the lookahead token could be the same (i) in both cases.
Fundamentally, the issue is that you want to build a one-pass compiler rather than just parsing the input into an AST and then walking the AST to generate assembler code (possibly after doing some other traverses over the AST in order to perform other analyses and allow for code optimisation). The one-pass code generator depends on Mid-Rule Actions, and Mid-Rule Actions in turn can easily generate unresolvable parsing conflicts. This issue is so notorious that there is a chapter in the bison manual dedicated to it, which is well worth reading.
So there is no good solution. But in this case, there is a simple solution, because the action you want to take is just to insert a label, and inserting a label which happens never to be used is not in any way going to affect the code which will ultimately be executed. So you might as well insert a label immediately after the FOR statement, whether you will need it or not, and then insert another label after the InitExpression if it turns out that there was such a thing. You don't need to actually know which label to use until you reach the end of the conditional expression, which is much later.
As explained in the Bison manual chapter I already linked to, this cannot be done using Mid-Rule Actions, because Bison doesn't attempt to compare Mid-Rule Actions with each other. Even if two actions happen to be identical, Bison will still need to decide which one to execute, thereby generating a conflict. So instead of using an MRA, you need to house the action in a marker non-terminal -- a non-terminal with an empty right-hand side, used only to trigger an action.
That would make the grammar look something like this:
ForLabel
: %empty { $$ = n; printf("for_begin_%d:\n", n++); }
ForStmt
: FOR
ForLabel[label]
Expression { printf("ifeq for_exit_%d\n", label); }
ForBlock { printf("jmp for_begin_%d\n", label);
printf("for_exit_%d:\n", label); }
| FOR
ForLabel
InitExpress ';'
ForLabel[label]
Expression ';'
Expression { printf("ifeq for_exit_%d\n", label); }
ForBlock { printf("jmp for_begin_%d\n", label);
printf("for_exit_%d:\n", label); }
;
([label] gives a name to a semantic value, which avoids having to use a rather mysterious and possibly incorrect $2 or $6. See Named References in the handy Bison manual.)
I am trying to create a parser for simple chemical formulae. Meaning, they have no states of matter, charge, or anything like that. The formulae only have strings representing compounds, quantities, and parentheses.
Following this answer to a similar question, and some rudimentary knowledge of discrete math, I hoped that I could write a simple Recursive Descent Parser to generate the number of each atom inside of the formula. I already have a really simple answer for this that involves single parentheses, but not nested parentheses.
Here are the productions of the grammar without parentheses:
Compound: Component { Component };
Component: Atom [Quantity]
Atom: 'H' | 'He' | 'Li' | 'Be' ...
Quantity: Digit { Digit }
Digit: '0' | '1' | ... '9'
[...] is read as optional, and will be an if test in the program (either it is there or missing)
| is alternatives, and so is an if .. else if .. else or switch 'test', it is saying the input must match one of these
{ ... } is read as repetition of 0 or more, and will be a while loop in the program
Characters between quotes are literal characters which will be in the string. All the other words are names of rules, and for a recursive descent parser, end up being the names of the functions which get called to chop up, and handle the input.
With nested parentheses, I have no idea what to do. By nested parentheses I mean something like (Fe2(OH)2(H2O)8)2, or something fictitious and complicated like (Ab(CD2(Ef(G2H)3)(IJ2)4)3)2
Because now there is a production that I don't really understand how to articulate, but here is my best attempt:
Parenthetical: Compound { Parenthetical } [Quantity]
So the basic rules parse any simple sequence of chemical symbols and quantities without parenthesis.
I assume the Quantity is defining the quantity of the whole chunk of stuff between '(' ... ')'
So, '(' ... ') [Quantity] needs to be parsed as exactly the same thing as the Component, i.e. as an alternative to: Atom [Quantity]
So the only thing to change is the Component rule; it becomes:
Component: Atom [Quantity] | '(' Compound ')' [Quantity]
In the code function (or procedure) which is parsing Component, it will have a look at the next character (token), and if it is an '(', it will consume it, then call the function (or procedure) responsible for parsing Compound, and after that, check the next character (token) is a ')' (if not, it's a syntax error), then handle the optional Quantity, and then it is finished.
I am assuming you are using a programming language which supports recursive function (or procedure) calls. That housekeeping, done by code behind the scenes for your program, will make this 'just work' (TM).
Alternatively, you could solve the problem in a different way. Add a new rule, which says:
Stuff: Atom | '(' Compound ')'
Then modify the rule:
Compound: Stuff [Quantity]
Then write a new function (or procedure) for Stuff, and change the Compound code to simply call Stuff, then handle the optional Quantity.
There are good technical reasons for doing this to support some parsing technology. However you're using recursive descent where it won't really matter.
Edit:
The type of grammar which works very well for a recursive decent parser is called LL(1), which means parse from left-to-right, and create the left-most derivation. That is a 'natural' way to parse when the code and function calls is the control flow. To find the theory of how to check grammars are LL(1) search the web for "parsing LL(1)" or "grammar follow sets".
It is pretty uncommon to see nested brackets in chemical formula. But maybe, for instance ammonium carbonate and barium nitrate in a 2:3 ratio could be written as "( (NH4)2 CO3)2 ( Ba(NO3)2 )3"
I found a right-to-left parser that pushes the multiplier onto a multiplier stack worked really well for me:
double multiplier[8];
double num = 1.0;
int multdepth = 0;
multiplier[0] = 1;
char molecule[1024]; // contains molecular formula
//parse the molecular formula right-to-left whilst keeping track of multiplier
for (int i = strlen(molecule) - 1; i >= 0; i--)
{
if (isdigit(molecule[i]) || molecule[i] == '.')
i = readnum(i, &num);
if (isalpha(molecule[i]))
{
i = parseatom(i, num * multiplier[multdepth]);
num = 1.0; // need to reset the multiplier here
}
if (molecule[i] == ')')
{
multdepth++;
multiplier[multdepth] = num * multiplier[multdepth - 1];
num = 1.0;
}
if (molecule[i] == '(')
{
multdepth--;
if (multdepth < 0)
error("Opening bracket not terminated");
}
}
I have not messed with building languages or parsers in a formal way since grad school and have forgotten most of what I knew back then. I now have a project that might benefit from such a thing but I'm not sure how to approach the following situation.
Let's say that in the language I want to parse there is a token that means "generate a random floating point number" in an expression.
exp: NUMBER
{$$ = $1;}
| NUMBER PLUS exp
{$$ = $1 + $3;}
| R PLUS exp
{$$ = random() + $3;}
;
I also want a "list" generating operator that will reevaluate an "exp" some number of times. Maybe like:
listExp: NUMBER COLON exp
{
for (int i = 0; i < $1; i++) {
print $3;
}
}
;
The problem I see is that "exp" will have already been reduced by the time the loop starts. If I have the input
2 : R + 2
then I think the random number will be generated as the "exp" is parsed and 2 added to it -- lets say the result is 2.0055. Then in the list expression I think 2.0055 would be printed out twice.
Is there a way to mark the "exp" before evaluation and then parse it as many times as the list loop count requires? The idea being to get a different random number in each evaluation.
Your best bet is to build an AST and evaluate the entire AST at the end of the parse. In-line evaluation is only possible for very simple (i.e. "calculator-like") projects.
Instead of an AST, you could construct code for a stack- or three-address- virtual machine. That's generally more efficient, particularly if you intend to execute the code frequently, but the AST is a lot simpler to construct, and executing it is a single depth-first scan.
Depending on your language design there are at least 5 different points at which a token in the language could be bound to a value. They are:
Pre-processor (like C #define)
Lexer: recognise tokens
Parser: recognise token structure, output AST
Semantic analysis: analyse AST, assign types and conversions etc
Code generation: output executable code or execute code directly.
If you have a token that can occur multiple times and you want to assign it a different random value each time, then phase 4 is the place to do it. If you generate an AST, walk the tree and assign the values. If you go straight to code generation (or an interpreter) do it then.
I'm having trouble working out how to use any of the functions in the Text.Parsec.Indent module provided by the indents package for Haskell, which is a sort of add-on for Parsec.
What do all these functions do? How are they to be used?
I can understand the brief Haddock description of withBlock, and I've found examples of how to use withBlock, runIndent and the IndentParser type here, here and here. I can also understand the documentation for the four parsers indentBrackets and friends. But many things are still confusing me.
In particular:
What is the difference between withBlock f a p and
do aa <- a
pp <- block p
return f aa pp
Likewise, what's the difference between withBlock' a p and do {a; block p}
In the family of functions indented and friends, what is ‘the level of the reference’? That is, what is ‘the reference’?
Again, with the functions indented and friends, how are they to be used? With the exception of withPos, it looks like they take no arguments and are all of type IParser () (IParser defined like this or this) so I'm guessing that all they can do is to produce an error or not and that they should appear in a do block, but I can't figure out the details.
I did at least find some examples on the usage of withPos in the source code, so I can probably figure that out if I stare at it for long enough.
<+/> comes with the helpful description “<+/> is to indentation sensitive parsers what ap is to monads” which is great if you want to spend several sessions trying to wrap your head around ap and then work out how that's analogous to a parser. The other three combinators are then defined with reference to <+/>, making the whole group unapproachable to a newcomer.
Do I need to use these? Can I just ignore them and use do instead?
The ordinary lexeme combinator and whiteSpace parser from Parsec will happily consume newlines in the middle of a multi-token construct without complaining. But in an indentation-style language, sometimes you want to stop parsing a lexical construct or throw an error if a line is broken and the next line is indented less than it should be. How do I go about doing this in Parsec?
In the language I am trying to parse, ideally the rules for when a lexical structure is allowed to continue on to the next line should depend on what tokens appear at the end of the first line or the beginning of the subsequent line. Is there an easy way to achieve this in Parsec? (If it is difficult then it is not something which I need to concern myself with at this time.)
So, the first hint is to take a look at IndentParser
type IndentParser s u a = ParsecT s u (State SourcePos) a
I.e. it's a ParsecT keeping an extra close watch on SourcePos, an abstract container which can be used to access, among other things, the current column number. So, it's probably storing the current "level of indentation" in SourcePos. That'd be my initial guess as to what "level of reference" means.
In short, indents gives you a new kind of Parsec which is context sensitive—in particular, sensitive to the current indentation. I'll answer your questions out of order.
(2) The "level of reference" is the "belief" referred in the current parser context state of where this indentation level starts. To be more clear, let me give some test cases on (3).
(3) In order to start experimenting with these functions, we'll build a little test runner. It'll run the parser with a string that we give it and then unwrap the inner State part using an initialPos which we get to modify. In code
import Text.Parsec
import Text.Parsec.Pos
import Text.Parsec.Indent
import Control.Monad.State
testParse :: (SourcePos -> SourcePos)
-> IndentParser String () a
-> String -> Either ParseError a
testParse f p src = fst $ flip runState (f $ initialPos "") $ runParserT p () "" src
(Note that this is almost runIndent, except I gave a backdoor to modify the initialPos.)
Now we can take a look at indented. By examining the source, I can tell it does two things. First, it'll fail if the current SourcePos column number is less-than-or-equal-to the "level of reference" stored in the SourcePos stored in the State. Second, it somewhat mysteriously updates the State SourcePos's line counter (not column counter) to be current.
Only the first behavior is important, to my understanding. We can see the difference here.
>>> testParse id indented ""
Left (line 1, column 1): not indented
>>> testParse id (spaces >> indented) " "
Right ()
>>> testParse id (many (char 'x') >> indented) "xxxx"
Right ()
So, in order to have indented succeed, we need to have consumed enough whitespace (or anything else!) to push our column position out past the "reference" column position. Otherwise, it'll fail saying "not indented". Similar behavior exists for the next three functions: same fails unless the current position and reference position are on the same line, sameOrIndented fails if the current column is strictly less than the reference column, unless they are on the same line, and checkIndent fails unless the current and reference columns match.
withPos is slightly different. It's not just a IndentParser, it's an IndentParser-combinator—it transforms the input IndentParser into one that thinks the "reference column" (the SourcePos in the State) is exactly where it was when we called withPos.
This gives us another hint, btw. It lets us know we have the power to change the reference column.
(1) So now let's take a look at how block and withBlock work using our new, lower level reference column operators. withBlock is implemented in terms of block, so we'll start with block.
-- simplified from the actual source
block p = withPos $ many1 (checkIndent >> p)
So, block resets the "reference column" to be whatever the current column is and then consumes at least 1 parses from p so long as each one is indented identically as this newly set "reference column". Now we can take a look at withBlock
withBlock f a p = withPos $ do
r1 <- a
r2 <- option [] (indented >> block p)
return (f r1 r2)
So, it resets the "reference column" to the current column, parses a single a parse, tries to parse an indented block of ps, then combines the results using f. Your implementation is almost correct, except that you need to use withPos to choose the correct "reference column".
Then, once you have withBlock, withBlock' = withBlock (\_ bs -> bs).
(5) So, indented and friends are exactly the tools to doing this: they'll cause a parse to immediately fail if it's indented incorrectly with respect to the "reference position" chosen by withPos.
(4) Yes, don't worry about these guys until you learn how to use Applicative style parsing in base Parsec. It's often a much cleaner, faster, simpler way of specifying parses. Sometimes they're even more powerful, but if you understand Monads then they're almost always completely equivalent.
(6) And this is the crux. The tools mentioned so far can only do indentation failure if you can describe your intended indentation using withPos. Quickly, I don't think it's possible to specify withPos based on the success or failure of other parses... so you'll have to go another level deeper. Fortunately, the mechanism that makes IndentParsers work is obvious—it's just an inner State monad containing SourcePos. You can use lift :: MonadTrans t => m a -> t m a to manipulate this inner state and set the "reference column" however you like.
Cheers!
I want to ask you a question about Flex, the program for parsing code.
Supposing I have an instruction like this one, in the rules part:
"=" BEGIN(attribution);
<attribution>{var_name} { fprintf(yyout, "="); ECHO; }
<attribution>";" BEGIN(INITIAL);
{var_name} is a regular expression that matches a variable's name, and all I want to do is to copy at the output all the attribution instructions, such as
a = 3;
or
b = a;
My rule though cannot write with fprintf the left member of the attribution, but only
= 3;
or
=a;
One solution for that might be that, after I make the match "=" and I am in the attribution state, to go 2 positions back as to get the left operand as well.
How can I do that in Flex?
Why are you using flex for syntactical analysis?
What you are doing sounds like a bison stuff not a flex job.
You'll be able to store previous token.
If you still want to use flex, you can use the / pattern.
Using this may lead to inefficiencies and the lexer can be bogus; it depends of the whole rule set.
{var_name}/"=" { ECHO; BEGIN(attribution); }
See the flex manual.