Here's my grammar to the for statements:
FOR x>0 {
//somthing
}
// or
FOR x = 0; x > 0; x++ {
//somthing
}
it has the same prefix FOR, and I'd want to print the for_begin label after InitExpression,
however the codes right after FOR will become useless because of confliction.
ForStmt
: FOR {
printf("for_begin_%d:\n", n);
} Expression {
printf("ifeq for_exit_%d\n", n);
} ForBlock
| FOR ForClause ForBlock
;
ForClause
: InitExpression ';' {
printf("for_begin_%d:\n", n);
} Expression ';' Expression { printf("ifeq for_exit_%d\n", n); }
;
I had tried to change it to something like:
ForStart
: FOR
| FOR InitExpression
;
or use a flag to mention where to print the for_begin label,
but also fail to resolve the conflict.
How to make it not conflict?
How can the parser know which alternative of the FOR statement it sees?
While it's possible that an InitExpression has identifiable form, such as an assignment statement, which could not be used in a conditional expression. That strikes me as too restrictive for practical purposes -- there are many things you might do to initialise a loop other than a direct assignment -- but leaving that aside, it means that the earliest the InitExpression can be definitively identified is when the assignment operator is seen. If lvalues in your language can only be simple identifiers, that would make it the second lookahead token after the FOR, but in most useful language lvalues can be much more complicated than just simple identifiers, and so it's likely that the InitExpression cannot be definitively identified with finite lookahead.
But it's more likely that the only significant difference between the two forms is that the expression in the first form is followed by a block (which I suppose cannot start with a semicolon) and the first expression in the second form is followed by a semicolon. So the parser knows what it is parsing at the end of the first expression and no earlier.
Normally, that would not cause a problem. Were it not for the MidRule Action which inserts a label, the parser does not have to make a reduction decision until it reaches the end of the first expression, at which point it needs to decide whether to reduce the first expression as an InitExpression or an Expression. But at that point, the lookahead token as either a semicolon or the first token of a block, so the lookahead token can guide the decision.
But the Mid-Rule Action makes that impossible. The Mid-Rule Action must either be reduced or not before shifting the token which immediately follows the FOR token, and -- as your examples show -- the lookahead token could be the same (i) in both cases.
Fundamentally, the issue is that you want to build a one-pass compiler rather than just parsing the input into an AST and then walking the AST to generate assembler code (possibly after doing some other traverses over the AST in order to perform other analyses and allow for code optimisation). The one-pass code generator depends on Mid-Rule Actions, and Mid-Rule Actions in turn can easily generate unresolvable parsing conflicts. This issue is so notorious that there is a chapter in the bison manual dedicated to it, which is well worth reading.
So there is no good solution. But in this case, there is a simple solution, because the action you want to take is just to insert a label, and inserting a label which happens never to be used is not in any way going to affect the code which will ultimately be executed. So you might as well insert a label immediately after the FOR statement, whether you will need it or not, and then insert another label after the InitExpression if it turns out that there was such a thing. You don't need to actually know which label to use until you reach the end of the conditional expression, which is much later.
As explained in the Bison manual chapter I already linked to, this cannot be done using Mid-Rule Actions, because Bison doesn't attempt to compare Mid-Rule Actions with each other. Even if two actions happen to be identical, Bison will still need to decide which one to execute, thereby generating a conflict. So instead of using an MRA, you need to house the action in a marker non-terminal -- a non-terminal with an empty right-hand side, used only to trigger an action.
That would make the grammar look something like this:
ForLabel
: %empty { $$ = n; printf("for_begin_%d:\n", n++); }
ForStmt
: FOR
ForLabel[label]
Expression { printf("ifeq for_exit_%d\n", label); }
ForBlock { printf("jmp for_begin_%d\n", label);
printf("for_exit_%d:\n", label); }
| FOR
ForLabel
InitExpress ';'
ForLabel[label]
Expression ';'
Expression { printf("ifeq for_exit_%d\n", label); }
ForBlock { printf("jmp for_begin_%d\n", label);
printf("for_exit_%d:\n", label); }
;
([label] gives a name to a semantic value, which avoids having to use a rather mysterious and possibly incorrect $2 or $6. See Named References in the handy Bison manual.)
Related
I am trying to make a C-like language using Flex/Bison. My problem is that I can't find a proper way to handle new lines. I have to ignore all new lines so I don't returnt them as a token to Bison because that would make the grammar rules so difficult to make but I am asked in some rules to make a mandatory change of line. For example:
Program "identifier" -> mandatory change of line
Function "identifier"("parameters") -> mandatory change of line
If I return \n as a token to flex then i have to put new lines in all of my grammar rules and that's surely not practical. I tried to make a variable work like a switch or something but it didn't quite work.
Any help or suggestion?
If the required newline is simply aesthetic -- that is, if it isn't required in order to avoid an ambiguity -- then the easiest way to enforce it is often just to track token locations (which is something that bison and flex can help you with) so that you can check in your reduction action that two consecutive tokens were not on the same line:
func_defn: "function" IDENT '(' opt_arg_list ')' body "end" {
if (#5.last_line == #6.first_line) {
yyerror("Body of function must start on a new line");
/* YYABORT; */ /* If you want to kill the parse at this point. */
}
// ...
}
Bison doesn't require any declarations or options in order to use locations; it will insert location support if it notices that you use #N in any action (which is how you refer to the location of a token). However, it is sometimes useful to insert a %locations declaration to force location support. Normally no other change is necessary to your grammar.
You do have to insert a little bit of code in your lexer in order to report the location values to the parser. Locations are communicated through a global variable called yylloc, whose value is of type YYLTYPE. By default, YYLTYPE is a struct with four int members: first_line, first_column, last_line, last_column. (See the Bison manual for more details.) These fields need to be set in your lexer for every token. Fortunately, flex allows you to define the macro YY_USER_ACTION, which contains code executed just before every action (even empty actions), which you can use to populate yylloc. Here's one which will work for many simple lexical analysers; you can put it in the code block at the top of your flex file.
/* Simple YY_USER_ACTION. Will not work if any action includes
* yyless(), yymore(), input() or REJECT.
*/
#define YY_USER_ACTION \
yylloc.first_line = yylloc.last_line; \
yylloc.first_column = yylloc.last_column; \
if (yylloc.last_line == yylineno) \
yylloc.last_column += yyleng; \
else { \
yylloc.last_line = yylineno; \
yylloc.last_column = yytext + yyleng - strrchr(yytext, '\n'); \
}
If the simple location check described above isn't sufficient for your use case, then you can do it through what's called "lexical feedback": a mechanism where the parser not only collects information from the lexical scanner, but also communicates back to the lexer when some kind of lexical change is needed.
Lexical feedback is usually discouraged because it can be fragile. It's always important to remember that the parser and the scanner are not necessarily synchronised. The parser often (but not always) needs to know the next token after the current production, so the lexical state when a production's action is being executed might be the lexical state after the next token, rather than the state after the last token in the production. But it might not; many parser generators, including Bison, try to execute an action immediately if they can figure out that the same action will be executed regardless of the next token. Unfortunately, that's not always predictable. In the case of Bison, for example, changing the parsing algorithm from the default LALR(1) to Canonical LR(1) or to GLR can also change a particular reduction action from immediate to deferred.
So if you're going to try to communicate with the scanner, you should try to do so in a way that will work whether or not the scanner has already been asked for the lookahead token. One way to do this is to put the code which communicates with the scanner in a Mid-Rule Action one token earlier than the token which you want to influence. [Note 1]
In order to make newlines "mostly optional", we need to tell the lexer when it should return a newline instead of ignoring it. One way to do this is to export a function which the lexer can call. We put the definition of that function into the generated parser and its declaration into the generated header file:
/* Anything in code requires and code provides sections is also
* copied into the generated header. So we can use it to declare
* exported functions.
*/
%code requires {
#include <stdbool.h>
bool need_nl(void);
}
%%
// ...
/* See [Note 2], below. */
/* Program directive. */
prog_decl: "program" { need_nl_flag = true; } IDENT '\n'
/* Function definition */
func_defn: "function" IDENT
'(' opt_arg_list { need_nl_flag = true; } ')' '\n'
body
"end"
// ...
%%
static bool need_nl_flag = false;
/* The scanner should call this function when it sees a newline.
* If the function returns true, the newline should be returned as a token.
* The function resets the value of the flag, so it must not be called for any
* other purpose. (This interface allows us to set the flag in a parser action
* without having to worry about clearing it later.)
*/
bool need_nl(void) {
bool temp = need_nl_flag;
need_nl_flag = false;
return temp;
}
// ...
Then we just need a small adjustment to the scanner in order to call that function. This uses Flex's set difference operator {-} to make a character class containing all whitespace other than a newline. Because we put that rule first, the second rule will only be used for whitespace including at least one newline character. Note that we only return one newline token for any sequence of blank lines.
([[:space:]]{-}[\n])+ { /* ignore whitespace */ }
[[:space:]]+ { if (need_nl()) return '\n'; }
Notes
That's not something you can do without thought, either: it might also be an error to change the scanner configuration too soon. In the action, you can check whether or not the lookahead token has already been read by looking at the value of yychar. If yychar is YYEMPTY, then no lookahead token has been read. If it is YYEOF, then an attempt was made to read a lookahead token but the end of input was encountered. Otherwise, the lookahead token has already been read.
It might seem tempting to use two actions, one before the token prior to the one you want to affect, and one just before that token. The first action could execute only if yychar is not YYEMPTY, indicating that the lookahead token has already been read and the scanner is about to read the token you want to change, while the second action will only execute if yychar at that point is YYEMPTY. But it's entirely possible that for a particular parse both of those conditions are true, or that neither is true.
Bison does have one configuration which you can use to make the lookahead decision completely predictable. If you set %define lr.default-reduction accepting, then Bison will always attempt to read a lookahead symbol, and you can be sure that placing the action one token early will work. Unless you are using the parser interactively, there is no real cost for enabling this option. But it won't work with old Bison versions or with other parser generators such as byacc.
For this grammar, we could have put the mid-rule actions just before the '\n' tokens rather than one token earlier (as long as the parser is never converted to a GLR or Canonical-LR parser). That's because in both rules, the MRA will go in between two tokens, and (presumably) there are no other rules which might apply up to the first of these tokens. Under those circumstances Bison can certainly know that the MRA can be reduced without examining the lookahead token to see if it is \n: either the next token is a newline and the reduction was required, or the next token is not a newline, which will be a syntax error. Since Bison does not guarantee to detect syntax errors before reduction actions are run, it can reduce the MRA action before knowing whether the parse will succeed.
There is a pattern called trailing context you cant try : https://people.cs.aau.dk/~marius/sw/flex/Flex-Regular-Expressions.html
"identifier"/[\n]
"function-identifier"/[\n]
"end" { return 'END'; }
...
0[xX][0-9a-fA-F]+ { return 'NUMBER'; }
[A-Za-z_$][A-Za-z0-9_$]* { return 'IDENT'; }
...
Call
: IDENT ArgumentList
{{ $$ = ['CallExpr', $1, $2]; }}
| IDENT
{{ $$ = ['CallExprNoArgs', $1]; }}
;
CallArray
: CallElement
{{ $$ = ['CallArray', $1]; }}
;
CallElement
: CallElement "." Call
{{ $$ = ['CallElement', $1, $3]; }}
| Call
;
Hello! So, in my grammar I want "res.end();" to not detect end as a keyword, but as an ident. I've been thinking for a while about this one but couldn't solve it. Does anyone have any ideas? Thank you!
edit: It's a C-like programming language.
There's not quite enough information in the question to justify the assumptions I'm making here, so this answer may be inexact.
Let's suppose we have a somewhat Lua-like language in which a.b is syntactic sugar for a["b"]. Furthermore, since the . must be followed by a lexical identifier -- in other words, it is never followed by a syntactic keyword -- we'd like to inhibit keyword recognition in this context.
That's a pretty simple rule. It's simple enough that the lexer could implement it without any semantic information at all; all that it says is that the token which follows a . must be an identifier. In this context, keywords should be treated as identifiers, and anything else other than an identifier is an error.
We can do this with start conditions. Specifically, we define a start condition which is only used after a . token:
%x selector
%%
/* White space and comment rules need to explicitly include
* the selector condition
*/
<INITIAL,selector>\s+ ;
/* Other rules, including keywords, are unmodified */
"end" return "END";
/* The dot rule triggers a new start condition */
"." this.begin("selector"); return ".";
/* Outside of the start condition, identifiers don't change state. */
[A-Za-z_]\w* yylval = yytext; return "ID";
/* Only identifiers are valid in this start condition, and if found
* the start condition is changed back. Anything else is an error.
*/
<selector>[A-Za-z_]\w* yylval = yytext; this.popState(); return "ID";
<selector>. parse_error("Expecting identifier");
Modify your parser, so it always knows what it is expecting to read next (that will be some set of tokens, you can compute this using the notion of First(x) for x being any nonterminal).
When lexing, have the lexer ask the parser what set of tokens it expects next.
Your keywork reconizer for 'end' asks the parser, and it either ways "expecting 'end'" at which pointer the lexer simply hands on the 'end' lexeme, or it says "expecting ID" at which point it hands the parser an ID with name text "end".
This may or may not be convenient to get your parser to do. But you need something like this.
We use a GLR parser; our parser accepts multiple tokens in the same place. Our solution is to generate both the 'end' keyword and and the identifier with text "end" and shove them both into the GLR parser. It can handle local ambiguity; the multiple parses caused by this proceed until the parser with the wrong assumption encounters a syntax error, and then it just vanishes, by fiat. The last standing parser is the one with the right set of assumptions. This scheme is somewhat like the first one, just that we hand the parser the choices and it decides rather than making the lexer decide.
You might be able to send your parser a "two-interpretation" lexeme, e.g., a keyword-in-context lexeme, which in essence claims it it both a keyword and/or an identifier. With a single token lookahead internally, the parser can likely decide easily and restamp the lexeme. Not as general as the GLR solution, but probably works in a lot of cases.
I was writing a parser to parse C-like grammars.
First, it could now parse code like:
a = 1;
b = 2;
Now I want to make the semicolon at the end of line optional.
The original YACC rule was:
stmt: expr ';' { ... }
Where the new line is processed by the lexer that written by myself(the code are simplified):
rule(/\r\n|\r|\n/) { increase_lineno(); return :PASS }
the instruction :PASS here is equivalent to return nothing in LEX, which drop current matched text and skip to the next rule, just like what is usually done with whitespaces.
Because of this, I can't just simply change my YACC rule into:
stmt: expr end_of_stmt { ... }
;
end_of_stmt: ';'
| '\n'
;
So I chose to change the lexer's state dynamically by the parser correspondingly.
Like this:
stmt: expr { state = :STATEMENT_END } ';' { ... }
And add a lexer rule that can match new line with the new state:
rule(/\r\n|\r|\n/, :STATEMENT_END) { increase_lineno(); state = nil; return ';' }
Which means when the lexer is under :STATEMENT_END state. it will first increase the line number as usual, and then set the state into initial one, and then pretend itself is a semicolon.
It's strange that it doesn't actually work with following code:
a = 1
b = 2
I debugged it and got it is not actually get a ';' as expect when scanned the newline after the number 1, and the state specified rule is not really executed.
And the code to set the new state is executed after it already scanned the new line and returned nothing, that means, these works is done as following order:
scan a, = and 1
scan newline and skip, so get the next value b
the inserted code({ state = :STATEMENT_END }) is executed
raising error -- unexpected b here
This is what I expect:
scan a, = and 1
found that it matches the rule expr, so reduce into stmt
execute the inserted code to set the new lexer state
scan the newline and return a ; according the new state matching rule
continue to scan & parse the following line
After introspection I found that might caused as YACC uses LALR(1), this parser will read forward for one token first. When it scans to there, the state is not set yet, so it cannot get a correct token.
My question is: how to make it work as expected? I have no idea on this.
Thanks.
The first thing to recognize is that having optional line terminators like this introduces ambiguity into your language, and so you first need to decide which way you want to resolve the ambiguity. In this case, the main ambiguity comes from operators that may be either infix or prefix. For example:
a = b
-c;
Do you want to treat the above as a single expr-statement, or as two separate statements with the first semicolon elided? A similar potential ambiguity occurs with function call syntax in a C-like language:
a = b
(c);
If you want these to resolve as two statements, you can use the approach you've tried; you just need to set the state one token earlier. This gets tricky as you DON'T want to set the state if you have unclosed parenthesis, so you end up needing an additional state var to record the paren nesting depth, and only set the insert-semi-before-newline state when that is 0.
If you want to resolve the above cases as one statement, things get tricky, as you actually need more lookahead to decide when a newline should end a statement -- at the very least you need to look at the token AFTER the newline (and any comments or other ignored stuff). In this case you can have the lexer do the extra lookahead. If you were using flex (which you're apparently not?), I would suggest either using the / operator (which does lookahead directly), or defer returning the semicolon until the lexer rule that matches the next token.
In general, when doing this kind of token state recording, I find it easiest to do it entirely within the lexer where possible, so you don't need to worry about the extra token of lookahead sometimes (but not always) done by the parser. In this specific case, an easy approach would be to have the lexer record the parenthesis seen (+1 for (, -1 for )), and the last token returned. Then, in the newline rule, if the paren level is 0 and the last token was something that could end an expression (ID or constant or ) or postfix-only operator), return the extra ;
An alternate approach is to have the lexer return NEWLINE as its own token. You would then change the parser to accept stmt: expr NEWLINE as well as optional newlines between most other tokens in the grammar. This exposes the ambiguity directly to the parser (its now not LALR(1)), so you need to resolve it either by using yacc's operator precedence rules (tricky and error prone), or using something like bison's %glr-parser option or btyacc's backtracking ability to deal with the ambiguity directly.
What you are attempting is certainly possible.
Ruby, in fact, does exactly this, and it has a yacc parser. Newlines soft-terminate statements, semicolons are optional, and statements are automatically continued on multiple lines "if they need it".
Communicating between the parser and lexical analyzer may be necessary, and yes, legacy yacc is LALR(1).
I don't know exactly how Ruby does it. My guess has always been that it doesn't actually communicate (much) but rather the lexer recognizes constructs that obviously aren't finished and silently just treats newlines as spaces until the parens and brackets balance. It must also notice when lines end with binary operators or commas and eat those newlines too.
Just a guess, but I believe this technique would work. And Ruby is open source... if you want to see exactly how Matz did it.
I want to ask you a question about Flex, the program for parsing code.
Supposing I have an instruction like this one, in the rules part:
"=" BEGIN(attribution);
<attribution>{var_name} { fprintf(yyout, "="); ECHO; }
<attribution>";" BEGIN(INITIAL);
{var_name} is a regular expression that matches a variable's name, and all I want to do is to copy at the output all the attribution instructions, such as
a = 3;
or
b = a;
My rule though cannot write with fprintf the left member of the attribution, but only
= 3;
or
=a;
One solution for that might be that, after I make the match "=" and I am in the attribution state, to go 2 positions back as to get the left operand as well.
How can I do that in Flex?
Why are you using flex for syntactical analysis?
What you are doing sounds like a bison stuff not a flex job.
You'll be able to store previous token.
If you still want to use flex, you can use the / pattern.
Using this may lead to inefficiencies and the lexer can be bogus; it depends of the whole rule set.
{var_name}/"=" { ECHO; BEGIN(attribution); }
See the flex manual.
I have read the GOLD Homepage ( http://www.devincook.com/goldparser/ ) docs, FAQ and Wikipedia to find out what practical application there could possibly be for GOLD. I was thinking along the lines of having a programming language (easily) available to my systems such as ABAP on SAP or X++ on Axapta - but it doesn't look feasible to me, at least not easily - even if you use GOLD.
The final use of the parsed result produced by GOLD escapes me - what do you do with the result of the parse?
EDIT: A practical example (description) would be great.
Parsing really consists of two phases. The first is "lexing", which convert the raw strings of character in to something that the program can more readily understand (commonly called tokens).
Simple example, lex would convert:
if (a + b > 2) then
In to:
IF_TOKEN LEFT_PAREN IDENTIFIER(a) PLUS_SIGN IDENTIFIER(b) GREATER_THAN NUMBER(2) RIGHT_PAREN THEN_TOKEN
The parse takes that stream of tokens, and attempts to make yet more sense out of them. In this case, it would try and match up those tokens to an IF_STATEMENT. To the parse, the IF _STATEMENT may well look like this:
IF ( BOOLEAN_EXPRESSION ) THEN
Where the result of the lexing phase is a token stream, the result of the parsing phase is a Parse Tree.
So, a parser could convert the above in to:
if_statement
|
v
boolean_expression.operator = GREATER_THAN
| |
| v
V numeric_constant.string="2"
expression.operator = PLUS_SIGN
| |
| v
v identifier.string = "b"
identifier.string = "a"
Here you see we have an IF_STATEMENT. An IF_STATEMENT has a single argument, which is a BOOLEAN_EXPRESSION. This was explained in some manner to the parser. When the parser is converting the token stream, it "knows" what a IF looks like, and know what a BOOLEAN_EXPRESSION looks like, so it can make the proper assignments when it sees the code.
For example, if you have just:
if (a + b) then
The parser could know that it's not a boolean expression (because the + is arithmetic, not a boolean operator) and the parse could throw an error at this point.
Next, we see that a BOOLEAN_EXPRESSION has 3 components, the operator (GREATER_THAN), and two sides, the left side and the right side.
On the left side, it points to yet another expression, the "a + b", while on the right is points to a NUMERIC_CONSTANT, in this case the string "2". Again, the parser "knows" this is a NUMERIC constant because we told it about strings of numbers. If it wasn't numbers, it would be an IDENTIFIER (like "a" and "b" are).
Note, that if we had something like:
if (a + b > "XYZ") then
That "parses" just fine (expression on the left, string constant on the right). We don't know from looking at this whether this is a valid expression or not. We don't know if "a" or "b" reference Strings or Numbers at this point. So, this is something the parser can't decided for us, can't flag as an error, as it simply doesn't know. That will happen when we evaluate (either execute or try to compile in to code) the IF statement.
If we did:
if [a > b ) then
The parser can readily see that syntax error as a problem, and will throw an error. That string of tokens doesn't look like anything it knows about.
So, the point being that when you get a complete parse tree, you have some assurance that at first cut the "code looks good". Now during execution, other errors may well come up.
To evaluate the parse tree, you just walk the tree. You'll have some code associated with the major nodes of the parse tree during the compile or evaluation part. Let's assuming that we have an interpreter.
public void execute_if_statment(ParseTreeNode node) {
// We already know we have a IF_STATEMENT node
Value value = evaluate_expression(node.getBooleanExpression());
if (value.getBooleanResult() == true) {
// we do the "then" part of the code
}
}
public Value evaluate_expression(ParseTreeNode node) {
Value result = null;
if (node.isConstant()) {
result = evaluate_constant(node);
return result;
}
if (node.isIdentifier()) {
result = lookupIdentifier(node);
return result;
}
Value leftSide = evaluate_expression(node.getLeftSide());
Value rightSide = evaluate_expression(node.getRightSide());
if (node.getOperator() == '+') {
if (!leftSide.isNumber() || !rightSide.isNumber()) {
throw new RuntimeError("Must have numbers for adding");
}
int l = leftSide.getIntValue();
int r = rightSide.getIntValue();
int sum = l + r;
return new Value(sum);
}
if (node.getOperator() == '>') {
if (leftSide.getType() != rightSide.getType()) {
throw new RuntimeError("You can only compare values of the same type");
}
if (leftSide.isNumber()) {
int l = leftSide.getIntValue();
int r = rightSide.getIntValue();
boolean greater = l > r;
return new Value(greater);
} else {
// do string compare instead
}
}
}
So, you can see that we have a recursive evaluator here. You see how we're checking the run time types, and performing the basic evaluations.
What will happen is the execute_if_statement will evaluate it's main expression. Even tho we wanted only BOOLEAN_EXPRESION in the parse, all expressions are mostly the same for our purposes. So, execute_if_statement calls evaluate_expression.
In our system, all expressions have an operator and a left and right side. Each side of an expression is ALSO an expression, so you can see how we immediately try and evaluate those as well to get their real value. The one note is that if the expression consists of a CONSTANT, then we simply return the constants value, if it's an identifier, we look it up as a variable (and that would be a good place to throw a "I can't find the variable 'a'" message), otherwise we're back to the left side/right side thing.
I hope you can see how a simple evaluator can work once you have a token stream from a parser. Note how during evaluation, the major elements of the language are in place, otherwise we'd have got a syntax error and never got to this phase. We can simply expect to "know" that when we have a, for example, PLUS operator, we're going to have 2 expressions, the left and right side. Or when we execute an IF statement, that we already have a boolean expression to evaluate. The parse is what does that heavy lifting for us.
Getting started with a new language can be a challenge, but you'll find once you get rolling, the rest become pretty straightforward and it's almost "magic" that it all works in the end.
Note, pardon the formatting, but underscores are messing things up -- I hope it's still clear.
I would recommend antlr.org for information and the 'free' tool I would use for any parser use.
GOLD can be used for any kind of application where you have to apply context-free grammars to input.
elaboration:
Essentially, CFGs apply to all programming languages. So if you wanted to develop a scripting language for your company, you'd need to write a parser- or get a parsing program. Alternatively, if you wanted to have a semi-natural language for input for non-programmers in the company, you could use a parser to read that input and spit out more "machine-readable" data. Essentially, a context-free grammar allows you to describe far more inputs than a regular expression. The GOLD system apparently makes the parsing problem somewhat easier than lex/yacc(the UNIX standard programs for parsing).