Section 2.2 of the Happy user manual advises to you use left recursion rather than right recursion, because right recursion is "inefficient". Basically they're saying that if you try to parse a long sequence of items, right recursion will overflow the parse stack, whereas left recursion uses constant stack. The canonical example given is
items : item { $1 : [] }
| items "," item { $3 : $1 }
Unfortunately, this means that the list of items comes out backwards.
Now it's easy enough to apply reverse at the end (although maddeningly annoying that you have to do this everywhere the parser is called, rather than once where it's defined). However, if the list of items is large, surely reverse is also going to overflow the Haskell stack? No?
Basically, how do I make it so that I can parse arbitrarily-large files and still get the results out in the correct order?
If all you want is the entire items to be reversed every time, you can define
items : items_ {reverse $1}
items_ : item { $1 : [] }
| items_ "," item { $3 : $1 }
reverse won't overflow the stack. If you need to convince yourself of this, try evaluating length $ reverse [1..10000000]
Related
When you write a Happy description, you have to define all possible types of token that can appear. But you can only match against token types, not individual token values...
This is kind of problematic. Consider, for example, the data keyword. According to the Haskell Report, this token is a "reservedid". So my tokeniser recognises it and marks it as such. However, consider the as keyword. Now it turns out that this is not a reservedid; it's an ordinary varid. It's only special in one context. You can totally declare a normal variable named as, and it's fine.
So here's a question: How do I parse as specifically?
Initially I didn't really think about it. I just defined a new token type which represents any varid token who's text happens to be as.
...and then I spent about 2 hours trying to work out why the hell my grammar doesn't actually work. Yeah, it turns out that since this token type overlaps with an existing token type, the declaration order is significant. (!!!) Literally, changing the order of the declarations made the grammar parse perfectly.
But now I'm worried. I fear that as will never be matched as a varid and will only ever match as itself. So all the grammar rules that say varid will reject the as token — which is completely wrong!
What is the correct way to fix this?
What GHC does in its Parser.y is to define a nonterminal token type special_id that lists many of the special non-keywords like as, and then define the tyvarid and varid (nonterminal) tokens to include that as an option besides the terminal VARID (and some others, although most of them look to me like they should have been put in special_id too).
An excerpt:
varid :: { Located RdrName }
: VARID { sL1 $1 $! mkUnqual varName (getVARID $1) }
| special_id { sL1 $1 $! mkUnqual varName (unLoc $1) }
| 'unsafe' { sL1 $1 $! mkUnqual varName (fsLit "unsafe") }
...
special_id :: { Located FastString }
special_id
: 'as' { sL1 $1 (fsLit "as") }
| 'qualified' { sL1 $1 (fsLit "qualified") }
| 'hiding' { sL1 $1 (fsLit "hiding") }
| 'export' { sL1 $1 (fsLit "export") }
...
I am trying to do a top-down visit of an Algebraic Data Type. When I find a node of a certain type, I would also like to bind to the nodes of that particular node, for e.g.
data Script=script(list[Stmt] body | ...
data Stmt =exprstmt(Expr expr)| ...
data Expr =assign(Expr left, Expr right) | var(str name)| scalar(Type aType)|... ;
Script myScript=someScript(srcFile);
top-down visit(myScript)
{
case (Expr e:assign(left,right), left:=var(_), right :=scalar(_) )
{
str varName=left.name;
Type myType=right.aType;
}
}
So what I'm trying to do in the case statement is to search for a specific kind of node: i.e. of type assign(var(),scalar()) by doing a couple of pattern matches. My intention is to bind the variables left and right to var() and scalar() respectively at the same time that I find the particular kind of node. I hoped to NOT do a nested 'case' statement in order to retrieve information about the sub-nodes. Maybe that is possible, but I'm not sure.
You can just nest the patterns, like so:
top-down visit(myScript) {
case e:assign(l:var(varName),r:scalar(aType)) :
// do something useful
println("<varName> : <aType>");
}
I believe left and right might be reserved keywords so this is why I used l and r instead.
Or simpler (because you don't need names for the nested patterns:
top-down visit(myScript) {
case assign(var(varName),scalar(aType)) :
// do something useful
println("<varName> : <aType>");
}
I have not messed with building languages or parsers in a formal way since grad school and have forgotten most of what I knew back then. I now have a project that might benefit from such a thing but I'm not sure how to approach the following situation.
Let's say that in the language I want to parse there is a token that means "generate a random floating point number" in an expression.
exp: NUMBER
{$$ = $1;}
| NUMBER PLUS exp
{$$ = $1 + $3;}
| R PLUS exp
{$$ = random() + $3;}
;
I also want a "list" generating operator that will reevaluate an "exp" some number of times. Maybe like:
listExp: NUMBER COLON exp
{
for (int i = 0; i < $1; i++) {
print $3;
}
}
;
The problem I see is that "exp" will have already been reduced by the time the loop starts. If I have the input
2 : R + 2
then I think the random number will be generated as the "exp" is parsed and 2 added to it -- lets say the result is 2.0055. Then in the list expression I think 2.0055 would be printed out twice.
Is there a way to mark the "exp" before evaluation and then parse it as many times as the list loop count requires? The idea being to get a different random number in each evaluation.
Your best bet is to build an AST and evaluate the entire AST at the end of the parse. In-line evaluation is only possible for very simple (i.e. "calculator-like") projects.
Instead of an AST, you could construct code for a stack- or three-address- virtual machine. That's generally more efficient, particularly if you intend to execute the code frequently, but the AST is a lot simpler to construct, and executing it is a single depth-first scan.
Depending on your language design there are at least 5 different points at which a token in the language could be bound to a value. They are:
Pre-processor (like C #define)
Lexer: recognise tokens
Parser: recognise token structure, output AST
Semantic analysis: analyse AST, assign types and conversions etc
Code generation: output executable code or execute code directly.
If you have a token that can occur multiple times and you want to assign it a different random value each time, then phase 4 is the place to do it. If you generate an AST, walk the tree and assign the values. If you go straight to code generation (or an interpreter) do it then.
I'm trying to do a parser with Happy (Haskell Tool) But I'm getting a message error: "unused ruled: 11 and unused terminals: 10" and I don't know what this means. In other hand I'm really not sure about the use of $i parameters in the statements of the rules, I think my error is because of that. If any can help me...
It's not an error if you get these messages, it just means that part of your grammar is unused because it is not reachable from the start symbol. To see more information about how Happy understands your grammar, use the --info flag to Happy:
happy --info MyParser.y
which generates a file MyParser.info in addition to the usual MyParser.hs.
Unused rules and terminals are parts of your grammar for which there is no way to reach from the top level parse statements, if I recall correctly. To see how to use the $$ parameters, read the happy user guide.
The $$ symbol is a placeholder that
represents the value of this token.
Normally the value of a token is the
token itself, but by using the $$
symbol you can specify some component
of the token object to be the value.
Unused rules and terminals means you have described rules that can't be reached during parsing (pretty much like "if true then 1 else 2", the 2 branch will never be reached).
Check the output of --info for more details.
For the $$ thing, it is a data extractor: let's say you have a lexer that produces token
of the following type:
data TokenType = INT | SYM
data TokenLex = L TokenType String
where TokenType is here to distinguish usefull data and keywords.
In the action of your parser, you can extract the String part by using $$
%token INTEGER {L INT $$ }
%token OTHER {L _ $$}
foo : INTEGER bar INTEGER { read $1 + read $3 }
| ...
In this rule, $1 means "give me the content of the first INTEGER" and $3 "the content of the second INTEGER". $2 means "give me the content of bar (which may be another complex rule).
Thanks to $$, $1 and $3 are geniune Haskell String because we told Happy that "the content of an INTEGER is the "String" part of the TokenLex", not the whole Token.
I have read the GOLD Homepage ( http://www.devincook.com/goldparser/ ) docs, FAQ and Wikipedia to find out what practical application there could possibly be for GOLD. I was thinking along the lines of having a programming language (easily) available to my systems such as ABAP on SAP or X++ on Axapta - but it doesn't look feasible to me, at least not easily - even if you use GOLD.
The final use of the parsed result produced by GOLD escapes me - what do you do with the result of the parse?
EDIT: A practical example (description) would be great.
Parsing really consists of two phases. The first is "lexing", which convert the raw strings of character in to something that the program can more readily understand (commonly called tokens).
Simple example, lex would convert:
if (a + b > 2) then
In to:
IF_TOKEN LEFT_PAREN IDENTIFIER(a) PLUS_SIGN IDENTIFIER(b) GREATER_THAN NUMBER(2) RIGHT_PAREN THEN_TOKEN
The parse takes that stream of tokens, and attempts to make yet more sense out of them. In this case, it would try and match up those tokens to an IF_STATEMENT. To the parse, the IF _STATEMENT may well look like this:
IF ( BOOLEAN_EXPRESSION ) THEN
Where the result of the lexing phase is a token stream, the result of the parsing phase is a Parse Tree.
So, a parser could convert the above in to:
if_statement
|
v
boolean_expression.operator = GREATER_THAN
| |
| v
V numeric_constant.string="2"
expression.operator = PLUS_SIGN
| |
| v
v identifier.string = "b"
identifier.string = "a"
Here you see we have an IF_STATEMENT. An IF_STATEMENT has a single argument, which is a BOOLEAN_EXPRESSION. This was explained in some manner to the parser. When the parser is converting the token stream, it "knows" what a IF looks like, and know what a BOOLEAN_EXPRESSION looks like, so it can make the proper assignments when it sees the code.
For example, if you have just:
if (a + b) then
The parser could know that it's not a boolean expression (because the + is arithmetic, not a boolean operator) and the parse could throw an error at this point.
Next, we see that a BOOLEAN_EXPRESSION has 3 components, the operator (GREATER_THAN), and two sides, the left side and the right side.
On the left side, it points to yet another expression, the "a + b", while on the right is points to a NUMERIC_CONSTANT, in this case the string "2". Again, the parser "knows" this is a NUMERIC constant because we told it about strings of numbers. If it wasn't numbers, it would be an IDENTIFIER (like "a" and "b" are).
Note, that if we had something like:
if (a + b > "XYZ") then
That "parses" just fine (expression on the left, string constant on the right). We don't know from looking at this whether this is a valid expression or not. We don't know if "a" or "b" reference Strings or Numbers at this point. So, this is something the parser can't decided for us, can't flag as an error, as it simply doesn't know. That will happen when we evaluate (either execute or try to compile in to code) the IF statement.
If we did:
if [a > b ) then
The parser can readily see that syntax error as a problem, and will throw an error. That string of tokens doesn't look like anything it knows about.
So, the point being that when you get a complete parse tree, you have some assurance that at first cut the "code looks good". Now during execution, other errors may well come up.
To evaluate the parse tree, you just walk the tree. You'll have some code associated with the major nodes of the parse tree during the compile or evaluation part. Let's assuming that we have an interpreter.
public void execute_if_statment(ParseTreeNode node) {
// We already know we have a IF_STATEMENT node
Value value = evaluate_expression(node.getBooleanExpression());
if (value.getBooleanResult() == true) {
// we do the "then" part of the code
}
}
public Value evaluate_expression(ParseTreeNode node) {
Value result = null;
if (node.isConstant()) {
result = evaluate_constant(node);
return result;
}
if (node.isIdentifier()) {
result = lookupIdentifier(node);
return result;
}
Value leftSide = evaluate_expression(node.getLeftSide());
Value rightSide = evaluate_expression(node.getRightSide());
if (node.getOperator() == '+') {
if (!leftSide.isNumber() || !rightSide.isNumber()) {
throw new RuntimeError("Must have numbers for adding");
}
int l = leftSide.getIntValue();
int r = rightSide.getIntValue();
int sum = l + r;
return new Value(sum);
}
if (node.getOperator() == '>') {
if (leftSide.getType() != rightSide.getType()) {
throw new RuntimeError("You can only compare values of the same type");
}
if (leftSide.isNumber()) {
int l = leftSide.getIntValue();
int r = rightSide.getIntValue();
boolean greater = l > r;
return new Value(greater);
} else {
// do string compare instead
}
}
}
So, you can see that we have a recursive evaluator here. You see how we're checking the run time types, and performing the basic evaluations.
What will happen is the execute_if_statement will evaluate it's main expression. Even tho we wanted only BOOLEAN_EXPRESION in the parse, all expressions are mostly the same for our purposes. So, execute_if_statement calls evaluate_expression.
In our system, all expressions have an operator and a left and right side. Each side of an expression is ALSO an expression, so you can see how we immediately try and evaluate those as well to get their real value. The one note is that if the expression consists of a CONSTANT, then we simply return the constants value, if it's an identifier, we look it up as a variable (and that would be a good place to throw a "I can't find the variable 'a'" message), otherwise we're back to the left side/right side thing.
I hope you can see how a simple evaluator can work once you have a token stream from a parser. Note how during evaluation, the major elements of the language are in place, otherwise we'd have got a syntax error and never got to this phase. We can simply expect to "know" that when we have a, for example, PLUS operator, we're going to have 2 expressions, the left and right side. Or when we execute an IF statement, that we already have a boolean expression to evaluate. The parse is what does that heavy lifting for us.
Getting started with a new language can be a challenge, but you'll find once you get rolling, the rest become pretty straightforward and it's almost "magic" that it all works in the end.
Note, pardon the formatting, but underscores are messing things up -- I hope it's still clear.
I would recommend antlr.org for information and the 'free' tool I would use for any parser use.
GOLD can be used for any kind of application where you have to apply context-free grammars to input.
elaboration:
Essentially, CFGs apply to all programming languages. So if you wanted to develop a scripting language for your company, you'd need to write a parser- or get a parsing program. Alternatively, if you wanted to have a semi-natural language for input for non-programmers in the company, you could use a parser to read that input and spit out more "machine-readable" data. Essentially, a context-free grammar allows you to describe far more inputs than a regular expression. The GOLD system apparently makes the parsing problem somewhat easier than lex/yacc(the UNIX standard programs for parsing).