Mid rule actions bison - what is the meaning $n in mid rule? - parsing

In given:
Var: Art { if( $1 == NULL) print("error"); } Asta
Art: /*epsilon*/
Asta: /*epsilon*/
What is the meaning of $1 in line number one (the line of Var: ...).
Because the mid-rule action, I not sure if $1 is meaning to the not-terminal Art or not.

This is explained in more detail in the bison manual section on Midrule actions.
Yes, in your example $1 refers to the value of the terminal Art.

Related

Is this reserve declaration hard to understand or defective?

I've got a problem with using a reserve (backslash) declaration for priority disambiguation. Below is a self-contained example. The production 'Ipv4Address' is a strict subset of 'Domain0'. In parsing URL's, though, you want dotted-quad addresses to be handled differently than domain names, so you want to split 'Domain0' into two parts; 'Domain1' is one of those two parts. The test suite included, however, is failing at 't3()', where 'Domain1' is accepting an IP address, which looks like it should be excluded.
Is this a problem with the reserve declaration, or is this a defect in the current version of Rascal? I'm on the 0.10.x unstable branch at present, per advice to see if that corrected a different problem (with the Tutor). I haven't checked with the stable branch since keeping them both installed means parallel Eclipse environments, which I haven't been motivated to do.
module grammar_test
import ParseTree;
syntax Domain0 = { Subdomain '.' }+;
syntax Domain1 = Domain0 \ IPv4Address ;
lexical Subdomain = [0-9A-Za-z]+ | [0-9A-Za-z]+'-'[a-zA-Z0-9\-]*[a-zA-Z0-9] ;
lexical IPv4Address = DecimalOctet '.' DecimalOctet '.' DecimalOctet '.' DecimalOctet ;
lexical DecimalOctet = [0-9] | [1-9][0-9] | '1'[0-9][0-9] | '2'[0-4][0-9] | '25'[0-5] ;
test bool t1()
{
return parseAccept(#IPv4Address, "192.168.0.1");
}
test bool t2()
{
return parseAccept(#Domain0, "192.168.0.1");
}
test bool t3()
{
return parseReject(#Domain1, "192.168.0.1");
}
bool parseAccept( type[&T<:Tree] begin, str input )
{
try
{
parse(begin, input, allowAmbiguity=false);
}
catch ParseError(loc _):
{
return false;
}
return true;
}
bool parseReject( type[&T<:Tree] begin, str input )
{
try
{
parse(begin, input, allowAmbiguity=false);
}
catch ParseError(loc _):
{
return true;
}
return false;
}
This example has been cut down from larger code. I first encountered the error in a larger scope. Using the rule "IPv4Address | Domain1" was throwing an Ambiguity exception, which I tracked down to the behavior that "Domain1" was accepting something it shouldn't be. Curiously "IPv4Address > Domain1" was also throwing Ambiguity, but I'm guessing this has the same root cause as the present isolated example.
The difference operator for keyword reservations currently only works correctly if the right-hand side is a finite language expressed as disjunction of literal keywords like "if" | "then" | "while" or a non-terminal which is defined like that: lexical X = "if" | "then" | "while". And then you can writeA \ X` for some effect.
For other types of non-terminals the parser is just generated but the \ constraint has no effect. You wrote Domain0 \ IPv4Address and IPv3Address does not hold to the above assumption.
(We should either add a warning about that or generate a parser which can implement the full semantics of language difference; but that's for another time).
Admittedly such a powerful difference operator could be used to express an some order of preference between non-terminals. Alas.
Possible (sketches of) solutions:
stage two passes solution: parse the input using the more general Subdomain syntax, then pattern and match rewrite in a single pass all quadruples to IPv4Address
maximal munch solution: adapt the grammar using follow restrictions to implement eager behavior for the IPv4Address, like {Subdomain !>> [.][0-9] "."}+ or something in that vain.

Parsing with Happy: left-recursion versus right-recursion

Section 2.2 of the Happy user manual advises to you use left recursion rather than right recursion, because right recursion is "inefficient". Basically they're saying that if you try to parse a long sequence of items, right recursion will overflow the parse stack, whereas left recursion uses constant stack. The canonical example given is
items : item { $1 : [] }
| items "," item { $3 : $1 }
Unfortunately, this means that the list of items comes out backwards.
Now it's easy enough to apply reverse at the end (although maddeningly annoying that you have to do this everywhere the parser is called, rather than once where it's defined). However, if the list of items is large, surely reverse is also going to overflow the Haskell stack? No?
Basically, how do I make it so that I can parse arbitrarily-large files and still get the results out in the correct order?
If all you want is the entire items to be reversed every time, you can define
items : items_ {reverse $1}
items_ : item { $1 : [] }
| items_ "," item { $3 : $1 }
reverse won't overflow the stack. If you need to convince yourself of this, try evaluating length $ reverse [1..10000000]

Happy: This is not a keyword

When you write a Happy description, you have to define all possible types of token that can appear. But you can only match against token types, not individual token values...
This is kind of problematic. Consider, for example, the data keyword. According to the Haskell Report, this token is a "reservedid". So my tokeniser recognises it and marks it as such. However, consider the as keyword. Now it turns out that this is not a reservedid; it's an ordinary varid. It's only special in one context. You can totally declare a normal variable named as, and it's fine.
So here's a question: How do I parse as specifically?
Initially I didn't really think about it. I just defined a new token type which represents any varid token who's text happens to be as.
...and then I spent about 2 hours trying to work out why the hell my grammar doesn't actually work. Yeah, it turns out that since this token type overlaps with an existing token type, the declaration order is significant. (!!!) Literally, changing the order of the declarations made the grammar parse perfectly.
But now I'm worried. I fear that as will never be matched as a varid and will only ever match as itself. So all the grammar rules that say varid will reject the as token — which is completely wrong!
What is the correct way to fix this?
What GHC does in its Parser.y is to define a nonterminal token type special_id that lists many of the special non-keywords like as, and then define the tyvarid and varid (nonterminal) tokens to include that as an option besides the terminal VARID (and some others, although most of them look to me like they should have been put in special_id too).
An excerpt:
varid :: { Located RdrName }
: VARID { sL1 $1 $! mkUnqual varName (getVARID $1) }
| special_id { sL1 $1 $! mkUnqual varName (unLoc $1) }
| 'unsafe' { sL1 $1 $! mkUnqual varName (fsLit "unsafe") }
...
special_id :: { Located FastString }
special_id
: 'as' { sL1 $1 (fsLit "as") }
| 'qualified' { sL1 $1 (fsLit "qualified") }
| 'hiding' { sL1 $1 (fsLit "hiding") }
| 'export' { sL1 $1 (fsLit "export") }
...

Parser in Happy

I'm trying to do a parser with Happy (Haskell Tool) But I'm getting a message error: "unused ruled: 11 and unused terminals: 10" and I don't know what this means. In other hand I'm really not sure about the use of $i parameters in the statements of the rules, I think my error is because of that. If any can help me...
It's not an error if you get these messages, it just means that part of your grammar is unused because it is not reachable from the start symbol. To see more information about how Happy understands your grammar, use the --info flag to Happy:
happy --info MyParser.y
which generates a file MyParser.info in addition to the usual MyParser.hs.
Unused rules and terminals are parts of your grammar for which there is no way to reach from the top level parse statements, if I recall correctly. To see how to use the $$ parameters, read the happy user guide.
The $$ symbol is a placeholder that
represents the value of this token.
Normally the value of a token is the
token itself, but by using the $$
symbol you can specify some component
of the token object to be the value.
Unused rules and terminals means you have described rules that can't be reached during parsing (pretty much like "if true then 1 else 2", the 2 branch will never be reached).
Check the output of --info for more details.
For the $$ thing, it is a data extractor: let's say you have a lexer that produces token
of the following type:
data TokenType = INT | SYM
data TokenLex = L TokenType String
where TokenType is here to distinguish usefull data and keywords.
In the action of your parser, you can extract the String part by using $$
%token INTEGER {L INT $$ }
%token OTHER {L _ $$}
foo : INTEGER bar INTEGER { read $1 + read $3 }
| ...
In this rule, $1 means "give me the content of the first INTEGER" and $3 "the content of the second INTEGER". $2 means "give me the content of bar (which may be another complex rule).
Thanks to $$, $1 and $3 are geniune Haskell String because we told Happy that "the content of an INTEGER is the "String" part of the TokenLex", not the whole Token.

Start states in Lex / Flex

I'm using Flex and Bison for a parser generator, but having problems with the start states in my scanner.
I'm using exclusive rules to deal with commenting, but this grammar doesn't seem to match quoted tokens:
%x COMMENT
// { BEGIN(COMMENT); }
<COMMENT>[^\n] ;
<COMMENT>\n { BEGIN(INITIAL); }
"==" { return EQUALEQUAL; }
. ;
In this simple example the line:
// a == b
isn't matched entirely as a comment, unless I include this rule:
<COMMENT>"==" ;
How do I get round this without having to add all these tokens into my exclusive rules?
Matching C-style comments in Lex/Flex or whatever is well documented:
in the documentation, as well as various variations around the Internet.
Here is a variation on that found in the Flex documentation:
<INITIAL>{
"//" BEGIN(IN_COMMENT);
}
<IN_COMMENT>{
\n BEGIN(INITIAL);
[^\n]+ // eat comment
"/" // eat the lone /
}
Try adding a "+" after the [^n] rule. I don't know why the exclusive state is still picking up '==' even in an exclusive state, but apparently it is. Flex will normally match the rule that matches the most text, and adding the "+" will at least make the two rules tie in length. Putting the COMMENT rule first will cause it to be used in case of a tie.
The clue is:
The problem is this 'eat comment'
rule doesn't seem to match tokens with
more than one character
so add a * to match zero or more non-newlines. You want Zero otherwise a empty comment will not match.
%x COMMENT
// { BEGIN(COMMENT); }
<COMMENT>[^\n]* ;
<COMMENT>\n { BEGIN(INITIAL); }
"==" { return EQUALEQUAL; }
. ;

Resources