What causes Happy to throw a parse error? - parsing

I've written a lexer in Alex and I'm trying to hook it up to a parser written in Happy. I'll try my best to summarize my problem without pasting huge chunks of code.
I know from my unit tests of my lexer that the string "\x7" is lexed to:
[TokenNonPrint '\x7', TokenEOF]
My token type (spit out by the lexer), is Token. I've defined lexWrap and alexEOF as described here, which gives me the following header and token declarations:
%name parseTokens
%tokentype { Token }
%lexer { lexWrap } { alexEOF }
%monad { Alex }
%error { parseError }
%token
NONPRINT {TokenNonPrint $$}
PLAIN { TokenPlain $$ }
I invoke the parser+lexer combo with the following:
parseExpr :: String -> Either String [Expr]
parseExpr s = runAlex s parseTokens
And here are my first few productions:
exprs :: { [Expr] }
exprs
: {- empty -} { trace "exprs 30" [] }
| exprs expr { trace "exprs 31" $ $2 : $1 }
nonprint :: { Cmd }
: NONPRINT { NonPrint $ parseNonPrint $1}
expr :: { Expr }
expr
: nonprint {trace "expr 44" $ Cmd $ $1}
| PLAIN { trace "expr 37" $ Plain $1 }
I'll leave out the datatype declarations of Expr and NonPrint since they're long and only the constructors Cmd and NonPrint matter here. The function parseNonPrint is defined at the bottom of Parse.y as:
parseNonPrint :: Char -> NonPrint
parseNonPrint '\x7' = Bell
Also, my error handling function looks like:
parseError :: Token -> Alex a
parseError tokens = error ("Error processing token: " ++ show tokens)
Written like this, I expect the following hspec test to pass:
parseExpr "\x7" `shouldBe` Right [Cmd (NonPrint Bell)]
But instead, I see "exprs 30" print once (even though I'm running 5 different unit tests) and all of my tests of parseExpr return Right []. I don't understand why that would be the case, but I changed the exprs production to prevent it:
exprs :: { [Expr] }
exprs
: expr { trace "exprs 30" [$1] }
| exprs expr { trace "exprs 31" $ $2 : $1 }
Now all of my tests fail on the first token they hit --- parseExpr "\x7" fails with:
uncaught exception: ErrorCall (Error processing token: TokenNonPrint '\a')
And I'm thoroughly confused, since I would expect the parser to take the path exprs -> expr -> nonprint -> NONPRINT and succeed. I don't see why this input would put the parser in an error state. None of the trace statements are hit (optimized away?).
What am I doing wrong?

It turns out the cause of this error was the innocuous line
%lexer { lexWrap } { alexEOF }
which was recommended by the linked question about using Alex with Happy (unfortunately, one of the top Google results for queries like "using Alex as a monadic lexer with Happy). The fix is to change it to the following:
%lexer { lexWrap } { TokenEOF }
I had to dig in to the generated code to uncover the issue. It is caused by the code derived from the %tokens directive, which looks as follows (I commented out all of my token declarations except for TokenNonPrint while trying to track down the error):
happyNewToken action sts stk
= lexWrap(\tk ->
let cont i = happyDoAction i tk action sts stk in
case tk of {
alexEOF -> happyDoAction 2# tk action sts stk; -- !!!!
TokenNonPrint happy_dollar_dollar -> cont 1#;
_ -> happyError' tk
})
Evidently, Happy transforms each line of the %tokens directive in to one branch of a pattern match. It also inserts a branch for whatever was identified to it as the EOF token in the %lexer directive.
By inserting the name of a value, alexEOF, rather than a data constructor, TokenEOF, this branch of the case statement has the effect of re-binding the name alexEOF to whatever token was passed in to lexWrap, shadowing the original binding and short-circuiting the case statement so that it hits the EOF rule every time, which somehow results in Happy entering an error state.
The mistake isn't caught by the type system, since the identifier alexEOF (or TokenEOF) doesn't appear anywhere else in the generated code. Misusing the %lexer directive like this will cause GHC to emit a warning, but, since the warning appears in generated code, it's impossible to distinguish it from all of the other harmless warnings the code throws out.

Related

Extract Token List from OCamllex lexbuf

I am writing a Python interpreter in OCaml using ocamllex, and in order to handle the indentation-based syntax, I want to
tokenize the input using ocamllex
iterate through the list of lexed tokens and insert INDENT and DEDENT tokens as needed for the parser
parse this list into an AST
However, in ocamllex, the lexing step produces a lexbuf stream which can't be easily iterated through to do the indentation checking. Is there a good way to extract a list of tokens from lexbuf, i.e.
let lexbuf = (Lexing.from_channel stdin) in
let token_list = tokenize lexbuf
where token_list has type Parser.token list? My hack was to define a trivial parser like
tokenize: /* used by the parser to read the input into the indentation function */
| token EOL { $1 # [EOL] }
| EOL { SEP :: [EOL] }
token:
| COLON { [COLON] }
| TAB { [TAB] }
| RETURN { [RETURN] }
...
| token token %prec RECURSE { $1 # $2 }
and to call this like
let lexbuf = (Lexing.from_channel stdin) in
let temp = (Parser.tokenize Scanner.token) lexbuf in (* char buffer to token list *)
but this has all sorts of issues with shift-reduce errors and unnecessary complexity. Is there a better way to write a lexbuf -> Parser.token list function in OCaml?

Error: popping nterm

I'm trying to understand the diagnostic messages given by Flex:
Entering state 5
Return for a new token:
Reading a token: Next token is token END_OF_FILE (4.0: )
Shifting token END_OF_FILE (4.0: )
Entering state 43
Reducing stack by rule 143 (line 331):
$1 = nterm syntax (0.0-17: )
$2 = nterm top_levels (0.18-4.0: )
$3 = token END_OF_FILE (4.0: )
-> $$ = nterm s (0.0-4.0: )
Stack now 0
Entering state 3
Return for a new token:
Reading a token: Next token is token END_OF_FILE (4.0: )
4/0: syntax error
Error: popping nterm s (0.0-4.0: )
Stack now 0
Cleanup: discarding lookahead token END_OF_FILE (4.0: )
Stack now 0
I cannot understand why / what is it trying to do with EOF token. Below are the Flex rules:
<<EOF>> { return END_OF_FILE; }
And the Bison rules:
top_level : message
| enum
| service
| import { $$ = Py_None; }
| package { $$ = Py_None; }
| option_def { $$ = Py_None; }
| ';' { $$ = Py_None; } ;
top_levels : %empty { $$ = py_list(Py_None); }
| top_levels top_level { $$ = py_append($1, $2); } ;
s : syntax top_levels END_OF_FILE { $$ = $2; } ;
And the output file generated by Bison:
State 3
0 $accept: s . $end
$end shift, and go to state 6
State 5
142 top_levels: top_levels . top_level
143 s: syntax top_levels . END_OF_FILE
BOOL shift, and go to state 9
... bunch of similar rules
END_OF_FILE shift, and go to state 43
';' shift, and go to state 44
import go to state 45
... bunch of similar rules
top_level go to state 55
State 6
0 $accept: s $end .
$default accept
I have no idea what's going on. Why does it report reading EOF token twice? What was exactly the problem with popping s? To me it seems like it actually accepted the whole thing, and then decided to reject it because it red the token second time... but the whole reporting is very confusing.
1. The problem
Don't do this:
<<EOF>> { return END_OF_FILE; }
Yacc/bison parsers augment grammars with an internal rule which produces the start symbol followed by an internal eof token called $end, whose token number is 0. (You can see this rule in states 3 and 6.) That is the only accepting rule in the grammar.
By default, (f)lex scanners return 0 when EOF is detected. So that all Just Works.
When you try to send a different token on EOF, you are attempting to defeat this mechanism, but it won't work because the start symbol is not an accepting rule. After the start symbil is reduced, the parser tries to reduce the $accept rule, so it asks the scanner for another token. But the scanner has already hit EOF. In most cases, the scanner will execute the <<EOF>> action again (although this is not guaranteed), but that's not going to produce the $end token it needs. So you get a syntax error.
2. The underlying problem (maybe)
Normally, people try this in order to create a user action which runs when the input is accepted, typically in order to return the result of the parse to yyparse's caller through an "out" parameter. Trying to explicitly recognize an EOF token (or even the $end token) in the start production cannot work, but there is a much simpler solution: an extra unit rule:
%start return
%%
return: s { *out = $1; }
s: syntax top_levels { $$ = $2; }
Note that you could also do this without top_levels:
%start return
%%
return: { *out = $1; }
s: syntax { $$ = py_list(Py_None); }
| s top_level { $$ = py_append($1, $2); }
An alternative is to use the special YYACCEPT action macro in the action for the start rule. However, I believe the standard solution outlined above is simpler because it doesn't require anything from the scanner.
3. The trace output
Error: popping nterm s (0.0-4.0: )
Means:
A syntax error was detected.
As part of error recovery, the parser popped the non-terminal s from the stack.
That non-terminal's source location extends from 0.0 to 4.0 (line . column)
If s (or its semantic type) had had a registered destructor, that would have run at step 2. You probably will want to register destructor for syntactic types which reference Python values in order to decrement their reference counts so that you don't leak memory on syntax errors. But perhaps I'm wrong about that.
Also, you could register a %printer for the syntactic value, in which case it would have been printed after the colon.

ocaml menhir parser production is never reduced

I'm in the middle of learning how to parse simple programs.
This is my lexer.
{
open Parser
exception SyntaxError of string
}
let white = [' ' '\t']+
let blank = ' '
let identifier = ['a'-'z']
rule token = parse
| white {token lexbuf} (* skip whitespace *)
| '-' { HYPHEN }
| identifier {
let buf = Buffer.create 64 in
Buffer.add_string buf (Lexing.lexeme lexbuf);
scan_string buf lexbuf;
let content = (Buffer.contents buf) in
STRING(content)
}
| _ { raise (SyntaxError "Unknown stuff here") }
and scan_string buf = parse
| ['a'-'z']+ {
Buffer.add_string buf (Lexing.lexeme lexbuf);
scan_string buf lexbuf
}
| eof { () }
My "ast":
type t =
String of string
| Array of t list
My parser:
%token <string> STRING
%token HYPHEN
%start <Ast.t> yaml
%%
yaml:
| scalar { $1 }
| sequence {$1}
;
sequence:
| sequence_items {
Ast.Array (List.rev $1)
}
;
sequence_items:
(* empty *) { [] }
| sequence_items HYPHEN scalar {
$3::$1
};
scalar:
| STRING { Ast.String $1 }
;
I'm currently at a point where I want to either parse plain 'strings', i.e.
some text or 'arrays' of 'strings', i.e. - item1 - item2.
When I compile the parser with Menhir I get:
Warning: production sequence -> sequence_items is never reduced.
Warning: in total, 1 productions are never reduced.
I'm pretty new to parsing. Why is this never reduced?
You declare that your entry point to the parser is called main
%start <Ast.t> main
But I can't see the main production in your code. Maybe the entry point is supposed to be yaml? If that is changed—does the error still persists?
Also, try adding EOF token to your lexer and to entry-level production, like this:
parse_yaml: yaml EOF { $1 }
See here for example: https://github.com/Virum/compiler/blob/28e807b842bab5dcf11460c8193dd5b16674951f/grammar.mly#L56
The link to Real World OCaml below also discusses how to use EOL—I think this will solve your problem.
By the way, really cool that you are writing a YAML parser in OCaml. If made open-source it will be really useful to the community. Note that YAML is indentation-sensitive, so to parse it with Menhir you will need to produce some kind of INDENT and DEDENT tokens by your lexer. Also, YAML is a strict superset of JSON, that means it might (or might not) make sense to start with a JSON subset and then expand it. Real World OCaml shows how to write a JSON parser using Menhir:
https://dev.realworldocaml.org/16-parsing-with-ocamllex-and-menhir.html

The value of x is undefined here, so this reference is not allowed

I wrote a very simple parser combinator library which seems to work alright (https://github.com/mukeshsoni/tinyparsec).
I then tried writing parser for json with the library. The code for the json parser is here - https://github.com/mukeshsoni/tinyparsec/blob/master/src/example_parsers/JsonParser.purs
The grammar for json is recursive -
data JsonVal
= JsonInt Int
| JsonString String
| JsonBool Boolean
| JsonObj (List (Tuple String JsonVal))
Which means the parser for json object must again call the parser for jsonVal. The code for jsonObj parser looks like this -
jsonValParser
= jsonIntParser <|> jsonBoolParser <|> jsonStringParser <|> jsonObjParser
propValParser :: Parser (Tuple String JsonVal)
propValParser = do
prop <- stringLitParser
_ <- symb ":"
val <- jsonValParser
pure (Tuple prop val)
listOfPropValParser :: Parser (List (Tuple String JsonVal))
listOfPropValParser = sepBy propValParser (symb ",")
jsonObjParser :: Parser JsonVal
jsonObjParser = do
_ <- symb "{"
propValList <- listOfPropValParser
_ <- symb "}"
pure (JsonObj propValList)
But when i try to build it, i get the following error - The value of propValParser is undefined here. So this reference is not allowed here
I found similar issues on stackoverflow but could not understand why the error happens or how should i refactor my code so that it takes care of the recursive references from jsonValParser to propValParser.
Any help would be appreciated.
See https://stackoverflow.com/a/36991223/139614 for a similar case - you'll need to make use of the fix function, or introduce Unit -> ... in front of a parser somewhere to break the cyclic definition.
I managed to get rid of the error by wrapping the blocks which were throwing error inside a do block and starting the do block with a noop -
listOfPropValParser :: Parser (List (Tuple String JsonVal))
listOfPropValParser = do
_ <- pure 1 -- does nothing but defer the execution of the second line
sepBy propValParser (symb ",")
Had to do the same for jsonValParser.
jsonValParser = do
_ <- pure 1
jsonIntParser <|> jsonBoolParser <|> jsonStringParser <|> jsonObjParser
The idea is to defer the execution of the code which might lead to cyclic dependency. The added line, _ <- pure 1, does exactly that. I think it might be doing the same as fix from Data.Fix does or what defer from Data.Lazy does.

How to implement a BNF grammar tree for parsing input in GO?

The grammar for the type language is as follows:
TYPE ::= TYPEVAR | PRIMITIVE_TYPE | FUNCTYPE | LISTTYPE;
PRIMITIVE_TYPE ::= ‘int’ | ‘float’ | ‘long’ | ‘string’;
TYPEVAR ::= ‘`’ VARNAME; // Note, the character is a backwards apostrophe!
VARNAME ::= [a-zA-Z][a-zA-Z0-9]*; // Initial letter, then can have numbers
FUNCTYPE ::= ‘(‘ ARGLIST ‘)’ -> TYPE | ‘(‘ ‘)’ -> TYPE;
ARGLIST ::= TYPE ‘,’ ARGLIST | TYPE;
LISTTYPE ::= ‘[‘ TYPE ‘]’;
My input like this: TYPE
for example, if I input (int,int)->float, this is valid. If I input ( [int] , int), it's a wrong type and invalid.
I need to parse input from keyboard and decide if it's valid under this grammar(for later type inference). However, I don't know how to build this grammar with go and how to parse input by each byte. Is there any hint or similar implementation? That's will be really helpful.
For your purposes, the grammar of types looks simple enough that you should be able to write a recursive descent parser that roughly matches the shape of your grammar.
As a concrete example, let's say that we're recognizing a similar language.
TYPE ::= PRIMITIVETYPE | TUPLETYPE
PRIMITIVETYPE ::= 'int'
TUPLETYPE ::= '(' ARGLIST ')'
ARGLIST ::= TYPE ARGLIST | TYPE
Not quite exactly the same as your original problem, but you should be able to see the similarities.
A recursive descent parser consists of functions for each production rule.
func ParseType(???) error {
???
}
func ParsePrimitiveType(???) error {
???
}
func ParseTupleType(???) error {
???
}
func ParseArgList(???) error {
???
}
where we'll denote things that we don't quite know what to put as ???* till we get there. We at least will say for now that we get an error if we can't parse.
The input into each of the functions is some stream of tokens. In our case, those tokens consist of sequences of:
"int"
"("
")"
and we can imagine a Stream might be something that satisfies:
type Stream interface {
Peek() string // peek at next token, stay where we are
Next() string // pick next token, move forward
}
to let us walk sequentially through the token stream.
A lexer is responsible for taking something like a string or io.Reader and producing this stream of string tokens. Lexers are fairly easy to write: you can imagine just using regexps or something similar to break a string into tokens.
Assuming we have a token stream, then a parser then just needs to deal with that stream and a very limited set of possibilities. As mentioned before, each production rule corresponds to a parsing function. Within a production rule, each alternative is a conditional branch. If the grammar is particularly simple (as yours is!), we can figure out which conditional branch to take.
For example, let's look at TYPE and its corresponding ParseType function:
TYPE ::= PRIMITIVETYPE | TUPLETYPE
PRIMITIVETYPE ::= 'int'
TUPLETYPE ::= '(' ARGLIST ')'
How might this corresponds to the definition of ParseType?
The production says that there are two possibilities: it can either be (1) primitive, or (2) tuple. We can peek at the token stream: if we see "int", then we know it's primitive. If we see a "(", then since the only possibility is that it's tuple type, we can call the tupletype parser function and let it do the dirty work.
It's important to note: if we don't see either a "(" nor an "int", then something horribly has gone wrong! We know this just from looking at the grammar. We can see that every type must parse from something FIRST starting with one of those two tokens.
Ok, let's write the code.
func ParseType(s Stream) error {
peeked := s.Peek()
if peeked == "int" {
return ParsePrimitiveType(s)
}
if peeked == "(" {
return ParseTupleType(s)
}
return fmt.Errorf("ParseType on %#v", peeked)
}
Parsing PRIMITIVETYPE and TUPLETYPE is equally direct.
func ParsePrimitiveType(s Stream) error {
next := s.Next()
if next == "int" {
return nil
}
return fmt.Errorf("ParsePrimitiveType on %#v", next)
}
func ParseTupleType(s Stream) error {
lparen := s.Next()
if lparen != "(" {
return fmt.Errorf("ParseTupleType on %#v", lparen)
}
err := ParseArgList(s)
if err != nil {
return err
}
rparen := s.Next()
if rparen != ")" {
return fmt.Errorf("ParseTupleType on %#v", rparen)
}
return nil
}
The only one that might cause some issues is the parser for argument lists. Let's look at the rule.
ARGLIST ::= TYPE ARGLIST | TYPE
If we try to write the function ParseArgList, we might get stuck because we don't yet know which choice to make. Do we go for the first, or the second choice?
Well, let's at least parse out the part that's common to both alternatives: the TYPE part.
func ParseArgList(s Stream) error {
err := ParseType(s)
if err != nil {
return err
}
/// ... FILL ME IN. Do we call ParseArgList() again, or stop?
}
So we've parsed the prefix. If it was the second case, we're done. But what if it were the first case? Then we'd still have to read additional lists of types.
Ah, but if we are continuing to read additional types, then the stream must FIRST start with another type. And we know that all types FIRST start either with "int" or "(". So we can peek at the stream. Our decision whether or not we picked the first or second choice hinges just on this!
func ParseArgList(s Stream) error {
err := ParseType(s)
if err != nil {
return err
}
peeked := s.Peek()
if peeked == "int" || peeked == "(" {
// alternative 1
return ParseArgList(s)
}
// alternative 2
return nil
}
Believe it or not, that's pretty much all we need. Here is working code.
package main
import "fmt"
type Stream interface {
Peek() string
Next() string
}
type TokenSlice []string
func (s *TokenSlice) Peek() string {
return (*s)[0]
}
func (s *TokenSlice) Next() string {
result := (*s)[0]
*s = (*s)[1:]
return result
}
func ParseType(s Stream) error {
peeked := s.Peek()
if peeked == "int" {
return ParsePrimitiveType(s)
}
if peeked == "(" {
return ParseTupleType(s)
}
return fmt.Errorf("ParseType on %#v", peeked)
}
func ParsePrimitiveType(s Stream) error {
next := s.Next()
if next == "int" {
return nil
}
return fmt.Errorf("ParsePrimitiveType on %#v", next)
}
func ParseTupleType(s Stream) error {
lparen := s.Next()
if lparen != "(" {
return fmt.Errorf("ParseTupleType on %#v", lparen)
}
err := ParseArgList(s)
if err != nil {
return err
}
rparen := s.Next()
if rparen != ")" {
return fmt.Errorf("ParseTupleType on %#v", rparen)
}
return nil
}
func ParseArgList(s Stream) error {
err := ParseType(s)
if err != nil {
return err
}
peeked := s.Peek()
if peeked == "int" || peeked == "(" {
// alternative 1
return ParseArgList(s)
}
// alternative 2
return nil
}
func main() {
fmt.Println(ParseType(&TokenSlice{"int"}))
fmt.Println(ParseType(&TokenSlice{"(", "int", ")"}))
fmt.Println(ParseType(&TokenSlice{"(", "int", "int", ")"}))
fmt.Println(ParseType(&TokenSlice{"(", "(", "int", ")", "(", "int", ")", ")"}))
// Should show error:
fmt.Println(ParseType(&TokenSlice{"(", ")"}))
}
This is a toy parser, of course, because it is not handling certain kinds of errors very well (like premature end of input), and tokens should include, not only their textual content, but also their source location for good error reporting. For your own purposes, you'll also want to expand the parsers so that they don't just return error, but also some kind of useful result from the parse.
This answer is just a sketch on how recursive descent parsers work. But you should really read a good compiler book to get the details, because you need them. The Dragon Book, for example, spends at least a good chapter on about how to write recursive descent parsers with plenty of the technical details. in particular, you want to know about the concept of FIRST sets (which I hinted at), because you'll need to understand them to choose which alternative is appropriate when writing each of your parser functions.

Resources