Extract Token List from OCamllex lexbuf - parsing

I am writing a Python interpreter in OCaml using ocamllex, and in order to handle the indentation-based syntax, I want to
tokenize the input using ocamllex
iterate through the list of lexed tokens and insert INDENT and DEDENT tokens as needed for the parser
parse this list into an AST
However, in ocamllex, the lexing step produces a lexbuf stream which can't be easily iterated through to do the indentation checking. Is there a good way to extract a list of tokens from lexbuf, i.e.
let lexbuf = (Lexing.from_channel stdin) in
let token_list = tokenize lexbuf
where token_list has type Parser.token list? My hack was to define a trivial parser like
tokenize: /* used by the parser to read the input into the indentation function */
| token EOL { $1 # [EOL] }
| EOL { SEP :: [EOL] }
token:
| COLON { [COLON] }
| TAB { [TAB] }
| RETURN { [RETURN] }
...
| token token %prec RECURSE { $1 # $2 }
and to call this like
let lexbuf = (Lexing.from_channel stdin) in
let temp = (Parser.tokenize Scanner.token) lexbuf in (* char buffer to token list *)
but this has all sorts of issues with shift-reduce errors and unnecessary complexity. Is there a better way to write a lexbuf -> Parser.token list function in OCaml?

Related

ocaml menhir parser production is never reduced

I'm in the middle of learning how to parse simple programs.
This is my lexer.
{
open Parser
exception SyntaxError of string
}
let white = [' ' '\t']+
let blank = ' '
let identifier = ['a'-'z']
rule token = parse
| white {token lexbuf} (* skip whitespace *)
| '-' { HYPHEN }
| identifier {
let buf = Buffer.create 64 in
Buffer.add_string buf (Lexing.lexeme lexbuf);
scan_string buf lexbuf;
let content = (Buffer.contents buf) in
STRING(content)
}
| _ { raise (SyntaxError "Unknown stuff here") }
and scan_string buf = parse
| ['a'-'z']+ {
Buffer.add_string buf (Lexing.lexeme lexbuf);
scan_string buf lexbuf
}
| eof { () }
My "ast":
type t =
String of string
| Array of t list
My parser:
%token <string> STRING
%token HYPHEN
%start <Ast.t> yaml
%%
yaml:
| scalar { $1 }
| sequence {$1}
;
sequence:
| sequence_items {
Ast.Array (List.rev $1)
}
;
sequence_items:
(* empty *) { [] }
| sequence_items HYPHEN scalar {
$3::$1
};
scalar:
| STRING { Ast.String $1 }
;
I'm currently at a point where I want to either parse plain 'strings', i.e.
some text or 'arrays' of 'strings', i.e. - item1 - item2.
When I compile the parser with Menhir I get:
Warning: production sequence -> sequence_items is never reduced.
Warning: in total, 1 productions are never reduced.
I'm pretty new to parsing. Why is this never reduced?
You declare that your entry point to the parser is called main
%start <Ast.t> main
But I can't see the main production in your code. Maybe the entry point is supposed to be yaml? If that is changed—does the error still persists?
Also, try adding EOF token to your lexer and to entry-level production, like this:
parse_yaml: yaml EOF { $1 }
See here for example: https://github.com/Virum/compiler/blob/28e807b842bab5dcf11460c8193dd5b16674951f/grammar.mly#L56
The link to Real World OCaml below also discusses how to use EOL—I think this will solve your problem.
By the way, really cool that you are writing a YAML parser in OCaml. If made open-source it will be really useful to the community. Note that YAML is indentation-sensitive, so to parse it with Menhir you will need to produce some kind of INDENT and DEDENT tokens by your lexer. Also, YAML is a strict superset of JSON, that means it might (or might not) make sense to start with a JSON subset and then expand it. Real World OCaml shows how to write a JSON parser using Menhir:
https://dev.realworldocaml.org/16-parsing-with-ocamllex-and-menhir.html

What causes Happy to throw a parse error?

I've written a lexer in Alex and I'm trying to hook it up to a parser written in Happy. I'll try my best to summarize my problem without pasting huge chunks of code.
I know from my unit tests of my lexer that the string "\x7" is lexed to:
[TokenNonPrint '\x7', TokenEOF]
My token type (spit out by the lexer), is Token. I've defined lexWrap and alexEOF as described here, which gives me the following header and token declarations:
%name parseTokens
%tokentype { Token }
%lexer { lexWrap } { alexEOF }
%monad { Alex }
%error { parseError }
%token
NONPRINT {TokenNonPrint $$}
PLAIN { TokenPlain $$ }
I invoke the parser+lexer combo with the following:
parseExpr :: String -> Either String [Expr]
parseExpr s = runAlex s parseTokens
And here are my first few productions:
exprs :: { [Expr] }
exprs
: {- empty -} { trace "exprs 30" [] }
| exprs expr { trace "exprs 31" $ $2 : $1 }
nonprint :: { Cmd }
: NONPRINT { NonPrint $ parseNonPrint $1}
expr :: { Expr }
expr
: nonprint {trace "expr 44" $ Cmd $ $1}
| PLAIN { trace "expr 37" $ Plain $1 }
I'll leave out the datatype declarations of Expr and NonPrint since they're long and only the constructors Cmd and NonPrint matter here. The function parseNonPrint is defined at the bottom of Parse.y as:
parseNonPrint :: Char -> NonPrint
parseNonPrint '\x7' = Bell
Also, my error handling function looks like:
parseError :: Token -> Alex a
parseError tokens = error ("Error processing token: " ++ show tokens)
Written like this, I expect the following hspec test to pass:
parseExpr "\x7" `shouldBe` Right [Cmd (NonPrint Bell)]
But instead, I see "exprs 30" print once (even though I'm running 5 different unit tests) and all of my tests of parseExpr return Right []. I don't understand why that would be the case, but I changed the exprs production to prevent it:
exprs :: { [Expr] }
exprs
: expr { trace "exprs 30" [$1] }
| exprs expr { trace "exprs 31" $ $2 : $1 }
Now all of my tests fail on the first token they hit --- parseExpr "\x7" fails with:
uncaught exception: ErrorCall (Error processing token: TokenNonPrint '\a')
And I'm thoroughly confused, since I would expect the parser to take the path exprs -> expr -> nonprint -> NONPRINT and succeed. I don't see why this input would put the parser in an error state. None of the trace statements are hit (optimized away?).
What am I doing wrong?
It turns out the cause of this error was the innocuous line
%lexer { lexWrap } { alexEOF }
which was recommended by the linked question about using Alex with Happy (unfortunately, one of the top Google results for queries like "using Alex as a monadic lexer with Happy). The fix is to change it to the following:
%lexer { lexWrap } { TokenEOF }
I had to dig in to the generated code to uncover the issue. It is caused by the code derived from the %tokens directive, which looks as follows (I commented out all of my token declarations except for TokenNonPrint while trying to track down the error):
happyNewToken action sts stk
= lexWrap(\tk ->
let cont i = happyDoAction i tk action sts stk in
case tk of {
alexEOF -> happyDoAction 2# tk action sts stk; -- !!!!
TokenNonPrint happy_dollar_dollar -> cont 1#;
_ -> happyError' tk
})
Evidently, Happy transforms each line of the %tokens directive in to one branch of a pattern match. It also inserts a branch for whatever was identified to it as the EOF token in the %lexer directive.
By inserting the name of a value, alexEOF, rather than a data constructor, TokenEOF, this branch of the case statement has the effect of re-binding the name alexEOF to whatever token was passed in to lexWrap, shadowing the original binding and short-circuiting the case statement so that it hits the EOF rule every time, which somehow results in Happy entering an error state.
The mistake isn't caught by the type system, since the identifier alexEOF (or TokenEOF) doesn't appear anywhere else in the generated code. Misusing the %lexer directive like this will cause GHC to emit a warning, but, since the warning appears in generated code, it's impossible to distinguish it from all of the other harmless warnings the code throws out.

Creating Simple Parser in F#

I am currently attempting to create an extremely simple parser in F# using FsLex and FsYacc. At first, the only functionality I am trying to achieve is allowing the program to take in a string that represents addition of integers and output a result. For example, I would want the parser to be able to take in "5 + 2" and output the string "7". I am only interested in string arguments and outputs because I would like to import the parser into Excel using Excel DNA once I extend the functionality to support more operations. However, I am currently struggling to get even this simple integer addition to work correctly.
My lexer.fsl file looks like:
{
module lexer
open System
open Microsoft.FSharp.Text.Lexing
open Parser
let lexeme = LexBuffer<_>.LexemeString
let ops = ["+", PLUS;] |> Map.ofList
}
let digit = ['0'-'9']
let operator = "+"
let integ = digit+
rule lang = parse
| integ
{INT(Int32.Parse(lexeme lexbuf))}
| operator
{ops.[lexeme lexbuf]}
My parser.fsy file looks like:
%{
open Program
%}
%token <int>INT
%token PLUS
%start input
%type <int> input
%%
input:
exp {$1}
;
exp:
| INT { $1 }
| exp exp PLUS { $1 + $2 }
;
Additionally, I have a Program.fs file that acts like an (extremely small) AST:
module Program
type value =
| Int of int
type op = Plus
Finally, I have the file Main.fs that is supposed to test out the functionality of the interpreter (as well as import the function into Excel).
module Main
open ExcelDna.Integration
open System
open Program
open Parser
[<ExcelFunction(Description = "")>]
let main () =
let x = "5 + 2"
let lexbuf = Microsoft.FSharp.Text.Lexing.LexBuffer<_>.FromString x
let y = input lexer.lang lexbuf
y
However, when I run this function, the parser doesn't work at all. When I build the project, the parser.fs and lexer.fs files are created correctly. I feel that there is something simple I am missing, but I have no idea how to make this function correctly.

How to make FsLex rules tail-recursive / prevent StackOverflowExceptions

I'm implementing a scripting language with fslex / fsyacc and have a problem with large user inputs (i.e. >1k scripts), specifically this error occurs when the lexer analyzes a very large string.
I scan strings in a custom lexer function (to allow users to escape characters like quotes with a backslash). Here's the function:
and myText pos builder = parse
| '\\' escapable { let char = lexbuf.LexemeChar(1)
builder.Append (escapeSpecialTextChar char) |> ignore
myText pos builder lexbuf
| newline { newline lexbuf;
builder.Append Environment.NewLine |> ignore
myText pos builder lexbuf
}
| (whitespace | letter | digit)+ { builder.Append (lexeme lexbuf) |> ignore
myText pos builder lexbuf
} // scan as many regular characters as possible
| '"' { TEXT (builder.ToString()) } // finished reading myText
| eof { raise (LexerError (sprintf "The text started at line %d, column %d has not been closed with \"." pos.pos_lnum (pos.pos_cnum - pos.pos_bol))) }
| _ { builder.Append (lexbuf.LexemeChar 0) |> ignore
myText pos builder lexbuf
} // read all other characters individually
The function just interprets the various characters and then recursively calls itself – or returns the read string when it reads a closing quote. When the input is too large, a StackOverflowException will be thrown either in the _ or (whitespace | ... rule.
The generated Lexer.fs contains this:
(* Rule myText *)
and myText pos builder (lexbuf : Microsoft.FSharp.Text.Lexing.LexBuffer<_>) = _fslex_myText pos builder 21 lexbuf
// [...]
and _fslex_myText pos builder _fslex_state lexbuf =
match _fslex_tables.Interpret(_fslex_state,lexbuf) with
| 0 -> (
# 105 "Lexer.fsl"
let char = lexbuf.LexemeChar(1) in
builder.Append (escapeSpecialTextChar char) |> ignore
myText pos builder lexbuf
// [...]
So the actual rule becomes _fslex_myText and myText calls it with some internal _fslex_state.
I assume this is the problem: myText is not tail-recursive because it is transformed into 2 functions that call each other, which at some point results in a StackOverflowException.
So my question really is:
1) is my assumption correct? Or can the F# compiler optimize these 2 functions like it does with tail-recursive ones and generate a while-loop instead of recursive calls? (Functional programming is still new to me)
2) What can I do to make the function tail-recursive / prevent the StackOverflowException? FsLex documentation is not that extensive and I couldn't figure out what it is the official F# compiler does differently – obviously it has no problems with large strings, but its string rule also calls itself recursively, so I'm at a loss here :/
As written, your myText function appears to be OK -- the F# compiler should be able to make that tail-recursive.
One thing to check -- are you compiling/running this in Debug or Release configuration? By default, the F# compiler only does tail-call optimization in the Release configuration. If you want tail-calls while you're debugging, you need to explicitly enable that in the project's properties (on the Build tab).

Why is this fsyacc input producing F# that does not compile?

My fsyacc code is giving a compiler error saying a variable is not found, but I'm not sure why. I was hoping someone could point out the issue.
%{
open Ast
%}
// The start token becomes a parser function in the compiled code:
%start start
// These are the terminal tokens of the grammar along with the types of
// the data carried by each token:
%token NAME
%token ARROW TICK VOID
%token LPAREN RPAREN
%token EOF
// This is the type of the data produced by a successful reduction of the 'start'
// symbol:
%type < Query > start
%%
// These are the rules of the grammar along with the F# code of the
// actions executed as rules are reduced. In this case the actions
// produce data using F# data construction terms.
start: Query { Terms($1) }
Query:
| Term EOF { $1 }
Term:
| VOID { Void }
| NAME { Conc($1) }
| TICK NAME { Abst($2) }
| LPAREN Term RPAREN { Lmda($2) }
| Term ARROW Term { TermList($1, $3) }
The line | NAME {Conc($1)} and the following line both give this error:
error FS0039: The value or constructor '_1' is not defined
I understand the syntactic issue, but what's wrong with the yacc input?
If it helps, here is the Ast definition:
namespace Ast
open System
type Query =
| Terms of Term
and Term =
| Void
| Conc of String
| Abst of String
| Lmda of Term
| TermList of Term * Term
And the fslex input:
{
module Lexer
open System
open Parser
open Microsoft.FSharp.Text.Lexing
let lexeme lexbuf =
LexBuffer<char>.LexemeString lexbuf
}
// These are some regular expression definitions
let name = ['a'-'z' 'A'-'Z' '0'-'9']
let whitespace = [' ' '\t' ]
let newline = ('\n' | '\r' '\n')
rule tokenize = parse
| whitespace { tokenize lexbuf }
| newline { tokenize lexbuf }
// Operators
| "->" { ARROW }
| "'" { TICK }
| "void" { VOID }
// Misc
| "(" { LPAREN }
| ")" { RPAREN }
// Numberic constants
| name+ { NAME }
// EOF
| eof { EOF }
This is not FsYacc's fault. NAME is a valueless token.
You'd want to do these fixes:
%token NAME
to
%token <string> NAME
and
| name+ { NAME }
to
| name+ { NAME (lexeme lexbuf) }
Everything should now compile.

Resources