I am currently attempting to create an extremely simple parser in F# using FsLex and FsYacc. At first, the only functionality I am trying to achieve is allowing the program to take in a string that represents addition of integers and output a result. For example, I would want the parser to be able to take in "5 + 2" and output the string "7". I am only interested in string arguments and outputs because I would like to import the parser into Excel using Excel DNA once I extend the functionality to support more operations. However, I am currently struggling to get even this simple integer addition to work correctly.
My lexer.fsl file looks like:
{
module lexer
open System
open Microsoft.FSharp.Text.Lexing
open Parser
let lexeme = LexBuffer<_>.LexemeString
let ops = ["+", PLUS;] |> Map.ofList
}
let digit = ['0'-'9']
let operator = "+"
let integ = digit+
rule lang = parse
| integ
{INT(Int32.Parse(lexeme lexbuf))}
| operator
{ops.[lexeme lexbuf]}
My parser.fsy file looks like:
%{
open Program
%}
%token <int>INT
%token PLUS
%start input
%type <int> input
%%
input:
exp {$1}
;
exp:
| INT { $1 }
| exp exp PLUS { $1 + $2 }
;
Additionally, I have a Program.fs file that acts like an (extremely small) AST:
module Program
type value =
| Int of int
type op = Plus
Finally, I have the file Main.fs that is supposed to test out the functionality of the interpreter (as well as import the function into Excel).
module Main
open ExcelDna.Integration
open System
open Program
open Parser
[<ExcelFunction(Description = "")>]
let main () =
let x = "5 + 2"
let lexbuf = Microsoft.FSharp.Text.Lexing.LexBuffer<_>.FromString x
let y = input lexer.lang lexbuf
y
However, when I run this function, the parser doesn't work at all. When I build the project, the parser.fs and lexer.fs files are created correctly. I feel that there is something simple I am missing, but I have no idea how to make this function correctly.
Related
I'm in the middle of learning how to parse simple programs.
This is my lexer.
{
open Parser
exception SyntaxError of string
}
let white = [' ' '\t']+
let blank = ' '
let identifier = ['a'-'z']
rule token = parse
| white {token lexbuf} (* skip whitespace *)
| '-' { HYPHEN }
| identifier {
let buf = Buffer.create 64 in
Buffer.add_string buf (Lexing.lexeme lexbuf);
scan_string buf lexbuf;
let content = (Buffer.contents buf) in
STRING(content)
}
| _ { raise (SyntaxError "Unknown stuff here") }
and scan_string buf = parse
| ['a'-'z']+ {
Buffer.add_string buf (Lexing.lexeme lexbuf);
scan_string buf lexbuf
}
| eof { () }
My "ast":
type t =
String of string
| Array of t list
My parser:
%token <string> STRING
%token HYPHEN
%start <Ast.t> yaml
%%
yaml:
| scalar { $1 }
| sequence {$1}
;
sequence:
| sequence_items {
Ast.Array (List.rev $1)
}
;
sequence_items:
(* empty *) { [] }
| sequence_items HYPHEN scalar {
$3::$1
};
scalar:
| STRING { Ast.String $1 }
;
I'm currently at a point where I want to either parse plain 'strings', i.e.
some text or 'arrays' of 'strings', i.e. - item1 - item2.
When I compile the parser with Menhir I get:
Warning: production sequence -> sequence_items is never reduced.
Warning: in total, 1 productions are never reduced.
I'm pretty new to parsing. Why is this never reduced?
You declare that your entry point to the parser is called main
%start <Ast.t> main
But I can't see the main production in your code. Maybe the entry point is supposed to be yaml? If that is changed—does the error still persists?
Also, try adding EOF token to your lexer and to entry-level production, like this:
parse_yaml: yaml EOF { $1 }
See here for example: https://github.com/Virum/compiler/blob/28e807b842bab5dcf11460c8193dd5b16674951f/grammar.mly#L56
The link to Real World OCaml below also discusses how to use EOL—I think this will solve your problem.
By the way, really cool that you are writing a YAML parser in OCaml. If made open-source it will be really useful to the community. Note that YAML is indentation-sensitive, so to parse it with Menhir you will need to produce some kind of INDENT and DEDENT tokens by your lexer. Also, YAML is a strict superset of JSON, that means it might (or might not) make sense to start with a JSON subset and then expand it. Real World OCaml shows how to write a JSON parser using Menhir:
https://dev.realworldocaml.org/16-parsing-with-ocamllex-and-menhir.html
I'm trying to parse a file, using FParsec, which consists of either float or int values. I'm facing two problems that I can't find a good solution for.
1
Both pint32 and pfloat will successfully parse the same string, but give different answers, e.g pint32 will return 3 when parsing the string "3.0" and pfloat will return 3.0 when parsing the same string. Is it possible to try parsing a floating point value using pint32 and have it fail if the string is "3.0"?
In other words, is there a way to make the following code work:
let parseFloatOrInt lines =
let rec loop intvalues floatvalues lines =
match lines with
| [] -> floatvalues, intvalues
| line::rest ->
match run floatWs line with
| Success (r, _, _) -> loop intvalues (r::floatvalues) rest
| Failure _ ->
match run intWs line with
| Success (r, _, _) -> loop (r::intvalues) floatvalues rest
| Failure _ -> loop intvalues floatvalues rest
loop [] [] lines
This piece of code will correctly place all floating point values in the floatvalues list, but because pfloat returns "3.0" when parsing the string "3", all integer values will also be placed in the floatvalues list.
2
The above code example seems a bit clumsy to me, so I'm guessing there must be a better way to do it. I considered combining them using choice, however both parsers must return the same type for that to work. I guess I could make a discriminated union with one option for float and one for int and convert the output from pint32 and pfloat using the |>> operator. However, I'm wondering if there is a better solution?
You're on the right path thinking about defining domain data and separating definition of parsers and their usage on source data. This seems to be a good approach, because as your real-life project grows further, you would probably need more data types.
Here's how I would write it:
/// The resulting type, or DSL
type MyData =
| IntValue of int
| FloatValue of float
| Error // special case for all parse failures
// Then, let's define individual parsers:
let pMyInt =
pint32
|>> IntValue
// this is an alternative version of float parser.
// it ensures that the value has non-zero fractional part.
// caveat: the naive approach would treat values like 42.0 as integer
let pMyFloat =
pfloat
>>= (fun x -> if x % 1 = 0 then fail "Not a float" else preturn (FloatValue x))
let pError =
// this parser must consume some input,
// otherwise combined with `many` it would hang in a dead loop
skipAnyChar
>>. preturn Error
// Now, the combined parser:
let pCombined =
[ pMyFloat; pMyInt; pError ] // note, future parsers will be added here;
// mind the order as float supersedes the int,
// and Error must be the last
|> List.map (fun p -> p .>> ws) // I'm too lazy to add whitespase skipping
// into each individual parser
|> List.map attempt // each parser is optional
|> choice // on each iteration, one of the parsers must succeed
|> many // a loop
Note, the code above is capable working with any sources: strings, streams, or whatever. Your real app may need to work with files, but unit testing can be simplified by using just string list.
// Now, applying the parser somewhere in the code:
let maybeParseResult =
match run pCombined myStringData with
| Success(result, _, _) -> Some result
| Failure(_, _, _) -> None // or anything that indicates general parse failure
UPD. I have edited the code according to comments. pMyFloat was updated to ensure that the parsed value has non-zero fractional part.
FParsec has the numberLiteral parser that can be used to solve the problem.
As a start you can use the example available at the link above:
open FParsec
open FParsec.Primitives
open FParsec.CharParsers
type Number = Int of int64
| Float of float
// -?[0-9]+(\.[0-9]*)?([eE][+-]?[0-9]+)?
let numberFormat = NumberLiteralOptions.AllowMinusSign
||| NumberLiteralOptions.AllowFraction
||| NumberLiteralOptions.AllowExponent
let pnumber : Parser<Number, unit> =
numberLiteral numberFormat "number"
|>> fun nl ->
if nl.IsInteger then Int (int64 nl.String)
else Float (float nl.String)```
I'm implementing a scripting language with fslex / fsyacc and have a problem with large user inputs (i.e. >1k scripts), specifically this error occurs when the lexer analyzes a very large string.
I scan strings in a custom lexer function (to allow users to escape characters like quotes with a backslash). Here's the function:
and myText pos builder = parse
| '\\' escapable { let char = lexbuf.LexemeChar(1)
builder.Append (escapeSpecialTextChar char) |> ignore
myText pos builder lexbuf
| newline { newline lexbuf;
builder.Append Environment.NewLine |> ignore
myText pos builder lexbuf
}
| (whitespace | letter | digit)+ { builder.Append (lexeme lexbuf) |> ignore
myText pos builder lexbuf
} // scan as many regular characters as possible
| '"' { TEXT (builder.ToString()) } // finished reading myText
| eof { raise (LexerError (sprintf "The text started at line %d, column %d has not been closed with \"." pos.pos_lnum (pos.pos_cnum - pos.pos_bol))) }
| _ { builder.Append (lexbuf.LexemeChar 0) |> ignore
myText pos builder lexbuf
} // read all other characters individually
The function just interprets the various characters and then recursively calls itself – or returns the read string when it reads a closing quote. When the input is too large, a StackOverflowException will be thrown either in the _ or (whitespace | ... rule.
The generated Lexer.fs contains this:
(* Rule myText *)
and myText pos builder (lexbuf : Microsoft.FSharp.Text.Lexing.LexBuffer<_>) = _fslex_myText pos builder 21 lexbuf
// [...]
and _fslex_myText pos builder _fslex_state lexbuf =
match _fslex_tables.Interpret(_fslex_state,lexbuf) with
| 0 -> (
# 105 "Lexer.fsl"
let char = lexbuf.LexemeChar(1) in
builder.Append (escapeSpecialTextChar char) |> ignore
myText pos builder lexbuf
// [...]
So the actual rule becomes _fslex_myText and myText calls it with some internal _fslex_state.
I assume this is the problem: myText is not tail-recursive because it is transformed into 2 functions that call each other, which at some point results in a StackOverflowException.
So my question really is:
1) is my assumption correct? Or can the F# compiler optimize these 2 functions like it does with tail-recursive ones and generate a while-loop instead of recursive calls? (Functional programming is still new to me)
2) What can I do to make the function tail-recursive / prevent the StackOverflowException? FsLex documentation is not that extensive and I couldn't figure out what it is the official F# compiler does differently – obviously it has no problems with large strings, but its string rule also calls itself recursively, so I'm at a loss here :/
As written, your myText function appears to be OK -- the F# compiler should be able to make that tail-recursive.
One thing to check -- are you compiling/running this in Debug or Release configuration? By default, the F# compiler only does tail-call optimization in the Release configuration. If you want tail-calls while you're debugging, you need to explicitly enable that in the project's properties (on the Build tab).
I'm writing a parser for a language that is sufficiently simple for Genlex + camlp4 stream parsers to take care of it. However, I'd still be interested in having a more or less precise location (i.e. at least a line number) in case of parsing error.
My idea is to use an intermediate stream between the original char Stream and the token Stream of Genlex, that takes care of line counts, like in the code below, but I'm wondering whether there's a simpler solution?
let parse_file s =
let num_lines = ref 1 in
let bol = ref 0 in
let print_pos fmt i =
(* Emacs-friendly location *)
Printf.fprintf fmt "File %S, line %d, characters %d-%d:"
s !num_lines (i - !bol) (i - !bol)
in
(* Normal stream *)
let chan =
try open_in s
with
Sys_error e -> Printf.eprintf "Cannot open %s: %s\n%!" s e; exit 1
in
let chrs = Stream.of_channel chan in
(* Capture newlines and move num_lines and bol accordingly *)
let next i =
try
match Stream.next chrs with
| '\n' -> bol := i; incr num_lines; Some '\n'
| c -> Some c
with Stream.Failure -> None
in
let chrs = Stream.from next in
(* Pass that to the Genlex's lexer *)
let toks = lexer chrs in
let error s =
Printf.eprintf "%a\n%s %a\n%!"
print_pos (Stream.count chrs) s print_top toks;
exit 1
in
try
parse toks
with
| Stream.Failure -> error "Failure"
| Stream.Error e -> error ("Error " ^ e)
| Parsing.Parse_error -> error "Unexpected symbol"
A much simpler solution is to use Camlp4 grammars.
Parsers built this way allow one to get decent error messages "for free", unlike the case with stream parsers (which are a low level tool).
It could be that there is no need to define your own lexer, because OCaml's lexer suits your needs already. But if you really need your own lexer, then you can easily plug in a custom one:
module Camlp4Loc = Camlp4.Struct.Loc
module Lexer = MyLexer.Make(Camlp4Loc)
module Gram = Camlp4.Struct.Grammar.Static.Make(Lexer)
open Lexer
let entry = Gram.Entry.mk "entry"
EXTEND Gram
entry: [ [ ... ] ];
END
let parse str =
Gram.parse rule (Loc.mk file) (Stream.of_string str)
If you are new to OCaml, then all this module system trickery might seem at first like black voodoo magic :-) The fact that Camlp4 is a severely underdocumented beast might also contribute to the surreality of experience.
So never hesitate to ask a question (even a stupid one) on the mailing list.
I am writing a custom read function for one of the data types in my module. For eg, when I do read "(1 + 1)" :: Data, I want it to return Plus 1 1. My data declaration is data Data = Plus Int Int. Thanks
This sounds like something better suited to a parser; Parsec is a powerful Haskell parser combinator library, which I would recommend.
I'd like to second the notion of using a parser. However, if you absolutely have to use a pattern-matching, go like this:
import Data.List
data Expr = Plus Int Int | Minus Int Int deriving Show
test = [ myRead "(1 + 1)", myRead "(2-1)" ]
myRead = match . lexer
where
match ["(",a,"+",b,")"] = Plus (read a) (read b)
match ["(",a,"-",b,")"] = Minus (read a) (read b)
match garbage = error $ "Cannot parse " ++ show garbage
lexer = unfoldr next_lexeme
where
next_lexeme "" = Nothing
next_lexeme str = Just $ head $ lex str
You could use GHC's ReadP.