Ocaml lexer / parser rules - stream

I wrote a program in ocaml that given an infix expression like 1 + 2, outputs the prefix notation : + 1 2
My problem is I don't find a way to make a rules like : all value, operator and bracket should be always separated by at least one space: 1+ 1 would be wrong 1 + 1 ok. I would like to not use the ocamlp4 grammar.
here is the code:
open Genlex
type tree =
| Leaf of string
| Node of tree * string * tree
let my_lexer str =
let kwds = ["("; ")"; "+"; "-"; "*"; "/"] in
make_lexer kwds (Stream.of_string str)
let make_tree_from_stream stream =
let op_parser operator_l higher_perm =
let rec aux left higher_perm = parser
[<'Kwd op when List.mem op operator_l; right = higher_perm; s >]
-> aux (Node (left, op, right)) higher_perm s
| [< >]
-> left
in
parser [< left = higher_perm; s >] -> aux left higher_perm s
in
let rec high_perm l = op_parser ["*"; "/"] brackets l
and low_perm l = op_parser ["+"; "-"] high_perm l
and brackets = parser
| [< 'Kwd "("; e = low_perm; 'Kwd ")" >] -> e
| [< 'Ident n >] -> Leaf n
| [< 'Int n >] -> Leaf (string_of_int n)
in
low_perm stream
let rec draw_tree = function
| Leaf n -> Printf.printf "%s" n
| Node(fg, r, fd) -> Printf.printf "(%s " (r);
draw_tree fg;
Printf.printf " ";
draw_tree fd;
Printf.printf ")"
let () =
let line = read_line() in
draw_tree (make_tree_from_stream (my_lexer line)); Printf.printf "\n"
Plus if you have some tips about the code or if you notice some error of prog style then I will appreciate that you let it me know. Thanks !

The Genlex provides a ready-made lexer that respects OCaml's lexical convention, and in particular ignore the spaces in the positions you mention. I don't think you can implement what you want on top of it (it is not designed as a flexible solution, but a quick way to get a prototype working).
If you want to keep writing stream parsers, you could write your own lexer for it: define a token type, and lex a char Stream.t into a token Stream.t, which you can then parse as you wish. Otherwise, if you don't want to use Camlp4, you may want to try an LR parser generator, such as menhir (a better ocamlyacc).

Related

An elegant way to parse sexp

sexp is like this: type sexp = Atom of string | List of sexp list, e.g., "((a b) ((c d) e) f)".
I have written a parser to parse a sexp string to the type:
let of_string s =
let len = String.length s in
let empty_buf () = Buffer.create 16 in
let rec parse_atom buf i =
if i >= len then failwith "cannot parse"
else
match s.[i] with
| '(' -> failwith "cannot parse"
| ')' -> Atom (Buffer.contents buf), i-1
| ' ' -> Atom (Buffer.contents buf), i
| c when i = len-1 -> (Buffer.add_char buf c; Atom (Buffer.contents buf), i)
| c -> (Buffer.add_char buf c; parse_atom buf (i+1))
and parse_list acc i =
if i >= len || (i = len-1 && s.[i] <> ')') then failwith "cannot parse"
else
match s.[i] with
| ')' -> List (List.rev acc), i
| '(' ->
let list, j = parse_list [] (i+1) in
parse_list (list::acc) (j+1)
| c ->
let atom, j = parse_atom (empty_buf()) i in
parse_list (atom::acc) (j+1)
in
if s.[0] <> '(' then
let atom, j = parse_atom (empty_buf()) 0 in
if j = len-1 then atom
else failwith "cannot parse"
else
let list, j = parse_list [] 1 in
if j = len-1 then list
else failwith "cannot parse"
But I think it is too verbose and ugly.
Can someone help me with an elegant way to write such a parser?
Actually, I always have problems in writing code of parser and what I could do only is write such a ugly one.
Any tricks for this kind of parsing? How to effectively deal with symbols, such as (, ), that implies recursive parsing?
You can use a lexer+parser discipline to separate the details of lexical syntax (skipping spaces, mostly) from the actual grammar structure. That may seem overkill for such a simple grammar, but it's actually better as soon as the data you parse has the slightest chance of being wrong: you really want error location (and not to implement them yourself).
A technique that is easy and gives short parsers is to use stream parsers (using a Camlp4 extension for them described in the Developping Applications with Objective Caml book); you may even get a lexer for free by using the Genlex module.
If you want to do really do it manually, as in your example above, here is my recommendation to have a nice parser structure. Have mutually recursive parsers, one for each category of your syntax, with the following interface:
parsers take as input the index at which to start parsing
they return a pair of the parsed value and the first index not part of the value
nothing more
Your code does not respect this structure. For example, you parser for atoms will fail if it sees a (. That is not his role and responsibility: it should simply consider that this character is not part of the atom, and return the atom-parsed-so-far, indicating that this position is not in the atom anymore.
Here is a code example in this style for you grammar. I have split the parsers with accumulators in triples (start_foo, parse_foo and finish_foo) to factorize multiple start or return points, but that is only an implementation detail.
I have used a new feature of 4.02 just for fun, match with exception, instead of explicitly testing for the end of the string. It is of course trivial to revert to something less fancy.
Finally, the current parser does not fail if the valid expression ends before the end of the input, it only returns the end of the input on the side. That's helpful for testing but you would do it differently in "production", whatever that means.
let of_string str =
let rec parse i =
match str.[i] with
| exception _ -> failwith "unfinished input"
| ')' -> failwith "extraneous ')'"
| ' ' -> parse (i+1)
| '(' -> start_list (i+1)
| _ -> start_atom i
and start_list i = parse_list [] i
and parse_list acc i =
match str.[i] with
| exception _ -> failwith "unfinished list"
| ')' -> finish_list acc (i+1)
| ' ' -> parse_list acc (i+1)
| _ ->
let elem, j = parse i in
parse_list (elem :: acc) j
and finish_list acc i =
List (List.rev acc), i
and start_atom i = parse_atom (Buffer.create 3) i
and parse_atom acc i =
match str.[i] with
| exception _ -> finish_atom acc i
| ')' | ' ' -> finish_atom acc i
| _ -> parse_atom (Buffer.add_char acc str.[i]; acc) (i + 1)
and finish_atom acc i =
Atom (Buffer.contents acc), i
in
let result, rest = parse 0 in
result, String.sub str rest (String.length str - rest)
Note that it is an error to reach the end of input when parsing a valid expression (you must have read at least one atom or list) or when parsing a list (you must have encountered the closing parenthesis), yet it is valid at the end of an atom.
This parser does not return location information. All real-world parsers should do so, and this is enough of a reason to use a lexer/parser approach (or your preferred monadic parser library) instead of doing it by hand. Returning location information here is not terribly difficult, though, just duplicate the i parameter into the index of the currently parsed character, on one hand, and the first index used for the current AST node, on the other; whenever you produce a result, the location is the pair (first index, last valid index).

F# idiomatic way of transforming text

Myello! So I am looking for a concise, efficient an idiomatic way in F# to parse a file or a string. I have a strong preference to treat the input as a sequence of char (char seq). The idea is that every function is responsible to parse a piece of the input, return the converted text tupled with the unused input and be called by a higher level function that chains the unused input to the following functions and use the results to build a compound type. Every parsing function should therefore have a signature similar to this one: char seq -> char seq * 'a . If, for example, the function's responsibility is simply to extract the first word, then, one approach would be the following:
let parseFirstWord (text: char seq) =
let rec forTailRecursion t acc =
let c = Seq.head t
if c = '\n' then
(t, acc)
else
forTailRecursion (Seq.skip 1 t) (c::acc)
let rest, reversedWord = forTailRecursion text []
(rest, List.reverse reversedWord)
Now, of course the main problem with this approach is that it extracts the word in reverse order and so you have to reverse it. Its main advantages however are that is uses strictly functional features and proper tail recursion. One could avoid the reversing of the extracted value while losing tail recursion:
let rec parseFirstWord (text: char seq) =
let c = Seq.head t
if c = '\n' then
(t, [])
else
let rest, tail = parseFirstWord (Seq.skip 1 t)
(rest, (c::tail))
Or use a fast mutable data structure underneath instead of using purely functional features, such as:
let parseFirstWord (text: char seq) =
let rec forTailRecursion t queue =
let c = Seq.head t
if c = '\n' then
(t, queue)
else
forTailRecursion (Seq.skip 1 t) (queue.Enqueu(c))
forTailRecursion text (new Queue<char>())
I have no idea how to use OO concepts in F# mind you so corrections to the above code are welcome.
Being new to this language, I would like to be guided in terms of the usual compromises that an F# developer makes. Among the suggested approaches and your own, which should I consider more idiomatic and why? Also, in that particular case, how would you encapsulate the return value: char seq * char seq, char seq * char list or evenchar seq * Queue<char>? Or would you even consider char seq * String following a proper conversion?
I would definitely have a look at FSLex. FSYacc, FParsec. However if you just want to tokenize a seq<char> you can use a sequence expression to generate tokens in the right order. Reusing your idea of a recursive inner function, and combinining with a sequence expression, we can stay tail recursive like shown below, and avoid non-idiomatic tools like mutable data structures.
I changed the separator char for easy debugging and the signature of the function. This version produces a seq<string> (your tokens) as result, which is probably easier to consume than a tuple with the current token and the rest of the text. If you just want the first token, you can just take the head. Note that the sequence is generated 'on demand', i.e. the input is parsed only as tokens are consumed through the sequence. Should you need the remainder of the input text next to each token, you can yield a pair in loop instead, but I'm guessing the downstream consumer most likely wouldn't (furthermore, if the input text is itself a lazy sequence, possibly linked to a stream, we don't want to expose it as it should be iterated through only in one place).
let parse (text : char seq) =
let rec loop t acc =
seq {
if Seq.isEmpty t then yield acc
else
let c, rest = Seq.head t, Seq.skip 1 t
if c = ' ' then
yield acc
yield! loop rest ""
else yield! loop rest (acc + string c)
}
loop text ""
parse "The FOX is mine"
val it : seq<string> = seq ["The"; "FOX"; "is"; "mine"]
This is not the only 'idiomatic' way of doing this in F#. Every time we need to process a sequence, we can look at the functions made available in the Seq module. The most general of these is fold which iterates through a sequence once, accumulating a state at each element by running a given function. In the example below accumulate is such a function, that progressively builds the resulting sequence of tokens. Since Seq.fold doesn't run the accumulator function on an empty sequence, we need the last two lines to extract the last token from the function's internal accumulator.
This second implementation keeps the nice characteriestics of the first, i.e. tail recursion (inside the fold implementation, if I'm not mistaken) and processing of the input sequence on demand. It also happens to be shorter, albeit a bit less readable probably.
let parse2 (text : char seq) =
let accumulate (res, acc) c =
if c = ' ' then (Seq.append res (Seq.singleton acc), "")
else (res, acc + string c)
let (acc, last) = text |> Seq.fold accumulate (Seq.empty, "")
Seq.append acc (Seq.singleton last)
parse2 "The FOX is mine"
val it : seq<string> = seq ["The"; "FOX"; "is"; "mine"]
One way of lexing/parsing in a way truly unique to F# is by using active patterns. The following simplified example shows the general idea. It can process a calculation string of arbitrary length without producing a stack overflow.
let rec (|CharOf|_|) set = function
| c :: rest when Set.contains c set -> Some(c, rest)
| ' ' :: CharOf set (c, rest) -> Some(c, rest)
| _ -> None
let rec (|CharsOf|) set = function
| CharOf set (c, CharsOf set (cs, rest)) -> c::cs, rest
| rest -> [], rest
let (|StringOf|_|) set = function
| CharsOf set (_::_ as cs, rest) -> Some(System.String(Array.ofList cs), rest)
| _ -> None
type Token =
| Int of int
| Add | Sub | Mul | Div | Mod
| Unknown
let lex: string -> _ =
let digits = set ['0'..'9']
let ops = Set.ofSeq "+-*/%"
let rec lex chars =
seq { match chars with
| StringOf digits (s, rest) -> yield Int(int s); yield! lex rest
| CharOf ops (c, rest) ->
let op =
match c with
| '+' -> Add | '-' -> Sub | '*' -> Mul | '/' -> Div | '%' -> Mod
| _ -> failwith "invalid operator char"
yield op; yield! lex rest
| [] -> ()
| _ -> yield Unknown }
List.ofSeq >> lex
lex "1234 + 514 / 500"
// seq [Int 1234; Add; Int 514; Div; Int 500]

Create discriminated union data from file/database

I have a discriminated union for expressions like this one (EQ =; GT >; etc)
(AND (OR (EQ X 0)
(GT X 10))
(OR (EQ Y 0)
(GT Y 10)))
I want to create instances of DU from such expressions saved in file/database.
How do i do it? If it is not feasible, what is the best way to approach it in F#?
Daniel: these expressions are saved in prefix format (as above) as text and will be parsed in F#. Thanks.
If you just want to know how to model these expressions using DUs, here's one way:
type BinaryOp =
| EQ
| GT
type Expr =
| And of Expr * Expr
| Or of Expr * Expr
| Binary of BinaryOp * Expr * Expr
| Var of string
| Value of obj
let expr =
And(
Or(
Binary(EQ, Var("X"), Value(0)),
Binary(GT, Var("X"), Value(10))),
Or(
Binary(EQ, Var("Y"), Value(0)),
Binary(GT, Var("Y"), Value(10))))
Now, this may be too "loose," i.e., it permits expressions like And(Value(1), Value(2)), which may not be valid according to your grammar. But this should give you an idea of how to approach it.
There are also some good examples in the F# Programming wikibook.
If you need to parse these expressions, I highly recommend FParsec.
Daniel's answer is good. Here's a similar approach, along with a simple top-down parser built with active patterns:
type BinOp = | And | Or
type Comparison = | Gt | Eq
type Expr =
| BinOp of BinOp * Expr * Expr
| Comp of Comparison * string * int
module private Parsing =
// recognize and strip a leading literal
let (|Lit|_|) lit (s:string) =
if s.StartsWith(lit) then Some(s.Substring lit.Length)
else None
// strip leading whitespace
let (|NoWs|) (s:string) =
s.TrimStart(' ', '\t', '\r', '\n')
// parse a binary operator
let (|BinOp|_|) = function
| Lit "AND" r -> Some(And, r)
| Lit "OR" r -> Some(Or, r)
| _ -> None
// parse a comparison operator
let (|Comparison|_|) = function
| Lit "GT" r -> Some(Gt, r)
| Lit "EQ" r -> Some(Eq, r)
| _ -> None
// parse a variable (alphabetical characters only)
let (|Var|_|) s =
let m = System.Text.RegularExpressions.Regex.Match(s, "^[a-zA-Z]+")
if m.Success then
Some(m.Value, s.Substring m.Value.Length)
else
None
// parse an integer
let (|Int|_|) s =
let m = System.Text.RegularExpressions.Regex.Match(s, #"^-?\d+")
if m.Success then
Some(int m.Value, s.Substring m.Value.Length)
else
None
// parse an expression
let rec (|Expr|_|) = function
| NoWs (Lit "(" (BinOp (b, Expr(e1, Expr(e2, Lit ")" rest))))) ->
Some(BinOp(b, e1, e2), rest)
| NoWs (Lit "(" (Comparison (c, NoWs (Var (v, NoWs (Int (i, Lit ")" rest))))))) ->
Some(Comp(c, v, i), rest)
| _ -> None
let parse = function
| Parsing.Expr(e, "") -> e
| s -> failwith (sprintf "Not a valid expression: %s" s)
let e = parse #"
(AND (OR (EQ X 0)
(GT X 10))
(OR (EQ Y 0)
(GT Y 10)))"

F# How to tokenise user input: separating numbers, units, words?

I am fairly new to F#, but have spent the last few weeks reading reference materials. I wish to process a user-supplied input string, identifying and separating the constituent elements. For example, for this input:
XYZ Hotel: 6 nights at 220EUR / night
plus 17.5% tax
the output should resemble something like a list of tuples:
[ ("XYZ", Word); ("Hotel:", Word);
("6", Number); ("nights", Word);
("at", Operator); ("220", Number);
("EUR", CurrencyCode); ("/",
Operator); ("night", Word);
("plus", Operator); ("17.5",
Number); ("%", PerCent); ("tax",
Word) ]
Since I'm dealing with user input, it could be anything. Thus, expecting users to comply with a grammar is out of the question. I want to identify the numbers (could be integers, floats, negative...), the units of measure (optional, but could include SI or Imperial physical units, currency codes, counts such as "night/s" in my example), mathematical operators (as math symbols or as words including "at" "per", "of", "discount", etc), and all other words.
I have the impression that I should use active pattern matching -- is that correct? -- but I'm not exactly sure how to start. Any pointers to appropriate reference material or similar examples would be great.
I put together an example using the FParsec library. The example is not robust at all but it gives a pretty good picture of how to use FParsec.
type Element =
| Word of string
| Number of string
| Operator of string
| CurrencyCode of string
| PerCent of string
let parsePerCent state =
(parse {
let! r = pstring "%"
return PerCent r
}) state
let currencyCodes = [|
pstring "EUR"
|]
let parseCurrencyCode state =
(parse {
let! r = choice currencyCodes
return CurrencyCode r
}) state
let operators = [|
pstring "at"
pstring "/"
|]
let parseOperator state =
(parse {
let! r = choice operators
return Operator r
}) state
let parseNumber state =
(parse {
let! e1 = many1Chars digit
let! r = opt (pchar '.')
let! e2 = manyChars digit
return Number (e1 + (if r.IsSome then "." else "") + e2)
}) state
let parseWord state =
(parse {
let! r = many1Chars (letter <|> pchar ':')
return Word r
}) state
let elements = [|
parseOperator
parseCurrencyCode
parseWord
parseNumber
parsePerCent
|]
let parseElement state =
(parse {
do! spaces
let! r = choice elements
do! spaces
return r
}) state
let parseElements state =
manyTill parseElement eof state
let parse (input:string) =
let result = run parseElements input
match result with
| Success (v, _, _) -> v
| Failure (m, _, _) -> failwith m
It sounds like what you really want is just a lexer. A good alternative to FSParsec would be FSLex. (Good intro tutorial, albiet somewhat dated, can be found on my old blog here.) Using FSLex you can take your input text:
XYZ Hotel: 6 nights at 220EUR / night plus 17.5% tax
And get it properly tokenized into something like:
[ Word("XYZ"); Hotel; Int(6); Word("nights"); Word("at"); Int(220); EUR; ... ]
The next step, once you have an List of tokens, is to do some form of pattern matching / analysis to extract semantic information (which I assume is what you are really after). With the normalized token stream, it should be as simple as:
let rec processTokenList tokens =
match tokens with
| Float(x) :: Keyword("EUR") :: rest -> // Dollar amount x
| Word(x) :: Keyword("Hotel") :: rest -> // Hotel x
| hd :: rest -> // Couldn't find anything interesting...
processTokenList rest
That should at least get you started. But note that as your input gets more 'formal', so will the usefulness of your lexing. (And if you only accept a very specific input, then you can use a proper parser and be done with it!)

Parsing grammars using OCaml

I have a task to write a (toy) parser for a (toy) grammar using OCaml and not sure how to start (and proceed with) this problem.
Here's a sample Awk grammar:
type ('nonterm, 'term) symbol = N of 'nonterm | T of 'term;;
type awksub_nonterminals = Expr | Term | Lvalue | Incrop | Binop | Num;;
let awksub_grammar =
(Expr,
function
| Expr ->
[[N Term; N Binop; N Expr];
[N Term]]
| Term ->
[[N Num];
[N Lvalue];
[N Incrop; N Lvalue];
[N Lvalue; N Incrop];
[T"("; N Expr; T")"]]
| Lvalue ->
[[T"$"; N Expr]]
| Incrop ->
[[T"++"];
[T"--"]]
| Binop ->
[[T"+"];
[T"-"]]
| Num ->
[[T"0"]; [T"1"]; [T"2"]; [T"3"]; [T"4"];
[T"5"]; [T"6"]; [T"7"]; [T"8"]; [T"9"]]);;
And here's some fragments to parse:
let frag1 = ["4"; "+"; "3"];;
let frag2 = ["9"; "+"; "$"; "1"; "+"];;
What I'm looking for is a rulelist that is the result of the parsing a fragment, such as this one for frag1 ["4"; "+"; "3"]:
[(Expr, [N Term; N Binop; N Expr]);
(Term, [N Num]);
(Num, [T "3"]);
(Binop, [T "+"]);
(Expr, [N Term]);
(Term, [N Num]);
(Num, [T "4"])]
The restriction is to not use any OCaml libraries other than List... :/
Here is a rough sketch - straightforwardly descend into the grammar and try each branch in order. Possible optimization : tail recursion for single non-terminal in a branch.
exception Backtrack
let parse l =
let rules = snd awksub_grammar in
let rec descend gram l =
let rec loop = function
| [] -> raise Backtrack
| x::xs -> try attempt x l with Backtrack -> loop xs
in
loop (rules gram)
and attempt branch (path,tokens) =
match branch, tokens with
| T x :: branch' , h::tokens' when h = x ->
attempt branch' ((T x :: path),tokens')
| N n :: branch' , _ ->
let (path',tokens) = descend n ((N n :: path),tokens) in
attempt branch' (path', tokens)
| [], _ -> path,tokens
| _, _ -> raise Backtrack
in
let (path,tail) = descend (fst awksub_grammar) ([],l) in
tail, List.rev path
Ok, so the first think you should do is write a lexical analyser. That's the
function that takes the ‘raw’ input, like ["3"; "-"; "("; "4"; "+"; "2"; ")"],
and splits it into a list of tokens (that is, representations of terminal symbols).
You can define a token to be
type token =
| TokInt of int (* an integer *)
| TokBinOp of binop (* a binary operator *)
| TokOParen (* an opening parenthesis *)
| TokCParen (* a closing parenthesis *)
and binop = Plus | Minus
The type of the lexer function would be string list -> token list and the ouput of
lexer ["3"; "-"; "("; "4"; "+"; "2"; ")"]
would be something like
[ TokInt 3; TokBinOp Minus; TokOParen; TokInt 4;
TBinOp Plus; TokInt 2; TokCParen ]
This will make the job of writing the parser easier, because you won't have to
worry about recognising what is a integer, what is an operator, etc.
This is a first, not too difficult step because the tokens are already separated.
All the lexer has to do is identify them.
When this is done, you can write a more realistic lexical analyser, of type string -> token list, that takes a actual raw input, such as "3-(4+2)" and turns it into a token list.
I'm not sure if you specifically require the derivation tree, or if this is a just a first step in parsing. I'm assuming the latter.
You could start by defining the structure of the resulting abstract syntax tree by defining types. It could be something like this:
type expr =
| Operation of term * binop * term
| Term of term
and term =
| Num of num
| Lvalue of expr
| Incrop of incrop * expression
and incrop = Incr | Decr
and binop = Plus | Minus
and num = int
Then I'd implement a recursive descent parser. Of course it would be much nicer if you could use streams combined with the preprocessor camlp4of...
By the way, there's a small example about arithmetic expressions in the OCaml documentation here.

Resources