F# How to tokenise user input: separating numbers, units, words? - f#

I am fairly new to F#, but have spent the last few weeks reading reference materials. I wish to process a user-supplied input string, identifying and separating the constituent elements. For example, for this input:
XYZ Hotel: 6 nights at 220EUR / night
plus 17.5% tax
the output should resemble something like a list of tuples:
[ ("XYZ", Word); ("Hotel:", Word);
("6", Number); ("nights", Word);
("at", Operator); ("220", Number);
("EUR", CurrencyCode); ("/",
Operator); ("night", Word);
("plus", Operator); ("17.5",
Number); ("%", PerCent); ("tax",
Word) ]
Since I'm dealing with user input, it could be anything. Thus, expecting users to comply with a grammar is out of the question. I want to identify the numbers (could be integers, floats, negative...), the units of measure (optional, but could include SI or Imperial physical units, currency codes, counts such as "night/s" in my example), mathematical operators (as math symbols or as words including "at" "per", "of", "discount", etc), and all other words.
I have the impression that I should use active pattern matching -- is that correct? -- but I'm not exactly sure how to start. Any pointers to appropriate reference material or similar examples would be great.

I put together an example using the FParsec library. The example is not robust at all but it gives a pretty good picture of how to use FParsec.
type Element =
| Word of string
| Number of string
| Operator of string
| CurrencyCode of string
| PerCent of string
let parsePerCent state =
(parse {
let! r = pstring "%"
return PerCent r
}) state
let currencyCodes = [|
pstring "EUR"
|]
let parseCurrencyCode state =
(parse {
let! r = choice currencyCodes
return CurrencyCode r
}) state
let operators = [|
pstring "at"
pstring "/"
|]
let parseOperator state =
(parse {
let! r = choice operators
return Operator r
}) state
let parseNumber state =
(parse {
let! e1 = many1Chars digit
let! r = opt (pchar '.')
let! e2 = manyChars digit
return Number (e1 + (if r.IsSome then "." else "") + e2)
}) state
let parseWord state =
(parse {
let! r = many1Chars (letter <|> pchar ':')
return Word r
}) state
let elements = [|
parseOperator
parseCurrencyCode
parseWord
parseNumber
parsePerCent
|]
let parseElement state =
(parse {
do! spaces
let! r = choice elements
do! spaces
return r
}) state
let parseElements state =
manyTill parseElement eof state
let parse (input:string) =
let result = run parseElements input
match result with
| Success (v, _, _) -> v
| Failure (m, _, _) -> failwith m

It sounds like what you really want is just a lexer. A good alternative to FSParsec would be FSLex. (Good intro tutorial, albiet somewhat dated, can be found on my old blog here.) Using FSLex you can take your input text:
XYZ Hotel: 6 nights at 220EUR / night plus 17.5% tax
And get it properly tokenized into something like:
[ Word("XYZ"); Hotel; Int(6); Word("nights"); Word("at"); Int(220); EUR; ... ]
The next step, once you have an List of tokens, is to do some form of pattern matching / analysis to extract semantic information (which I assume is what you are really after). With the normalized token stream, it should be as simple as:
let rec processTokenList tokens =
match tokens with
| Float(x) :: Keyword("EUR") :: rest -> // Dollar amount x
| Word(x) :: Keyword("Hotel") :: rest -> // Hotel x
| hd :: rest -> // Couldn't find anything interesting...
processTokenList rest
That should at least get you started. But note that as your input gets more 'formal', so will the usefulness of your lexing. (And if you only accept a very specific input, then you can use a proper parser and be done with it!)

Related

How to parse recusrive grammar in FParsec

Previous questions which I could not use to get this to work
Recursive grammars in FParsec
Seems to be an old question which was asked before createParserForwardedToRef was added to FParsec
AST doesn't seem to be as horribly recursive as mine.
Parsing in to a recursive data structure
Grammar relies on a special character '[' to indicate another nesting level. I don't have this luxury
I want to build a sort of Lexer and project system for a language I have found myself writing lately. The language is called q. It is a fairly simple language and has no operator precedence. For example 1*2+3 is the same as (1*(2+3)). It works a bit like a reverse polish notation calculator, evaluation is right to left.
I am having trouble expressing this in FParsec. I have put together the following simplified demo
open FParsec
type BinaryOperator = BinaryOperator of string
type Number = Number of string
type Element =
|Number of Number
and Expression =
|Element of Element
|BinaryExpression of Element * BinaryOperator * Expression
let number = regex "\d+\.?\d*" |>> Number.Number
let element = [ number ] |> choice |>> Element.Number
let binaryOperator = ["+"; "-"; "*"; "%"] |> Seq.map pstring |> choice |>> BinaryOperator
let binaryExpression expression = pipe3 element binaryOperator expression (fun l o r -> (l,o,r))
let expression =
let exprDummy, expRef = createParserForwardedToRef()
let elemExpr = element |>> Element
let binExpr = binaryExpression exprDummy |>> BinaryExpression
expRef.Value <- [binExpr; elemExpr; ] |> choice
expRef
let statement = expression.Value .>> eof
let parseString s =
printfn "Parsing input: '%s'" s
match run statement s with
| Success(result, _, _) -> printfn "Ok: %A" result
| Failure(errorMsg, _, _) -> printfn "Error: %A" errorMsg
//tests
parseString "1.23"
parseString "1+1"
parseString "1*2+3" // equivalent to (1*(2+3))
So far, I haven't been able to come up with a way to satisfy all 3 tests cases. In the above, it tries to parse binExpr first, realises it can't, but then must be consuming the input because it doesn't try to evaluate elemExpr next. Not sure what to do. How do I satisfy the 3 tests?
Meditating on Tomas' answer, I have come up with the following that works
let expr, expRef = createParserForwardedToRef()
let binRightExpr = binaryOperator .>>. expr
expRef.Value <- parse{
let! first = element
return! choice [
binRightExpr |>> (fun (o, r) -> (first, o, r) |> BinaryExpression)
preturn (first |> Element)
]
}
let statement = expRef.Value .>> eof
The reason the first parser failed is given in the FParsec docs
The behaviour of the <|> combinator has two important characteristics:
<|> only tries the parser on the right side if the parser on the left
side fails. It does not implement a longest match rule.
However, it only tries the right parser if the left parser fails without consuming input.
Probably need to clean up a few things like the structure of the AST but I think I am good to go.

fparsec - limit number of characters that a parser is applied to

I have a problem where during the parsing of a stream I get to point where the next N characters need to be parsed by applying a specfic parser multiple times (in sequence).
(stripped down toy) Example:
17<tag><anothertag><a42...
^
|- I'm here
Let's say the 17 indicates that the next N=17 characters make up tags, so I need to repetetively apply my "tagParser" but stop after 17 chars and not consume the rest even if it looks like a tag because that has a different meaning and will be parsed by another parser.
I cannot use many or many1 because that would eat the stream beyond those N characters.
Nor can I use parray because I do not know how many successful applications of that parser are there within the N characters.
I was looking into manyMinMaxSatisfy but could not figure out how to make use of it in this case.
Is there a way to cut N chars of a stream and feed them to some parser? Or is there a way to invoke many applications but up to N chars?
Thanks.
You can use getPosition to make sure you don't go past the specified number of characters. I threw this together (using F# 6) and it seems to work, although simpler/faster solutions may be possible:
let manyLimit nChars p =
parse {
let! startPos = getPosition
let rec loop values =
parse {
let! curPos = getPosition
let nRemain = (startPos.Index + nChars) - curPos.Index
if nRemain = 0 then
return values
elif nRemain > 0 then
let! value = p
return! loop (value :: values)
else
return! fail $"limit exceeded by {-nRemain} chars"
}
let! values = loop []
return values |> List.rev
}
Test code:
let ptag =
between
(skipChar '<')
(skipChar '>')
(manySatisfy (fun c -> c <> '>'))
let parser =
parse {
let! nChars = pint64
let! tags = manyLimit nChars ptag
let! rest = restOfLine true
return tags, rest
}
run parser "17<tag><anothertag><a42..."
|> printfn "%A"
Output is:
Success: (["tag"; "anothertag"], "<a42...")
Quite low-level parser, that operates on raw Reply objects. It reads count of chars, creates substring to feed to tags parser and consumes rest. There's should be an easier way, but I don't have much experience with FParsec
open FParsec
type Tag = Tag of string
let pTag = // parses tag string and constructs 'Tag' object
skipChar '<' >>. many1Satisfy isLetter .>> skipChar '>'
|>> Tag
let pCountPrefixedTags stream =
let count = pint32 stream // read chars count
if count.Status = Ok then
let count = count.Result
// take exactly 'count' chars
let tags = manyMinMaxSatisfy count count (fun _ -> true) stream
if tags.Status = Ok then
// parse substring with tags
let res = run (many1 pTag) tags.Result
match res with
| Success (res, _, _) -> Reply(res)
| Failure (_, error, _) -> Reply(ReplyStatus.Error, error.Messages)
else
Reply(tags.Status, tags.Error)
else
Reply(count.Status, count.Error)
let consumeStream =
many1Satisfy (fun _ -> true)
run (pCountPrefixedTags .>>. consumeStream) "17<tag><anothertag><notTag..."
|> printfn "%A" // Success: ([Tag "tag"; Tag "anothertag"], "<notTag...")
You also can do this without going down to stream level.
open FParsec
let ptag =
between
(skipChar '<')
(skipChar '>')
(manySatisfy (fun c -> c <> '>'))
let tagsFromChars (l: char[]) =
let s = new System.String(l)
match run (many ptag) s with
| Success(result, _, _) -> result
| Failure(errorMsg, _, _) -> []
let parser =
parse {
let! nChars = pint32
let! tags = parray nChars anyChar |>> tagsFromChars
let! rest = restOfLine true
return tags, rest
}
run parser "17<tag><anothertag><a42..."
|> printfn "%A"

Splitting a Seq of Strings Of Variable Length in F#

I am using a .fasta file in F#. When I read it from disk, it is a sequence of strings. Each observation is usually 4-5 strings in length: 1st string is the title, then 2-4 strings of amino acids, and then 1 string of space. For example:
let filePath = #"/Users/XXX/sample_database.fasta"
let fileContents = File.ReadLines(filePath)
fileContents |> Seq.iter(fun x -> printfn "%s" x)
yields:
I am looking for a way to split each observation into its own collection using the OOB high order functions in F#. I do not want to use any mutable variables or for..each syntax. I thought Seq.chunkBySize would work -> but the size varies. Is there a Seq.chunkByCharacter?
Mutable variables are totally fine for this, provided their mutability doesn't leak into a wider context. Why exactly do you not want to use them?
But if you really want to go hardcore "functional", then the usual functional way of doing something like that is via fold.
Your folding state would be a pair of "blocks accumulated so far" and "current block".
At each step, if you get a non-empty string, you attach it to the "current block".
And if you get an empty string, that means the current block is over, so you attach the current block to the list of "blocks so far" and make the current block empty.
This way, at the end of folding you'll end up with a pair of "all blocks accumulated except the last one" and "last block", which you can glue together.
Plus, an optimization detail: since I'm going to do a lot of "attach a thing to a list", I'd like to use a linked list for that, because it has constant-time attaching. But then the problem is that it's only constant time for prepending, not appending, which means I'll end up with all the lists reversed. But no matter: I'll just reverse them again at the very end. List reversal is a linear operation, which means my whole thing would still be linear.
let splitEm lines =
let step (blocks, currentBlock) s =
match s with
| "" -> (List.rev currentBlock :: blocks), []
| _ -> blocks, s :: currentBlock
let (blocks, lastBlock) = Array.fold step ([], []) lines
List.rev (lastBlock :: blocks)
Usage:
> splitEm [| "foo"; "bar"; "baz"; ""; "1"; "2"; ""; "4"; "5"; "6"; "7"; ""; "8" |]
[["foo"; "bar"; "baz"]; ["1"; "2"]; ["4"; "5"; "6"; "7"]; ["8"]]
Note 1: You may have to address some edge cases depending on your data and what you want the behavior to be. For example, if there is an empty line at the very end, you'll end up with an empty block at the end.
Note 2: You may notice that this is very similar to imperative algorithm with mutating variables: I'm even talking about things like "attach to list of blocks" and "make current block empty". This is not a coincidence. In this purely functional version the "mutating" is accomplished by calling the same function again with different parameters, while in an equivalent imperative version you would just have those parameters turned into mutable memory cells. Same thing, different view. In general, any imperative iteration can be turned into a fold this way.
For comparison, here's a mechanical translation of the above to imperative mutation-based style:
let splitEm lines =
let mutable blocks = []
let mutable currentBlock = []
for s in lines do
match s with
| "" -> blocks <- List.rev currentBlock :: blocks; currentBlock <- []
| _ -> currentBlock <- s :: currentBlock
List.rev (currentBlock :: blocks)
To illustrate Fyodor's point about contained mutability, here's an example that is mutable as can be while still somewhat reasonable. The outer functional layer is a sequence expression, a common pattern demonstrated by Seq.scan in the F# source.
let chooseFoldSplit
folding (state : 'State)
(source : seq<'T>) : seq<'U[]> = seq {
let sref, zs = ref state, ResizeArray()
use ie = source.GetEnumerator()
while ie.MoveNext() do
let newState, uopt = folding !sref ie.Current
if newState <> !sref then
yield zs.ToArray()
zs.Clear()
sref := newState
match uopt with
| None -> ()
| Some u -> zs.Add u
if zs.Count > 0 then
yield zs.ToArray() }
// val chooseFoldSplit :
// folding:('State -> 'T -> 'State * 'U option) ->
// state:'State -> source:seq<'T> -> seq<'U []> when 'State : equality
There is mutability of a ref cell (equivalent to a mutable variable) and there is a mutable data structure; an alias for System.Collection.Generic.List<'T>, which allows appending at O(1) cost.
The folding function's signature 'State -> 'T -> 'State * 'U option is reminiscent of the folder of fold, except that it causes the result sequence to be split when its state changes. And it also spawns an option that denotes the next member for the current group (or not).
It would work fine without the conversion to a persistent array, as long as you iterate the resulting sequence lazily and only exactly once. Therefore we need to isolate the contents of the ResizeArrayfrom the outside world.
The simplest folding for your use case is negation of a boolean, but you could leverage it for more complex tasks like numbering your records:
[| "foo"; "1"; "2"; ""; "bar"; "4"; "5"; "6"; "7"; ""; "baz"; "8"; "" |]
|> chooseFoldSplit (fun b t ->
if t = "" then not b, None else b, Some t ) false
|> Seq.map (fun a ->
if a.Length > 1 then
{ Description = a.[0]; Sequence = String.concat "" a.[1..] }
else failwith "Format error" )
// val it : seq<FastaEntry> =
// seq [{Description = "foo";
// Sequence = "12";}; {Description = "bar";
// Sequence = "4567";}; {Description = "baz";
// Sequence = "8";}]
I went with recursion:
type FastaEntry = {Description:String; Sequence:String}
let generateFastaEntry (chunk:String seq) =
match chunk |> Seq.length with
| 0 -> None
| _ ->
let description = chunk |> Seq.head
let sequence = chunk |> Seq.tail |> Seq.reduce (fun acc x -> acc + x)
Some {Description=description; Sequence=sequence}
let rec chunk acc contents =
let index = contents |> Seq.tryFindIndex(fun x -> String.IsNullOrEmpty(x))
match index with
| None ->
let fastaEntry = generateFastaEntry contents
match fastaEntry with
| Some x -> Seq.append acc [x]
| None -> acc
| Some x ->
let currentChunk = contents |> Seq.take x
let fastaEntry = generateFastaEntry currentChunk
match fastaEntry with
| None -> acc
| Some y ->
let updatedAcc =
match Seq.isEmpty acc with
| true -> seq {y}
| false -> Seq.append acc (seq {y})
let remaining = contents |> Seq.skip (x+1)
chunk updatedAcc remaining
You also can use Regular Expression for these kind of stuff. Here is a solution that uses a regular expression to extract a whole Fasta Block at once.
type FastaEntry = {
Description: string
Sequence: string
}
let fastaRegexStr =
#"
^> # Line Starting with >
(.*) # Capture into $1
\r?\n # End-of-Line
( # Capturing in $2
(?:
^ # A Line ...
[A-Z]+ # .. containing A-Z
\*? \r?\n # Optional(*) followed by End-of-Line
)+ # ^ Multiple of those lines
)
(?:
(?: ^ [ \t\v\f]* \r?\n ) # Match an empty (whitespace) line ..
| # or
\z # End-of-String
)
"
(* Regex for matching one Fasta Block *)
let fasta = Regex(fastaRegexStr, RegexOptions.IgnorePatternWhitespace ||| RegexOptions.Multiline)
(* Whole file as a string *)
let content = System.IO.File.ReadAllText "fasta.fasta"
let entries = [
for m in fasta.Matches(content) do
let desc = m.Groups.[1].Value
(* Remove *, \r and \n from string *)
let sequ = Regex.Replace(m.Groups.[2].Value, #"\*|\r|\n", "")
{Description=desc; Sequence=sequ}
]

F# idiomatic way of transforming text

Myello! So I am looking for a concise, efficient an idiomatic way in F# to parse a file or a string. I have a strong preference to treat the input as a sequence of char (char seq). The idea is that every function is responsible to parse a piece of the input, return the converted text tupled with the unused input and be called by a higher level function that chains the unused input to the following functions and use the results to build a compound type. Every parsing function should therefore have a signature similar to this one: char seq -> char seq * 'a . If, for example, the function's responsibility is simply to extract the first word, then, one approach would be the following:
let parseFirstWord (text: char seq) =
let rec forTailRecursion t acc =
let c = Seq.head t
if c = '\n' then
(t, acc)
else
forTailRecursion (Seq.skip 1 t) (c::acc)
let rest, reversedWord = forTailRecursion text []
(rest, List.reverse reversedWord)
Now, of course the main problem with this approach is that it extracts the word in reverse order and so you have to reverse it. Its main advantages however are that is uses strictly functional features and proper tail recursion. One could avoid the reversing of the extracted value while losing tail recursion:
let rec parseFirstWord (text: char seq) =
let c = Seq.head t
if c = '\n' then
(t, [])
else
let rest, tail = parseFirstWord (Seq.skip 1 t)
(rest, (c::tail))
Or use a fast mutable data structure underneath instead of using purely functional features, such as:
let parseFirstWord (text: char seq) =
let rec forTailRecursion t queue =
let c = Seq.head t
if c = '\n' then
(t, queue)
else
forTailRecursion (Seq.skip 1 t) (queue.Enqueu(c))
forTailRecursion text (new Queue<char>())
I have no idea how to use OO concepts in F# mind you so corrections to the above code are welcome.
Being new to this language, I would like to be guided in terms of the usual compromises that an F# developer makes. Among the suggested approaches and your own, which should I consider more idiomatic and why? Also, in that particular case, how would you encapsulate the return value: char seq * char seq, char seq * char list or evenchar seq * Queue<char>? Or would you even consider char seq * String following a proper conversion?
I would definitely have a look at FSLex. FSYacc, FParsec. However if you just want to tokenize a seq<char> you can use a sequence expression to generate tokens in the right order. Reusing your idea of a recursive inner function, and combinining with a sequence expression, we can stay tail recursive like shown below, and avoid non-idiomatic tools like mutable data structures.
I changed the separator char for easy debugging and the signature of the function. This version produces a seq<string> (your tokens) as result, which is probably easier to consume than a tuple with the current token and the rest of the text. If you just want the first token, you can just take the head. Note that the sequence is generated 'on demand', i.e. the input is parsed only as tokens are consumed through the sequence. Should you need the remainder of the input text next to each token, you can yield a pair in loop instead, but I'm guessing the downstream consumer most likely wouldn't (furthermore, if the input text is itself a lazy sequence, possibly linked to a stream, we don't want to expose it as it should be iterated through only in one place).
let parse (text : char seq) =
let rec loop t acc =
seq {
if Seq.isEmpty t then yield acc
else
let c, rest = Seq.head t, Seq.skip 1 t
if c = ' ' then
yield acc
yield! loop rest ""
else yield! loop rest (acc + string c)
}
loop text ""
parse "The FOX is mine"
val it : seq<string> = seq ["The"; "FOX"; "is"; "mine"]
This is not the only 'idiomatic' way of doing this in F#. Every time we need to process a sequence, we can look at the functions made available in the Seq module. The most general of these is fold which iterates through a sequence once, accumulating a state at each element by running a given function. In the example below accumulate is such a function, that progressively builds the resulting sequence of tokens. Since Seq.fold doesn't run the accumulator function on an empty sequence, we need the last two lines to extract the last token from the function's internal accumulator.
This second implementation keeps the nice characteriestics of the first, i.e. tail recursion (inside the fold implementation, if I'm not mistaken) and processing of the input sequence on demand. It also happens to be shorter, albeit a bit less readable probably.
let parse2 (text : char seq) =
let accumulate (res, acc) c =
if c = ' ' then (Seq.append res (Seq.singleton acc), "")
else (res, acc + string c)
let (acc, last) = text |> Seq.fold accumulate (Seq.empty, "")
Seq.append acc (Seq.singleton last)
parse2 "The FOX is mine"
val it : seq<string> = seq ["The"; "FOX"; "is"; "mine"]
One way of lexing/parsing in a way truly unique to F# is by using active patterns. The following simplified example shows the general idea. It can process a calculation string of arbitrary length without producing a stack overflow.
let rec (|CharOf|_|) set = function
| c :: rest when Set.contains c set -> Some(c, rest)
| ' ' :: CharOf set (c, rest) -> Some(c, rest)
| _ -> None
let rec (|CharsOf|) set = function
| CharOf set (c, CharsOf set (cs, rest)) -> c::cs, rest
| rest -> [], rest
let (|StringOf|_|) set = function
| CharsOf set (_::_ as cs, rest) -> Some(System.String(Array.ofList cs), rest)
| _ -> None
type Token =
| Int of int
| Add | Sub | Mul | Div | Mod
| Unknown
let lex: string -> _ =
let digits = set ['0'..'9']
let ops = Set.ofSeq "+-*/%"
let rec lex chars =
seq { match chars with
| StringOf digits (s, rest) -> yield Int(int s); yield! lex rest
| CharOf ops (c, rest) ->
let op =
match c with
| '+' -> Add | '-' -> Sub | '*' -> Mul | '/' -> Div | '%' -> Mod
| _ -> failwith "invalid operator char"
yield op; yield! lex rest
| [] -> ()
| _ -> yield Unknown }
List.ofSeq >> lex
lex "1234 + 514 / 500"
// seq [Int 1234; Add; Int 514; Div; Int 500]

Ocaml lexer / parser rules

I wrote a program in ocaml that given an infix expression like 1 + 2, outputs the prefix notation : + 1 2
My problem is I don't find a way to make a rules like : all value, operator and bracket should be always separated by at least one space: 1+ 1 would be wrong 1 + 1 ok. I would like to not use the ocamlp4 grammar.
here is the code:
open Genlex
type tree =
| Leaf of string
| Node of tree * string * tree
let my_lexer str =
let kwds = ["("; ")"; "+"; "-"; "*"; "/"] in
make_lexer kwds (Stream.of_string str)
let make_tree_from_stream stream =
let op_parser operator_l higher_perm =
let rec aux left higher_perm = parser
[<'Kwd op when List.mem op operator_l; right = higher_perm; s >]
-> aux (Node (left, op, right)) higher_perm s
| [< >]
-> left
in
parser [< left = higher_perm; s >] -> aux left higher_perm s
in
let rec high_perm l = op_parser ["*"; "/"] brackets l
and low_perm l = op_parser ["+"; "-"] high_perm l
and brackets = parser
| [< 'Kwd "("; e = low_perm; 'Kwd ")" >] -> e
| [< 'Ident n >] -> Leaf n
| [< 'Int n >] -> Leaf (string_of_int n)
in
low_perm stream
let rec draw_tree = function
| Leaf n -> Printf.printf "%s" n
| Node(fg, r, fd) -> Printf.printf "(%s " (r);
draw_tree fg;
Printf.printf " ";
draw_tree fd;
Printf.printf ")"
let () =
let line = read_line() in
draw_tree (make_tree_from_stream (my_lexer line)); Printf.printf "\n"
Plus if you have some tips about the code or if you notice some error of prog style then I will appreciate that you let it me know. Thanks !
The Genlex provides a ready-made lexer that respects OCaml's lexical convention, and in particular ignore the spaces in the positions you mention. I don't think you can implement what you want on top of it (it is not designed as a flexible solution, but a quick way to get a prototype working).
If you want to keep writing stream parsers, you could write your own lexer for it: define a token type, and lex a char Stream.t into a token Stream.t, which you can then parse as you wish. Otherwise, if you don't want to use Camlp4, you may want to try an LR parser generator, such as menhir (a better ocamlyacc).

Resources