F# Match Many Regex - f#

I have a string that I know will match one (and only one) of three regexes. I want to try each regex in turn until a match is found. For two of the regexes it is sufficient to know that there is a match. The third regex has a capture group and returns an integer.
I have an active pattern for regex:
let (|Regex|_|) pattern input =
let m = Regex.Match(input, pattern)
if m.Success then Some(List.tail [ for g in m.Groups -> g.Value ])
else None
I’m new to F# and struggling for an idiomatic way to do this. I don’t really want to make a convoluted if-then-else expression,
Any help would be really appreciated.
Thanks

With your handy function all you need to do is use match with your 3 patterns:
let regex1 = "^[1234]+$"
let regex2 = "^[abcd]+$"
let regex3 = "^ab([123])$"
let testText v =
match v with
| Regex regex1 _ -> "matched 1!"
| Regex regex2 _ -> "matched 2!"
| Regex regex3 [ v ] -> sprintf "matched 3 = %d" (int v)
| _ -> "no match"
testText "231" |> print // matched 1!
testText "abd" |> print // matched 2!
testText "ab2" |> print // matched 3 = 2
testText "ab5" |> print // no match

Related

Splitting a Seq of Strings Of Variable Length in F#

I am using a .fasta file in F#. When I read it from disk, it is a sequence of strings. Each observation is usually 4-5 strings in length: 1st string is the title, then 2-4 strings of amino acids, and then 1 string of space. For example:
let filePath = #"/Users/XXX/sample_database.fasta"
let fileContents = File.ReadLines(filePath)
fileContents |> Seq.iter(fun x -> printfn "%s" x)
yields:
I am looking for a way to split each observation into its own collection using the OOB high order functions in F#. I do not want to use any mutable variables or for..each syntax. I thought Seq.chunkBySize would work -> but the size varies. Is there a Seq.chunkByCharacter?
Mutable variables are totally fine for this, provided their mutability doesn't leak into a wider context. Why exactly do you not want to use them?
But if you really want to go hardcore "functional", then the usual functional way of doing something like that is via fold.
Your folding state would be a pair of "blocks accumulated so far" and "current block".
At each step, if you get a non-empty string, you attach it to the "current block".
And if you get an empty string, that means the current block is over, so you attach the current block to the list of "blocks so far" and make the current block empty.
This way, at the end of folding you'll end up with a pair of "all blocks accumulated except the last one" and "last block", which you can glue together.
Plus, an optimization detail: since I'm going to do a lot of "attach a thing to a list", I'd like to use a linked list for that, because it has constant-time attaching. But then the problem is that it's only constant time for prepending, not appending, which means I'll end up with all the lists reversed. But no matter: I'll just reverse them again at the very end. List reversal is a linear operation, which means my whole thing would still be linear.
let splitEm lines =
let step (blocks, currentBlock) s =
match s with
| "" -> (List.rev currentBlock :: blocks), []
| _ -> blocks, s :: currentBlock
let (blocks, lastBlock) = Array.fold step ([], []) lines
List.rev (lastBlock :: blocks)
Usage:
> splitEm [| "foo"; "bar"; "baz"; ""; "1"; "2"; ""; "4"; "5"; "6"; "7"; ""; "8" |]
[["foo"; "bar"; "baz"]; ["1"; "2"]; ["4"; "5"; "6"; "7"]; ["8"]]
Note 1: You may have to address some edge cases depending on your data and what you want the behavior to be. For example, if there is an empty line at the very end, you'll end up with an empty block at the end.
Note 2: You may notice that this is very similar to imperative algorithm with mutating variables: I'm even talking about things like "attach to list of blocks" and "make current block empty". This is not a coincidence. In this purely functional version the "mutating" is accomplished by calling the same function again with different parameters, while in an equivalent imperative version you would just have those parameters turned into mutable memory cells. Same thing, different view. In general, any imperative iteration can be turned into a fold this way.
For comparison, here's a mechanical translation of the above to imperative mutation-based style:
let splitEm lines =
let mutable blocks = []
let mutable currentBlock = []
for s in lines do
match s with
| "" -> blocks <- List.rev currentBlock :: blocks; currentBlock <- []
| _ -> currentBlock <- s :: currentBlock
List.rev (currentBlock :: blocks)
To illustrate Fyodor's point about contained mutability, here's an example that is mutable as can be while still somewhat reasonable. The outer functional layer is a sequence expression, a common pattern demonstrated by Seq.scan in the F# source.
let chooseFoldSplit
folding (state : 'State)
(source : seq<'T>) : seq<'U[]> = seq {
let sref, zs = ref state, ResizeArray()
use ie = source.GetEnumerator()
while ie.MoveNext() do
let newState, uopt = folding !sref ie.Current
if newState <> !sref then
yield zs.ToArray()
zs.Clear()
sref := newState
match uopt with
| None -> ()
| Some u -> zs.Add u
if zs.Count > 0 then
yield zs.ToArray() }
// val chooseFoldSplit :
// folding:('State -> 'T -> 'State * 'U option) ->
// state:'State -> source:seq<'T> -> seq<'U []> when 'State : equality
There is mutability of a ref cell (equivalent to a mutable variable) and there is a mutable data structure; an alias for System.Collection.Generic.List<'T>, which allows appending at O(1) cost.
The folding function's signature 'State -> 'T -> 'State * 'U option is reminiscent of the folder of fold, except that it causes the result sequence to be split when its state changes. And it also spawns an option that denotes the next member for the current group (or not).
It would work fine without the conversion to a persistent array, as long as you iterate the resulting sequence lazily and only exactly once. Therefore we need to isolate the contents of the ResizeArrayfrom the outside world.
The simplest folding for your use case is negation of a boolean, but you could leverage it for more complex tasks like numbering your records:
[| "foo"; "1"; "2"; ""; "bar"; "4"; "5"; "6"; "7"; ""; "baz"; "8"; "" |]
|> chooseFoldSplit (fun b t ->
if t = "" then not b, None else b, Some t ) false
|> Seq.map (fun a ->
if a.Length > 1 then
{ Description = a.[0]; Sequence = String.concat "" a.[1..] }
else failwith "Format error" )
// val it : seq<FastaEntry> =
// seq [{Description = "foo";
// Sequence = "12";}; {Description = "bar";
// Sequence = "4567";}; {Description = "baz";
// Sequence = "8";}]
I went with recursion:
type FastaEntry = {Description:String; Sequence:String}
let generateFastaEntry (chunk:String seq) =
match chunk |> Seq.length with
| 0 -> None
| _ ->
let description = chunk |> Seq.head
let sequence = chunk |> Seq.tail |> Seq.reduce (fun acc x -> acc + x)
Some {Description=description; Sequence=sequence}
let rec chunk acc contents =
let index = contents |> Seq.tryFindIndex(fun x -> String.IsNullOrEmpty(x))
match index with
| None ->
let fastaEntry = generateFastaEntry contents
match fastaEntry with
| Some x -> Seq.append acc [x]
| None -> acc
| Some x ->
let currentChunk = contents |> Seq.take x
let fastaEntry = generateFastaEntry currentChunk
match fastaEntry with
| None -> acc
| Some y ->
let updatedAcc =
match Seq.isEmpty acc with
| true -> seq {y}
| false -> Seq.append acc (seq {y})
let remaining = contents |> Seq.skip (x+1)
chunk updatedAcc remaining
You also can use Regular Expression for these kind of stuff. Here is a solution that uses a regular expression to extract a whole Fasta Block at once.
type FastaEntry = {
Description: string
Sequence: string
}
let fastaRegexStr =
#"
^> # Line Starting with >
(.*) # Capture into $1
\r?\n # End-of-Line
( # Capturing in $2
(?:
^ # A Line ...
[A-Z]+ # .. containing A-Z
\*? \r?\n # Optional(*) followed by End-of-Line
)+ # ^ Multiple of those lines
)
(?:
(?: ^ [ \t\v\f]* \r?\n ) # Match an empty (whitespace) line ..
| # or
\z # End-of-String
)
"
(* Regex for matching one Fasta Block *)
let fasta = Regex(fastaRegexStr, RegexOptions.IgnorePatternWhitespace ||| RegexOptions.Multiline)
(* Whole file as a string *)
let content = System.IO.File.ReadAllText "fasta.fasta"
let entries = [
for m in fasta.Matches(content) do
let desc = m.Groups.[1].Value
(* Remove *, \r and \n from string *)
let sequ = Regex.Replace(m.Groups.[2].Value, #"\*|\r|\n", "")
{Description=desc; Sequence=sequ}
]

Can I make return type vary with parameter a bit like sprintf in F#?

In the F# core libraries there are functions whose signature seemingly changes based on the parameter at compile-time:
> sprintf "Hello %i" ;;
val it : (int -> string) = <fun:it#1>
> sprintf "Hello %s" ;;
val it : (string -> string) = <fun:it#2-1>
Is it possible to implement my own functions that have this property?
For example, could I design a function that matches strings with variable components:
matchPath "/products/:string/:string" (fun (category : string) (sku : string) -> ())
matchPath "/tickets/:int" (fun (id : int) -> ())
Ideally, I would like to do avoid dynamic casts.
There are two relevant F# features that make it possible to do something like this.
Printf format strings. The compiler handles format strings like "hi %s" in a special way. They are not limited just to printf and it's possible to use those in your library in a somewhat different way. This does not let you change the syntax, but if you were happy to specify your paths using e.g. "/products/%s/%d", then you could use this. The Giraffe library defines routef function, which uses this trick for request routing:
let webApp =
choose [
routef "/foo/%s/%s/%i" fooHandler
routef "/bar/%O" (fun guid -> text (guid.ToString()))
]
Type providers. Another option is to use F# type providers. With parameterized type providers, you can write a type that is parameterized by a literal string and has members with types that are generated by some F# code you write based on the literal string parameter. An example is the Regex type provider:
type TempRegex = Regex< #"^(?<Temperature>[\d\.]+)\s*°C$", noMethodPrefix = true >
TempRegex().Match("21.3°C").Temperature.TryValue
Here, the regular expression on the first line is static parameter of the Regex type provider. The type provider generates a Match method which returns an object with properties like Temperature that are based on the literal string. You would likely be able to use this and write something like:
MatchPath<"/products/:category/:sku">.Match(fun r ->
printfn "Got category %s and sku %s" r.Category r.Sku)
I tweaked your example so that r is an object with properties that have names matching to those in the string, but you could use a lambda with multiple parameters too. Although, if you wanted to specify types of those matches, you might need a fancier syntax like "/product/[category:int]/[sku:string]" - this is just a string you have to parse in the type provider, so it's completely up to you.
1st: Tomas's answer is the right answer.
But ... I had the same question.
And while I could understand it conceptually as "it has to be 'the string format thing' or 'the provider stuff'"
I could not tell my self that I got until I tried an implementation
... And it took me a bit .
I used FSharp.Core's printfs and Giraffe's FormatExpressions.fs as guidelines
And came up with this naive gist/implementation, inspired by Giraffe FormatExpressions.fs
BTW The trick is in this bit of magic fun (format: PrintfFormat<_, _, _, _, 'T>) (handle: 'T -> 'R)
open System.Text.RegularExpressions
// convert format pattern to Regex Pattern
let rec toRegexPattern =
function
| '%' :: c :: tail ->
match c with
| 'i' ->
let x, rest = toRegexPattern tail
"(\d+)" + x, rest
| 's' ->
let x, rest = toRegexPattern tail
"(\w+)" + x, rest
| x ->
failwithf "'%%%c' is Not Implemented\n" x
| c :: tail ->
let x, rest = toRegexPattern tail
let r = c.ToString() |> Regex.Escape
r + x, rest
| [] -> "", []
// Handler Factory
let inline Handler (format: PrintfFormat<_, _, _, _, 'T>) (handle: 'T -> string) (decode: string list -> 'T) =
format.Value.ToCharArray()
|> List.ofArray
|> toRegexPattern
|> fst, handle, decode
// Active Patterns
let (|RegexMatch|_|) pattern input =
let m = Regex.Match(input, pattern)
if m.Success then
let values =
[ for g in Regex(pattern).Match(input).Groups do
if g.Success && g.Name <> "0" then yield g.Value ]
Some values
else
None
let getPattern (pattern, _, _) = pattern
let gethandler (_, handle, _) = handle
let getDecoder (_, _, decode) = decode
let Router path =
let route1 =
Handler "/xyz/%s/%i"
(fun (category, id) ->
// process request
sprintf "handled: route1: %s/%i" category id)
(fun values ->
// convert matches
values |> List.item 0,
values
|> List.item 1
|> int32)
let route2 =
Handler "/xyz/%i"
(fun (id) -> sprintf "handled: route2: id: %i" id) // handle
(fun values -> values|> List.head |> int32) // decode
// Router
(match path with
| RegexMatch (getPattern route2) values ->
values
|> getDecoder route2
|> gethandler route2
| RegexMatch (getPattern route1) values ->
values
|> getDecoder route1
|> gethandler route1
| _ -> failwith "No Match")
|> printf "routed: %A\n"
let main argv =
try
let arg = argv |> Array.skip 1 |> Array.head
Router arg
0 // return an integer exit code
with
| Failure msg ->
eprintf "Error: %s\n" msg
-1

Count occurences of characters in a string and return an integer list showing how many times each letter in the alphabet appears

I need to make a recursive function which takes a string and counts how many times each letter in the alphabet appears in the string. It should return an integer list such that "abe" returns [1;1;0;0;1;0;0;0;0 and so on]
This is for a school assignment. I have tried making the string to a char list which I can use pattern matching on, but to no avail. I will show in code an example of this.
My 2 BEST tries:
1.
let rec histogram (src:string) : int list =
let dat = src.ToLower()
let trt = Seq.toList dat
match alphabet with
| []-> []
| head::tail when List.contains head trt -> |> Seq.countBy (fun x -> x) |> Seq.map snd :: histogram (System.String.Concat(Array.ofList(tail)))
| _tail -> histogram (System.String.Concat(Array.ofList(tail)))
let rec histogram (src:string) : int list =
let dat = src.ToLower()
let trt = Seq.toList dat
match trt with
| []-> []
| head::tail when List.exists ((=) head) alphabet -> List.countBy id ::histogram (System.String.Concat(Array.ofList(tail))
| _::tail -> convText (System.String.Concat(Array.ofList(tail)))```
I dont't think you should use recursion here. Besides all, it iterates a string for each letter, so it has complexity O(n^2). Try this:
let histogram (src:string) : int list =
let lettersCount = [|for i in 0..25 -> 0|]
for c in src.ToLower() do
let index = int c - int 'a'
lettersCount.[index]<-lettersCount.[index]+1
lettersCount |> Seq.toList
I understand, that is is not too functional way, but it is better for this task
Sorry for deleted comment. I somehow did not happen to notice, than you actually are to make a recursive function.
Look, you problem is that you are trying to mach an alphabet collection which is not passed to a function (probably it is defined in a closure), so it never changes through recursion.
This is a really crazy task for a recursion, but you can try something like this:
let rec histogram (src:string) : int list =
match src.ToLower()|>Seq.toList with
|head::tail ->
String.Join("", tail)
|>histogram
|>List.mapi(fun i c -> if i=(int head - int 'a') then c+1 else c)
|[] -> [for i in 1..26 -> 0]

How to construct a match expression

I am allowing a command-line parameter like this --10GB, where -- and GB are constant, but a number like 1, 10, or 100 could be substituted in between the constant values, like --5GB.
I could easily parse the start and end of the string with substr or written a command line parser, but wanted to use match instead. I am just not sure how to structure the match expression.
let GB1 = cvt_bytes_to_gb(int64(DiskFreeLevels.GB1))
let arg = argv.[0]
let match_head = "--"
let match_tail = "GB"
let parse_min_gb_arg arg =
match arg with
| match_head & match_tail -> cvt_gb_arg_to_int arg
| _ -> volLib.GB1
I get a warning saying _ This rule will never be matched. How should the what is an AND expression be constructed?
You can't match on strings, except matching on the whole value, e.g. match s with | "1" -> 1 | "2" -> 2 ...
Parsing beginning and end would be the most efficient way to do this, there is no need to get clever (this, by the way, is a universally true statement).
But if you really want to use pattern matching, it is definitely possible to do, but you'll have to make yourself some custom matchers (also known as "active patterns").
First, make a custom matcher that would parse out the "middle" part of the string surrounded by prefix and suffix:
let (|StrBetween|_|) starts ends (str: string) =
if str.StartsWith starts && str.EndsWith ends then
Some (str.Substring(starts.Length, str.Length - ends.Length - starts.Length))
else
None
Usage:
let x = match "abcd" with
| StrBetween "a" "d" s -> s
| _ -> "nope"
// x = "bc"
Then make a custom matcher that would parse out an integer:
let (|Int|_|) (s: string) =
match System.Int32.TryParse s with
| true, i -> Some i
| _ -> None
Usage:
let x = match "15" with
| Int i -> i
| _ -> 0
// x = 15
Now, combine the two:
let x = match "--10GB" with
| StrBetween "--" "GB" (Int i) -> i
| _ -> volLib.GB1
// x = 10
This ability of patterns to combine and nest is their primary power: you get to build a complicated pattern out of small, easily understandable pieces, and have the compiler match it to the input. That's basically why it's called "pattern matching". :-)
The best I can come up with is using a partial active pattern:
let (|GbFormat|_|) (x:string) =
let prefix = "--"
let suffix = "GB"
if x.StartsWith(prefix) && x.EndsWith(suffix) then
let len = x.Length - prefix.Length - suffix.Length
Some(x.Substring(prefix.Length, len))
else
None
let parse_min_gb_arg arg =
match arg with
| GbFormat gb -> gb
| _ -> volLib.GB1
parse_min_gb_arg "--42GB"

Pattern matching numeric strings

I have a function that pattern matches its argument, which is a string:
let processLexime lexime
match lexime with
| "abc" -> ...
| "bar" -> ...
| "cat" -> ...
| _ -> ...
This works as expected. However, I'm now trying to extend this by expressing "match a string containing only the following characters". In my specific example, I want anything containing only digits to be matched.
My question is, how can I express this in F#? I'd prefer to do this without any libraries such as FParsec, since I'm mainly doing this for learning purposes.
You can use active patterns: https://msdn.microsoft.com/en-us/library/dd233248.aspx
let (|Integer|_|) (str: string) =
let mutable intvalue = 0
if System.Int32.TryParse(str, &intvalue) then Some(intvalue)
else None
let parseNumeric str =
match str with
| Integer i -> printfn "%d : Integer" i
| _ -> printfn "%s : Not matched." str
One way would be an active pattern
let (|Digits|_|) (s:string) =
s.ToCharArray() |> Array.forall (fun c -> System.Char.IsDigit(c)) |> function |true -> Some(s) |false -> None
then you can do
match "1" with
|Digits(t) -> printf "matched"
I would use regular expressions combined with active patterns. With regular expressions you can easily match digits with \d and active patterns makes the syntax nice inside your match.
open System.Text.RegularExpressions
let (|ParseRegex|_|) regex str =
let m = Regex("^"+regex+"$").Match(str)
if (m.Success) then Some true else None
let Printmatch s =
match s with
| ParseRegex "w+" d -> printfn "only w"
| ParseRegex "(w+|s+)+" d -> printfn "only w and s"
| ParseRegex "\d+" d -> printfn "only digis"
|_ -> printfn "wrong"
[<EntryPoint>]
let main argv =
Printmatch "www"
Printmatch "ssswwswwws"
Printmatch "134554"
Printmatch "1dwd3ddwwd"
0
which prints
only w
only w and s
only digis
wrong

Resources