Return the condition in the fitler? - f#

I have the following code to filter a Seq and return the error if nothing returned.
let s = nodes
|> Seq.filter(fun (a, _, _, _) -> if a.ToLower().Contains(key1)) // condition 1
then true
else false // Error message should have Key1
|> Seq.filter(....) // condition 2
|> Seq.filter(....) // condition 3
.....
|> Seq.filter(function // condition N
| _, Some date, _, _ -> date >= StartPeriod
| _ -> false // put StartPeriod in the final message s is not empty before this step
)
if Seq.isEmpty s
then sprintf "Failed by condition 1 (%s) or condition 2 (%s) .... or condition N (Date > %s)"
key1, ...., (StartPeriod.ToShortDateSTring())
else ....
The final error message sprintf will contain all the filter conditions. Is it a way to let the code just return the ones (or just the last one) make the s empty?
Based on rmunn's answer, I modified it to return all the filters that contributed to empty the list.
let rec filterSeq filterList input msgs =
match filterList with
| [] -> input, msgs
| (label, filter) :: filters ->
let result = input |> Seq.filter filter
if result |> Seq.isEmpty then
printfn "The \"%s\" filter emptied out the input" label
Seq.empty, (List.append msgs [label])
else
filterSeq filters result (List.append msgs [label])
let intFiltersWithLabels = [
"Odd numbers", fun x -> x % 2 <> 0
"Not divisible by 3", fun x -> x % 3 <> 0
"Not divisible by 5", fun x -> x % 5 <> 0
"Even numbers", fun x -> x % 2 = 0
"Won't reach here", fun x -> x % 7 <> 0
]
{ 1..20 } |> filterSeq intFiltersWithLabels <| List.empty

What I would do is make a list of filters, and a recursive function that applies them one at a time. If the filter that was just applied returns an empty sequence, then stop, print the filter that just emptied your input, and return that empty sequence. Otherwise keep looping through the recursive function, taking the next filter in the list in turn, until either you end up with no input or you have run through your entire filter list and there's still some input remaining after passing all the filters.
Here's some sample code to illustrate what I mean. Notice how I've put labels in front of each filter function, so that I don't see output like "The <fun:filtersWithLabels#4> filter emptied out the input", but instead I see a sensible human-readable label for each filter.
let rec filterSeq filterList input =
match filterList with
| [] -> input
| (label, filter) :: filters ->
let result = input |> Seq.filter filter
if result |> Seq.isEmpty then
printfn "The \"%s\" filter emptied out the input" label
Seq.empty
else
filterSeq filters result
let intFiltersWithLabels = [
"Odd numbers", fun x -> x % 2 <> 0
"Not divisible by 3", fun x -> x % 3 <> 0
"Not divisible by 5", fun x -> x % 5 <> 0
"Even numbers", fun x -> x % 2 = 0
"Won't reach here", fun x -> x % 7 <> 0
]
{ 1..20 } |> filterSeq filtersWithLabels
// Prints: The "Even numbers" filter emptied out the input
If you want to print all filters until the one that emptied out the input, then you'd just move that printfn call up one line, outside the if expression. The fact that the recursion stops once the input is empty means that you won't see any printfn calls after the filter that emptied out the input.
Note that the way I wrote the function assumes that your original input will not be empty. If your original input was empty, then the function will credit the first filter for emptying the input and will print the first filter's label. You could solve that easily enough by checking for empty input before you check for empty result, but I didn't bother since this is just demo code. Just be aware of this if your real input could ever be empty in your actual use case.
Update: If you need to return a list of labels, not just print them, then make that a second parameter that you pass through your filterSeq function. Something like this:
let matchingFilters filterList input =
let rec filterSeq filterList labelsSoFar input =
match filterList with
| [] -> input, [] // Note NO labels returned in this case!
| (label, filter) :: filters ->
let result = input |> Seq.filter filter
if result |> Seq.isEmpty then
Seq.empty, (label :: labelsSoFar)
else
filterSeq filters (label :: labelsSoFar) result
let result, labels = filterSeq filterList [] input
result, List.rev labels
let filtersWithLabels = [
"Odd numbers", fun x -> x % 2 <> 0
"Not divisible by 3", fun x -> x % 3 <> 0
"Not divisible by 5", fun x -> x % 5 <> 0
"Even numbers", fun x -> x % 2 = 0
"Won't reach here", fun x -> x % 7 <> 0
]
{ 1..20 } |> matchingFilters filtersWithLabels
// Returns: ["Odd numbers"; "Not divisible by 3"; "Not divisible by 5"; "Even numbers"]
A couple things to note about this version of the function: it sounds like what you want is that if the filters run all the way through without emptying out the input, then you want NO filter labels to be returned. If I've misunderstood you, then replace the | [] -> input, [] line with | [] -> input, labelsSoFar to get all the labels in the output. Second thing to note is that I've changed the "shape" of this function: instead of returning a seq, it returns a 2-tuple of (result seq, list of filter labels). The list of filter labels will be empty if the result seq is not empty, but if the result seq ended up empty, then the list of filter labels will contain all the filters that were applied, not just all the filters that reduced the size of the input.
If what you really need is to check whether the size of the input is reduced and print only the labels of filters that filtered something out, then look at Funk's answer for how to check that, but be aware that Seq.length has to run through the entire original sequence and apply all the filters up to that point, each time. So it's a slow operation. If your input data set is large, then it's best to stick with the Seq.empty logic. Play around with it and decide what best fits your needs.

You can separate your logging / error handling code from your business logic by using a decorator.
First, our logger.
open System.Text
type Logger() =
let sb = StringBuilder()
member __.log msg =
sprintf "Element doesn't contain %s ; " msg |> sb.Append |> ignore
member __.getMessage() =
sb.ToString()
Now, we want to wrap Seq.filter so it logs every time we filter out some element(s).
let filterBuilder (logger:Logger) msg f seq =
let res = Seq.filter f seq
if Seq.length seq > Seq.length res then logger.log msg
res
Wrapping up with an example.
let logger = Logger()
let filterLog msg f seq = filterBuilder logger msg f seq
let seq = ["foo" ; "bar"]
let r =
seq
|> filterLog "f"
(fun s -> s.Contains("f"))
|> filterLog "o"
(fun s -> s.Contains("o"))
|> filterLog "b"
(fun s -> s.Contains("b"))
|> filterLog "a"
(fun s -> s.Contains("a"))
logger.getMessage()
val it : string = "Element doesn't contain f ; Element doesn't contain b ; "
"bar" gets filtered out immediately, producing the first message. "foo" goes out the third time around. Also note the second and last pipe in the line don't log any message.

Related

Splitting a Seq of Strings Of Variable Length in F#

I am using a .fasta file in F#. When I read it from disk, it is a sequence of strings. Each observation is usually 4-5 strings in length: 1st string is the title, then 2-4 strings of amino acids, and then 1 string of space. For example:
let filePath = #"/Users/XXX/sample_database.fasta"
let fileContents = File.ReadLines(filePath)
fileContents |> Seq.iter(fun x -> printfn "%s" x)
yields:
I am looking for a way to split each observation into its own collection using the OOB high order functions in F#. I do not want to use any mutable variables or for..each syntax. I thought Seq.chunkBySize would work -> but the size varies. Is there a Seq.chunkByCharacter?
Mutable variables are totally fine for this, provided their mutability doesn't leak into a wider context. Why exactly do you not want to use them?
But if you really want to go hardcore "functional", then the usual functional way of doing something like that is via fold.
Your folding state would be a pair of "blocks accumulated so far" and "current block".
At each step, if you get a non-empty string, you attach it to the "current block".
And if you get an empty string, that means the current block is over, so you attach the current block to the list of "blocks so far" and make the current block empty.
This way, at the end of folding you'll end up with a pair of "all blocks accumulated except the last one" and "last block", which you can glue together.
Plus, an optimization detail: since I'm going to do a lot of "attach a thing to a list", I'd like to use a linked list for that, because it has constant-time attaching. But then the problem is that it's only constant time for prepending, not appending, which means I'll end up with all the lists reversed. But no matter: I'll just reverse them again at the very end. List reversal is a linear operation, which means my whole thing would still be linear.
let splitEm lines =
let step (blocks, currentBlock) s =
match s with
| "" -> (List.rev currentBlock :: blocks), []
| _ -> blocks, s :: currentBlock
let (blocks, lastBlock) = Array.fold step ([], []) lines
List.rev (lastBlock :: blocks)
Usage:
> splitEm [| "foo"; "bar"; "baz"; ""; "1"; "2"; ""; "4"; "5"; "6"; "7"; ""; "8" |]
[["foo"; "bar"; "baz"]; ["1"; "2"]; ["4"; "5"; "6"; "7"]; ["8"]]
Note 1: You may have to address some edge cases depending on your data and what you want the behavior to be. For example, if there is an empty line at the very end, you'll end up with an empty block at the end.
Note 2: You may notice that this is very similar to imperative algorithm with mutating variables: I'm even talking about things like "attach to list of blocks" and "make current block empty". This is not a coincidence. In this purely functional version the "mutating" is accomplished by calling the same function again with different parameters, while in an equivalent imperative version you would just have those parameters turned into mutable memory cells. Same thing, different view. In general, any imperative iteration can be turned into a fold this way.
For comparison, here's a mechanical translation of the above to imperative mutation-based style:
let splitEm lines =
let mutable blocks = []
let mutable currentBlock = []
for s in lines do
match s with
| "" -> blocks <- List.rev currentBlock :: blocks; currentBlock <- []
| _ -> currentBlock <- s :: currentBlock
List.rev (currentBlock :: blocks)
To illustrate Fyodor's point about contained mutability, here's an example that is mutable as can be while still somewhat reasonable. The outer functional layer is a sequence expression, a common pattern demonstrated by Seq.scan in the F# source.
let chooseFoldSplit
folding (state : 'State)
(source : seq<'T>) : seq<'U[]> = seq {
let sref, zs = ref state, ResizeArray()
use ie = source.GetEnumerator()
while ie.MoveNext() do
let newState, uopt = folding !sref ie.Current
if newState <> !sref then
yield zs.ToArray()
zs.Clear()
sref := newState
match uopt with
| None -> ()
| Some u -> zs.Add u
if zs.Count > 0 then
yield zs.ToArray() }
// val chooseFoldSplit :
// folding:('State -> 'T -> 'State * 'U option) ->
// state:'State -> source:seq<'T> -> seq<'U []> when 'State : equality
There is mutability of a ref cell (equivalent to a mutable variable) and there is a mutable data structure; an alias for System.Collection.Generic.List<'T>, which allows appending at O(1) cost.
The folding function's signature 'State -> 'T -> 'State * 'U option is reminiscent of the folder of fold, except that it causes the result sequence to be split when its state changes. And it also spawns an option that denotes the next member for the current group (or not).
It would work fine without the conversion to a persistent array, as long as you iterate the resulting sequence lazily and only exactly once. Therefore we need to isolate the contents of the ResizeArrayfrom the outside world.
The simplest folding for your use case is negation of a boolean, but you could leverage it for more complex tasks like numbering your records:
[| "foo"; "1"; "2"; ""; "bar"; "4"; "5"; "6"; "7"; ""; "baz"; "8"; "" |]
|> chooseFoldSplit (fun b t ->
if t = "" then not b, None else b, Some t ) false
|> Seq.map (fun a ->
if a.Length > 1 then
{ Description = a.[0]; Sequence = String.concat "" a.[1..] }
else failwith "Format error" )
// val it : seq<FastaEntry> =
// seq [{Description = "foo";
// Sequence = "12";}; {Description = "bar";
// Sequence = "4567";}; {Description = "baz";
// Sequence = "8";}]
I went with recursion:
type FastaEntry = {Description:String; Sequence:String}
let generateFastaEntry (chunk:String seq) =
match chunk |> Seq.length with
| 0 -> None
| _ ->
let description = chunk |> Seq.head
let sequence = chunk |> Seq.tail |> Seq.reduce (fun acc x -> acc + x)
Some {Description=description; Sequence=sequence}
let rec chunk acc contents =
let index = contents |> Seq.tryFindIndex(fun x -> String.IsNullOrEmpty(x))
match index with
| None ->
let fastaEntry = generateFastaEntry contents
match fastaEntry with
| Some x -> Seq.append acc [x]
| None -> acc
| Some x ->
let currentChunk = contents |> Seq.take x
let fastaEntry = generateFastaEntry currentChunk
match fastaEntry with
| None -> acc
| Some y ->
let updatedAcc =
match Seq.isEmpty acc with
| true -> seq {y}
| false -> Seq.append acc (seq {y})
let remaining = contents |> Seq.skip (x+1)
chunk updatedAcc remaining
You also can use Regular Expression for these kind of stuff. Here is a solution that uses a regular expression to extract a whole Fasta Block at once.
type FastaEntry = {
Description: string
Sequence: string
}
let fastaRegexStr =
#"
^> # Line Starting with >
(.*) # Capture into $1
\r?\n # End-of-Line
( # Capturing in $2
(?:
^ # A Line ...
[A-Z]+ # .. containing A-Z
\*? \r?\n # Optional(*) followed by End-of-Line
)+ # ^ Multiple of those lines
)
(?:
(?: ^ [ \t\v\f]* \r?\n ) # Match an empty (whitespace) line ..
| # or
\z # End-of-String
)
"
(* Regex for matching one Fasta Block *)
let fasta = Regex(fastaRegexStr, RegexOptions.IgnorePatternWhitespace ||| RegexOptions.Multiline)
(* Whole file as a string *)
let content = System.IO.File.ReadAllText "fasta.fasta"
let entries = [
for m in fasta.Matches(content) do
let desc = m.Groups.[1].Value
(* Remove *, \r and \n from string *)
let sequ = Regex.Replace(m.Groups.[2].Value, #"\*|\r|\n", "")
{Description=desc; Sequence=sequ}
]

Break the iteration and got the values and states?

I need to call a function on each item in the list; and quit immediately if the function returns -1. I need to return the sum of results of the function and a string of "Done" or "Error".
let input = seq { 0..4 } // fake input
let calc1 x = // mimic failing after input 3. It's a very expensive function and should stop running after failing
if x >= 3 then -1 else x * 2
let run input calc =
input
|> Seq.map(fun x ->
let v = calc x
if v = -1 then .... // Error occurred, stop the execution if gets -1. Calc will not work anymore
v)
|> Seq.sum, if hasError then "Error" else "Done"
run input calc // should return (6, "Error")
run input id // should return (20, "Done")
The simplest way to effectively achieve exactly what is asked in idiomatic manner would be to use of an inner recursive function for traversing the sequence:
let run input calc =
let rec inner unprocessed sum =
match unprocessed with
| [] -> (sum, "Done")
| x::xs -> let res = calc x
if res < 0 then (sum, "Error") else inner xs (sum + res)
inner (input |> Seq.toList) 0
Then run (seq {0..4}) (fun x -> if x >=3 then -1 else x * 2) returns (6,"Error") while
run (seq [0;1;2;1;0;0;1;1;2;2]) (fun x -> if x >=3 then -1 else x * 2) returns (20, "Done")
More efficient version of the same thing shown below. This means it is now essentially a copy of #GeneBelitski's answer.
let run input calc =
let inputList = Seq.toList input
let rec subrun inp acc =
match inp with
| [] -> (acc, "Done")
| (x :: xs) ->
let res = calc x
match res with
| Some(y) -> subrun xs (acc + y)
| None -> (acc, "Error")
subrun inputList 0
Note that this function below is EXTREMELY slow, likely because it uses Seq.tail (I had thought that would be the same as List.tail). I leave it in for posterity.
The easiest way I can think of for doing this in F# would be to use a tail-recursive function. Something like
let run input calc =
let rec subrun inp acc =
if Seq.isEmpty inp then
(acc, "Done")
else
let res = calc (Seq.head inp)
match res with
| Some(x) -> subrun (Seq.tail inp) (acc + x)
| None -> (acc, "Error")
subrun input 0
I'm not 100% sure just how efficient that would be. In my experience, sometimes for some reason, my own tail-recursive functions seem to be considerably slower than using the built-in higher-order functions. This should at least get you to the right result.
The below, while apparently not answering the actual question, is left in just in case it is useful to someone.
The typical way to handle this would be to make your calc function return either an Option or Result type, e.g.
let calc1 x = if x = 3 then None else Some(x*2)
and then map that to your input. Afterwards, you can fairly easily do something like
|> Seq.exists Option.isNone
to make see if there are Nones in the resulting seq (you can pipe it to not if you want the opposite result).
If you just need to eliminate Nones from the list, you can use
Seq.choose id
which will eliminate all Nones while leaving the Options intact.
For summing the list, assuming that you have used choose to be left with just the Somes, then you can do
Seq.sumBy Option.get
Here is a monadic way of doing it using the Result monad.
First we create function calcR that if calc returns -1 returns Error otherwise returns Ok with the value:
let calcR f x =
let r = f x
if r = -1 then Error "result was = -1" else
Ok r
Then, we create function sumWhileOk that uses Seq.fold over the input, adding up the results as long as they are Ok.
let sumWhileOk fR =
Seq.fold(fun totalR v ->
totalR
|> Result.bind(fun total ->
fR v
|> Result.map (fun r -> total + r)
|> Result.mapError (fun _ -> total )
)
) (Ok 0)
Result.bind and Result.map only invoke their lambda function if the supplied value is Ok if it is Error it gets bypassed. Result.mapError is used to replace the error message from calcR with the current total as an error.
It is called this way:
input |> sumWhileOk (calcR id)
// returns: Ok 10
input |> sumWhileOk (calcR calc1)
// return: Error 6

Sequence of incorrect length generated by function

Why is the following function returning a sequence of incorrect length when the repl variable is set to false?
open MathNet.Numerics.Distributions
open MathNet.Numerics.LinearAlgebra
let sample (data : seq<float>) (size : int) (repl : bool) =
let n = data |> Seq.length
// without replacement
let rec generateIndex idx =
let m = size - Seq.length(idx)
match m > 0 with
| true ->
let newIdx = DiscreteUniform.Samples(0, n-1) |> Seq.take m
let idx = (Seq.append idx newIdx) |> Seq.distinct
generateIndex idx
| false ->
idx
let sample =
match repl with
| true ->
DiscreteUniform.Samples(0, n-1)
|> Seq.take size
|> Seq.map (fun index -> Seq.item index data)
| false ->
generateIndex (seq [])
|> Seq.map (fun index -> Seq.item index data)
sample
Running the function...
let requested = 1000
let dat = Normal.Samples(0., 1.) |> Seq.take 10000
let resultlen = sample dat requested false |> Seq.length
printfn "requested -> %A\nreturned -> %A" requested resultlen
Resulting lengths are wrong.
>
requested -> 1000
returned -> 998
>
requested -> 1000
returned -> 1001
>
requested -> 1000
returned -> 997
Any idea what mistake I'm making?
First, there's a comment I want to make about coding style. Then I'll get to the explanation of why your sequences are coming back with different lengths.
In the comments, I mentioned replacing match (bool) with true -> ... | false -> ... with a simple if ... then ... else expression, but there's another coding style that you're using that I think could be improved. You wrote:
let sample (various_parameters) = // This is a function
// Other code ...
let sample = some_calculation // This is a variable
sample // Return the variable
While F# allows you to reuse names like that, and the name inside the function will "shadow" the name outside the function, it's generally a bad idea for the reused name to have a totally different type than the original name. In other words, this can be a good idea:
let f (a : float option) =
let a = match a with
| None -> 0.0
| Some value -> value
// Now proceed, knowing that `a` has a real value even if had been None before
Or, because the above is exactly what F# gives you defaultArg for:
let f (a : float option) =
let a = defaultArg a 0.0
// This does exactly the same thing as the previous snippet
Here, we are making the name a inside our function refer to a different type than the parameter named a: the parameter was a float option, and the a inside our function is a float. But they're sort of the "same" type -- that is, there's very little mental difference between "The caller may have specified a floating-point value or they may not" and "Now I definitely have a floating-point value". But there's a very large mental gap between "The name sample is a function that takes three parameters" and "The name sample is a sequence of floats". I strongly recommend using a name like result for the value you're going to return from your function, rather than re-using the function name.
Also, this seems unnecessarily verbose:
let result =
match repl with
| true ->
DiscreteUniform.Samples(0, n-1)
|> Seq.take size
|> Seq.map (fun index -> Seq.item index data)
| false ->
generateIndex (seq [])
|> Seq.map (fun index -> Seq.item index data)
result
Anytime I find myself writing "let result = (something) ; result" at the end of my function, I usually just want to replace that whole code block with just the (something). I.e., the above snippet could just become:
match repl with
| true ->
DiscreteUniform.Samples(0, n-1)
|> Seq.take size
|> Seq.map (fun index -> Seq.item index data)
| false ->
generateIndex (seq [])
|> Seq.map (fun index -> Seq.item index data)
Which in turn can be replaced with an if...then...else expression:
if repl then
DiscreteUniform.Samples(0, n-1)
|> Seq.take size
|> Seq.map (fun index -> Seq.item index data)
else
generateIndex (seq [])
|> Seq.map (fun index -> Seq.item index data)
And that's the last expression in your code. In other words, I would probably rewrite your function as follows (changing ONLY the style, and making no changes to the logic):
open MathNet.Numerics.Distributions
open MathNet.Numerics.LinearAlgebra
let sample (data : seq<float>) (size : int) (repl : bool) =
let n = data |> Seq.length
// without replacement
let rec generateIndex idx =
let m = size - Seq.length(idx)
if m > 0 then
let newIdx = DiscreteUniform.Samples(0, n-1) |> Seq.take m
let idx = (Seq.append idx newIdx) |> Seq.distinct
generateIndex idx
else
idx
if repl then
DiscreteUniform.Samples(0, n-1)
|> Seq.take size
|> Seq.map (fun index -> Seq.item index data)
else
generateIndex (seq [])
|> Seq.map (fun index -> Seq.item index data)
If I can figure out why your sequences have the wrong length, I'll update this answer with that information as well.
UPDATE: Okay, I think I see what's happening in your generateIndex function that's giving you unexpected results. There are two things tripping you up: one is sequence laziness, and the other is randomness.
I copied your generateIndex function into VS Code and added some printfn statements to look at what's going on. First, the code I ran, and then the results:
let rec generateIndex n size idx =
let m = size - Seq.length(idx)
printfn "m = %d" m
match m > 0 with
| true ->
let newIdx = DiscreteUniform.Samples(0, n-1) |> Seq.take m
printfn "Generating newIdx as %A" (List.ofSeq newIdx)
let idx = (Seq.append idx newIdx) |> Seq.distinct
printfn "Now idx is %A" (List.ofSeq idx)
generateIndex n size idx
| false ->
printfn "Done, returning %A" (List.ofSeq idx)
idx
All those List.ofSeq idx calls are so that F# Interactive would print more than four items of the seq when I print it out (by default, if you try to print a seq with %A, it will only print out four values and then print an ellipsis if there are more values available in the seq). Also, I turned n and size into parameters (that I don't change between calls) so that I could test it easily. I then called it as generateIndex 100 5 (seq []) and got the following result:
m = 5
Generating newIdx as [74; 76; 97; 78; 31]
Now idx is [68; 28; 65; 58; 82]
m = 0
Done, returning [37; 58; 24; 48; 49]
val it : seq<int> = seq [12; 69; 97; 38; ...]
See how the numbers keep changing? That was my first clue that something was up. See, seqs are lazy. They don't evaluate their contents until they have to. You shouldn't think of a seq as a list of numbers. Instead, think of it as a generator that will, when asked for numbers, produce them according to some rule. In your case, the rule is "Choose random integers between 0 and n-1, then take m of those numbers". And the other thing about seqs is that they do not cache their contents (although there's a Seq.cache function available that will cache their contents). Therefore, if you have a seq based on a random number generator, its results will be different each time, as you can see in my output. When I printed out newIdx, it printed out as [74; 76; 97; 78; 31], but when I appended it to an empty seq, the result printed out as [68; 28; 65; 58; 82].
Why this difference? Because Seq.append does not force evaluation. It simply creates a new seq whose rule is "take all items from the first seq, then when that one exhausts, take all items from the second seq. And when that one exhausts, end." And Seq.distinct does not force evaluation either; it simply creates a new seq whose rule is "take the items from the seq handed to you, and start handing them out when asked. But memorize them as you go, and if you've handed one of them out before, don't hand it out again." So what you are passing around between your calls to generateIdx is an object that, when evaluated, will pick a set of random numbers between 0 and n-1 (in my simple case, between 0 and 100) and then reduce that set down to a distinct set of numbers.
Now, here's the thing. Every time you evaluate that seq, it will start from the beginning: first calling DiscreteUniform.Samples(0, n-1) to generate an infinite stream of random numbers, then selecting m numbers from that stream, then throwing out any duplicates. (I'm ignoring the Seq.append for now, because it would create unnecessary mental complexity and it isn't really part of the bug anyway). Now, at the start of each go-round of your function, you check the length of the sequence, which does cause it to be evaluated. That means that it selects (in the case of my sample code) 5 random numbers between 0 and 99, then makes sure that they're all distinct. If they are all distinct, then m = 0 and the function will exit, returning... not the list of numbers, but the seq object. And when that seq object is evaluated, it will start over from the beginning, choosing a different set of 5 random numbers and then throwing out any duplicates. Therefore, there's still a chance that at least one of that set of 5 numbers will end up being a duplicate, because the sequence whose length was tested (which we know contained no duplicates, otherwise m would have been greater than 0) was not the sequence that was returned. The sequence that was returned has a 1.0 * 0.99 * 0.98 * 0.97 * 0.96 chance of not containing any duplicates, which comes to about 0.9035. So there's a just-under-10% chance that even though you checked Seq.length and it was 5, the length of the returned seq ends up being 4 after all -- because it was choosing a different set of random numbers than the one you checked.
To prove this, I ran the function again, this time only picking 4 numbers so that the result would be completely shown at the F# Interactive prompt. And my run of generateIndex 100 4 (seq []) produced the following output:
m = 4
Generating newIdx as [36; 63; 97; 31]
Now idx is [39; 93; 53; 94]
m = 0
Done, returning [47; 94; 34]
val it : seq<int> = seq [48; 24; 14; 68]
Notice how when I printed "Done, returning (value of idx)", it had only 3 values? Even though it eventually returned 4 values (because it picked a different selection of random numbers for the actual result, and that selection had no duplicates), that demonstrated the problem.
By the way, there's one other problem with your function, which is that it's far slower than it needs to be. The function Seq.item, in some circumstances, has to run through the sequence from the beginning in order to pick the nth item of the sequence. It would be far better to store your data in an array at the start of your function (let arrData = data |> Array.ofSeq), then replace
|> Seq.map (fun index -> Seq.item index data)
with
|> Seq.map (fun index -> arrData.[index])
Array lookups are done in constant time, so that takes your sample function down from O(N^2) to O(N).
TL;DR: Use Seq.distinct before you take m values from it and the bug will go away. You can just replace your entire generateIdx function with a simple DiscreteUniform.Samples(0, n-1) |> Seq.distinct |> Seq.take size. (And use an array for your data lookups so that your function will run faster). In other words, here's the final almost-final version of how I would rewrite your code:
let sample (data : seq<float>) (size : int) (repl : bool) =
let arrData = data |> Array.ofSeq
let n = arrData |> Array.length
if repl then
DiscreteUniform.Samples(0, n-1)
|> Seq.take size
|> Seq.map (fun index -> arrData.[index])
else
DiscreteUniform.Samples(0, n-1)
|> Seq.distinct
|> Seq.take size
|> Seq.map (fun index -> arrData.[index])
That's it! Simple, easy to understand, and (as far as I can tell) bug-free.
Edit: ... but not completely DRY, because there's still a bit of repeated code in that "final" version. (Credit to CaringDev for pointing it out in the comments below). The Seq.take size |> Seq.map is repeated in both branches of the if expression, so there's a way to simplify that expression. We could do this:
let randomIndices =
if repl then
DiscreteUniform.Samples(0, n-1)
else
DiscreteUniform.Samples(0, n-1) |> Seq.distinct
randomIndices
|> Seq.take size
|> Seq.map (fun index -> arrData.[index])
So here's a truly-final version of my suggestion:
let sample (data : seq<float>) (size : int) (repl : bool) =
let arrData = data |> Array.ofSeq
let n = arrData |> Array.length
let randomIndices =
if repl then
DiscreteUniform.Samples(0, n-1)
else
DiscreteUniform.Samples(0, n-1) |> Seq.distinct
randomIndices
|> Seq.take size
|> Seq.map (fun index -> arrData.[index])

Remove a single non-unique value from a sequence in F#

I have a sequence of integers representing dice in F#.
In the game in question, the player has a pool of dice and can choose to play one (governed by certain rules) and keep the rest.
If, for example, a player rolls a 6, 6 and a 4 and decides to play one the sixes, is there a simple way to return a sequence with only one 6 removed?
Seq.filter (fun x -> x != 6) dice
removes all of the sixes, not just one.
Non-trivial operations on sequences are painful to work with, since they don't support pattern matching. I think the simplest solution is as follows:
let filterFirst f s =
seq {
let filtered = ref false
for a in s do
if filtered.Value = false && f a then
filtered := true
else yield a
}
So long as the mutable implementation is hidden from the client, it's still functional style ;)
If you're going to store data I would use ResizeArray instead of a Sequence. It has a wealth of functions built in such as the function you asked about. It's simply called Remove. Note: ResizeArray is an abbreviation for the CLI type List.
let test = seq [1; 2; 6; 6; 1; 0]
let a = new ResizeArray<int>(test)
a.Remove 6 |> ignore
Seq.toList a |> printf "%A"
// output
> [1; 2; 6; 1; 0]
Other data type options could be Array
let removeOneFromArray v a =
let i = Array.findIndex ((=)v) a
Array.append a.[..(i-1)] a.[(i+1)..]
or List
let removeOneFromList v l =
let rec remove acc = function
| x::xs when x = v -> List.rev acc # xs
| x::xs -> remove (x::acc) xs
| [] -> acc
remove [] l
the below code will work for a list (so not any seq but it sounds like the sequence your using could be a List)
let rec removeOne value list =
match list with
| head::tail when head = value -> tail
| head::tail -> head::(removeOne value tail)
| _ -> [] //you might wanna fail here since it didn't find value in
//the list
EDIT: code updated based on correct comment below. Thanks P
EDIT: After reading a different answer I thought that a warning would be in order. Don't use the above code for infite sequences but since I guess your players don't have infite dice that should not be a problem but for but for completeness here's an implementation that would work for (almost) any
finite sequence
let rec removeOne value seq acc =
match seq.Any() with
| true when s.First() = value -> seq.Skip(1)
| true -> seq.First()::(removeOne value seq.Skip(1))
| _ -> List.rev acc //you might wanna fail here since it didn't find value in
//the list
However I recommend using the first solution which Im confident will perform better than the latter even if you have to turn a sequence into a list first (at least for small sequences or large sequences with the soughtfor value in the end)
I don't think there is any function that would allow you to directly represent the idea that you want to remove just the first element matching the specified criteria from the list (e.g. something like Seq.removeOne).
You can implement the function in a relatively readable way using Seq.fold (if the sequence of numbers is finite):
let removeOne f l =
Seq.fold (fun (removed, res) v ->
if removed then true, v::res
elif f v then true, res
else false, v::res) (false, []) l
|> snd |> List.rev
> removeOne (fun x -> x = 6) [ 1; 2; 6; 6; 1 ];
val it : int list = [1; 2; 6; 1]
The fold function keeps some state - in this case of type bool * list<'a>. The Boolean flag represents whether we already removed some element and the list is used to accumulate the result (which has to be reversed at the end of processing).
If you need to do this for (possibly) infinite seq<int>, then you'll need to use GetEnumerator directly and implement the code as a recursive sequence expression. This is a bit uglier and it would look like this:
let removeOne f (s:seq<_>) =
// Get enumerator of the input sequence
let en = s.GetEnumerator()
let rec loop() = seq {
// Move to the next element
if en.MoveNext() then
// Is this the element to skip?
if f en.Current then
// Yes - return all remaining elements without filtering
while en.MoveNext() do
yield en.Current
else
// No - return this element and continue looping
yield en.Current
yield! loop() }
loop()
You can try this:
let rec removeFirstOccurrence item screened items =
items |> function
| h::tail -> if h = item
then screened # tail
else tail |> removeFirstOccurrence item (screened # [h])
| _ -> []
Usage:
let updated = products |> removeFirstOccurrence product []

F#: How do i split up a sequence into a sequence of sequences

Background:
I have a sequence of contiguous, time-stamped data. The data-sequence has gaps in it where the data is not contiguous. I want create a method to split the sequence up into a sequence of sequences so that each subsequence contains contiguous data (split the input-sequence at the gaps).
Constraints:
The return value must be a sequence of sequences to ensure that elements are only produced as needed (cannot use list/array/cacheing)
The solution must NOT be O(n^2), probably ruling out a Seq.take - Seq.skip pattern (cf. Brian's post)
Bonus points for a functionally idiomatic approach (since I want to become more proficient at functional programming), but it's not a requirement.
Method signature
let groupContiguousDataPoints (timeBetweenContiguousDataPoints : TimeSpan) (dataPointsWithHoles : seq<DateTime * float>) : (seq<seq< DateTime * float >>)= ...
On the face of it the problem looked trivial to me, but even employing Seq.pairwise, IEnumerator<_>, sequence comprehensions and yield statements, the solution eludes me. I am sure that this is because I still lack experience with combining F#-idioms, or possibly because there are some language-constructs that I have not yet been exposed to.
// Test data
let numbers = {1.0..1000.0}
let baseTime = DateTime.Now
let contiguousTimeStamps = seq { for n in numbers ->baseTime.AddMinutes(n)}
let dataWithOccationalHoles = Seq.zip contiguousTimeStamps numbers |> Seq.filter (fun (dateTime, num) -> num % 77.0 <> 0.0) // Has a gap in the data every 77 items
let timeBetweenContiguousValues = (new TimeSpan(0,1,0))
dataWithOccationalHoles |> groupContiguousDataPoints timeBetweenContiguousValues |> Seq.iteri (fun i sequence -> printfn "Group %d has %d data-points: Head: %f" i (Seq.length sequence) (snd(Seq.hd sequence)))
I think this does what you want
dataWithOccationalHoles
|> Seq.pairwise
|> Seq.map(fun ((time1,elem1),(time2,elem2)) -> if time2-time1 = timeBetweenContiguousValues then 0, ((time1,elem1),(time2,elem2)) else 1, ((time1,elem1),(time2,elem2)) )
|> Seq.scan(fun (indexres,(t1,e1),(t2,e2)) (index,((time1,elem1),(time2,elem2))) -> (index+indexres,(time1,elem1),(time2,elem2)) ) (0,(baseTime,-1.0),(baseTime,-1.0))
|> Seq.map( fun (index,(time1,elem1),(time2,elem2)) -> index,(time2,elem2) )
|> Seq.filter( fun (_,(_,elem)) -> elem <> -1.0)
|> PSeq.groupBy(fst)
|> Seq.map(snd>>Seq.map(snd))
Thanks for asking this cool question
I translated Alexey's Haskell to F#, but it's not pretty in F#, and still one element too eager.
I expect there is a better way, but I'll have to try again later.
let N = 20
let data = // produce some arbitrary data with holes
seq {
for x in 1..N do
if x % 4 <> 0 && x % 7 <> 0 then
printfn "producing %d" x
yield x
}
let rec GroupBy comp (input:LazyList<'a>) : LazyList<LazyList<'a>> =
LazyList.delayed (fun () ->
match input with
| LazyList.Nil -> LazyList.cons (LazyList.empty()) (LazyList.empty())
| LazyList.Cons(x,LazyList.Nil) ->
LazyList.cons (LazyList.cons x (LazyList.empty())) (LazyList.empty())
| LazyList.Cons(x,(LazyList.Cons(y,_) as xs)) ->
let groups = GroupBy comp xs
if comp x y then
LazyList.consf
(LazyList.consf x (fun () ->
let (LazyList.Cons(firstGroup,_)) = groups
firstGroup))
(fun () ->
let (LazyList.Cons(_,otherGroups)) = groups
otherGroups)
else
LazyList.cons (LazyList.cons x (LazyList.empty())) groups)
let result = data |> LazyList.of_seq |> GroupBy (fun x y -> y = x + 1)
printfn "Consuming..."
for group in result do
printfn "about to do a group"
for x in group do
printfn " %d" x
You seem to want a function that has signature
(`a -> bool) -> seq<'a> -> seq<seq<'a>>
I.e. a function and a sequence, then break up the input sequence into a sequence of sequences based on the result of the function.
Caching the values into a collection that implements IEnumerable would likely be simplest (albeit not exactly purist, but avoiding iterating the input multiple times. It will lose much of the laziness of the input):
let groupBy (fun: 'a -> bool) (input: seq) =
seq {
let cache = ref (new System.Collections.Generic.List())
for e in input do
(!cache).Add(e)
if not (fun e) then
yield !cache
cache := new System.Collections.Generic.List()
if cache.Length > 0 then
yield !cache
}
An alternative implementation could pass cache collection (as seq<'a>) to the function so it can see multiple elements to chose the break points.
A Haskell solution, because I don't know F# syntax well, but it should be easy enough to translate:
type TimeStamp = Integer -- ticks
type TimeSpan = Integer -- difference between TimeStamps
groupContiguousDataPoints :: TimeSpan -> [(TimeStamp, a)] -> [[(TimeStamp, a)]]
There is a function groupBy :: (a -> a -> Bool) -> [a] -> [[a]] in the Prelude:
The group function takes a list and returns a list of lists such that the concatenation of the result is equal to the argument. Moreover, each sublist in the result contains only equal elements. For example,
group "Mississippi" = ["M","i","ss","i","ss","i","pp","i"]
It is a special case of groupBy, which allows the programmer to supply their own equality test.
It isn't quite what we want, because it compares each element in the list with the first element of the current group, and we need to compare consecutive elements. If we had such a function groupBy1, we could write groupContiguousDataPoints easily:
groupContiguousDataPoints maxTimeDiff list = groupBy1 (\(t1, _) (t2, _) -> t2 - t1 <= maxTimeDiff) list
So let's write it!
groupBy1 :: (a -> a -> Bool) -> [a] -> [[a]]
groupBy1 _ [] = [[]]
groupBy1 _ [x] = [[x]]
groupBy1 comp (x : xs#(y : _))
| comp x y = (x : firstGroup) : otherGroups
| otherwise = [x] : groups
where groups#(firstGroup : otherGroups) = groupBy1 comp xs
UPDATE: it looks like F# doesn't let you pattern match on seq, so it isn't too easy to translate after all. However, this thread on HubFS shows a way to pattern match sequences by converting them to LazyList when needed.
UPDATE2: Haskell lists are lazy and generated as needed, so they correspond to F#'s LazyList (not to seq, because the generated data is cached (and garbage collected, of course, if you no longer hold a reference to it)).
(EDIT: This suffers from a similar problem to Brian's solution, in that iterating the outer sequence without iterating over each inner sequence will mess things up badly!)
Here's a solution that nests sequence expressions. The imperitave nature of .NET's IEnumerable<T> is pretty apparent here, which makes it a bit harder to write idiomatic F# code for this problem, but hopefully it's still clear what's going on.
let groupBy cmp (sq:seq<_>) =
let en = sq.GetEnumerator()
let rec partitions (first:option<_>) =
seq {
match first with
| Some first' -> //'
(* The following value is always overwritten;
it represents the first element of the next subsequence to output, if any *)
let next = ref None
(* This function generates a subsequence to output,
setting next appropriately as it goes *)
let rec iter item =
seq {
yield item
if (en.MoveNext()) then
let curr = en.Current
if (cmp item curr) then
yield! iter curr
else // consumed one too many - pass it on as the start of the next sequence
next := Some curr
else
next := None
}
yield iter first' (* ' generate the first sequence *)
yield! partitions !next (* recursively generate all remaining sequences *)
| None -> () // return an empty sequence if there are no more values
}
let first = if en.MoveNext() then Some en.Current else None
partitions first
let groupContiguousDataPoints (time:TimeSpan) : (seq<DateTime*_> -> _) =
groupBy (fun (t,_) (t',_) -> t' - t <= time)
Okay, trying again. Achieving the optimal amount of laziness turns out to be a bit difficult in F#... On the bright side, this is somewhat more functional than my last attempt, in that it doesn't use any ref cells.
let groupBy cmp (sq:seq<_>) =
let en = sq.GetEnumerator()
let next() = if en.MoveNext() then Some en.Current else None
(* this function returns a pair containing the first sequence and a lazy option indicating the first element in the next sequence (if any) *)
let rec seqStartingWith start =
match next() with
| Some y when cmp start y ->
let rest_next = lazy seqStartingWith y // delay evaluation until forced - stores the rest of this sequence and the start of the next one as a pair
seq { yield start; yield! fst (Lazy.force rest_next) },
lazy Lazy.force (snd (Lazy.force rest_next))
| next -> seq { yield start }, lazy next
let rec iter start =
seq {
match (Lazy.force start) with
| None -> ()
| Some start ->
let (first,next) = seqStartingWith start
yield first
yield! iter next
}
Seq.cache (iter (lazy next()))
Below is some code that does what I think you want. It is not idiomatic F#.
(It may be similar to Brian's answer, though I can't tell because I'm not familiar with the LazyList semantics.)
But it doesn't exactly match your test specification: Seq.length enumerates its entire input. Your "test code" calls Seq.length and then calls Seq.hd. That will generate an enumerator twice, and since there is no caching, things get messed up. I'm not sure if there is any clean way to allow multiple enumerators without caching. Frankly, seq<seq<'a>> may not be the best data structure for this problem.
Anyway, here's the code:
type State<'a> = Unstarted | InnerOkay of 'a | NeedNewInner of 'a | Finished
// f() = true means the neighbors should be kept together
// f() = false means they should be split
let split_up (f : 'a -> 'a -> bool) (input : seq<'a>) =
// simple unfold that assumes f captured a mutable variable
let iter f = Seq.unfold (fun _ ->
match f() with
| Some(x) -> Some(x,())
| None -> None) ()
seq {
let state = ref (Unstarted)
use ie = input.GetEnumerator()
let innerMoveNext() =
match !state with
| Unstarted ->
if ie.MoveNext()
then let cur = ie.Current
state := InnerOkay(cur); Some(cur)
else state := Finished; None
| InnerOkay(last) ->
if ie.MoveNext()
then let cur = ie.Current
if f last cur
then state := InnerOkay(cur); Some(cur)
else state := NeedNewInner(cur); None
else state := Finished; None
| NeedNewInner(last) -> state := InnerOkay(last); Some(last)
| Finished -> None
let outerMoveNext() =
match !state with
| Unstarted | NeedNewInner(_) -> Some(iter innerMoveNext)
| InnerOkay(_) -> failwith "Move to next inner seq when current is active: undefined behavior."
| Finished -> None
yield! iter outerMoveNext }
open System
let groupContigs (contigTime : TimeSpan) (holey : seq<DateTime * int>) =
split_up (fun (t1,_) (t2,_) -> (t2 - t1) <= contigTime) holey
// Test data
let numbers = {1 .. 15}
let contiguousTimeStamps =
let baseTime = DateTime.Now
seq { for n in numbers -> baseTime.AddMinutes(float n)}
let holeyData =
Seq.zip contiguousTimeStamps numbers
|> Seq.filter (fun (dateTime, num) -> num % 7 <> 0)
let grouped_data = groupContigs (new TimeSpan(0,1,0)) holeyData
printfn "Consuming..."
for group in grouped_data do
printfn "about to do a group"
for x in group do
printfn " %A" x
Ok, here's an answer I'm not unhappy with.
(EDIT: I am unhappy - it's wrong! No time to try to fix right now though.)
It uses a bit of imperative state, but it is not too difficult to follow (provided you recall that '!' is the F# dereference operator, and not 'not'). It is as lazy as possible, and takes a seq as input and returns a seq of seqs as output.
let N = 20
let data = // produce some arbitrary data with holes
seq {
for x in 1..N do
if x % 4 <> 0 && x % 7 <> 0 then
printfn "producing %d" x
yield x
}
let rec GroupBy comp (input:seq<_>) = seq {
let doneWithThisGroup = ref false
let areMore = ref true
use e = input.GetEnumerator()
let Next() = areMore := e.MoveNext(); !areMore
// deal with length 0 or 1, seed 'prev'
if not(e.MoveNext()) then () else
let prev = ref e.Current
while !areMore do
yield seq {
while not(!doneWithThisGroup) do
if Next() then
let next = e.Current
doneWithThisGroup := not(comp !prev next)
yield !prev
prev := next
else
// end of list, yield final value
yield !prev
doneWithThisGroup := true }
doneWithThisGroup := false }
let result = data |> GroupBy (fun x y -> y = x + 1)
printfn "Consuming..."
for group in result do
printfn "about to do a group"
for x in group do
printfn " %d" x

Resources