F# sequences with BigInteger indices - f#

I am looking for a type similar to sequences in F# where indices could be big integers, rather that being restricted to int. Does there exist anything like this?
By "big integer indices" I mean a type which allows for something equivalent to that:
let s = Seq.initInfinite (fun i -> i + 10I)

The following will generate an infinite series of bigints:
let s = Seq.initInfinite (fun i -> bigint i + 10I)
What i suspect you actually want though is a Map<'Key, 'Value>.
This lets you efficiently use a bigint as an index to look up whatever value it is you care about:
let map =
seq {
1I, "one"
2I, "two"
3I, "three"
}
|> Map.ofSeq
// val map : Map<System.Numerics.BigInteger,string> =
// map [(1, "one"); (2, "two"); (3, "three")]
map.TryFind 1I |> (printfn "%A") // Some "one"
map.TryFind 4I |> (printfn "%A") // None

The equivalent of initInfinite for BigIntegers would be
let inf = Seq.unfold (fun i -> let n = i + bigint.One in Some(n, n)) bigint.Zero
let biggerThanAnInt = inf |> Seq.skip (Int32.MaxValue) |> Seq.head // 2147483648
which takes ~2 min to run on my machine.
However, I doubt this is of any practical use :-) That is unless you start at some known value > Int32.MaxValue and stop reasonably soon (generating less than Int32.MaxValue items), which then could be solved by offsetting the BigInt indexes into the Int32 domain.
Theoretically you could amend the Seq module with functions working with BigIntegers to skip / window / ... an amount of items > Int32.MaxValue (e.g. by repeatedly performing the corresponding Int32 variant)

Since you want to index into a sequence, I assume you want a version of Seq.item that takes a BigInteger as index. There's nothing like that built into F#, but it's easy to define your own:
open System.Numerics
module Seq =
let itemI (index : BigInteger) source =
source |> Seq.item (int index)
Note that no new type is needed unless you're planning to create sequences that are longer than 2,147,483,647 items, which would probably not be practical anyway.
Usage:
let items = [| "moo"; "baa"; "oink" |]
items
|> Seq.itemI 2I
|> printfn "%A" // output: "oink"

Related

Trying to filter out values in a sequence that are not in another sequence

I am trying to filter out values from a sequence, that are not in another sequence. I was pretty sure my code worked, but it is taking a long time to run on my computer and because of this I am not sure, so I am here to see what the community thinks.
Code is below:
let statezip =
StateCsv.GetSample().Rows
|> Seq.map (fun row -> row.State)
|> Seq.distinct
type State = State of string
let unwrapstate (State s) = s
let neededstates (row:StateCsv) = Seq.contains (unwrapstate row.State) statezip
I am filtering by the neededstates function. Is there something wrong with the way I am doing this?
let datafilter =
StateCsv1.GetSample().Rows
|> Seq.map (fun row -> row.State,row.Income,row.Family)
|> Seq.filter neededstates
|> List.ofSeq
I believe that it should filter the sequence by the values that are true, since neededstates function is a bool. StateCsv and StateCsv1 have the same exact structure, although from different years.
Evaluation of contains on sequences and lists can be slow. For a case where you want to check for the existence of an element in a collection, the F# Set type is ideal. You can convert your sequences to sets using Set.ofSeq, and then run the logic over the sets instead. The following example uses the numbers from 1 to 10000 and then uses both sequences and sets to filter the result to only the odd numbers by checking that the values are not in a collection of even numbers.
Using Sequences:
let numberSeq = {0..10000}
let evenNumberSeq = seq { for n in numberSeq do if (n % 2 = 0) then yield n }
#time
numberSeq |> Seq.filter (fun n -> evenNumberSeq |> Seq.contains n |> not) |> Seq.toList
#time
This runs in about 1.9 seconds for me.
Using sets:
let numberSet = numberSeq |> Set.ofSeq
let evenNumberSet = evenNumberSeq |> Set.ofSeq
#time
numberSet |> Set.filter (fun n -> evenNumberSet |> Set.contains n |> not)
#time
This runs in only 0.005 seconds. Hopefully you can materialize your sequences to sets before performing your contains operation, thereby getting this level of speedup.

Sequence of incorrect length generated by function

Why is the following function returning a sequence of incorrect length when the repl variable is set to false?
open MathNet.Numerics.Distributions
open MathNet.Numerics.LinearAlgebra
let sample (data : seq<float>) (size : int) (repl : bool) =
let n = data |> Seq.length
// without replacement
let rec generateIndex idx =
let m = size - Seq.length(idx)
match m > 0 with
| true ->
let newIdx = DiscreteUniform.Samples(0, n-1) |> Seq.take m
let idx = (Seq.append idx newIdx) |> Seq.distinct
generateIndex idx
| false ->
idx
let sample =
match repl with
| true ->
DiscreteUniform.Samples(0, n-1)
|> Seq.take size
|> Seq.map (fun index -> Seq.item index data)
| false ->
generateIndex (seq [])
|> Seq.map (fun index -> Seq.item index data)
sample
Running the function...
let requested = 1000
let dat = Normal.Samples(0., 1.) |> Seq.take 10000
let resultlen = sample dat requested false |> Seq.length
printfn "requested -> %A\nreturned -> %A" requested resultlen
Resulting lengths are wrong.
>
requested -> 1000
returned -> 998
>
requested -> 1000
returned -> 1001
>
requested -> 1000
returned -> 997
Any idea what mistake I'm making?
First, there's a comment I want to make about coding style. Then I'll get to the explanation of why your sequences are coming back with different lengths.
In the comments, I mentioned replacing match (bool) with true -> ... | false -> ... with a simple if ... then ... else expression, but there's another coding style that you're using that I think could be improved. You wrote:
let sample (various_parameters) = // This is a function
// Other code ...
let sample = some_calculation // This is a variable
sample // Return the variable
While F# allows you to reuse names like that, and the name inside the function will "shadow" the name outside the function, it's generally a bad idea for the reused name to have a totally different type than the original name. In other words, this can be a good idea:
let f (a : float option) =
let a = match a with
| None -> 0.0
| Some value -> value
// Now proceed, knowing that `a` has a real value even if had been None before
Or, because the above is exactly what F# gives you defaultArg for:
let f (a : float option) =
let a = defaultArg a 0.0
// This does exactly the same thing as the previous snippet
Here, we are making the name a inside our function refer to a different type than the parameter named a: the parameter was a float option, and the a inside our function is a float. But they're sort of the "same" type -- that is, there's very little mental difference between "The caller may have specified a floating-point value or they may not" and "Now I definitely have a floating-point value". But there's a very large mental gap between "The name sample is a function that takes three parameters" and "The name sample is a sequence of floats". I strongly recommend using a name like result for the value you're going to return from your function, rather than re-using the function name.
Also, this seems unnecessarily verbose:
let result =
match repl with
| true ->
DiscreteUniform.Samples(0, n-1)
|> Seq.take size
|> Seq.map (fun index -> Seq.item index data)
| false ->
generateIndex (seq [])
|> Seq.map (fun index -> Seq.item index data)
result
Anytime I find myself writing "let result = (something) ; result" at the end of my function, I usually just want to replace that whole code block with just the (something). I.e., the above snippet could just become:
match repl with
| true ->
DiscreteUniform.Samples(0, n-1)
|> Seq.take size
|> Seq.map (fun index -> Seq.item index data)
| false ->
generateIndex (seq [])
|> Seq.map (fun index -> Seq.item index data)
Which in turn can be replaced with an if...then...else expression:
if repl then
DiscreteUniform.Samples(0, n-1)
|> Seq.take size
|> Seq.map (fun index -> Seq.item index data)
else
generateIndex (seq [])
|> Seq.map (fun index -> Seq.item index data)
And that's the last expression in your code. In other words, I would probably rewrite your function as follows (changing ONLY the style, and making no changes to the logic):
open MathNet.Numerics.Distributions
open MathNet.Numerics.LinearAlgebra
let sample (data : seq<float>) (size : int) (repl : bool) =
let n = data |> Seq.length
// without replacement
let rec generateIndex idx =
let m = size - Seq.length(idx)
if m > 0 then
let newIdx = DiscreteUniform.Samples(0, n-1) |> Seq.take m
let idx = (Seq.append idx newIdx) |> Seq.distinct
generateIndex idx
else
idx
if repl then
DiscreteUniform.Samples(0, n-1)
|> Seq.take size
|> Seq.map (fun index -> Seq.item index data)
else
generateIndex (seq [])
|> Seq.map (fun index -> Seq.item index data)
If I can figure out why your sequences have the wrong length, I'll update this answer with that information as well.
UPDATE: Okay, I think I see what's happening in your generateIndex function that's giving you unexpected results. There are two things tripping you up: one is sequence laziness, and the other is randomness.
I copied your generateIndex function into VS Code and added some printfn statements to look at what's going on. First, the code I ran, and then the results:
let rec generateIndex n size idx =
let m = size - Seq.length(idx)
printfn "m = %d" m
match m > 0 with
| true ->
let newIdx = DiscreteUniform.Samples(0, n-1) |> Seq.take m
printfn "Generating newIdx as %A" (List.ofSeq newIdx)
let idx = (Seq.append idx newIdx) |> Seq.distinct
printfn "Now idx is %A" (List.ofSeq idx)
generateIndex n size idx
| false ->
printfn "Done, returning %A" (List.ofSeq idx)
idx
All those List.ofSeq idx calls are so that F# Interactive would print more than four items of the seq when I print it out (by default, if you try to print a seq with %A, it will only print out four values and then print an ellipsis if there are more values available in the seq). Also, I turned n and size into parameters (that I don't change between calls) so that I could test it easily. I then called it as generateIndex 100 5 (seq []) and got the following result:
m = 5
Generating newIdx as [74; 76; 97; 78; 31]
Now idx is [68; 28; 65; 58; 82]
m = 0
Done, returning [37; 58; 24; 48; 49]
val it : seq<int> = seq [12; 69; 97; 38; ...]
See how the numbers keep changing? That was my first clue that something was up. See, seqs are lazy. They don't evaluate their contents until they have to. You shouldn't think of a seq as a list of numbers. Instead, think of it as a generator that will, when asked for numbers, produce them according to some rule. In your case, the rule is "Choose random integers between 0 and n-1, then take m of those numbers". And the other thing about seqs is that they do not cache their contents (although there's a Seq.cache function available that will cache their contents). Therefore, if you have a seq based on a random number generator, its results will be different each time, as you can see in my output. When I printed out newIdx, it printed out as [74; 76; 97; 78; 31], but when I appended it to an empty seq, the result printed out as [68; 28; 65; 58; 82].
Why this difference? Because Seq.append does not force evaluation. It simply creates a new seq whose rule is "take all items from the first seq, then when that one exhausts, take all items from the second seq. And when that one exhausts, end." And Seq.distinct does not force evaluation either; it simply creates a new seq whose rule is "take the items from the seq handed to you, and start handing them out when asked. But memorize them as you go, and if you've handed one of them out before, don't hand it out again." So what you are passing around between your calls to generateIdx is an object that, when evaluated, will pick a set of random numbers between 0 and n-1 (in my simple case, between 0 and 100) and then reduce that set down to a distinct set of numbers.
Now, here's the thing. Every time you evaluate that seq, it will start from the beginning: first calling DiscreteUniform.Samples(0, n-1) to generate an infinite stream of random numbers, then selecting m numbers from that stream, then throwing out any duplicates. (I'm ignoring the Seq.append for now, because it would create unnecessary mental complexity and it isn't really part of the bug anyway). Now, at the start of each go-round of your function, you check the length of the sequence, which does cause it to be evaluated. That means that it selects (in the case of my sample code) 5 random numbers between 0 and 99, then makes sure that they're all distinct. If they are all distinct, then m = 0 and the function will exit, returning... not the list of numbers, but the seq object. And when that seq object is evaluated, it will start over from the beginning, choosing a different set of 5 random numbers and then throwing out any duplicates. Therefore, there's still a chance that at least one of that set of 5 numbers will end up being a duplicate, because the sequence whose length was tested (which we know contained no duplicates, otherwise m would have been greater than 0) was not the sequence that was returned. The sequence that was returned has a 1.0 * 0.99 * 0.98 * 0.97 * 0.96 chance of not containing any duplicates, which comes to about 0.9035. So there's a just-under-10% chance that even though you checked Seq.length and it was 5, the length of the returned seq ends up being 4 after all -- because it was choosing a different set of random numbers than the one you checked.
To prove this, I ran the function again, this time only picking 4 numbers so that the result would be completely shown at the F# Interactive prompt. And my run of generateIndex 100 4 (seq []) produced the following output:
m = 4
Generating newIdx as [36; 63; 97; 31]
Now idx is [39; 93; 53; 94]
m = 0
Done, returning [47; 94; 34]
val it : seq<int> = seq [48; 24; 14; 68]
Notice how when I printed "Done, returning (value of idx)", it had only 3 values? Even though it eventually returned 4 values (because it picked a different selection of random numbers for the actual result, and that selection had no duplicates), that demonstrated the problem.
By the way, there's one other problem with your function, which is that it's far slower than it needs to be. The function Seq.item, in some circumstances, has to run through the sequence from the beginning in order to pick the nth item of the sequence. It would be far better to store your data in an array at the start of your function (let arrData = data |> Array.ofSeq), then replace
|> Seq.map (fun index -> Seq.item index data)
with
|> Seq.map (fun index -> arrData.[index])
Array lookups are done in constant time, so that takes your sample function down from O(N^2) to O(N).
TL;DR: Use Seq.distinct before you take m values from it and the bug will go away. You can just replace your entire generateIdx function with a simple DiscreteUniform.Samples(0, n-1) |> Seq.distinct |> Seq.take size. (And use an array for your data lookups so that your function will run faster). In other words, here's the final almost-final version of how I would rewrite your code:
let sample (data : seq<float>) (size : int) (repl : bool) =
let arrData = data |> Array.ofSeq
let n = arrData |> Array.length
if repl then
DiscreteUniform.Samples(0, n-1)
|> Seq.take size
|> Seq.map (fun index -> arrData.[index])
else
DiscreteUniform.Samples(0, n-1)
|> Seq.distinct
|> Seq.take size
|> Seq.map (fun index -> arrData.[index])
That's it! Simple, easy to understand, and (as far as I can tell) bug-free.
Edit: ... but not completely DRY, because there's still a bit of repeated code in that "final" version. (Credit to CaringDev for pointing it out in the comments below). The Seq.take size |> Seq.map is repeated in both branches of the if expression, so there's a way to simplify that expression. We could do this:
let randomIndices =
if repl then
DiscreteUniform.Samples(0, n-1)
else
DiscreteUniform.Samples(0, n-1) |> Seq.distinct
randomIndices
|> Seq.take size
|> Seq.map (fun index -> arrData.[index])
So here's a truly-final version of my suggestion:
let sample (data : seq<float>) (size : int) (repl : bool) =
let arrData = data |> Array.ofSeq
let n = arrData |> Array.length
let randomIndices =
if repl then
DiscreteUniform.Samples(0, n-1)
else
DiscreteUniform.Samples(0, n-1) |> Seq.distinct
randomIndices
|> Seq.take size
|> Seq.map (fun index -> arrData.[index])

Does Seq.groupBy preserve order within groups?

I want to group a sequence and then take the first occurrence of each element in the group. When I try this
Seq.groupBy f inSeq
|> Seq.map (fun (k,s) -> (k,s|>Seq.take 1|>Seq.exactlyOne))
I find that sometimes I get a different element from s. Is this expected?
Looking at the source of the groupBy implementation -
here's the relevant bit:
// Build the groupings
seq |> iter (fun v ->
let safeKey = keyf v
let mutable prev = Unchecked.defaultof<_>
match dict.TryGetValue (safeKey, &prev) with
| true -> prev.Add v
| false ->
let prev = ResizeArray ()
dict.[safeKey] <- prev
prev.Add v)
It iterates through the source array and adds the values to the corresponding list for the key. The order of subsequences is directly affected by the order of the input sequence. For the same input sequence, we can expect groupBy to return identical output sequences. This is how tests are coded for groupBy.
If you're seeing variations in the resulting sequences, check the input sequence.
Yes, this is expected. Sequences (seq) aren't guaranteed to be pure. You can define a sequence that will yield different values every time you iterate over them. If you call Seq.take 1 twice, you can get different results.
Consider, as an example, this sequence:
open System
let r = Random ()
let s = seq { yield r.Next(0, 9) }
If you call Seq.take 1 on that, you may get different results:
> s |> Seq.take 1;;
val it : seq<int> = seq [4]
> s |> Seq.take 1;;
val it : seq<int> = seq [1]
Using Seq.head isn't going to help you either:
> s |> Seq.head;;
val it : int = 2
> s |> Seq.head;;
val it : int = 6
If you want to guarantee deterministic behaviour, use a List instead.

How do I do in F# what would be called compression in APL?

In APL one can use a bit vector to select out elements of another vector; this is called compression. For example 1 0 1/3 5 7 would yield 3 7.
Is there a accepted term for this in functional programming in general and F# in particular?
Here is my F# program:
let list1 = [|"Bob"; "Mary"; "Sue"|]
let list2 = [|1; 0; 1|]
[<EntryPoint>]
let main argv =
0 // return an integer exit code
What I would like to do is compute a new string[] which would be [|"Bob"; Sue"|]
How would one do this in F#?
Array.zip list1 list2 // [|("Bob",1); ("Mary",0); ("Sue",1)|]
|> Array.filter (fun (_,x) -> x = 1) // [|("Bob", 1); ("Sue", 1)|]
|> Array.map fst // [|"Bob"; "Sue"|]
The pipe operator |> does function application syntactically reversed, i.e., x |> f is equivalent to f x. As mentioned in another answer, replace Array with Seq to avoid the construction of intermediate arrays.
I expect you'll find many APL primitives missing from F#. For lists and sequences, many can be constructed by stringing together primitives from the Seq, Array, or List modules, like the above. For reference, here is an overview of the Seq module.
I think the easiest is to use an array sequence expression, something like this:
let compress bits values =
[|
for i = 0 to bits.Length - 1 do
if bits.[i] = 1 then
yield values.[i]
|]
If you only want to use combinators, this is what I would do:
Seq.zip bits values
|> Seq.choose (fun (bit, value) ->
if bit = 1 then Some value else None)
|> Array.ofSeq
I use Seq functions instead of Array in order to avoid building intermediary arrays, but it would be correct too.
One might say this is more idiomatic:
Seq.map2 (fun l1 l2 -> if l2 = 1 then Some(l1) else None) list1 list2
|> Seq.choose id
|> Seq.toArray
EDIT (for the pipe lovers)
(list1, list2)
||> Seq.map2 (fun l1 l2 -> if l2 = 1 then Some(l1) else None)
|> Seq.choose id
|> Seq.toArray
Søren Debois' solution is good but, as he pointed out, but we can do better. Let's define a function, based on Søren's code:
let compressArray vals idx =
Array.zip vals idx
|> Array.filter (fun (_, x) -> x = 1)
|> Array.map fst
compressArray ends up creating a new array in each of the 3 lines. This can take some time, if the input arrays are long (1.4 seconds for 10M values in my quick test).
We can save some time by working on sequences and creating an array at the end only:
let compressSeq vals idx =
Seq.zip vals idx
|> Seq.filter (fun (_, x) -> x = 1)
|> Seq.map fst
This function is generic and will work on arrays, lists, etc. To generate an array as output:
compressSeq sq idx |> Seq.toArray
The latter saves about 40% of computation time (0.8s in my test).
As ildjarn commented, the function argument to filter can be rewritten to snd >> (=) 1, although that causes a slight performance drop (< 10%), probably because of the extra function call that is generated.

Return value in F# - incomplete construct

I've trying to learn F#. I'm a complete beginner, so this might be a walkover for you guys :)
I have the following function:
let removeEven l =
let n = List.length l;
let list_ = [];
let seq_ = seq { for x in 1..n do if x % 2 <> 0 then yield List.nth l (x-1)}
for x in seq_ do
let list_ = list_ # [x];
list_;
It takes a list, and return a new list containing all the numbers, which is placed at an odd index in the original list, so removeEven [x1;x2;x3] = [x1;x3]
However, I get my already favourite error-message: Incomplete construct at or before this point in expression...
If I add a print to the end of the line, instead of list_:
...
print_any list_;
the problem is fixed. But I do not want to print the list, I want to return it!
What causes this? Why can't I return my list?
To answer your question first, the compiler complains because there is a problem inside the for loop. In F#, let serves to declare values (that are immutable and cannot be changed later in the program). It isn't a statement as in C# - let can be only used as part of another expression. For example:
let n = 10
n + n
Actually means that you want the n symbol to refer to the value 10 in the expression n + n. The problem with your code is that you're using let without any expression (probably because you want to use mutable variables):
for x in seq_ do
let list_ = list_ # [x] // This isn't assignment!
list_
The problematic line is an incomplete expression - using let in this way isn't allowed, because it doesn't contain any expression (the list_ value will not be accessed from any code). You can use mutable variable to correct your code:
let mutable list_ = [] // declared as 'mutable'
let seq_ = seq { for x in 1..n do if x % 2 <> 0 then yield List.nth l (x-1)}
for x in seq_ do
list_ <- list_ # [x] // assignment using '<-'
Now, this should work, but it isn't really functional, because you're using imperative mutation. Moreover, appending elements using # is really inefficient thing to do in functional languages. So, if you want to make your code functional, you'll probably need to use different approach. Both of the other answers show a great approach, although I prefer the example by Joel, because indexing into a list (in the solution by Chaos) also isn't very functional (there is no pointer arithmetic, so it will be also slower).
Probably the most classical functional solution would be to use the List.fold function, which aggregates all elements of the list into a single result, walking from the left to the right:
[1;2;3;4;5]
|> List.fold (fun (flag, res) el ->
if flag then (not flag, el::res) else (not flag, res)) (true, [])
|> snd |> List.rev
Here, the state used during the aggregation is a Boolean flag specifying whether to include the next element (during each step, we flip the flag by returning not flag). The second element is the list aggregated so far (we add element by el::res only when the flag is set. After fold returns, we use snd to get the second element of the tuple (the aggregated list) and reverse it using List.rev, because it was collected in the reversed order (this is more efficient than appending to the end using res#[el]).
Edit: If I understand your requirements correctly, here's a version of your function done functional rather than imperative style, that removes elements with odd indexes.
let removeEven list =
list
|> Seq.mapi (fun i x -> (i, x))
|> Seq.filter (fun (i, x) -> i % 2 = 0)
|> Seq.map snd
|> List.ofSeq
> removeEven ['a'; 'b'; 'c'; 'd'];;
val it : char list = ['a'; 'c']
I think this is what you are looking for.
let removeEven list =
let maxIndex = (List.length list) - 1;
seq { for i in 0..2..maxIndex -> list.[i] }
|> Seq.toList
Tests
val removeEven : 'a list -> 'a list
> removeEven [1;2;3;4;5;6];;
val it : int list = [1; 3; 5]
> removeEven [1;2;3;4;5];;
val it : int list = [1; 3; 5]
> removeEven [1;2;3;4];;
val it : int list = [1; 3]
> removeEven [1;2;3];;
val it : int list = [1; 3]
> removeEven [1;2];;
val it : int list = [1]
> removeEven [1];;
val it : int list = [1]
You can try a pattern-matching approach. I haven't used F# in a while and I can't test things right now, but it would be something like this:
let rec curse sofar ls =
match ls with
| even :: odd :: tl -> curse (even :: sofar) tl
| even :: [] -> curse (even :: sofar) []
| [] -> List.rev sofar
curse [] [ 1; 2; 3; 4; 5 ]
This recursively picks off the even elements. I think. I would probably use Joel Mueller's approach though. I don't remember if there is an index-based filter function, but that would probably be the ideal to use, or to make if it doesn't exist in the libraries.
But in general lists aren't really meant as index-type things. That's what arrays are for. If you consider what kind of algorithm would require a list having its even elements removed, maybe it's possible that in the steps prior to this requirement, the elements can be paired up in tuples, like this:
[ (1,2); (3,4) ]
That would make it trivial to get the even-"indexed" elements out:
thelist |> List.map fst // take first element from each tuple
There's a variety of options if the input list isn't guaranteed to have an even number of elements.
Yet another alternative, which (by my reckoning) is slightly slower than Joel's, but it's shorter :)
let removeEven list =
list
|> Seq.mapi (fun i x -> (i, x))
|> Seq.choose (fun (i,x) -> if i % 2 = 0 then Some(x) else None)
|> List.ofSeq

Resources