Sampling in F# : is Set adequate?

Sampling in F# : is Set adequate? - f#

I have an array of items, from which I'd like to sample.
I was under the impression that a Set would the a good structure to sample from, in a fold where I'd give back the original or a modified set with the retrieved element missing depending if I want replacement of not.
However, there seems to no method to retrieve an element directly from a Set.
Is there something I am missing ? or should I use Set of indices, along with a surrogate function that starts at some random position < Set.count and goes up until it finds a member ?
That is, something along this line
module Seq =
let modulo (n:int) start =
let rec next i = seq { yield (i + 1)%n ; yield! next (i+1)}
next start
module Array =
let Sample (withReplacement:bool) seed (entries:'T array) =
let prng, indexes = new Random(seed), Set(Seq.init (entries |> Array.length) id)
Seq.unfold (fun set -> let N = set |> Set.count
let next = Seq.modulo N (prng.Next(N)) |> Seq.truncate N |> Seq.tryFind(fun i -> set |> Set.exists ((=) i))
if next.IsSome then
Some(entries.[next.Value], if withReplacement then set else Set.remove next.Value set)
else
None)
Edit : Tracking positively what I gave, instead of tracking what I still can give would make it simpler and more efficient.

For sampling without replacement, you could just permute the source seq and take however many elements you want to sample
let sampleWithoutReplacement n s =
let a = Array.ofSeq s
seq { for i = a.Length downto 1 do
let j = rnd.Next i
yield a.[j]
a.[j] <- a.[i - 1] }
|> Seq.take n
To sample with replacement, just pick a random element n times from the source seq
let sampleWithReplacement n s =
let a = Array.ofSeq s
Seq.init n (fun _ -> a.[rnd.Next(a.Length)])
These may not be the most efficient methods with huge data sets however

Continuing our comments...if you want to randomly sample a sequence without slurping the entire thing into memory you could generate a set of random indices the size of your desired sample (not too different from what you already have):
let rand count max =
System.Random()
|> Seq.unfold (fun r -> Some(r.Next(max), r))
|> Seq.distinct
|> Seq.take count
|> set
let takeSample sampleSize inputSize input =
let indices = rand sampleSize inputSize
input
|> Seq.mapi (fun idx x ->
if Set.contains idx indices then Some x else None)
|> Seq.choose id
let inputSize = 100000
let input = Seq.init inputSize id
let sample = takeSample 50 inputSize input
printfn "%A" (Seq.toList sample)

Related

Sequence of incorrect length generated by function

Why is the following function returning a sequence of incorrect length when the repl variable is set to false?
open MathNet.Numerics.Distributions
open MathNet.Numerics.LinearAlgebra
let sample (data : seq<float>) (size : int) (repl : bool) =
let n = data |> Seq.length
// without replacement
let rec generateIndex idx =
let m = size - Seq.length(idx)
match m > 0 with
| true ->
let newIdx = DiscreteUniform.Samples(0, n-1) |> Seq.take m
let idx = (Seq.append idx newIdx) |> Seq.distinct
generateIndex idx
| false ->
idx
let sample =
match repl with
| true ->
DiscreteUniform.Samples(0, n-1)
|> Seq.take size
|> Seq.map (fun index -> Seq.item index data)
| false ->
generateIndex (seq [])
|> Seq.map (fun index -> Seq.item index data)
sample
Running the function...
let requested = 1000
let dat = Normal.Samples(0., 1.) |> Seq.take 10000
let resultlen = sample dat requested false |> Seq.length
printfn "requested -> %A\nreturned -> %A" requested resultlen
Resulting lengths are wrong.
>
requested -> 1000
returned -> 998
>
requested -> 1000
returned -> 1001
>
requested -> 1000
returned -> 997
Any idea what mistake I'm making?

First, there's a comment I want to make about coding style. Then I'll get to the explanation of why your sequences are coming back with different lengths.
In the comments, I mentioned replacing match (bool) with true -> ... | false -> ... with a simple if ... then ... else expression, but there's another coding style that you're using that I think could be improved. You wrote:
let sample (various_parameters) = // This is a function
// Other code ...
let sample = some_calculation // This is a variable
sample // Return the variable
While F# allows you to reuse names like that, and the name inside the function will "shadow" the name outside the function, it's generally a bad idea for the reused name to have a totally different type than the original name. In other words, this can be a good idea:
let f (a : float option) =
let a = match a with
| None -> 0.0
| Some value -> value
// Now proceed, knowing that `a` has a real value even if had been None before
Or, because the above is exactly what F# gives you defaultArg for:
let f (a : float option) =
let a = defaultArg a 0.0
// This does exactly the same thing as the previous snippet
Here, we are making the name a inside our function refer to a different type than the parameter named a: the parameter was a float option, and the a inside our function is a float. But they're sort of the "same" type -- that is, there's very little mental difference between "The caller may have specified a floating-point value or they may not" and "Now I definitely have a floating-point value". But there's a very large mental gap between "The name sample is a function that takes three parameters" and "The name sample is a sequence of floats". I strongly recommend using a name like result for the value you're going to return from your function, rather than re-using the function name.
Also, this seems unnecessarily verbose:
let result =
match repl with
| true ->
DiscreteUniform.Samples(0, n-1)
|> Seq.take size
|> Seq.map (fun index -> Seq.item index data)
| false ->
generateIndex (seq [])
|> Seq.map (fun index -> Seq.item index data)
result
Anytime I find myself writing "let result = (something) ; result" at the end of my function, I usually just want to replace that whole code block with just the (something). I.e., the above snippet could just become:
match repl with
| true ->
DiscreteUniform.Samples(0, n-1)
|> Seq.take size
|> Seq.map (fun index -> Seq.item index data)
| false ->
generateIndex (seq [])
|> Seq.map (fun index -> Seq.item index data)
Which in turn can be replaced with an if...then...else expression:
if repl then
DiscreteUniform.Samples(0, n-1)
|> Seq.take size
|> Seq.map (fun index -> Seq.item index data)
else
generateIndex (seq [])
|> Seq.map (fun index -> Seq.item index data)
And that's the last expression in your code. In other words, I would probably rewrite your function as follows (changing ONLY the style, and making no changes to the logic):
open MathNet.Numerics.Distributions
open MathNet.Numerics.LinearAlgebra
let sample (data : seq<float>) (size : int) (repl : bool) =
let n = data |> Seq.length
// without replacement
let rec generateIndex idx =
let m = size - Seq.length(idx)
if m > 0 then
let newIdx = DiscreteUniform.Samples(0, n-1) |> Seq.take m
let idx = (Seq.append idx newIdx) |> Seq.distinct
generateIndex idx
else
idx
if repl then
DiscreteUniform.Samples(0, n-1)
|> Seq.take size
|> Seq.map (fun index -> Seq.item index data)
else
generateIndex (seq [])
|> Seq.map (fun index -> Seq.item index data)
If I can figure out why your sequences have the wrong length, I'll update this answer with that information as well.
UPDATE: Okay, I think I see what's happening in your generateIndex function that's giving you unexpected results. There are two things tripping you up: one is sequence laziness, and the other is randomness.
I copied your generateIndex function into VS Code and added some printfn statements to look at what's going on. First, the code I ran, and then the results:
let rec generateIndex n size idx =
let m = size - Seq.length(idx)
printfn "m = %d" m
match m > 0 with
| true ->
let newIdx = DiscreteUniform.Samples(0, n-1) |> Seq.take m
printfn "Generating newIdx as %A" (List.ofSeq newIdx)
let idx = (Seq.append idx newIdx) |> Seq.distinct
printfn "Now idx is %A" (List.ofSeq idx)
generateIndex n size idx
| false ->
printfn "Done, returning %A" (List.ofSeq idx)
idx
All those List.ofSeq idx calls are so that F# Interactive would print more than four items of the seq when I print it out (by default, if you try to print a seq with %A, it will only print out four values and then print an ellipsis if there are more values available in the seq). Also, I turned n and size into parameters (that I don't change between calls) so that I could test it easily. I then called it as generateIndex 100 5 (seq []) and got the following result:
m = 5
Generating newIdx as [74; 76; 97; 78; 31]
Now idx is [68; 28; 65; 58; 82]
m = 0
Done, returning [37; 58; 24; 48; 49]
val it : seq<int> = seq [12; 69; 97; 38; ...]
See how the numbers keep changing? That was my first clue that something was up. See, seqs are lazy. They don't evaluate their contents until they have to. You shouldn't think of a seq as a list of numbers. Instead, think of it as a generator that will, when asked for numbers, produce them according to some rule. In your case, the rule is "Choose random integers between 0 and n-1, then take m of those numbers". And the other thing about seqs is that they do not cache their contents (although there's a Seq.cache function available that will cache their contents). Therefore, if you have a seq based on a random number generator, its results will be different each time, as you can see in my output. When I printed out newIdx, it printed out as [74; 76; 97; 78; 31], but when I appended it to an empty seq, the result printed out as [68; 28; 65; 58; 82].
Why this difference? Because Seq.append does not force evaluation. It simply creates a new seq whose rule is "take all items from the first seq, then when that one exhausts, take all items from the second seq. And when that one exhausts, end." And Seq.distinct does not force evaluation either; it simply creates a new seq whose rule is "take the items from the seq handed to you, and start handing them out when asked. But memorize them as you go, and if you've handed one of them out before, don't hand it out again." So what you are passing around between your calls to generateIdx is an object that, when evaluated, will pick a set of random numbers between 0 and n-1 (in my simple case, between 0 and 100) and then reduce that set down to a distinct set of numbers.
Now, here's the thing. Every time you evaluate that seq, it will start from the beginning: first calling DiscreteUniform.Samples(0, n-1) to generate an infinite stream of random numbers, then selecting m numbers from that stream, then throwing out any duplicates. (I'm ignoring the Seq.append for now, because it would create unnecessary mental complexity and it isn't really part of the bug anyway). Now, at the start of each go-round of your function, you check the length of the sequence, which does cause it to be evaluated. That means that it selects (in the case of my sample code) 5 random numbers between 0 and 99, then makes sure that they're all distinct. If they are all distinct, then m = 0 and the function will exit, returning... not the list of numbers, but the seq object. And when that seq object is evaluated, it will start over from the beginning, choosing a different set of 5 random numbers and then throwing out any duplicates. Therefore, there's still a chance that at least one of that set of 5 numbers will end up being a duplicate, because the sequence whose length was tested (which we know contained no duplicates, otherwise m would have been greater than 0) was not the sequence that was returned. The sequence that was returned has a 1.0 * 0.99 * 0.98 * 0.97 * 0.96 chance of not containing any duplicates, which comes to about 0.9035. So there's a just-under-10% chance that even though you checked Seq.length and it was 5, the length of the returned seq ends up being 4 after all -- because it was choosing a different set of random numbers than the one you checked.
To prove this, I ran the function again, this time only picking 4 numbers so that the result would be completely shown at the F# Interactive prompt. And my run of generateIndex 100 4 (seq []) produced the following output:
m = 4
Generating newIdx as [36; 63; 97; 31]
Now idx is [39; 93; 53; 94]
m = 0
Done, returning [47; 94; 34]
val it : seq<int> = seq [48; 24; 14; 68]
Notice how when I printed "Done, returning (value of idx)", it had only 3 values? Even though it eventually returned 4 values (because it picked a different selection of random numbers for the actual result, and that selection had no duplicates), that demonstrated the problem.
By the way, there's one other problem with your function, which is that it's far slower than it needs to be. The function Seq.item, in some circumstances, has to run through the sequence from the beginning in order to pick the nth item of the sequence. It would be far better to store your data in an array at the start of your function (let arrData = data |> Array.ofSeq), then replace
|> Seq.map (fun index -> Seq.item index data)
with
|> Seq.map (fun index -> arrData.[index])
Array lookups are done in constant time, so that takes your sample function down from O(N^2) to O(N).
TL;DR: Use Seq.distinct before you take m values from it and the bug will go away. You can just replace your entire generateIdx function with a simple DiscreteUniform.Samples(0, n-1) |> Seq.distinct |> Seq.take size. (And use an array for your data lookups so that your function will run faster). In other words, here's the final almost-final version of how I would rewrite your code:
let sample (data : seq<float>) (size : int) (repl : bool) =
let arrData = data |> Array.ofSeq
let n = arrData |> Array.length
if repl then
DiscreteUniform.Samples(0, n-1)
|> Seq.take size
|> Seq.map (fun index -> arrData.[index])
else
DiscreteUniform.Samples(0, n-1)
|> Seq.distinct
|> Seq.take size
|> Seq.map (fun index -> arrData.[index])
That's it! Simple, easy to understand, and (as far as I can tell) bug-free.
Edit: ... but not completely DRY, because there's still a bit of repeated code in that "final" version. (Credit to CaringDev for pointing it out in the comments below). The Seq.take size |> Seq.map is repeated in both branches of the if expression, so there's a way to simplify that expression. We could do this:
let randomIndices =
if repl then
DiscreteUniform.Samples(0, n-1)
else
DiscreteUniform.Samples(0, n-1) |> Seq.distinct
randomIndices
|> Seq.take size
|> Seq.map (fun index -> arrData.[index])
So here's a truly-final version of my suggestion:
let sample (data : seq<float>) (size : int) (repl : bool) =
let arrData = data |> Array.ofSeq
let n = arrData |> Array.length
let randomIndices =
if repl then
DiscreteUniform.Samples(0, n-1)
else
DiscreteUniform.Samples(0, n-1) |> Seq.distinct
randomIndices
|> Seq.take size
|> Seq.map (fun index -> arrData.[index])

Understanding Mutability in F# : case study

I'm a beginner in F#, and this is my first attempt at programming something serious. I'm sorry the code is a bit long, but there are some issues with mutability that I don't understand.
This is an implementation of the Karger MinCut Algorithm to calculate the mincut on a non-directed graph component. I won't discuss here how the algo works,
for more info https://en.wikipedia.org/wiki/Karger%27s_algorithm
What is important is it's a randomized algorithm, which is running a determined number of trial runs, and taking the "best" run.
I realize now that I could avoid a lot of the problems below if I did construct a specific function for each random trial, but I'd like to understand EXACTLY what is wrong in the implementation below.
I'm running the code on this simple graph (the mincut is 2 when we cut the graph
into 2 components (1,2,3,4) and (5,6,7,8) with only 2 edges between those 2 components)
3--4-----5--6
|\/| |\/|
|/\| |/\|
2--1-----7--8
the file simplegraph.txt should encode this graph as follow
(1st column = node number, other columns = links)
1 2 3 4 7
2 1 3 4
3 1 2 4
4 1 2 3 5
5 4 6 7 8
6 5 7 8
7 1 5 6 8
8 5 6 7
This code may look too much as imperative programming yet, I'm sorry for that.
So There is a main for i loop calling each trial.
the first execution, (when i=1) looks smooth and perfect,
but I have runtime error execution when i=2, because it looks some variables,
like WG are not reinitialized correctly, causing out of bound errors.
WG, WG1 and WGmin are type wgraphobj, which are a record of Dictionary objects
WG1 is defined outside the main loop and i make no new assignments to WG1.
[but its type is mutable though, alas]
I defined first WG with the instruction
let mutable WG = WG1
then at the beginning of the for i loop,
i write
WG <- WG1
and then later, i modify the WG object in each trial to make some calculations.
when the trial is finished and we go to the next trial (i is increased) i want to reset WG to its initial state being like WG1.
but it seems its not working, and I don't get why...
Here is the full code
MyModule.fs [some functions not necessary for execution]
namespace MyModule
module Dict =
open System.Collections.Generic
let toSeq d = d |> Seq.map (fun (KeyValue(k,v)) -> (k,v))
let toArray (d:IDictionary<_,_>) = d |> toSeq |> Seq.toArray
let toList (d:IDictionary<_,_>) = d |> toSeq |> Seq.toList
let ofMap (m:Map<'k,'v>) = new Dictionary<'k,'v>(m) :> IDictionary<'k,'v>
let ofList (l:('k * 'v) list) = new Dictionary<'k,'v>(l |> Map.ofList) :> IDictionary<'k,'v>
let ofSeq (s:('k * 'v) seq) = new Dictionary<'k,'v>(s |> Map.ofSeq) :> IDictionary<'k,'v>
let ofArray (a:('k * 'v) []) = new Dictionary<'k,'v>(a |> Map.ofArray) :> IDictionary<'k,'v>
Karger.fs
open MyModule.Dict
open System.IO
let x = File.ReadAllLines "\..\simplegraph.txt";;
// val x : string [] =
let splitAtTab (text:string)=
text.Split [|'\t';' '|]
let splitIntoKeyValue (s:seq<'T>) =
(Seq.head s, Seq.tail s)
let parseLine (line:string)=
line
|> splitAtTab
|> Array.filter (fun s -> not(s=""))
|> Array.map (fun s-> (int s))
|> Array.toSeq
|> splitIntoKeyValue
let y =
x |> Array.map parseLine
open System.Collections.Generic
// let graph = new Map <int, int array>
let graphD = new Dictionary<int,int seq>()
y |> Array.iter graphD.Add
let graphM = y |> Map.ofArray //immutable
let N = y.Length // number of nodes
let Nruns = 2
let remove_table = new Dictionary<int,bool>()
[for i in 1..N do yield (i,false)] |> List.iter remove_table.Add
// let remove_table = seq [|for a in 1 ..N -> false|] // plus court
let label_head_table = new Dictionary<int,int>()
[for i in 1..N do yield (i,i)] |> List.iter label_head_table.Add
let label = new Dictionary<int,int seq>()
[for i in 1..N do yield (i,[i])] |> List.iter label.Add
let mutable min_cut = 1000000
type wgraphobj =
{ Graph : Dictionary<int,int seq>
RemoveTable : Dictionary<int,bool>
Label : Dictionary<int,int seq>
LabelHead : Dictionary<int,int> }
let WG1 = {Graph = graphD;
RemoveTable = remove_table;
Label = label;
LabelHead = label_head_table}
let mutable WGmin = WG1
let IsNotRemoved x = //
match x with
| (i,false) -> true
| (i,true) -> false
let IsNotRemoved1 WG i = //
(i,WG.RemoveTable.[i]) |>IsNotRemoved
let GetLiveNode d =
let myfun x =
match x with
| (i,b) -> i
d |> toList |> List.filter IsNotRemoved |> List.map myfun
let rand = System.Random()
// subsets a dictionary given a sub_list of keys
let D_Subset (dict:Dictionary<'T,'U>) (sub_list:list<'T>) =
let z = Dictionary<'T,'U>() // create new empty dictionary
sub_list |> List.filter (fun k -> dict.ContainsKey k)
|> List.map (fun k -> (k, dict.[k]))
|> List.iter (fun s -> z.Add s)
z
// subsets a dictionary given a sub_list of keys to remove
let D_SubsetC (dict:Dictionary<'T,'U>) (sub_list:list<'T>) =
let z = dict
sub_list |> List.filter (fun k -> dict.ContainsKey k)
|> List.map (fun k -> (dict.Remove k)) |>ignore
z
// subsets a sequence by values in a sequence
let S_Subset (S:seq<'T>)(sub_list:seq<'T>) =
S |> Seq.filter (fun s-> Seq.exists (fun elem -> elem = s) sub_list)
let S_SubsetC (S:seq<'T>)(sub_list:seq<'T>) =
S |> Seq.filter (fun s-> not(Seq.exists (fun elem -> elem = s) sub_list))
[<EntryPoint>]
let main argv =
let mutable u = 0
let mutable v = 0
let mutable r = 0
let mutable N_cut = 1000000
let mutable cluster_A_min = seq [0]
let mutable cluster_B_min = seq [0]
let mutable WG = WG1
let mutable LiveNodeList = [0]
// when i = 2, i encounter problems with mutability
for i in 1 .. Nruns do
WG <- WG1
printfn "%d" i
for k in 1..(N-2) do
LiveNodeList <- GetLiveNode WG.RemoveTable
r <- rand.Next(0,N-k)
u <- LiveNodeList.[r] //selecting a live node
let uuu = WG.Graph.[u] |> Seq.map (fun s -> WG.LabelHead.[s] )
|> Seq.filter (IsNotRemoved1 WG)
|> Seq.distinct
let n_edge = uuu |> Seq.length
let x = rand.Next(1,n_edge)
let mutable ok = false //maybe we can take this out
while not(ok) do
// selecting the edge from node u
v <- WG.LabelHead.[Array.get (uuu |> Seq.toArray) (x-1)]
let vvv = WG.Graph.[v] |> Seq.map (fun s -> WG.LabelHead.[s] )
|> Seq.filter (IsNotRemoved1 WG)
|> Seq.distinct
let zzz = S_SubsetC (Seq.concat [uuu;vvv] |> Seq.distinct) [u;v]
WG.Graph.[u] <- zzz
let lab_u = WG.Label.[u]
let lab_v = WG.Label.[v]
WG.Label.[u] <- Seq.concat [lab_u;lab_v] |> Seq.distinct
if (k<N-1) then
WG.RemoveTable.[v]<-true
//updating Label_head for all members of Label.[v]
WG.LabelHead.[v]<- u
for j in WG.Label.[v] do
WG.LabelHead.[j]<- u
ok <- true
printfn "u= %d v=%d" u v
// end of for k in 1..(N-2)
// counting cuts
// u,v contain the 2 indexes of groupings
let cluster_A = WG.Label.[u]
let cluster_B = S_SubsetC (seq[for i in 1..N do yield i]) cluster_A // defined as complementary of A
// let WG2 = {Graph = D_Subset WG1.Graph (cluster_A |> Seq.toList)
// RemoveTable = remove_table
// Label = D_Subset WG1.Graph (cluster_A |> Seq.toList)
// LabelHead = label_head_table}
let cross_edge = // returns keyvalue pair (k,S')
let IsInCluster cluster (k,S) =
(k,S_Subset S cluster)
graphM |> toSeq |> Seq.map (IsInCluster cluster_B)
N_cut <-
cross_edge |> Seq.map (fun (k:int,v:int seq)-> Seq.length v)
|> Seq.sum
if (N_cut<min_cut) then
min_cut <- N_cut
WGmin <- WG
cluster_A_min <- cluster_A
cluster_B_min <- cluster_B
// end of for i in 1..Nruns
0 // return an integer exit code
Description of the algo: (i don't think its too essential to solve my problem)
at each trial, there are several steps. at each step, we merge 2 nodes into 1, (removing effectively 1) updating the graph. we do that 6 times until there are only 2 nodes left, which we define as 2 clusters, and we look at the number of cross edges between those 2 clusters. if we are "lucky" those 2 clusters would be (1,2,3,4) and (5,6,7,8) and find the right number of cuts.
at each step, the object WG is updated with the effects of merging 2 nodes
with only LiveNodes (the ones which are not eliminated as a result of merging 2 nodes) being perfectly kept up to date.
WG.Graph is the updated graph
WG.Label contains the labels of the nodes which have been merged into the current node
WG.LabelHead contains the label of the node into which that node has been merged
WG.RemoveTable says if the node has been removed or not.
Thanks in advance for anyone willing to take a look at it !

"It seems not working", because wgraphobj is a reference type, which is allocated on the stack, which means that when you're mutating the innards of WG, you're also mutating the innards of WG1, because they're the same innards.
This is precisely the kind of mess you get yourself into if you use mutable state. This is why people recommend to not use it. In particular, your use of mutable dictionaries undermines the robustness of your algorithm. I recommend using the F#'s own efficient immutable dictionary (called Map) instead.
Now, in response to your comment about WG.Graph <- GraphD giving compile error.
WG is mutable, but WG.Graph is not (but the contents of WG.Graph are again mutable). There is a difference, let me try to explain it.
WG is mutable in the sense that it points to some object of type wgraphobj, but you can make it, in the course of your program, to point at another object of the same type.
WG.Graph, on the other hand, is a field packed inside WG. It points to some object of type Dictionary<_,_>. And you cannot make it point to another object. You can create a different wgraphobj, in which the field Graph point to a different dictionary, but you cannot change where the field Graph of the original wgraphobj points.
In order to make the field Graph itself mutable, you can declare it as such:
type wgraphobj = {
mutable Graph: Dictionary<int, int seq>
...
Then you will be able to mutate that field:
WG.Graph <- GraphD
Note that in this case you do not need to declare the value WG itself as mutable.
However, it seems to me that for your purposes you can actually go the way of creating a new instance wgraphobj with the field Graph changed, and assigning it to the mutable reference WG:
WG.Graph <- { WG with Graph = GraphD }

Tail Recursion in F# : Stack Overflow

I'm trying to implement Kosaraju's algorithm on a large graph
as part of an assignment [MOOC Algo I Stanford on Coursera]
https://en.wikipedia.org/wiki/Kosaraju%27s_algorithm
The current code works on a small graph, but I'm hitting Stack Overflow during runtime execution.
Despite having read the relevant chapter in Expert in F#, or other available examples on websites and SO, i still don't get how to use continuation to solve this problem
Below is the full code for general purpose, but it will already fail when executing DFSLoop1 and the recursive function DFSsub inside. I think I'm not making the function tail recursive [because of the instructions
t<-t+1
G.[n].finishingtime <- t
?]
but i don't understand how i can implement the continuation properly.
When considering only the part that fails, DFSLoop1 is taking as argument a graph to which we will apply Depth-First Search. We need to record the finishing time as part of the algo to proceed to the second part of the algo in a second DFS Loop (DFSLoop2) [of course we are failing before that].
open System
open System.Collections.Generic
open System.IO
let x = File.ReadAllLines "C:\Users\Fagui\Documents\GitHub\Learning Fsharp\Algo Stanford I\PA 4 - SCC.txt";;
// let x = File.ReadAllLines "C:\Users\Fagui\Documents\GitHub\Learning Fsharp\Algo Stanford I\PA 4 - test1.txt";;
// val x : string [] =
let splitAtTab (text:string)=
text.Split [|'\t';' '|]
let splitIntoKeyValue (A: int[]) =
(A.[0], A.[1])
let parseLine (line:string)=
line
|> splitAtTab
|> Array.filter (fun s -> not(s=""))
|> Array.map (fun s-> (int s))
|> splitIntoKeyValue
let y =
x |> Array.map parseLine
//val it : (int * int) []
type Children = int[]
type Node1 =
{children : Children ;
mutable finishingtime : int ;
mutable explored1 : bool ;
}
type Node2 =
{children : Children ;
mutable leader : int ;
mutable explored2 : bool ;
}
type DFSgraphcore = Dictionary<int,Children>
let directgraphcore = new DFSgraphcore()
let reversegraphcore = new DFSgraphcore()
type DFSgraph1 = Dictionary<int,Node1>
let reversegraph1 = new DFSgraph1()
type DFSgraph2 = Dictionary<int,Node2>
let directgraph2 = new DFSgraph2()
let AddtoGraph (G:DFSgraphcore) (n,c) =
if not(G.ContainsKey n) then
let node = [|c|]
G.Add(n,node)
else
let c'= G.[n]
G.Remove(n) |> ignore
G.Add (n, Array.append c' [|c|])
let inline swaptuple (a,b) = (b,a)
y|> Array.iter (AddtoGraph directgraphcore)
y|> Array.map swaptuple |> Array.iter (AddtoGraph reversegraphcore)
for i in directgraphcore.Keys do
if reversegraphcore.ContainsKey(i) then do
let node = {children = reversegraphcore.[i] ;
finishingtime = -1 ;
explored1 = false ;
}
reversegraph1.Add (i,node)
else
let node = {children = [||] ;
finishingtime = -1 ;
explored1 = false ;
}
reversegraph1.Add (i,node)
directgraphcore.Clear |> ignore
reversegraphcore.Clear |> ignore
// for i in reversegraph1.Keys do printfn "%d %A" i reversegraph1.[i].children
printfn "pause"
Console.ReadKey() |> ignore
let num_nodes =
directgraphcore |> Seq.length
let DFSLoop1 (G:DFSgraph1) =
let mutable t = 0
let mutable s = -1
let mutable k = num_nodes
let rec DFSsub (G:DFSgraph1)(n:int) (cont:int->int) =
//how to make it tail recursive ???
G.[n].explored1 <- true
// G.[n].leader <- s
for j in G.[n].children do
if not(G.[j].explored1) then DFSsub G j cont
t<-t+1
G.[n].finishingtime <- t
// end of DFSsub
for i in num_nodes .. -1 .. 1 do
printfn "%d" i
if not(G.[i].explored1) then do
s <- i
( DFSsub G i (fun s -> s) ) |> ignore
// printfn "%d %d" i G.[i].finishingtime
DFSLoop1 reversegraph1
printfn "pause"
Console.ReadKey() |> ignore
for i in directgraphcore.Keys do
let node = {children =
directgraphcore.[i]
|> Array.map (fun k -> reversegraph1.[k].finishingtime) ;
leader = -1 ;
explored2= false ;
}
directgraph2.Add (reversegraph1.[i].finishingtime,node)
let z = 0
let DFSLoop2 (G:DFSgraph2) =
let mutable t = 0
let mutable s = -1
let mutable k = num_nodes
let rec DFSsub (G:DFSgraph2)(n:int) (cont:int->int) =
G.[n].explored2 <- true
G.[n].leader <- s
for j in G.[n].children do
if not(G.[j].explored2) then DFSsub G j cont
t<-t+1
// G.[n].finishingtime <- t
// end of DFSsub
for i in num_nodes .. -1 .. 1 do
if not(G.[i].explored2) then do
s <- i
( DFSsub G i (fun s -> s) ) |> ignore
// printfn "%d %d" i G.[i].leader
DFSLoop2 directgraph2
printfn "pause"
Console.ReadKey() |> ignore
let table = [for i in directgraph2.Keys do yield directgraph2.[i].leader]
let results = table |> Seq.countBy id |> Seq.map snd |> Seq.toList |> List.sort |> List.rev
printfn "%A" results
printfn "pause"
Console.ReadKey() |> ignore
Here is a text file with a simple graph example
1 4
2 8
3 6
4 7
5 2
6 9
7 1
8 5
8 6
9 7
9 3
(the one which is causing overflow is 70Mo big with around 900,000 nodes)
EDIT
to clarify a few things first
Here is the "pseudo code"
Input: a directed graph G = (V,E), in adjacency list representation. Assume that the vertices V are labeled
1, 2, 3, . . . , n.
1. Let Grev denote the graph G after the orientation of all arcs have been reversed.
2. Run the DFS-Loop subroutine on Grev, processing vertices according to the given order, to obtain a
finishing time f(v) for each vertex v ∈ V .
3. Run the DFS-Loop subroutine on G, processing vertices in decreasing order of f(v), to assign a leader
to each vertex v ∈ V .
4. The strongly connected components of G correspond to vertices of G that share a common leader.
Figure 2: The top level of our SCC algorithm. The f-values and leaders are computed in the first and second
calls to DFS-Loop, respectively (see below).
Input: a directed graph G = (V,E), in adjacency list representation.
1. Initialize a global variable t to 0.
[This keeps track of the number of vertices that have been fully explored.]
2. Initialize a global variable s to NULL.
[This keeps track of the vertex from which the last DFS call was invoked.]
3. For i = n downto 1:
[In the first call, vertices are labeled 1, 2, . . . , n arbitrarily. In the second call, vertices are labeled by
their f(v)-values from the first call.]
(a) if i not yet explored:
i. set s := i
ii. DFS(G, i)
Figure 3: The DFS-Loop subroutine.
Input: a directed graph G = (V,E), in adjacency list representation, and a source vertex i ∈ V .
1. Mark i as explored.
[It remains explored for the entire duration of the DFS-Loop call.]
2. Set leader(i) := s
3. For each arc (i, j) ∈ G:
(a) if j not yet explored:
i. DFS(G, j)
4. t + +
5. Set f(i) := t
Figure 4: The DFS subroutine. The f-values only need to be computed during the first call to DFS-Loop, and
the leader values only need to be computed during the second call to DFS-Loop.
EDIT
i have amended the code, with the help of an experienced programmer (a lisper but who has no experience in F#) simplifying somewhat the first part to have more quickly an example without bothering about non-relevant code for this discussion.
The code focuses only on half of the algo, running DFS once to get finishing times of the reversed tree.
This is the first part of the code just to create a small example
y is the original tree. the first element of a tuple is the parent, the second is the child. But we will be working with the reverse tree
open System
open System.Collections.Generic
open System.IO
let x = File.ReadAllLines "C:\Users\Fagui\Documents\GitHub\Learning Fsharp\Algo Stanford I\PA 4 - SCC.txt";;
// let x = File.ReadAllLines "C:\Users\Fagui\Documents\GitHub\Learning Fsharp\Algo Stanford I\PA 4 - test1.txt";;
// val x : string [] =
let splitAtTab (text:string)=
text.Split [|'\t';' '|]
let splitIntoKeyValue (A: int[]) =
(A.[0], A.[1])
let parseLine (line:string)=
line
|> splitAtTab
|> Array.filter (fun s -> not(s=""))
|> Array.map (fun s-> (int s))
|> splitIntoKeyValue
// let y =
// x |> Array.map parseLine
//let y =
// [|(1, 4); (2, 8); (3, 6); (4, 7); (5, 2); (6, 9); (7, 1); (8, 5); (8, 6);
// (9, 7); (9, 3)|]
// let y = Array.append [|(1,1);(1,2);(2,3);(3,1)|] [|for i in 4 .. 10000 do yield (i,4)|]
let y = Array.append [|(1,1);(1,2);(2,3);(3,1)|] [|for i in 4 .. 99999 do yield (i,i+1)|]
//val it : (int * int) []
type Children = int list
type Node1 =
{children : Children ;
mutable finishingtime : int ;
mutable explored1 : bool ;
}
type Node2 =
{children : Children ;
mutable leader : int ;
mutable explored2 : bool ;
}
type DFSgraphcore = Dictionary<int,Children>
let directgraphcore = new DFSgraphcore()
let reversegraphcore = new DFSgraphcore()
type DFSgraph1 = Dictionary<int,Node1>
let reversegraph1 = new DFSgraph1()
let AddtoGraph (G:DFSgraphcore) (n,c) =
if not(G.ContainsKey n) then
let node = [c]
G.Add(n,node)
else
let c'= G.[n]
G.Remove(n) |> ignore
G.Add (n, List.append c' [c])
let inline swaptuple (a,b) = (b,a)
y|> Array.iter (AddtoGraph directgraphcore)
y|> Array.map swaptuple |> Array.iter (AddtoGraph reversegraphcore)
// définir reversegraph1 = ... with....
for i in reversegraphcore.Keys do
let node = {children = reversegraphcore.[i] ;
finishingtime = -1 ;
explored1 = false ;
}
reversegraph1.Add (i,node)
for i in directgraphcore.Keys do
if not(reversegraphcore.ContainsKey(i)) then do
let node = {children = [] ;
finishingtime = -1 ;
explored1 = false ;
}
reversegraph1.Add (i,node)
directgraphcore.Clear |> ignore
reversegraphcore.Clear |> ignore
// for i in reversegraph1.Keys do printfn "%d %A" i reversegraph1.[i].children
printfn "pause"
Console.ReadKey() |> ignore
let num_nodes =
directgraphcore |> Seq.length
So basically the graph is (1->2->3->1)::(4->5->6->7->8->....->99999->10000)
and the reverse graph is (1->3->2->1)::(10000->9999->....->4)
here is the main code written in direct style
//////////////////// main code is below ///////////////////
let DFSLoop1 (G:DFSgraph1) =
let mutable t = 0
let mutable s = -1
let rec iter (n:int) (f:'a->unit) (list:'a list) : unit =
match list with
| [] -> (t <- t+1) ; (G.[n].finishingtime <- t)
| x::xs -> f x ; iter n f xs
let rec DFSsub (G:DFSgraph1) (n:int) : unit =
let my_f (j:int) : unit = if not(G.[j].explored1) then (DFSsub G j)
G.[n].explored1 <- true
iter n my_f G.[n].children
for i in num_nodes .. -1 .. 1 do
// printfn "%d" i
if not(G.[i].explored1) then do
s <- i
DFSsub G i
printfn "%d %d" i G.[i].finishingtime
// End of DFSLoop1
DFSLoop1 reversegraph1
printfn "pause"
Console.ReadKey() |> ignore
its not tail recursive, so we use continuations, here is the same code adapted to CPS style:
//////////////////// main code is below ///////////////////
let DFSLoop1 (G:DFSgraph1) =
let mutable t = 0
let mutable s = -1
let rec iter_c (n:int) (f_c:'a->(unit->'r)->'r) (list:'a list) (cont: unit->'r) : 'r =
match list with
| [] -> (t <- t+1) ; (G.[n].finishingtime <- t) ; cont()
| x::xs -> f_c x (fun ()-> iter_c n f_c xs cont)
let rec DFSsub (G:DFSgraph1) (n:int) (cont: unit->'r) : 'r=
let my_f_c (j:int)(cont:unit->'r):'r = if not(G.[j].explored1) then (DFSsub G j cont) else cont()
G.[n].explored1 <- true
iter_c n my_f_c G.[n].children cont
for i in maxnum_nodes .. -1 .. 1 do
// printfn "%d" i
if not(G.[i].explored1) then do
s <- i
DFSsub G i id
printfn "%d %d" i G.[i].finishingtime
DFSLoop1 reversegraph1
printfn "faré"
printfn "pause"
Console.ReadKey() |> ignore
both codes compile and give the same results for the small example (the one in comment) or the same tree that we are using , with a smaller size (1000 instead of 100000)
so i don't think its a bug in the algo here, we've got the same tree structure, just a bigger tree is causing problems. it looks to us the continuations are well written. we've typed the code explicitly. and all calls end with a continuation in all cases...
We are looking for expert advice !!! thanks !!!

I did not try to understand the whole code snippet, because it is fairly long, but you'll certainly need to replace the for loop with an iteration implemented using continuation passing style. Something like:
let rec iterc f cont list =
match list with
| [] -> cont ()
| x::xs -> f x (fun () -> iterc f cont xs)
I didn't understand the purpose of cont in your DFSub function (it is never called, is it?), but the continuation based version would look roughly like this:
let rec DFSsub (G:DFSgraph2)(n:int) cont =
G.[n].explored2 <- true
G.[n].leader <- s
G.[n].children
|> iterc
(fun j cont -> if not(G.[j].explored2) then DFSsub G j cont else cont ())
(fun () -> t <- t + 1)

Overflowing the stack when you recurse through hundreds of thousands of entries isn't bad at all, really. A lot of programming language implementations will choke on much shorter recursions than that. You're having serious programmer problems — nothing to be ashamed of!
Now if you want to do deeper recursions than your implementation will handle, you need to transform your algorithm so it is iterative and/or tail-recursive (the two are isomorphic — except that tail-recursion allows for decentralization and modularity, whereas iteration is centralized and non-modular).
To transform an algorithm from recursive to tail-recursive, which is an important skill to possess, you need to understand the state that is implicitly stored in a stack frame, i.e. those free variables in the function body that change across the recursion, and explicitly store them in a FIFO queue (a data structure that replicates your stack, and can be implemented trivially as a linked list). Then you can pass that linked list of reified frame variables as an argument to your tail recursive functions.
In more advanced cases where you have many tail recursive functions each with a different kind of frame, instead of simple self-recursion, you may need to define some mutually recursive data types for the reified stack frames, instead of using a list. But I believe Kosaraju's algorithm only involves self-recursive functions.

OK, so the code given above was the RIGHT code !
the problem lies with the compiler of F#
here is some words about it from Microsoft
http://blogs.msdn.com/b/fsharpteam/archive/2011/07/08/tail-calls-in-fsharp.aspx
Basically, be careful with the settings, in default mode, the compiler may NOT make automatically the tail calls. To do so, in VS2015, go to the Solution Explorer, right click with the mouse and click on "Properties" (the last element of the scrolling list)
Then in the new window, click on "Build" and tick the box "Generate tail calls"
It is also to check if the compiler did its job looking at the disassembly using
ILDASM.exe
you can find the source code for the whole algo in my github repository
https://github.com/FaguiCurtain/Learning-Fsharp/blob/master/Algo%20Stanford/Algo%20Stanford/Kosaraju_cont.fs
on a performance point of view, i'm not very satisfied. The code runs on 36 seconds on my laptop. From the forum with other fellow MOOCers, C/C++/C# typically executes in subsecond to 5s, Java around 10-15, Python around 20-30s.
So my implementation is clearly not optimized. I am now happy to hear about tricks to make it faster !!! thanks !!!!

Second Taxicab Number Generator

I am attempting to generate a series of guesses for the second Taxicab number. What I want to do is is call the Attempt function on a series of integers in a finite sequence. I have my two questions about implementation in the comments.
A taxi cab number, in case your wondering, is the least number that satisfied the sum of 2 unique cubes in for n unique sets of 2 unique cubes. Ta(2) is 1729.
[<EntryPoint>]
let main argv =
let Attempt (start : int) =
let stop = start+20
let integerList = [start..stop]
let list = List.init 3 (fun x -> integerList.[x])
//Is there a simple way to make initialize the list with random indices of integerList?
let Cube x = x*x*x
let newlist = list |> List.map (fun x -> Cube x)
let partitionList (x : List<int>) (y : int) = List.sum [x.[y];x.[y+1]]
let intLIST = [0..2]
let partitionList' = [for i in intLIST do yield partitionList newlist i]
let x = Set.ofList partitionList'
let y = Set.ofList partitionList'
//I was going to try to use some kind of equality operator to determine whether the two sets were equal, which could tell me whether we had actually found a Taxicab number by the weakened definition.
System.Console.Write(list)
System.Console.Write(newlist)
let rnd = System.Random()
//My primary question is how can I convert a random to an integer to use in start for the function Attempt?
System.Console.ReadKey() |> ignore
printfn("%A") argv
0

Dirty way to initialize list with random indexes of another list:
let randomIndexes count myList =
let rand = System.Random()
seq {
for n = 1 to count do
yield rand.Next(List.length myList) }
|> Seq.distinct
//|> Seq.sort // if you need them sorted
|> List.ofSeq
let result = randomIndexes 5 [3;2;4;5]
printfn "%A" result

Get a random subset from a set in F#

I am trying to think of an elegant way of getting a random subset from a set in F#
Any thoughts on this?
Perhaps this would work: say we have a set of 2x elements and we need to pick a subset of y elements. Then if we could generate an x sized bit random number that contains exactly y 2n powers we effectively have a random mask with y holes in it. We could keep generating new random numbers until we get the first one satisfying this constraint but is there a better way?

If you don't want to convert to an array you could do something like this. This is O(n*m) where m is the size of the set.
open System
let rnd = Random(0);
let set = Array.init 10 (fun i -> i) |> Set.of_array
let randomSubSet n set =
seq {
let i = set |> Set.to_seq |> Seq.nth (rnd.Next(set.Count))
yield i
yield! set |> Set.remove i
}
|> Seq.take n
|> Set.of_seq
let result = set |> randomSubSet 3
for x in result do
printfn "%A" x

Agree with #JohannesRossel. There's an F# shuffle-an-array algorithm here you can modify suitably. Convert the Set into an array, and then loop until you've selected enough random elements for the new subset.

Not having a really good grasp of F# and what might be considered elegant there, you could just do a shuffle on the list of elements and select the first y. A Fisher-Yates shuffle even helps you in this respect as you also only need to shuffle y elements.

rnd must be out of subset function.
let rnd = new Random()
let rec subset xs =
let removeAt n xs = ( Seq.nth (n-1) xs, Seq.append (Seq.take (n-1) xs) (Seq.skip n xs) )
match xs with
| [] -> []
| _ -> let (rem, left) = removeAt (rnd.Next( List.length xs ) + 1) xs
let next = subset (List.of_seq left)
if rnd.Next(2) = 0 then rem :: next else next

Do you mean a random subset of any size?
For the case of a random subset of a specific size, there's a very elegant answer here:
Select N random elements from a List<T> in C#
Here it is in pseudocode:
RandomKSubset(list, k):
n = len(list)
needed = k
result = {}
for i = 0 to n:
if rand() < needed / (n-i)
push(list[i], result)
needed--
return result

Using Seq.fold to construct using lazy evaluation random sub-set:
let rnd = new Random()
let subset2 xs = let insertAt n xs x = Seq.concat [Seq.take n xs; seq [x]; Seq.skip n xs]
let randomInsert xs = insertAt (rnd.Next( (Seq.length xs) + 1 )) xs
xs |> Seq.fold randomInsert Seq.empty |> Seq.take (rnd.Next( Seq.length xs ) + 1)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart