How do I just generate data with fscheck? - f#

Is it possible to generate data, specifically a list, with fscheck for use outside of fscheck? I'm unable to debug a situation in fscheck testing where it looks like the comparison results are equal, but fscheck says they are not.
I have this generator for a list of objects. How do I generate a list I can use from this generator?
let genListObj min max = Gen.listOf Arb.generate<obj> |> Gen.suchThat (fun l -> (l.Length >= min) && (l.Length <= max))

Edit: this function is now part of the FsCheck API (Gen.sample) so you don't need the below anymore...
Here is a sample function to generate n samples from a given generator:
let sample n gn =
let rec sample i seed samples =
if i = 0 then samples
else sample (i-1) (Random.stdSplit seed |> snd) (Gen.eval 1000 seed gn :: samples)
sample n (Random.newSeed()) []
Edit: the 1000 magic number in there represents the size of the generated values. 1000 is pretty big - e.g. sequences will be between 0 and 1000 elements long, and so will strings, for example. If generation takes a long time, you may want to tweak that value (or take it in as a parameter of the function).

Related

F# - fsc.exe hangs up on huge file

I run some organic chemistry models. A model is described by a generated ModelData.fs file, e.g.: https://github.com/kkkmail/ClmFSharp/blob/master/Clm/Model/ModelData.fs . The file has a very simple structure and using a generated model file is the only way that it can possibly work.
The referenced file is just for tests, but the real models are huge and may go close to 60 - 70 MB / 1.5M LOC. When I try to compile such files, F# compiler,fsc.exe, just hangs up and never comes back. It "eats" about 1.5 GB of memory and then does something forever at near 100% processing capacity. It can clearly handle smaller models, which take about 10 MB in under about a minute. So somewhere between 10 MB and 70 MB something breaks down badly in fsc.
I wonder if there are some parameter tweaks that I could make to the way the fsc compiles the project in order to make it capable of handling such huge models.
The huge models that I am referring to have one parameter set as follows: let numberOfSubstances = 65643. This results in various generated arrays of that size. I wonder if this could be the source of the problem.
Thanks a lot!
I don't think you need to autogenerate all of that.
From your comments, I understand that the functions d0, d1, ... are generated from a big sparse matrix in a way that sums up all of the input array x (with coefficients), but crucially skips summing up zero coefficients, which gives you a great performance gain, because the matrix is huge. Would that be a correct assessment?
If so, I still don't think you need to generate code to do that.
Let's take a look. I will assume that your giant sparse matrix has an interface for obtaining cell values, and it looks something like this:
let getMatrixCell (i: int) (j: int) : double
let maxI: int
let maxJ: int
Then your autogeneration code might look something like this:
let generateDFunction (i: int) =
printfn "let d%d (x: double[]) =" i
printfn " [|"
for j in 0..maxJ do
let cell = getMatrixCell i j
if cell <> 0 then
printfn " %f * x.[%d]" cell j
printfn " |]"
printfn " |> Array.sum"
Which would result in something like this:
let d25 (x : array<double>) =
[|
-1.0 * x.[25]
1.0 * x.[3]
|]
|> Array.sum
Note that I am simplifying here: in your example file, it looks like the functions also multiply negative coefficients by x.[i]. But maybe I'm also overcomplicating, because it looks like all the coefficients are always either 1 or -1. But that is all nonessential to my point.
Now, in the comments, it has been proposed that you don't generate functions d0, d1, ... but instead work directly with the matrix. For example, this would be a naive implementation of such suggestion:
let calculateDFunction (i: int) (x: double[]) =
[| for j in 0..maxJ -> (getMatrixCell i j) * x.[j] |] |> Array.sum
You then argued that this solution would be prohibitively slow, because it always iterates over the whole array x, which is huge, but most of the coefficients are zero, so it doesn't have to.
And then your way of solving this issue was to use an intermediate step of generated code: you generate the functions that only touch non-zero indicies, and then you compile and use those functions.
But here's the point: yes, you do need that intermediate step to get rid of non-zero indicies, but it doesn't have to be generated-and-compiled code!
Instead, you can prepare lists/arrays of non-zero indicies ahead of time:
let indicies =
[| for i in 0..maxI ->
[ for j in 0..maxJ do
let cell = getMatrixCell i j
if cell <> 0 then yield (j, cell)
]
|]
This will yield an array indicies : Array<int list>, where each index k corresponds to your autogenerated function dk, and it contains a list of non-zero matrix indicies together with their values in the matrix. For example, the function d22 I gave above would be represented by the 22nd element of indicies:
indicies.[22] = [ (25, -1.0), (3, 1.0) ]
Based on this intermediate structure, you can then calculate any function dk:
let calculateDFunction (k: int) (x: double[]) =
[| for (j, coeff) in indicies.[k] -> coeff * x.[j] |] |> Array.sum
In fact, if performance is crucial to you (as it seems to be from the comments), you probably should do away with all those intermediate arrays: hundreds or thousands heap allocations on each iteration is definitely not helping. You can sum with a mutable variable instead:
let calculateDFunction (k: int) (x: double[]) =
let sum = 0.0
for (j, coeff) in indicies.[k] do
sum <- sum + coeff * x.[j]
sum

Split Dataset into smallers sets randomly f#

I am learning F # and I would like to learn how to split a data set into 10 smaller sets randomly. Anyone have any ideas to start ??? What topic should I read ??? I need help to continue. Thank you.
A lot depends on what exactly it is that you want to achive. You can use the Permute function of most collections. Here is an example that takes advantage of the MathNet.Numerics to generate the random indexes and then shuffles the data. Of course you can first split and then shuffle the date as well. And use Array.permute instead. Just nuget MathNet.Numerics and MathNet.Numerics.FSharp.
#if INTERACTIVE
#r #"../packages/MathNet.Numerics/lib/net461/MathNet.Numerics.dll"
#r #"../packages/MathNet.Numerics.FSharp/lib/net45/MathNet.Numerics.FSharp.dll"
#endif
open System
open MathNet.Numerics
let rnd = System.Random()
let randomData = Array.init 100 (fun _ -> rnd.Next()) // generate the initial data
let randomIndex = (Combinatorics.GeneratePermutation 100) // create a random index
randomIndex
|> Array.map (fun x -> randomData.[x]) //shuffle the data
|> Array.splitInto 10 //split it into 10 subsets
Your result will be, in this case, an int array of arrays. It's more idiomatic to use Lists in F#. Also if your data is very large you might consider using Seq which is lazy.

How do I print out the entire Fibonacci sequence up to a user inputted value in F#?

So I have a program that, currently, finds the fibonacci equivalent of a user inputted value, e.g. 6 would be 5 (or 13 depending on whether or not you start with a 0). Personally I prefer the sequence starting with 0.
open System
let rec fib (n1 : bigint) (n2 : bigint) c =
if c = 1 then
n2
else
fib n2 (n1+n2) (c-1);;
let GetFib n =
(fib 1I 1I n);;
let input = Console.ReadLine()
Console.WriteLine(GetFib (Int32.Parse input))
The problem is that ALL it does is find the equivalent number in the sequence. I am trying to get it to print out all the values up to that user inputted value, e.g. 6 would print out 0,1,1,2,3,5. If anyone could help me figure out how to print out the whole sequence, that would be very helpful. Also if anyone can look at my code and tell me how to make it start at 0 when printing out the whole sequence, that would also be very much appreciated.
Thank you in advance for any help.
Take a look at the link s952163 gave you in the comments - that shows ways of generating a fibonnaci sequence using Seq expressions and also explains why these are useful.
The following will print a sequence up until the specified sequence number:
let fibsTo n = Seq.unfold (fun (m,n) -> Some (m, (n,n+m))) (0I,1I)
|>Seq.takeWhile (fun x -> x <= n)
let input = Console.ReadLine()
(fibsTo (Numerics.BigInteger.Parse input))|>Seq.iter(printfn "%A")
Note the use of printfn rather than console.writeline, the former is more idiomatic.
Also, you may want to consider handling negative inputs here as these will throw an error.

Apply several aggregate functions with one enumeration

Let's assume I have a series of functions that work on a sequence, and I want to use them together in the following fashion:
let meanAndStandardDeviation data =
let m = mean data
let sd = standardDeviation data
(m, sd)
The code above is going to enumerate the sequence twice. I am interested in a function that will give the same result but enumerate the sequence only once. This function will be something like this:
magicFunction (mean, standardDeviation) data
where the input is a tuple of functions and a sequence and the ouput is the same with the function above.
Is this possible if the functions mean and stadardDeviation are black boxes and I cannot change their implementation?
If I wrote mean and standardDeviation myself, is there a way to make them work together? Maybe somehow making them keep yielding the input to the next function and hand over the result when they are done?
The only way to do this using just a single iteration when the functions are black boxes is to use the Seq.cache function (which evaluates the sequence once and stores the results in memory) or to convert the sequence to other in-memory representation.
When a function takes seq<T> as an argument, you don't even have a guarantee that it will evaluate it just once - and usual implementations of standard deviation would first calculate the average and then iterate over the sequence again to calculate the squares of errors.
I'm not sure if you can calculate standard deviation with just a single pass. However, it is possible to do that if the functions are expressed using fold. For example, calculating maximum and average using two passes looks like this:
let maxv = Seq.fold max Int32.MinValue input
let minv = Seq.fold min Int32.MaxValue input
You can do that using a single pass like this:
Seq.fold (fun (s1, s2) v ->
(max s1 v, min s2 v)) (Int32.MinValue, Int32.MaxValue) input
The lambda function is a bit ugly, but you can define a combinator to compose two functions:
let par f g (i, j) v = (f i v, g j v)
Seq.fold (par max min) (Int32.MinValue, Int32.MaxValue) input
This approach works for functions that can be defined using fold, which means that they consist of some initial value (Int32.MinValue in the first example) and then some function that is used to update the initial (previous) state when it gets the next value (and then possibly some post-processing of the result). In general, it should be possible to rewrite single-pass functions in this style, but I'm not sure if this can be done for standard deviation. It can be definitely done for mean:
let (count, sum) = Seq.fold (fun (count, sum) v ->
(count + 1.0, sum + v)) (0.0, 0.0) input
let mean = sum / count
What we're talking about here is a function with the following signature:
(seq<'a> -> 'b) * (seq<'a> -> 'c) -> seq<'a> -> ('b * 'c)
There is no straightforward way that I can think of that will achieve the above using a single iteration of the sequence if that is the signature of the functions. Well, no way that is more efficient than:
let magicFunc (f1:seq<'a>->'b, f2:seq<'a>->'c) (s:seq<'a>) =
let cached = s |> Seq.cache
(f1 cached, f2 cached)
That ensures a single iteration of the sequence itself (perhaps there are side effects, or it's slow), but does so by essentially caching the results. The cache is still iterated another time. Is there anything wrong with that? What are you trying to achieve?

N-gram split function for string similarity comparison

As part of excersise to better understand F# which I am currently learning , I wrote function to
split given string into n-grams.
1) I would like to receive feedback about my function : can this be written simpler or in more efficient way?
2) My overall goal is to write function that returns string similarity (on 0.0 .. 1.0 scale) based on n-gram similarity; Does this approach works well for short strings comparisons , or can this method reliably be used to compare large strings (like articles for example).
3) I am aware of the fact that n-gram comparisons ignore context of two strings. What method would you suggest to accomplish my goal?
//s:string - target string to split into n-grams
//n:int - n-gram size to split string into
let ngram_split (s:string, n:int) =
let ngram_count = s.Length - (s.Length % n)
let ngram_list = List.init ngram_count (fun i ->
if( i + n >= s.Length ) then
s.Substring(i,s.Length - i) + String.init ((i + n) - s.Length)
(fun i -> "#")
else
s.Substring(i,n)
)
let ngram_array_unique = ngram_list
|> Seq.ofList
|> Seq.distinct
|> Array.ofSeq
//produce tuples of ngrams (ngram string,how much occurrences in original string)
Seq.init ngram_array_unique.Length (fun i -> (ngram_array_unique.[i],
ngram_list |> List.filter(fun item -> item = ngram_array_unique.[i])
|> List.length)
)
I don't know much about evaluating similarity of strings, so I can't give you much feedback regarding points 2 and 3. However, here are a few suggestions that may help to make your implementation simpler.
Many of the operations that you need to do are already available in some F# library function for working with sequences (lists, arrays, etc.). Strings are also sequences (of characters), so you can write the following:
open System
let ngramSplit n (s:string) =
let ngrams = Seq.windowed n s
let grouped = Seq.groupBy id ngrams
Seq.map (fun (ngram, occurrences) ->
String(ngram), Seq.length occurrences) grouped
The Seq.windowed function implements a sliding window, which is exactly what you need to extract the n-grams of your string. The Seq.groupBy function collects the elements of a sequence (n-grams) into a sequence of groups that contain values with the same key. We use id to calculate the key, which means that the n-gram is itself the key (and so we get groups, where each group contains the same n-grams). Then we just convert n-gram to string and count the number of elements in the group.
Alternatively, you can write the entire function as a single processing pipeline like this:
let ngramSplit n (s:string) =
s |> Seq.windowed n
|> Seq.groupBy id
|> Seq.map (fun (ngram, occurrences) ->
String(ngram), Seq.length occurrences)
Your code looks OK to me. Since ngram extraction and similarity comparison are used very often. You should consider some efficiency issues here.
The MapReduce pattern is very suitable for your frequency counting problem:
get a string, emit (word, 1) pairs out
do a grouping of the words and adds all the counting together.
let wordCntReducer (wseq: seq<int*int>) =
wseq
|> Seq.groupBy (fun (id,cnt) -> id)
|> Seq.map (fun (id, idseq) ->
(id, idseq |> Seq.sumBy(fun (id,cnt) -> cnt)))
(* test: wordCntReducer [1,1; 2,1; 1,1; 2,1; 2,2;] *)
You also need to maintain a <word,int> map during your ngram building for a set of strings. As it is much more efficient to handle integers rather than strings during later processing.
(2) to compare the distance between two short strings. A common practice is to use Edit Distance using a simple dynamic programming. To compute the similarity between articles, a state-of-the-art method is to use TFIDF feature representation. Actuallym the code above is for term frequency counting, extracted from my data mining library.
(3) There are complex NLP methods, e.g. tree kernels based on the parse tree, to in-cooperate the context information in.
I think you have some good answers for question (1).
Question (2):
You probably want cosine similarity to compare two arbitrary collections of n-grams (the larger better). This gives you a range of 0.0 - 1.0 without any scaling needed. The Wikipedia page gives an equation, and the F# translation is pretty straightforward:
let cos a b =
let dot = Seq.sum (Seq.map2 ( * ) a b)
let magnitude v = Math.Sqrt (Seq.sum (Seq.map2 ( * ) v v))
dot / (magnitude a * magnitude b)
For input, you need to run something like Tomas' answer to get two maps, then remove keys that only exist in one:
let values map = map |> Map.toSeq |> Seq.map snd
let desparse map1 map2 = Map.filter (fun k _ -> Map.containsKey k map2) map1
let distance textA textB =
let a = ngramSplit 3 textA |> Map.ofSeq
let b = ngramSplit 3 textB |> Map.ofSeq
let aValues = desparse a b |> values
let bValues = desparse b a |> values
cos aValues bValues
With character-based n-grams, I don't know how good your results will be. It depends on what kind of features of the text you are interested in. I do natural language processing, so usually my first step is part-of-speech tagging. Then I compare over n-grams of the parts of speech. I use T'n'T for this, but it has bizarro licencing issues. Some of my colleagues use ACOPOST instead, a Free alternative (as in beer AND freedom). I don't know how good the accuracy is, but POS tagging is a well-understood problem these days, at least for English and related languages.
Question (3):
The best way to compare two strings that are nearly identical is Levenshtein distance. I don't know if that is your case here, although you can relax the assumptions in a number of ways, eg for comparing DNA strings.
The standard book on this subject is Sankoff and Kruskal's "Time Warps, String Edits, and Maromolecules". It's pretty old (1983), but gives good examples of how to adapt the basic algorithm to a number of applications.
Question 3:
My reference book is Computing Patterns in Strings by Bill Smyth

Resources