Split Dataset into smallers sets randomly f#

Split Dataset into smallers sets randomly f# - f#

I am learning F # and I would like to learn how to split a data set into 10 smaller sets randomly. Anyone have any ideas to start ??? What topic should I read ??? I need help to continue. Thank you.

A lot depends on what exactly it is that you want to achive. You can use the Permute function of most collections. Here is an example that takes advantage of the MathNet.Numerics to generate the random indexes and then shuffles the data. Of course you can first split and then shuffle the date as well. And use Array.permute instead. Just nuget MathNet.Numerics and MathNet.Numerics.FSharp.
#if INTERACTIVE
#r #"../packages/MathNet.Numerics/lib/net461/MathNet.Numerics.dll"
#r #"../packages/MathNet.Numerics.FSharp/lib/net45/MathNet.Numerics.FSharp.dll"
#endif
open System
open MathNet.Numerics
let rnd = System.Random()
let randomData = Array.init 100 (fun _ -> rnd.Next()) // generate the initial data
let randomIndex = (Combinatorics.GeneratePermutation 100) // create a random index
randomIndex
|> Array.map (fun x -> randomData.[x]) //shuffle the data
|> Array.splitInto 10 //split it into 10 subsets
Your result will be, in this case, an int array of arrays. It's more idiomatic to use Lists in F#. Also if your data is very large you might consider using Seq which is lazy.

Related

Release the processed data in sequence

I am doing data processing with F#. First I got all files in a directory, then process each file to generate some data structure. Finally I will store the processed data into SQLite. I known that if I using Seq to store the file name and then pipe-forward to Seq.map that will do lazy process for each file. But how about there are so many files that contain all of them in memory is impossible. Then in imperative programming language, I could read one file, process it, store it and release the intermedia data and do next file. Of course F# could do imperative programming, but I want to know if there are some chances to do it in Functional programming style?
files
|> Seq.map readFile
|> Seq.map processContent
|> Seq.map storeProcessResult
code above shows my opinion. files contains a sequence of file names, then I read the content of file, process it into some structure and finally store the result into database. I know that because of the lazy behaviour, file will be read and processed one by one. But when will the final data released?

Obviously only you know what happens inside your readFile, processContent and storeProcessResult functions. As #FuleSnabel says in his comment you can map and then use fold (recursion) to process the file.
Here's a simple test you can perform to see the difference in memory consumption: create a List of lists with 10 million elements and sum the list, then create a Seq of lists with 10 million elements, and sum the list. I'm using 64-bit FSI.
This will use about 1GB of memory:
let z = [for i in 1..3 -> List.init 10000000 (fun _ -> 1)]
let w = z |> List.map (fun x -> System.GC.Collect();List.sum x)
This will only use a few MB of memory, much less than even one list with 10 million 1s in it:
let x = seq {for i in 1..3 -> List.init 10000000 (fun _ -> 1 ) }
let y = x |> Seq.map (fun x -> System.GC.Collect(); List.sum x)
This is just one (and probably easy) part in the workflow. If you're opening files, you have to make sure to close them as well, hence my suggestion of use above. However I do recognize that accessing the filesystem, and processing large amounts of data in a lazy sequence might cause some problems, in that case you can always profile it and see where the bottleneck is.
By the way, you don't need to call the GC directly in the code, I just did so the intermediate results don't pollute the memory count in the test.

formatting Composite function in f#

I have a recursive function in f# that iterates a string[] of commands that need to be run, each command runs a new command to generate a map to be passed to the next function.
The commands run correctly but are large and cumbersome to read, I believe that there is a better way to order / format these composite functions using pipe syntax however coming from c# as a lot of us do i for the life of me cannot seem to get it to work.
my command is :
let rec iterateCommands (map:Map<int,string array>) commandPosition =
if commandPosition < commands.Length then
match splitCommand(commands.[0]).[0] with
|"comOne" ->
iterateCommands (map.Add(commandPosition,create(splitCommand commands.[commandPosition])))(commandPosition+1)
The closest i have managed is by indenting the function but this is messy :
iterateCommands
(map.Add
(commandPosition,create
(splitCommand commands.[commandPosition])
)
)
(commandPosition+1)
Is it even possible to reformat this in f#? From what i have read i believe it possible, any help would be greatly appreciated
The command/variable types are:
commandPosition - int
commands - string[]
splitCommand string -> string[]
create string[] -> string[]
map : Map<int,string[]>
and of course the map.add map -> map + x

It's often hard to make out what is going on in a big statement with multiple inputs. I'd give names to the individual expressions, so that a reader can jump into any position and have a rough idea what's in the values used in a calculation, e.g.
let inCommands = splitCommand commands.[commandPosition]
let map' = map.Add (commandPosition, inCommands)
iterateCommands map' inCommands
Since I don't know what is being done here, the names aren't very meaningful. Ideally, they'd help to understand the individual steps of the calculation.

It'd be a bit easier to compose the call if you changed the arguments around:
let rec iterateCommands commandPosition (map:Map<int,string array>) =
// ...
That would enable you to write something like:
splitCommand commands.[commandPosition]
|> create
|> (fun x -> commandPosition, x)
|> map.Add
|> iterateCommands (commandPosition + 1)
The fact that commandPosition appears thrice in the composition is, in my opinion, a design smell, as is the fact that the type of this entire expression is unit. It doesn't look particularly functional, but since I don't understand exactly what this function attempts to do, I can't suggest a better design.
If you don't control iterateCommands, and hence can't change the order of arguments, you can always define a standard functional programming utility function:
let flip f x y = f y x
This enables you to write the following against the original version of iterateCommands:
splitCommand commands.[commandPosition]
|> create
|> (fun x -> commandPosition, x)
|> map.Add
|> (flip iterateCommands) (commandPosition + 1)

Custom Operator for Lag or Standard Deviation

What is the proper way to extend the available operators when using RX?
I'd like to build out some operations that I think would be useful.
The first operation is simply the standard deviation of a series.
The second operation is the nth lagged value i.e. if we are lagging 2 and our series is A B C D E F when F is pushed the lag would be D when A is pushed the lag would be null/empty when B is pushed the lag would be null/empty when C is pushed the Lag would be A
Would it make sense to base these types of operators off of the built-ins from rx.codeplex.com or is there an easier way?

In idiomatic Rx, arbitrary delays can be composed by Zip.
let lag (count : int) o =
let someo = Observable.map Some o
let delayed = Observable.Repeat(None, count).Concat(someo)
Observable.Zip(someo, delayed, (fun c d -> d))
As for a rolling buffer, the most efficient way is to simply use a Queue/ResizeArray of fixed size.
let rollingBuffer (size : int) o =
Observable.Create(fun (observer : IObserver<_>) ->
let buffer = new Queue<_>(size)
o |> Observable.subscribe(fun v ->
buffer.Enqueue(v)
if buffer.Count = size then
observer.OnNext(buffer.ToArray())
buffer.Dequeue() |> ignore
)
)
For numbers |> rollingBuffer 3 |> log:
seq [0L; 1L; 2L]
seq [1L; 2L; 3L]
seq [2L; 3L; 4L]
seq [3L; 4L; 5L]
...
For pairing adjacent values, you can just use Observable.pairwise
let delta (a, b) = b - a
let deltaStream = numbers |> Observable.pairwise |> Observable.map(delta)
Observable.Scan is more concise if you want to apply a rolling calculation .

Some of these are easier than others (as usual). For a 'lag' by count (rather than time) you just create a sliding window by using Observable.Buffer equivalent to the size of 'lag', then take the first element of the result list.
So far lag = 3, the function is:
obs.Buffer(3,1).Select(l => l.[0])
This is pretty straightforward to turn into an extension function. I don't know if it is efficient in that it reuses the same list, but in most cases that shouldn't matter. I know you want F#, the translation is straightforward.
For running aggregates, you can usually use Observable.Scan to get a 'running' value. This is calculated based on all values seen so far (and is pretty straightforward to implement) - ie all you have to implement each subsequent element is the previous aggregate and the new element.
If for whatever reason you need a running aggregate based on a sliding window, then we get into more difficult domain. Here you first need an operation that can give you a sliding window - this is covered by Buffer above. However, then you need to know which values have been removed from this window, and which have been added.
As such, I recommend a new Observable function that maintains an internal window based on existing window + new value, and returns new window + removed value + added value. You can write this using Observable.Scan (I recommend an internal Queue for efficient implementation). It should take a function for determining which values to remove given a new value (this way it can be parameterised for sliding by time or by count).
At that point, Observable.Scan can again be used to take the old aggregate + window + removed values + added value and give a new aggregate.
Hope this helps, I do realise it's a lot of words. If you can confirm the requirement, I can help out with the actual extension method for that specific use case.

For lag, you could do something like
module Observable =
let lag n obs =
let buf = System.Collections.Generic.Queue()
obs |> Observable.map (fun x ->
buf.Enqueue(x)
if buf.Count > n then Some(buf.Dequeue())
else None)
This:
Observable.Range(1, 9)
|> Observable.lag 2
|> Observable.subscribe (printfn "%A")
|> ignore
prints:
<null>
<null>
Some 1
Some 2
Some 3
Some 4
Some 5
Some 6
Some 7

How do I just generate data with fscheck?

Is it possible to generate data, specifically a list, with fscheck for use outside of fscheck? I'm unable to debug a situation in fscheck testing where it looks like the comparison results are equal, but fscheck says they are not.
I have this generator for a list of objects. How do I generate a list I can use from this generator?
let genListObj min max = Gen.listOf Arb.generate<obj> |> Gen.suchThat (fun l -> (l.Length >= min) && (l.Length <= max))

Edit: this function is now part of the FsCheck API (Gen.sample) so you don't need the below anymore...
Here is a sample function to generate n samples from a given generator:
let sample n gn =
let rec sample i seed samples =
if i = 0 then samples
else sample (i-1) (Random.stdSplit seed |> snd) (Gen.eval 1000 seed gn :: samples)
sample n (Random.newSeed()) []
Edit: the 1000 magic number in there represents the size of the generated values. 1000 is pretty big - e.g. sequences will be between 0 and 1000 elements long, and so will strings, for example. If generation takes a long time, you may want to tweak that value (or take it in as a parameter of the function).

N-gram split function for string similarity comparison

As part of excersise to better understand F# which I am currently learning , I wrote function to
split given string into n-grams.
1) I would like to receive feedback about my function : can this be written simpler or in more efficient way?
2) My overall goal is to write function that returns string similarity (on 0.0 .. 1.0 scale) based on n-gram similarity; Does this approach works well for short strings comparisons , or can this method reliably be used to compare large strings (like articles for example).
3) I am aware of the fact that n-gram comparisons ignore context of two strings. What method would you suggest to accomplish my goal?
//s:string - target string to split into n-grams
//n:int - n-gram size to split string into
let ngram_split (s:string, n:int) =
let ngram_count = s.Length - (s.Length % n)
let ngram_list = List.init ngram_count (fun i ->
if( i + n >= s.Length ) then
s.Substring(i,s.Length - i) + String.init ((i + n) - s.Length)
(fun i -> "#")
else
s.Substring(i,n)
)
let ngram_array_unique = ngram_list
|> Seq.ofList
|> Seq.distinct
|> Array.ofSeq
//produce tuples of ngrams (ngram string,how much occurrences in original string)
Seq.init ngram_array_unique.Length (fun i -> (ngram_array_unique.[i],
ngram_list |> List.filter(fun item -> item = ngram_array_unique.[i])
|> List.length)
)

I don't know much about evaluating similarity of strings, so I can't give you much feedback regarding points 2 and 3. However, here are a few suggestions that may help to make your implementation simpler.
Many of the operations that you need to do are already available in some F# library function for working with sequences (lists, arrays, etc.). Strings are also sequences (of characters), so you can write the following:
open System
let ngramSplit n (s:string) =
let ngrams = Seq.windowed n s
let grouped = Seq.groupBy id ngrams
Seq.map (fun (ngram, occurrences) ->
String(ngram), Seq.length occurrences) grouped
The Seq.windowed function implements a sliding window, which is exactly what you need to extract the n-grams of your string. The Seq.groupBy function collects the elements of a sequence (n-grams) into a sequence of groups that contain values with the same key. We use id to calculate the key, which means that the n-gram is itself the key (and so we get groups, where each group contains the same n-grams). Then we just convert n-gram to string and count the number of elements in the group.
Alternatively, you can write the entire function as a single processing pipeline like this:
let ngramSplit n (s:string) =
s |> Seq.windowed n
|> Seq.groupBy id
|> Seq.map (fun (ngram, occurrences) ->
String(ngram), Seq.length occurrences)

Your code looks OK to me. Since ngram extraction and similarity comparison are used very often. You should consider some efficiency issues here.
The MapReduce pattern is very suitable for your frequency counting problem:
get a string, emit (word, 1) pairs out
do a grouping of the words and adds all the counting together.
let wordCntReducer (wseq: seq<int*int>) =
wseq
|> Seq.groupBy (fun (id,cnt) -> id)
|> Seq.map (fun (id, idseq) ->
(id, idseq |> Seq.sumBy(fun (id,cnt) -> cnt)))
(* test: wordCntReducer [1,1; 2,1; 1,1; 2,1; 2,2;] *)
You also need to maintain a <word,int> map during your ngram building for a set of strings. As it is much more efficient to handle integers rather than strings during later processing.
(2) to compare the distance between two short strings. A common practice is to use Edit Distance using a simple dynamic programming. To compute the similarity between articles, a state-of-the-art method is to use TFIDF feature representation. Actuallym the code above is for term frequency counting, extracted from my data mining library.
(3) There are complex NLP methods, e.g. tree kernels based on the parse tree, to in-cooperate the context information in.

I think you have some good answers for question (1).
Question (2):
You probably want cosine similarity to compare two arbitrary collections of n-grams (the larger better). This gives you a range of 0.0 - 1.0 without any scaling needed. The Wikipedia page gives an equation, and the F# translation is pretty straightforward:
let cos a b =
let dot = Seq.sum (Seq.map2 ( * ) a b)
let magnitude v = Math.Sqrt (Seq.sum (Seq.map2 ( * ) v v))
dot / (magnitude a * magnitude b)
For input, you need to run something like Tomas' answer to get two maps, then remove keys that only exist in one:
let values map = map |> Map.toSeq |> Seq.map snd
let desparse map1 map2 = Map.filter (fun k _ -> Map.containsKey k map2) map1
let distance textA textB =
let a = ngramSplit 3 textA |> Map.ofSeq
let b = ngramSplit 3 textB |> Map.ofSeq
let aValues = desparse a b |> values
let bValues = desparse b a |> values
cos aValues bValues
With character-based n-grams, I don't know how good your results will be. It depends on what kind of features of the text you are interested in. I do natural language processing, so usually my first step is part-of-speech tagging. Then I compare over n-grams of the parts of speech. I use T'n'T for this, but it has bizarro licencing issues. Some of my colleagues use ACOPOST instead, a Free alternative (as in beer AND freedom). I don't know how good the accuracy is, but POS tagging is a well-understood problem these days, at least for English and related languages.
Question (3):
The best way to compare two strings that are nearly identical is Levenshtein distance. I don't know if that is your case here, although you can relax the assumptions in a number of ways, eg for comparing DNA strings.
The standard book on this subject is Sankoff and Kruskal's "Time Warps, String Edits, and Maromolecules". It's pretty old (1983), but gives good examples of how to adapt the basic algorithm to a number of applications.

Question 3:
My reference book is Computing Patterns in Strings by Bill Smyth

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Split Dataset into smallers sets randomly f# - f#

I am learning F # and I would like to learn how to split a data set into 10 smaller sets randomly. Anyone have any ideas to start ??? What topic should I read ??? I need help to continue. Thank you.

Related

Release the processed data in sequence

formatting Composite function in f#

Custom Operator for Lag or Standard Deviation

How do I just generate data with fscheck?

N-gram split function for string similarity comparison

Categories

Resources