Apply several aggregate functions with one enumeration - f#

Let's assume I have a series of functions that work on a sequence, and I want to use them together in the following fashion:
let meanAndStandardDeviation data =
let m = mean data
let sd = standardDeviation data
(m, sd)
The code above is going to enumerate the sequence twice. I am interested in a function that will give the same result but enumerate the sequence only once. This function will be something like this:
magicFunction (mean, standardDeviation) data
where the input is a tuple of functions and a sequence and the ouput is the same with the function above.
Is this possible if the functions mean and stadardDeviation are black boxes and I cannot change their implementation?
If I wrote mean and standardDeviation myself, is there a way to make them work together? Maybe somehow making them keep yielding the input to the next function and hand over the result when they are done?

The only way to do this using just a single iteration when the functions are black boxes is to use the Seq.cache function (which evaluates the sequence once and stores the results in memory) or to convert the sequence to other in-memory representation.
When a function takes seq<T> as an argument, you don't even have a guarantee that it will evaluate it just once - and usual implementations of standard deviation would first calculate the average and then iterate over the sequence again to calculate the squares of errors.
I'm not sure if you can calculate standard deviation with just a single pass. However, it is possible to do that if the functions are expressed using fold. For example, calculating maximum and average using two passes looks like this:
let maxv = Seq.fold max Int32.MinValue input
let minv = Seq.fold min Int32.MaxValue input
You can do that using a single pass like this:
Seq.fold (fun (s1, s2) v ->
(max s1 v, min s2 v)) (Int32.MinValue, Int32.MaxValue) input
The lambda function is a bit ugly, but you can define a combinator to compose two functions:
let par f g (i, j) v = (f i v, g j v)
Seq.fold (par max min) (Int32.MinValue, Int32.MaxValue) input
This approach works for functions that can be defined using fold, which means that they consist of some initial value (Int32.MinValue in the first example) and then some function that is used to update the initial (previous) state when it gets the next value (and then possibly some post-processing of the result). In general, it should be possible to rewrite single-pass functions in this style, but I'm not sure if this can be done for standard deviation. It can be definitely done for mean:
let (count, sum) = Seq.fold (fun (count, sum) v ->
(count + 1.0, sum + v)) (0.0, 0.0) input
let mean = sum / count

What we're talking about here is a function with the following signature:
(seq<'a> -> 'b) * (seq<'a> -> 'c) -> seq<'a> -> ('b * 'c)
There is no straightforward way that I can think of that will achieve the above using a single iteration of the sequence if that is the signature of the functions. Well, no way that is more efficient than:
let magicFunc (f1:seq<'a>->'b, f2:seq<'a>->'c) (s:seq<'a>) =
let cached = s |> Seq.cache
(f1 cached, f2 cached)
That ensures a single iteration of the sequence itself (perhaps there are side effects, or it's slow), but does so by essentially caching the results. The cache is still iterated another time. Is there anything wrong with that? What are you trying to achieve?

Related

unexpected return type from list comprehension

I am teaching myself a bit of F# by doing a bit of simple matrix mathematics. I decided to write a set of simple functions for combining two matrices as I thought that this would be a good way of learning list comprehensions. However when I compile it my unit tests produce a type mismatch exception.
//return a column from the matrix as a list
let getColumn(matrix: list<list<double>>, column:int) =
[for row in matrix do yield row.Item(column)]
//return a row from the matrix as a list
let getRow(matrix: list<list<double>>, column:int) =
matrix.Item(column)
//find the minimum width of the matrices in order to avoid index out of range exceptions
let minWidth(matrix1: list<list<double>>,matrix2: list<list<double>>) =
let width1 = [for row in matrix1 do yield row.Length] |> List.min
let width2 = [for row in matrix2 do yield row.Length] |> List.min
if width1 > width2 then width2 else width1
//find the minimum height of the matrices in order to avoid index out of range exceptions
let minHeight(matrix1: list<list<double>>,matrix2: list<list<double>>) =
let height1 = matrix1.Length
let height2 = matrix2.Length
if height1 > height2 then height2 else height1
//combine the two matrices
let concat(matrix1: list<list<double>>,matrix2: list<list<double>>) =
let width = minWidth(matrix1, matrix2)
let height = minHeight(matrix1, matrix2)
[for y in 0 .. height do yield [for x in 0 .. width do yield (List.fold2 (fun acc a b -> acc + (a*b)), getRow(matrix1, y), getColumn(matrix2, x))]]
I was expecting the function to return a list of lists of type
double list list
However what it actually returns looks more like some kind of lambda expression
((int -> int list -> int list -> int) * double list * double list) list list
Can somebody tell me what is being returned, and how to force it to be evaluated into the list of lists that I originally expected?
There's a short answer and a long answer to your question.
The short answer
The short version is that F# functions (like List.fold2) take multiple parameters not with commas the way you think they do, but with spaces in between. I.e., you should NOT call List.fold2 like this:
List.fold2 (function, list1, list2)
but rather like this:
List.fold2 function list1 list2
Now, if you just remove the commas in your List.fold2 call, you'll see that the compiler complains about your getRow(matrix1, y) call, and tells you to put parentheses around them. (And the outer pair of parentheses around List.fold2 isn't actually needed). So this:
(List.fold2 (fun acc a b -> acc + (a*b)), getRow(matrix1, y), getColumn(matrix2, x))
Needs to turn into this:
List.fold2 (fun acc a b -> acc + (a*b)) (getRow(matrix1, y)) (getColumn(matrix2, x))
The long answer
The way F# functions take multiple parameters is actually very different from most other languages such as C#. In fact, all F# functions take exactly one parameter! "But wait," you're probably thinking right now, "you just now showed me the syntax for F# functions taking multiple parameters!" Yes, I did. What's going on under the hood is a combination of currying and partial application. I'd write a long explanation, but Scott Wlaschin has already written one, that's much better than I could have written, so I'll just point you to the https://fsharpforfunandprofit.com/series/thinking-functionally.html series to help you understand what's going on here. (The sections on currying and partial application are the ones you want, but I'd recommend reading the series in order because the later parts build on concepts introduced in earlier parts).
And yes, this "long" answer appears shorter than the "short" answer, but if you go read that series (and then the rest of Scott Wlaschin's excellent site), you'll find that it's much longer than the short answer. :-)
If you have more questions, I'll be happy to try to explain.

What's the most "functional" way to select a subset from this array?

I'd like to get more comfortable with functional programming, and the first educational task I've set myself is converting a program that computes audio frequencies from C# to F#. The meat of the original application is a big "for" loop that selects a subset of the values in a large array; which values are taken depends on the last accepted value and a ranked list of the values seen since then. There are a few variables that persist between iterations to track progress toward determining the next value.
My first attempt at making this loop more "functional" involved a tail-recursive function whose arguments included the array, the result set so far, the ranked list of values recently seen, and a few other items that need to persist between executions. This seems clunky, and I don't feel like I've gained anything by turning everything that used to be a variable into a parameter on this recursive function.
How would a functional programming master approach this kind of task? Is this an exceptional situation in which a "pure" functional approach doesn't quite fit, and am I wrong for eschewing mutable variables just because I feel they reduce the "purity" of my function? Maybe they don't make it less pure since they only exist inside that function's scope. I don't have a feel for that yet.
Here's an attempted distillation of the code, with some "let" statements and the actual components of state removed ("temp" is the intermediate result array that needs to be processed):
let fif (_,_,_,_,fif) = fif
temp
|> Array.fold (fun (a, b, c, tentativeNextVals, acc) curVal ->
if (hasProperty curVal c) then
// do not consider current value
(a, b, c, Seq.empty, acc)
else
if (hasOtherProperty curVal b) then
// add current value to tentative list
(a, b, c, tentativeNextVals.Concat [curVal], acc)
else
// accept a new value
let newAcceptedVal = chooseNextVal (tentativeNextVals.Concat [curVal])
(newC, newB, newC, Seq.empty, acc.Concat [newAcceptedVal])
) (0,0,0,Seq.empty,Seq.empty)
|> fif
Something like this using fold?
let filter list =
List.fold (fun statevar element -> if condition statevar then statevar else element) initialvalue list
Try using Seq.skip and Seq.take:
let subset (min, max) seq =
seq
|> Seq.skip (min)
|> Seq.take (max - min)
This function will accept arrays but return a sequence, so you can convert it back using Array.ofSeq.
PS: If your goal is to keep your program functional, the most important rule is this: avoid mutability as much as you can. This means that you probably shouldn't be using arrays; use lists which are immutable. If you're using an array for it's fast random access, go for it; just be sure to never set indices.

Custom Operator for Lag or Standard Deviation

What is the proper way to extend the available operators when using RX?
I'd like to build out some operations that I think would be useful.
The first operation is simply the standard deviation of a series.
The second operation is the nth lagged value i.e. if we are lagging 2 and our series is A B C D E F when F is pushed the lag would be D when A is pushed the lag would be null/empty when B is pushed the lag would be null/empty when C is pushed the Lag would be A
Would it make sense to base these types of operators off of the built-ins from rx.codeplex.com or is there an easier way?
In idiomatic Rx, arbitrary delays can be composed by Zip.
let lag (count : int) o =
let someo = Observable.map Some o
let delayed = Observable.Repeat(None, count).Concat(someo)
Observable.Zip(someo, delayed, (fun c d -> d))
As for a rolling buffer, the most efficient way is to simply use a Queue/ResizeArray of fixed size.
let rollingBuffer (size : int) o =
Observable.Create(fun (observer : IObserver<_>) ->
let buffer = new Queue<_>(size)
o |> Observable.subscribe(fun v ->
buffer.Enqueue(v)
if buffer.Count = size then
observer.OnNext(buffer.ToArray())
buffer.Dequeue() |> ignore
)
)
For numbers |> rollingBuffer 3 |> log:
seq [0L; 1L; 2L]
seq [1L; 2L; 3L]
seq [2L; 3L; 4L]
seq [3L; 4L; 5L]
...
For pairing adjacent values, you can just use Observable.pairwise
let delta (a, b) = b - a
let deltaStream = numbers |> Observable.pairwise |> Observable.map(delta)
Observable.Scan is more concise if you want to apply a rolling calculation .
Some of these are easier than others (as usual). For a 'lag' by count (rather than time) you just create a sliding window by using Observable.Buffer equivalent to the size of 'lag', then take the first element of the result list.
So far lag = 3, the function is:
obs.Buffer(3,1).Select(l => l.[0])
This is pretty straightforward to turn into an extension function. I don't know if it is efficient in that it reuses the same list, but in most cases that shouldn't matter. I know you want F#, the translation is straightforward.
For running aggregates, you can usually use Observable.Scan to get a 'running' value. This is calculated based on all values seen so far (and is pretty straightforward to implement) - ie all you have to implement each subsequent element is the previous aggregate and the new element.
If for whatever reason you need a running aggregate based on a sliding window, then we get into more difficult domain. Here you first need an operation that can give you a sliding window - this is covered by Buffer above. However, then you need to know which values have been removed from this window, and which have been added.
As such, I recommend a new Observable function that maintains an internal window based on existing window + new value, and returns new window + removed value + added value. You can write this using Observable.Scan (I recommend an internal Queue for efficient implementation). It should take a function for determining which values to remove given a new value (this way it can be parameterised for sliding by time or by count).
At that point, Observable.Scan can again be used to take the old aggregate + window + removed values + added value and give a new aggregate.
Hope this helps, I do realise it's a lot of words. If you can confirm the requirement, I can help out with the actual extension method for that specific use case.
For lag, you could do something like
module Observable =
let lag n obs =
let buf = System.Collections.Generic.Queue()
obs |> Observable.map (fun x ->
buf.Enqueue(x)
if buf.Count > n then Some(buf.Dequeue())
else None)
This:
Observable.Range(1, 9)
|> Observable.lag 2
|> Observable.subscribe (printfn "%A")
|> ignore
prints:
<null>
<null>
Some 1
Some 2
Some 3
Some 4
Some 5
Some 6
Some 7

N-gram split function for string similarity comparison

As part of excersise to better understand F# which I am currently learning , I wrote function to
split given string into n-grams.
1) I would like to receive feedback about my function : can this be written simpler or in more efficient way?
2) My overall goal is to write function that returns string similarity (on 0.0 .. 1.0 scale) based on n-gram similarity; Does this approach works well for short strings comparisons , or can this method reliably be used to compare large strings (like articles for example).
3) I am aware of the fact that n-gram comparisons ignore context of two strings. What method would you suggest to accomplish my goal?
//s:string - target string to split into n-grams
//n:int - n-gram size to split string into
let ngram_split (s:string, n:int) =
let ngram_count = s.Length - (s.Length % n)
let ngram_list = List.init ngram_count (fun i ->
if( i + n >= s.Length ) then
s.Substring(i,s.Length - i) + String.init ((i + n) - s.Length)
(fun i -> "#")
else
s.Substring(i,n)
)
let ngram_array_unique = ngram_list
|> Seq.ofList
|> Seq.distinct
|> Array.ofSeq
//produce tuples of ngrams (ngram string,how much occurrences in original string)
Seq.init ngram_array_unique.Length (fun i -> (ngram_array_unique.[i],
ngram_list |> List.filter(fun item -> item = ngram_array_unique.[i])
|> List.length)
)
I don't know much about evaluating similarity of strings, so I can't give you much feedback regarding points 2 and 3. However, here are a few suggestions that may help to make your implementation simpler.
Many of the operations that you need to do are already available in some F# library function for working with sequences (lists, arrays, etc.). Strings are also sequences (of characters), so you can write the following:
open System
let ngramSplit n (s:string) =
let ngrams = Seq.windowed n s
let grouped = Seq.groupBy id ngrams
Seq.map (fun (ngram, occurrences) ->
String(ngram), Seq.length occurrences) grouped
The Seq.windowed function implements a sliding window, which is exactly what you need to extract the n-grams of your string. The Seq.groupBy function collects the elements of a sequence (n-grams) into a sequence of groups that contain values with the same key. We use id to calculate the key, which means that the n-gram is itself the key (and so we get groups, where each group contains the same n-grams). Then we just convert n-gram to string and count the number of elements in the group.
Alternatively, you can write the entire function as a single processing pipeline like this:
let ngramSplit n (s:string) =
s |> Seq.windowed n
|> Seq.groupBy id
|> Seq.map (fun (ngram, occurrences) ->
String(ngram), Seq.length occurrences)
Your code looks OK to me. Since ngram extraction and similarity comparison are used very often. You should consider some efficiency issues here.
The MapReduce pattern is very suitable for your frequency counting problem:
get a string, emit (word, 1) pairs out
do a grouping of the words and adds all the counting together.
let wordCntReducer (wseq: seq<int*int>) =
wseq
|> Seq.groupBy (fun (id,cnt) -> id)
|> Seq.map (fun (id, idseq) ->
(id, idseq |> Seq.sumBy(fun (id,cnt) -> cnt)))
(* test: wordCntReducer [1,1; 2,1; 1,1; 2,1; 2,2;] *)
You also need to maintain a <word,int> map during your ngram building for a set of strings. As it is much more efficient to handle integers rather than strings during later processing.
(2) to compare the distance between two short strings. A common practice is to use Edit Distance using a simple dynamic programming. To compute the similarity between articles, a state-of-the-art method is to use TFIDF feature representation. Actuallym the code above is for term frequency counting, extracted from my data mining library.
(3) There are complex NLP methods, e.g. tree kernels based on the parse tree, to in-cooperate the context information in.
I think you have some good answers for question (1).
Question (2):
You probably want cosine similarity to compare two arbitrary collections of n-grams (the larger better). This gives you a range of 0.0 - 1.0 without any scaling needed. The Wikipedia page gives an equation, and the F# translation is pretty straightforward:
let cos a b =
let dot = Seq.sum (Seq.map2 ( * ) a b)
let magnitude v = Math.Sqrt (Seq.sum (Seq.map2 ( * ) v v))
dot / (magnitude a * magnitude b)
For input, you need to run something like Tomas' answer to get two maps, then remove keys that only exist in one:
let values map = map |> Map.toSeq |> Seq.map snd
let desparse map1 map2 = Map.filter (fun k _ -> Map.containsKey k map2) map1
let distance textA textB =
let a = ngramSplit 3 textA |> Map.ofSeq
let b = ngramSplit 3 textB |> Map.ofSeq
let aValues = desparse a b |> values
let bValues = desparse b a |> values
cos aValues bValues
With character-based n-grams, I don't know how good your results will be. It depends on what kind of features of the text you are interested in. I do natural language processing, so usually my first step is part-of-speech tagging. Then I compare over n-grams of the parts of speech. I use T'n'T for this, but it has bizarro licencing issues. Some of my colleagues use ACOPOST instead, a Free alternative (as in beer AND freedom). I don't know how good the accuracy is, but POS tagging is a well-understood problem these days, at least for English and related languages.
Question (3):
The best way to compare two strings that are nearly identical is Levenshtein distance. I don't know if that is your case here, although you can relax the assumptions in a number of ways, eg for comparing DNA strings.
The standard book on this subject is Sankoff and Kruskal's "Time Warps, String Edits, and Maromolecules". It's pretty old (1983), but gives good examples of how to adapt the basic algorithm to a number of applications.
Question 3:
My reference book is Computing Patterns in Strings by Bill Smyth

How To Apply a Function to an Array of float Arrays?

Let's suppose I have n arrays, where n is a variable (some number greater than 2, usually less than 10).
Each array has k elements.
I also have an array of length n that contains a set of weights that dictate how I would like to linearly combine all the arrays.
I am trying to create a high performance higher order function to combine these arrays in F#.
How can I do this, so that I get a function that takes an array of arrays (arrs is a sample), a weights array (weights), and then computed a weighted sum based on the weights?
let weights = [|.6;;.3;.1|]
let arrs = [| [|.0453;.065345;.07566;1.562;356.6|] ;
[|.0873;.075565;.07666;1.562222;3.66|] ;
[|.06753;.075675;.04566;1.452;3.4556|] |]
thanks for any ideas.
Here's one solution:
let combine weights arrs =
Array.map2 (fun w -> Array.map ((*) w)) weights arrs
|> Array.reduce (Array.map2 (+))
EDIT
Here's some (much needed) explanation of how this works. Logically, we want to do the following:
Apply each weight to its corresponding row.
Add together the weight-adjusted rows.
The two lines above do just that.
We use the Array.map2 function to combine corresponding weights and rows; the way that we combine them is to multiply each element in the row by the weight, which is accomplished via the inner Array.map.
Now we have an array of weighted rows and need to add them together. We can do this one step at a time by keeping a running sum, adding each array in turn. The way we sum two arrays pointwise is to use Array.map2 again, using (+) as the function for combining the elements from each. We wrap this in an Array.reduce to apply this addition function to each row in turn, starting with the first row.
Hopefully this is a reasonably elegant approach to the problem, though the point-free style admittedly makes it a bit tricky to follow. However, note that it's not especially performant; doing in-place updates rather than creating new arrays with each application of map, map2, and reduce would be more efficient. Unfortunately, the standard library doesn't contain nice analogues of these operations which work in-place. It would be relatively easy to create such analogues, though, and they could be used in almost exactly the same way as I've done here.
Something like this did it for me:
let weights = [|0.6;0.3;0.1|]
let arrs = [| [|0.0453;0.065345;0.07566;1.562;356.6|] ;
[|0.0873;0.075565;0.07666;1.562222;3.66|] ;
[|0.06753;0.075675;0.04566;1.452;3.4556|] |]
let applyWeight x y = x * y
let rotate (arr:'a[][]) =
Array.map (fun y -> (Array.map (fun x -> arr.[x].[y])) [|0..arr.Length - 1|]) [|0..arr.[0].Length - 1|]
let weightedarray = Array.map (fun x -> Array.map(applyWeight (fst x)) (snd x)) (Array.zip weights arrs)
let newarrs = Array.map Array.sum (rotate weightedarray)
printfn "%A" newarrs
By the way.. the 0 preceding a float value is necessary.

Resources