Query Expressions and Lazy Evaluation - f#

I am hoping to understand how query expressions are really evaluated. I have a situation where I'm using a query expression to access a large amount of data from a database. I then interact with this data via a GUI. For example the user might supply an additive factor that I want to apply to one column and then plot. What I'm not clear on is how to structure this so that the same data isn't being pulled from the database each time the GUI updates.
For example:
let a state= query{...}
let results = a "ALASKA"
let calcoutput y = results |> Seq.map (fun x -> x.Temperature + y)
or
let calcoutput state y = a state |> Seq.map (fun x -> x.Temperature + y)
I am not clear if these are actually the same code, and if so am I pulling data from the DB each time I execute calcoutput with a different y (it appears so). Should I be casting the "results" sequence as a List and then using that to avoid this?

You can use Seq.cache function.
http://msdn.microsoft.com/en-us/library/ee370430.aspx
Quote: "This result sequence will have the same elements as the input sequence. The result can be enumerated multiple times. The input sequence is enumerated at most once and only as far as is necessary. Caching a sequence is typically useful when repeatedly evaluating items in the original sequence is computationally expensive or if iterating the sequence causes side-effects that the user does not want to be repeated multiple times."

Related

How to combine two differently typed sequences into tuples in f#?

I'm a bit new to F#, I have mostly c# background.
I'm working with two lists/sequences that represent the same thing, but from different datasources (one is a local file, the other is a group of items in an online system.
I need to report the mismatches between the two datasets.
So far I have filtered down the two lists to contain only items that aren't have mismatches in the other dataset.
Now I want to pair them up into tuples (or anything else really) based on one of the properties so I can log the differences.
So far what I've tried is this:
let printDiff (offlinePackages: oP seq) (onLinePackages: onP seq) =
Seq.map2(fun x y -> (x, y)) offlinePackages onLinePackages
This does pair up my data, however I'd still need some logic to do the pairup based on their matching property value.
I'm looking for something like:
if offLinePackage.Label = OnlinePackage.Label then /do the matchup/
else /don't do anything/
I know I'm still stuck in my object oriented thinking, that's also why I'm asking.
Thanks in advance!
Matching sequence elements based on some equivalency function - that's called "join".
The most "straightforward" way to do a join is to get a Cartesian product and filter it down, like this:
let matchup seq1 seq2 =
Seq.allPairs seq1 seq2
|> Seq.filter (fun (x, y) -> x.someProp = y.someProp)
Or in a computation expression form:
let matchup seq1 seq2
seq {
for x in seq1 do
for y in seq2 do
if x.someProp = y.someProp then yield (x,y)
}
But this is a bit inefficient. The complexity would be O(n*m), because it iterates over all possible pairs. Fine if the lists are short, but will bite you the more you scale.
To do this more efficiently, you can use the query computation builder and its operation join, which does a hash-join (i.e. it builds a hashtable first and then matches the elements based on that):
let matchup seq1 seq2 =
query {
for x in seq1 do
join y in seq2 on (x.someProp = y.someProp)
select (x,y)
}

This prime factorization code works for small numbers but fails with an OutOfMemoryException for large numbers?

I'm trying to get the prime factors for a large number..
let factors (x:int64) =
[1L..x]
|> Seq.filter(fun n -> x%n = 0L)
let isPrime (x:int64) =
factors x
|> Seq.length = 2
let primeFactors (x:int64)=
factors x
|> Seq.filter isPrime
This works for say 13195 but fails with an OutOfMemoryException for 600851475143?
Sorry if i'm missing something obvious, it's only my third day on F# and I didn't know what a prime factor was until this morning.
The expresion [1L..x] creates a list, which in your example gets too large to be stored in memory.
Sequences in contrast are lazy, so if used with care you can avoid computing the whole intermediate list. Your code already uses sequences but as said before it begins with a list, to avoid converting from a list you can use curly brackets: {1L..x}
Using sequence expressions is another option:
let factors (x:int64) = seq {
for i = 1L to x do
if x%i = 0L then yield i}
Having solved the OutOfMemoryException problem your prime function is very slow, you can optimise it as suggested in the comments by returning false immediately after finding a divisor between 1 and its square root. Further optimisations may be achieved by dividing the number by the prime factors as you find them and using a sieve for the primes, you can also have a look at some efficient algorithms here.
The expression [...] creates a list of the items specified. In F#, a List can be defined something like this:
type List<'t> =
| empty
| item of 't * List<'t>
As an example, `[1..5]' would become a structure looking like this:
item(1, item(2, item(3, item(4, item(5, empty)))))
As you can see, this will not be a problem for small numbers of items, but for larger numbers of items this will eventually use up all the available memory and cause an OutOfMemoryExcepion. As Gustavo mentioned, to avoid this, you can use a sequence, which will create each item on demand rather than all at the beginning. This reduces the number of things in memory at one time and thus avoids an OutOfMemoryException.
Since you're already using the Seq module instead of the List module (i.e. Seq.filter vs List.filter etc), you can simply use a sequence instead of a list which would look like this: {1L..x}.

What's the most "functional" way to select a subset from this array?

I'd like to get more comfortable with functional programming, and the first educational task I've set myself is converting a program that computes audio frequencies from C# to F#. The meat of the original application is a big "for" loop that selects a subset of the values in a large array; which values are taken depends on the last accepted value and a ranked list of the values seen since then. There are a few variables that persist between iterations to track progress toward determining the next value.
My first attempt at making this loop more "functional" involved a tail-recursive function whose arguments included the array, the result set so far, the ranked list of values recently seen, and a few other items that need to persist between executions. This seems clunky, and I don't feel like I've gained anything by turning everything that used to be a variable into a parameter on this recursive function.
How would a functional programming master approach this kind of task? Is this an exceptional situation in which a "pure" functional approach doesn't quite fit, and am I wrong for eschewing mutable variables just because I feel they reduce the "purity" of my function? Maybe they don't make it less pure since they only exist inside that function's scope. I don't have a feel for that yet.
Here's an attempted distillation of the code, with some "let" statements and the actual components of state removed ("temp" is the intermediate result array that needs to be processed):
let fif (_,_,_,_,fif) = fif
temp
|> Array.fold (fun (a, b, c, tentativeNextVals, acc) curVal ->
if (hasProperty curVal c) then
// do not consider current value
(a, b, c, Seq.empty, acc)
else
if (hasOtherProperty curVal b) then
// add current value to tentative list
(a, b, c, tentativeNextVals.Concat [curVal], acc)
else
// accept a new value
let newAcceptedVal = chooseNextVal (tentativeNextVals.Concat [curVal])
(newC, newB, newC, Seq.empty, acc.Concat [newAcceptedVal])
) (0,0,0,Seq.empty,Seq.empty)
|> fif
Something like this using fold?
let filter list =
List.fold (fun statevar element -> if condition statevar then statevar else element) initialvalue list
Try using Seq.skip and Seq.take:
let subset (min, max) seq =
seq
|> Seq.skip (min)
|> Seq.take (max - min)
This function will accept arrays but return a sequence, so you can convert it back using Array.ofSeq.
PS: If your goal is to keep your program functional, the most important rule is this: avoid mutability as much as you can. This means that you probably shouldn't be using arrays; use lists which are immutable. If you're using an array for it's fast random access, go for it; just be sure to never set indices.

Apply several aggregate functions with one enumeration

Let's assume I have a series of functions that work on a sequence, and I want to use them together in the following fashion:
let meanAndStandardDeviation data =
let m = mean data
let sd = standardDeviation data
(m, sd)
The code above is going to enumerate the sequence twice. I am interested in a function that will give the same result but enumerate the sequence only once. This function will be something like this:
magicFunction (mean, standardDeviation) data
where the input is a tuple of functions and a sequence and the ouput is the same with the function above.
Is this possible if the functions mean and stadardDeviation are black boxes and I cannot change their implementation?
If I wrote mean and standardDeviation myself, is there a way to make them work together? Maybe somehow making them keep yielding the input to the next function and hand over the result when they are done?
The only way to do this using just a single iteration when the functions are black boxes is to use the Seq.cache function (which evaluates the sequence once and stores the results in memory) or to convert the sequence to other in-memory representation.
When a function takes seq<T> as an argument, you don't even have a guarantee that it will evaluate it just once - and usual implementations of standard deviation would first calculate the average and then iterate over the sequence again to calculate the squares of errors.
I'm not sure if you can calculate standard deviation with just a single pass. However, it is possible to do that if the functions are expressed using fold. For example, calculating maximum and average using two passes looks like this:
let maxv = Seq.fold max Int32.MinValue input
let minv = Seq.fold min Int32.MaxValue input
You can do that using a single pass like this:
Seq.fold (fun (s1, s2) v ->
(max s1 v, min s2 v)) (Int32.MinValue, Int32.MaxValue) input
The lambda function is a bit ugly, but you can define a combinator to compose two functions:
let par f g (i, j) v = (f i v, g j v)
Seq.fold (par max min) (Int32.MinValue, Int32.MaxValue) input
This approach works for functions that can be defined using fold, which means that they consist of some initial value (Int32.MinValue in the first example) and then some function that is used to update the initial (previous) state when it gets the next value (and then possibly some post-processing of the result). In general, it should be possible to rewrite single-pass functions in this style, but I'm not sure if this can be done for standard deviation. It can be definitely done for mean:
let (count, sum) = Seq.fold (fun (count, sum) v ->
(count + 1.0, sum + v)) (0.0, 0.0) input
let mean = sum / count
What we're talking about here is a function with the following signature:
(seq<'a> -> 'b) * (seq<'a> -> 'c) -> seq<'a> -> ('b * 'c)
There is no straightforward way that I can think of that will achieve the above using a single iteration of the sequence if that is the signature of the functions. Well, no way that is more efficient than:
let magicFunc (f1:seq<'a>->'b, f2:seq<'a>->'c) (s:seq<'a>) =
let cached = s |> Seq.cache
(f1 cached, f2 cached)
That ensures a single iteration of the sequence itself (perhaps there are side effects, or it's slow), but does so by essentially caching the results. The cache is still iterated another time. Is there anything wrong with that? What are you trying to achieve?

How to efficiently find out if a sequence has at least n items?

Just naively using Seq.length may be not good enough as will blow up on infinite sequences.
Getting more fancy with using something like ss |> Seq.truncate n |> Seq.length will work, but behind the scene would involve double traversing of the argument sequence chunk by IEnumerator's MoveNext().
The best approach I was able to come up with so far is:
let hasAtLeast n (ss: seq<_>) =
let mutable result = true
use e = ss.GetEnumerator()
for _ in 1 .. n do result <- e.MoveNext()
result
This involves only single sequence traverse (more accurately, performing e.MoveNext() n times) and correctly handles boundary cases of empty and infinite sequences. I can further throw in few small improvements like explicit processing of specific cases for lists, arrays, and ICollections, or some cutting on traverse length, but wonder if any more effective approach to the problem exists that I may be missing?
Thank you for your help.
EDIT: Having on hand 5 overall implementation variants of hasAtLeast function (2 my own, 2 suggested by Daniel and one suggested by Ankur) I've arranged a marathon between these. Results that are tie for all implementations prove that Guvante is right: a simplest composition of existing algorithms would be the best, there is no point here in overengineering.
Further throwing in the readability factor I'd use either my own pure F#-based
let hasAtLeast n (ss: seq<_>) =
Seq.length (Seq.truncate n ss) >= n
or suggested by Ankur the fully equivalent Linq-based one that capitalizes on .NET integration
let hasAtLeast n (ss: seq<_>) =
ss.Take(n).Count() >= n
Here's a short, functional solution:
let hasAtLeast n items =
items
|> Seq.mapi (fun i x -> (i + 1), x)
|> Seq.exists (fun (i, _) -> i = n)
Example:
let items = Seq.initInfinite id
items |> hasAtLeast 10000
And here's an optimally efficient one:
let hasAtLeast n (items:seq<_>) =
use e = items.GetEnumerator()
let rec loop n =
if n = 0 then true
elif e.MoveNext() then loop (n - 1)
else false
loop n
Functional programming breaks up work loads into small chunks that do very generic tasks that do one simple thing. Determining if there are at least n items in a sequence is not a simple task.
You already found both the solutions to this "problem", composition of existing algorithms, which works for the majority of cases, and creating your own algorithm to solve the issue.
However I have to wonder whether your first solution wouldn't work. MoveNext() is only called n times on the original method for certain, Current is never called, and even if MoveNext() is called on some wrapper class the performance implications are likely tiny unless n is huge.
EDIT:
I was curious so I wrote a simple program to test out the timing of the two methods. The truncate method was quicker for a simple infinite sequence and one that had Sleep(1). It looks like I was right when your correction sounded like overengineering.
I think clarification is needed to explain what is happening in those methods. Seq.truncate takes a sequence and returns a sequence. Other than saving the value of n it doesn't do anything until enumeration. During enumeration it counts and stops after n values. Seq.length takes an enumeration and counts, returning the count when it ends. So the enumeration is only enumerated once, and the amount of overhead is a couple of method calls and two counters.
Using Linq this would be as simple as:
let hasAtLeast n (ss: seq<_>) =
ss.Take(n).Count() >= n
Seq take method blows up if there are not enough elements.
Example usage to show it traverse seq only once and till required elements:
seq { for i = 0 to 5 do
printfn "Generating %d" i
yield i }
|> hasAtLeast 4 |> printfn "%A"

Resources