Pairwise Sequence Processing to compare db tables - f#

Consider the following Use case:
I want to iterate through 2 db tables in parallel and find differences and gaps/missing records in either table. Assume that 1) pk of table is an Int ID field; 2) the tables are read in ID order; 3) records may be missing from either table (with corresponding sequence gaps).
I'd like to do this in a single pass over each db - using lazy reads. (My initial version of this program uses sequence objects and the data reader - unfortunately makes multiple passes over each db).
I've thought of using pairwise sequence processing and use Seq.skip within the iterations to try and keep the table processing in sync. However apparently this is very slow as I Seq.skip has a high overhead (creating new sequences under the hood) so this could be a problem with a large table (say 200k recs).
I imagine this is a common design pattern (compare concurrent data streams from different sources) and am interested in feedback/comments/links to similar projects.
Anyone care to comment?

Here's my (completely untested) take, doing a single pass over both tables:
let findDifferences readerA readerB =
let idsA, idsB =
let getIds (reader:System.Data.Common.DbDataReader) =
reader |> LazyList.unfold (fun reader ->
if reader.Read ()
then Some (reader.GetInt32 0, reader)
else None)
getIds readerA, getIds readerB
let onlyInA, onlyInB = ResizeArray<_>(), ResizeArray<_>()
let rec impl a b =
let inline handleOnlyInA idA as' = onlyInA.Add idA; impl as' b
let inline handleOnlyInB idB bs' = onlyInB.Add idB; impl a bs'
match a, b with
| LazyList.Cons (idA, as'), LazyList.Cons (idB, bs') ->
if idA < idB then handleOnlyInA idA as'
elif idA > idB then handleOnlyInB idB bs'
else impl as' bs'
| LazyList.Nil, LazyList.Nil -> () // termination condition
| LazyList.Cons (idA, as'), _ -> handleOnlyInA idA as'
| _, LazyList.Cons (idB, bs') -> handleOnlyInB idB bs'
impl idsA idsB
onlyInA.ToArray (), onlyInB.ToArray ()
This takes two DataReaders (one for each table) and returns two int[]s which indicate the IDs that were only present in their respective table. The code assumes that the ID field is of type int and is at ordinal index 0.
Also note that this code uses LazyList from the F# PowerPack, so you'll need to get that if you don't already have it. If you're targeting .NET 4.0 then I strongly recommend getting the .NET 4.0 binaries which I've built and hosted here, as the binaries from the F# PowerPack site only target .NET 2.0 and sometimes don't play nice with VS2010 SP1 (see this thread for more info: Problem with F# Powerpack. Method not found error).

When you use sequences, any lazy function adds some overhead on the sequence. Calling Seq.skip thousands of times on the same sequence will clearly be slow.
You can use Seq.zip or Seq.map2 to process two sequences at a time:
> Seq.map2 (+) [1..3] [10..12];;
val it : seq<int> = seq [11; 13; 15]
If the Seq module is not enough, you might need to write your own function.
I'm not sure if I understand what you try to do, but this sample function might help you:
let fct (s1: seq<_>) (s2: seq<_>) =
use e1 = s1.GetEnumerator()
use e2 = s2.GetEnumerator()
let rec walk () =
// do some stuff with the element of both sequences
printfn "%d %d" e1.Current e2.Current
if cond1 then // move in both sequences
if e1.MoveNext() && e2.MoveNext() then walk ()
else () // end of a sequence
elif cond2 then // move to the next element of s1
if e1.MoveNext() then walk()
else () // end of s1
elif cond3 then // move to the next element of s2
if e2.MoveNext() then walk ()
else () // end of s2
// we need at least one element in each sequence
if e1.MoveNext() && e2.MoveNext() then walk()
Edit :
The previous function was meant to extend functionality of the Seq module, and you'll probably want to make it a high-order function. As ildjarn said, using LazyList can lead to cleaner code:
let rec merge (l1: LazyList<_>) (l2: LazyList<_>) =
match l1, l2 with
| LazyList.Cons(h1, t1), LazyList.Cons(h2, t2) ->
if h1 <= h2 then LazyList.cons h1 (merge t1 l2)
else LazyList.cons h2 (merge l1 t2)
| LazyList.Nil, l2 -> l2
| _ -> l1
merge (LazyList.ofSeq [1; 4; 5; 7]) (LazyList.ofSeq [1; 2; 3; 6; 8; 9])
But I still think you should separate the iteration of your data, from the processing. Writing a high-order function to iterate is a good idea (at the end, it's not annoying if the iterator function code uses mutable enumerators).

Related

How to split F# result type list into lists of inner type

I have a list/sequence as follows Result<DataEntry, exn> []. This list is populated by calling multiple API endpoints in parallel based on some user inputs.
I don't care if some of the calls fail as long as at least 1 succeeds. I then need to perform multiple operations on the success list.
My question is how to partition the Result list into exn [] and DataEntry [] lists. I tried the following:
// allData is Result<DataEntry, exn> []
let filterOutErrors (input: Result<DataEntry, exn>) =
match input with
| Ok v -> true
| _ -> false
let values, err = allData |> Array.partition filterOutErrors
This in principle meets the requirement since values contains all the success cases but understandably the compiler can't infer the types so both values and err contains Result<DataEntry, exn>.
Is there any way to split a list of result Result<Success, Err> such that you end up with separate lists of the inner type?
Is there any way to split a list of result Result<Success, Err> such that you end up with separate lists of the inner type?
Remember that Seq / List / Array are foldable, so you can use fold to convert a Seq / List / Array of 'Ts into any other type 'S. Here you want to go from []Result<DataEntry, exn> to, e.g., the tuple list<DataEntry> * list<exn>. We can define the following folder function, that takes an initial state s of type list<'a> * list<'b> and a Result Result<'a, 'b> and returns your tuple of lists list<'a> * list<'b>:
let listFolder s r =
match r with
| Ok data -> (data :: (fst s), snd s)
| Error err -> (fst s, err :: (snd s))
then you can fold over your array as follows:
let (values, err) = Seq.fold listFolder ([], []) allData
You can extract the good and the bad like this.
let values =
allData
|> Array.choose (fun r ->
match r with
| Result.Ok ok -> Some ok
| Result.Error _ -> None)
let err =
allData
|> Array.choose (fun r ->
match r with
| Result.Ok _ -> None
| Result.Error error -> Some error)
You seem confused about whether you have arrays or lists. The F# code you use, in the snippet and in your question text, all points to use of arrays, in spite of you several times mentioning lists.
It has recently been recommended that we use array instead of the [] symbol in types, since there are inconsistencies in the way F# uses the symbol [] to mean list in some places, and array in other places. There is also the symbol [||] for arrays, which may add more confusion.
So that would be recommending Result<DataEntry,exn> array in this case.
The answer from Víctor G. Adán is functional, but it's a downside that the API requires you to pass in two empty lists, exposing the internal implementation.
You could wrap this into a "starter" function, but then the code grows, requires nested functions or using modules and the intention is obscured.
The answer from Bent Tranberg, while more readable requires two passes of the data, and it seems inefficient to map into Option type just to be able to filter on it using .Choose.
I propose KISS'ing it with some good old mutation.
open System.Collections.Generic
let splitByOkAndErrors xs =
let oks = List<'T>()
let errors = List<'V>()
for x in xs do
match x with
| Ok v -> oks.Add v
| Error e -> errors.Add e
(oks |> seq, errors |> seq)
I know I know, mutation, yuck right? I believe you should not shy away from that even in F#, use the right tool for every situation: the mutation is kept local to the function, so it's still pure. The API is clean just taking in the list of Result to split, there is no concepts like folding, recursive calls, list cons pattern matching etc. to understand, and the function won't reverse the input list, you also have the option to return array or seq, that is, you are not confined to a linked list that can only be appended to in O(1) in the head - which in my experience seldom fits well into business case, win win win in my book.
I general, I hope to see F# grow into a more multi-paradigm programming language in the community's mind. It's nice to see these functional solutions, but I fear they scare some people away unnecessarily, as F# is already multi-paradigm!

summing elements from a user defined datatype

Upon covering the predefined datatypes in f# (i.e lists) and how to sum elements of a list or a sequence, I'm trying to learn how I can work with user defined datatypes. Say I create a data type, call it list1:
type list1 =
A
| B of int * list1
Where:
A stands for an empty list
B builds a new list by adding an int in front of another list
so 1,2,3,4, will be represented with the list1 value:
B(1, B(2, B(3, B(4, A))))
From the wikibook I learned that with a list I can sum the elements by doing:
let List.sum [1; 2; 3; 4]
But how do I go about summing the elements of a user defined datatype? Any hints would be greatly appreciated.
Edit: I'm able to take advantage of the match operator:
let rec sumit (l: ilist) : int =
match l with
| (B(x1, A)) -> x1
| (B(x1, B(x2, A))) -> (x1+x2)
sumit (B(3, B(4, A)))
I get:
val it : int = 7
How can I make it so that if I have more than 2 ints it still sums the elemets (i.e. (B(3, B(4, B(5, A)))) gets 12?
One good general approach to questions like this is to write out your algorithm in word form or pseudocode form, then once you've figured out your algorithm, convert it to F#. In this case where you want to sum the lists, that would look like this:
The first step in figuring out an algorithm is to carefully define the specifications of the problem. I want an algorithm to sum my custom list type. What exactly does that mean? Or, to be more specific, what exactly does that mean for the two different kinds of values (A and B) that my custom list type can have? Well, let's look at them one at a time. If a list is of type A, then that represents an empty list, so I need to decide what the sum of an empty list should be. The most sensible value for the sum of an empty list is 0, so the rule is "I the list is of type A, then the sum is 0". Now, if the list is of type B, then what does the sum of that list mean? Well, the sum of a list of type B would be its int value, plus the sum of the sublist.
So now we have a "sum" rule for each of the two types that list1 can have. If A, the sum is 0. If B, the sum is (value + sum of sublist). And that rule translates almost verbatim into F# code!
let rec sum (lst : list1) =
match lst with
| A -> 0
| B (value, sublist) -> value + sum sublist
A couple things I want to note about this code. First, one thing you may or may not have seen before (since you seem to be an F# beginner) is the rec keyword. This is required when you're writing a recursive function: due to internal details in how the F# parser is implemented, if a function is going to call itself, you have to declare that ahead of time when you declare the function's name and parameters. Second, this is not the best way to write a sum function, because it is not actually tail-recursive, which means that it might throw a StackOverflowException if you try to sum a really, really long list. At this point in your learning F# you maybe shouldn't worry about that just yet, but eventually you will learn a useful technique for turning a non-tail-recursive function into a tail-recursive one. It involves adding an extra parameter usually called an "accumulator" (and sometimes spelled acc for short), and a properly tail-recursive version of the above sum function would have looked like this:
let sum (lst : list1) =
let rec tailRecursiveSum (acc : int) (lst : list1) =
match lst with
| A -> acc
| B (value, sublist) -> tailRecursiveSum (acc + value) sublist
tailRecursiveSum 0 lst
If you're already at the point where you can understand this, great! If you're not at that point yet, bookmark this answer and come back to it once you've studied tail recursion, because this technique (turning a non-tail-recursive function into a tail-recursive one with the use of an inner function and an accumulator parameter) is a very valuable one that has all sorts of applications in F# programming.
Besides tail-recursion, generic programming may be a concept of importance for the functional learner. Why go to the trouble of creating a custom data type, if it only can hold integer values?
The sum of all elements of a list can be abstracted as the repeated application of the addition operator to all elements of the list and an accumulator primed with an initial state. This can be generalized as a functional fold:
type 'a list1 = A | B of 'a * 'a list1
let fold folder (state : 'State) list =
let rec loop s = function
| A -> s
| B(x : 'T, xs) -> loop (folder s x) xs
loop state list
// val fold :
// folder:('State -> 'T -> 'State) -> state:'State -> list:'T list1 -> 'State
B(1, B(2, B(3, B(4, A))))
|> fold (+) 0
// val it : int = 10
Making also the sum function generic needs a little black magic called statically resolved type parameters. The signature isn't pretty, it essentially tells you that it expects the (+) operator on a type to successfully compile.
let inline sum xs = fold (+) Unchecked.defaultof<_> xs
// val inline sum :
// xs: ^a list1 -> ^b
// when ( ^b or ^a) : (static member ( + ) : ^b * ^a -> ^b)
B(1, B(2, B(3, B(4, A))))
|> sum
// val it : int = 10

How to write efficient list/seq functions in F#? (mapFoldWhile)

I was trying to write a generic mapFoldWhile function, which is just mapFold but requires the state to be an option and stops as soon as it encounters a None state.
I don't want to use mapFold because it will transform the entire list, but I want it to stop as soon as an invalid state (i.e. None) is found.
This was myfirst attempt:
let mapFoldWhile (f : 'State option -> 'T -> 'Result * 'State option) (state : 'State option) (list : 'T list) =
let rec mapRec f state list results =
match list with
| [] -> (List.rev results, state)
| item :: tail ->
let (result, newState) = f state item
match newState with
| Some x -> mapRec f newState tail (result :: results)
| None -> ([], None)
mapRec f state list []
The List.rev irked me, since the point of the exercise was to exit early and constructing a new list ought to be even slower.
So I looked up what F#'s very own map does, which was:
let map f list = Microsoft.FSharp.Primitives.Basics.List.map f list
The ominous Microsoft.FSharp.Primitives.Basics.List.map can be found here and looks like this:
let map f x =
match x with
| [] -> []
| [h] -> [f h]
| (h::t) ->
let cons = freshConsNoTail (f h)
mapToFreshConsTail cons f t
cons
The consNoTail stuff is also in this file:
// optimized mutation-based implementation. This code is only valid in fslib, where mutation of private
// tail cons cells is permitted in carefully written library code.
let inline setFreshConsTail cons t = cons.(::).1 <- t
let inline freshConsNoTail h = h :: (# "ldnull" : 'T list #)
So I guess it turns out that F#'s immutable lists are actually mutable because performance? I'm a bit worried about this, having used the prepend-then-reverse list approach as I thought it was the "way to go" in F#.
I'm not very experienced with F# or functional programming in general, so maybe (probably) the whole idea of creating a new mapFoldWhile function is the wrong thing to do, but then what am I to do instead?
I often find myself in situations where I need to "exit early" because a collection item is "invalid" and I know that I don't have to look at the rest. I'm using List.pick or Seq.takeWhile in some cases, but in other instances I need to do more (mapFold).
Is there an efficient solution to this kind of problem (mapFoldWhile in particular and "exit early" in general) with functional programming concepts, or do I have to switch to an imperative solution / use a Collections.Generics.List?
In most cases, using List.rev is a perfectly sufficient solution.
You are right that the F# core library uses mutation and other dirty hacks to squeeze some more performance out of the F# list operations, but I think the micro-optimizations done there are not particularly good example. F# list functions are used almost everywhere so it might be a good trade-off, but I would not follow it in most situations.
Running your function with the following:
let l = [ 1 .. 1000000 ]
#time
mapFoldWhile (fun s v -> 0, s) (Some 1) l
I get ~240ms on the second line when I run the function without changes. When I just drop List.rev (so that it returns the data in the other order), I get around ~190ms. If you are really calling the function frequently enough that this matters, then you'd have to use mutation (actually, your own mutable list type), but I think that is rarely worth it.
For general "exit early" problems, you can often write the code as a composition of Seq.scan and Seq.takeWhile. For example, say you want to sum numbers from a sequence until you reach 1000. You can write:
input
|> Seq.scan (fun sum v -> v + sum) 0
|> Seq.takeWhile (fun sum -> sum < 1000)
Using Seq.scan generates a sequence of sums that is over the whole input, but since this is lazily generated, using Seq.takeWhile stops the computation as soon as the exit condition happens.

Return item at position x in a list

I was reading this post While or Tail Recursion in F#, what to use when? were several people say that the 'functional way' of doing things is by using maps/folds and higher order functions instead of recursing and looping.
I have this function that returns the item at position x in a list:
let rec getPos l c = if c = 0 then List.head l else getPos (List.tail l) (c - 1)
how can it be converted to be more functional?
This is a primitive list function (also known as List.nth).
It is okay to use recursion, especially when creating the basic building blocks. Although it would be nicer with pattern matching instead of if-else, like this:
let rec getPos l c =
match l with
| h::_ when c = 0 -> h
| _::t -> getPos t (c-1)
| [] -> failwith "list too short"
It is possible to express this function with List.fold, however the result is less clear than the recursive version.
I'm not sure what you mean by more functional.
Are you rolling this yourself as a learning exercise?
If not, you could just try this:
> let mylist = [1;2;3;4];;
> let n = 2;;
> mylist.[n];;
Your definition is already pretty functional since it uses a tail-recursive function instead of an imperative loop construct. However, it also looks like something a Scheme programmer might have written because you're using head and tail.
I suspect you're really asking how to write it in a more idiomatic ML style. The answer is to use pattern matching:
let rec getPos list n =
match list with
| hd::tl ->
if n = 0 then hd
else getPos tl (n - 1)
| [] -> failWith "Index out of range."
The recursion on the structure of the list is now revealed in the code. You also get a warning if the pattern matching is non-exhaustive so you're forced to deal with the index too big error.
You're right that functional programming also encourages the use of combinators like map or fold (so called points-free style). But too much of it just leads to unreadable code. I don't think it's warranted in this case.
Of course, Benjol is right, in practice you would just write mylist.[n].
If you'd like to use high-order functions for this, you could do:
let nth n = Seq.take (n+1) >> Seq.fold (fun _ x -> Some x) None
let nth n = Seq.take (n+1) >> Seq.reduce (fun _ x -> x)
But the idea is really to have basic constructions and combine them build whatever you want. Getting the nth element of a sequence is clearly a basic block that you should use. If you want the nth item, as Benjol mentioned, do myList.[n].
For building basic constructions, there's nothing wrong to use recursion or mutable loops (and often, you have to do it this way).
Not as a practical solution, but as an exercise, here is one of the ways to express nth via foldr or, in F# terms, List.foldBack:
let myNth n xs =
let step e f = function |0 -> e |n -> f (n-1)
let error _ = failwith "List is too short"
List.foldBack step xs error n

cons operator (::) in F#

The :: operator in F# always prepends elements to the list. Is there an operator that appends to the list? I'm guessing that using # operator
[1; 2; 3] # [4]
would be less efficient, than appending one element.
As others said, there is no such operator, because it wouldn't make much sense. I actually think that this is a good thing, because it makes it easier to realize that the operation will not be efficient. In practice, you shouldn't need the operator - there is usually a better way to write the same thing.
Typical scenario: I think that the typical scenario where you could think that you need to append elements to the end is so common that it may be useful to describe it.
Adding elements to the end seems necessary when you're writing a tail-recursive version of a function using the accumulator parameter. For example a (inefficient) implementation of filter function for lists would look like this:
let filter f l =
let rec filterUtil acc l =
match l with
| [] -> acc
| x::xs when f x -> filterUtil (acc # [x]) xs
| x::xs -> filterUtil acc xs
filterUtil [] l
In each step, we need to append one element to the accumulator (which stores elements to be returned as the result). This code can be easily modified to use the :: operator instead of appending elements to the end of the acc list:
let filter f l =
let rec filterUtil acc l =
match l with
| [] -> List.rev acc // (1)
| x::xs when f x -> filterUtil (x::acc) xs // (2)
| x::xs -> filterUtil acc xs
filterUtil [] l
In (2), we're now adding elements to the front of the accumulator and when the function is about to return the result, we reverse the list (1), which is a lot more efficient than appending elements one by one.
Lists in F# are singly-linked and immutable. This means consing onto the front is O(1) (create an element and have it point to an existing list), whereas snocing onto the back is O(N) (as the entire list must be replicated; you can't change the existing final pointer, you must create a whole new list).
If you do need to "append one element to the back", then e.g.
l # [42]
is the way to do it, but this is a code smell.
The cost of appending two standard lists is proportional to the length of the list on the left. In particular, the cost of
xs # [x]
is proportional to the length of xs—it is not a constant cost.
If you want a list-like abstraction with a constant-time append, you can use John Hughes's function representation, which I'll call hlist. I'll try to use OCaml syntax, which I hope is close enough to F#:
type 'a hlist = 'a list -> 'a list (* a John Hughes list *)
let empty : 'a hlist = let id xs = xs in id
let append xs ys = fun tail -> xs (ys tail)
let singleton x = fun tail -> x :: tail
let cons x xs = append (singleton x) xs
let snoc xs x = append xs (singleton x)
let to_list : 'a hlist -> 'a list = fun xs -> xs []
The idea is that you represent a list functionally as a function from "the rest of the elements" to "the final list". This works great if you are going to build up the whole list before you look at any of the elements. Otherwise you'll have to deal with the linear cost of append or use another data structure entirely.
I'm guessing that using # operator [...] would be less efficient, than appending one element.
If it is, it will be a negligible difference. Both appending a single item and concatenating a list to the end are O(n) operations. As a matter of fact I can't think of a single thing that # has to do, which a single-item append function wouldn't.
Maybe you want to use another data structure. We have double-ended queues (or short "Deques") in fsharpx. You can read more about them at http://jackfoxy.com/double-ended-queues-for-fsharp
The efficiency (or lack of) comes from iterating through the list to find the final element. So declaring a new list with [4] is going to be negligible for all but the most trivial scenarios.
Try using a double-ended queue instead of list. I recently added 4 versions of deques (Okasaki's spelling) to FSharpx.Core (Available through NuGet. Source code at FSharpx.Core.Datastructures). See my article about using dequeus Double-ended queues for F#
I've suggested to the F# team the cons operator, ::, and the active pattern discriminator be made available for other data structures with a head/tail signature.3

Resources