While writing up examples for memoization and continuation passing style (CPS) functions in a functional language, I ended up using the Fibonacci example for both. However, Fibonacci doesn't really benefit from CPS, as the loop still has to run exponentially often, while with memoization its only O(n) the first time and O(1) every following time.
Combining both CPS and memoization has a slight benefit for Fibonacci, but are there examples around where CPS is the best way that prevents you from running out of stack and improves performance and where memoization is not a solution?
Or: is there any guideline for when to choose one over the other or both?
On CPS
While CPS is useful as an intermediate language in a compiler, on the source language level it is mostly a device to (1) encode sophisticated control flow (not really performance-related) and (2) transform a non-tail-call consuming stack space into a continuation-allocating tail-call consuming heap space. For example when you write (code untested)
let rec fib = function
| 0 | 1 -> 1
| n -> fib (n-1) + fib (n-2)
let rec fib_cps n k = match n with
| 0 | 1 -> k 1
| n -> fib_cps (n-1) (fun a -> fib_cps (n-2) (fun b -> k (a+b)))
The previous non-tail-call fib (n-2), which allocated a new stack frame, is turned into the tail-call fib (n-2) (fun b -> k (a+b)) which allocates the closure fun b -> k (a+b) (on the heap) to pass it as argument.
This does not asymptotically reduce the memory usage of your program (some further domain-specific optimizations might, but that's another story). You're just trading stack space for heap space, which is interesting on systems where stack space is severely limited by the OS (not the case with some implementations of ML such as SML/NJ, which track their call stack on the heap instead of using the system stack), and potentially performance-degrading because of the additional GC costs and potential locality decrease.
CPS transformation is unlikely to improve performances much (though details of your implementation and runtime systems might make it so), and is a generally applicable transformation that allows to avoid the snark "Stack Overflow" of recursive functions with a deep call stack.
On Memoization
Memoization is useful to introduce sharing of subcalls of recursive functions. A recursive function typically solves a "problem" ("compute fibonacci of n", etc.) by decomposing it into several strictly simpler "sub-problems" (the recursive subcalls), with some base cases for which the problem is solvable right away.
For any recursive function (or recursive formulation of a problem), you can observe the structure of the subproblem space. Which simpler instances of Fib(k) will Fib(n) need to return its result? Which simpler instances will those instances in turn need?
In the general case, this space of subproblem is a graph (generally acyclic for termination purposes): there are some nodes that have several parents, that is several distinct problems for which they are subproblems. For example, Fib(n-2) is a subproblem both of Fib(n) and Fib(n-2). The amount of node sharing in this graph depends on the particular problem / recursive functions. In the case of Fibonacci, all nodes are shared between two parents, so there is a lot of sharing.
A direct recursive call without memoization will not be able observe this sharing. The structure of the calls of a recursive function is a tree, not a graph, and shared subproblems such as Fib(n-2) will be fully visited several times (as many as there are paths from the starting node to the subproblem node in the graph). Memoization introduces sharing by letting some subcalls return directly with "we've already computed this node and here is the result". For problems that have a lot of sharing, this can result in dramatic reduction of (useless) computation: Fibonacci moves from exponential to linear complexity when memoization is introduced -- note that there are other ways to write the function, without using memoization but a different subcalls structure, to have a linear complexity.
let rec fib_pair = function
| 0 -> (1,1)
| n -> let (u,v) = fib_pair (n-1) in (v,u+v)
The technique of using some form of sharing (usually through large tables storing the results) to avoid useless duplication of subcomputations is well-known in the algorithmic community, it is called Dynamic Programming. When you recognize that a problem is amenable to this treatment (you notice the sharing among the subproblems), this can provide large performance benefits.
Does a comparison make sense?
The two seems mostly independent of each other.
There are a lot of problems where memoization is not applicable, because the subproblem graph structure does not have any sharing (it is a tree). On the contrary, CPS transformation is applicable for all recursive functions, but does not by itself lead to performance benefits (other than potential constant factors due to the particular implementation and runtime system you're using, though they're likely to make the code slower rather than faster).
Improving performances by inspecting non-tail contexts
There are optimizations technique that is related to CPS that can improve performance of recursive functions. They consist in looking at the computations "left to be done" after the recursive calls (what would be turned into a function in straightforward CPS style) and finding an alternate, more efficient representation for it, that does not result in systematic closure allocation. Consider for example:
let rec length = function
| [] -> 0
| _::t -> 1 + length t
let rec length_cps li k = match li with
| [] -> k 0
| _::t -> length_cps t (fun a -> k (a + 1))
You can notice that the context of the non-recursive call, namely [_ + 1], has a simple structure: it adds an integer. Instead of representing this as a function fun a -> k (a+1), you may just store the integer to be added corresponding to several application of this function, making k an integer instead of a function.
let rec length_acc li k = match li with
| [] -> k + 0
| _::t -> length_acc t (k + 1)
This function runs in constant, rather than linear, space. By turning the representation of the tail contexts from functions to integers, we have eliminated the allocation step that made memory usage linear.
Close inspection of the order in which the additions are performed will reveal that they are now performed in a different direction: we are adding the 1's corresponding to the beginning of the list first, while the cps version was adding them last. This order-reversal is valid because _ + 1 is an associative operation (if you have several nested contexts, foo + 1 + 1 + 1, it is valid to start compute them either from the inside, ((foo+1)+1)+1, or the outside, foo+(1+(1+1))). The optimization above can be used for all such "associative" contexts around a non-tail-call.
There are certainly other optimizations available based on the same idea (I'm no expert on such optimizations), which is to look at the structure of the continuations involved and represent them under a more efficient form than functions allocated on the heap.
This is related to the transformation of "defunctionalization" that changes the representation of continuations from functions to data-structures, without changing the memory consumption (a defunctionalized program will allocate a data node when a closure would have been allocated in the original program), but allows to express the tail-recursive CPS version in a first-order language (without first-class functions) and can be more efficient on systems where data structures and pattern-matching is more efficient than closure allocation and indirect calls.
type length_cont =
| Linit
| Lcons of length_cont
let rec length_cps_defun li k = match li with
| [] -> length_cont_eval 0 k
| _::t -> length_cps_defun t (Lcons k)
and length_cont_eval acc = function
| Linit -> acc
| Lcons k -> length_cont_eval (acc+1) k
let length li = length_cps_defun li Linit
type fib_cont =
| Finit
| Fminus1 of int * fib_cont
| Fminus2 of fib_cont * int
let rec fib_cps_defun n k = match n with
| 0 | 1 -> fib_cont_eval 1 k
| n -> fib_cps_defun (n-1) (Fminus1 (n, k))
and fib_cont_eval acc = function
| Finit -> acc
| Fminus1 (n, k) -> fib_cps_defun (n-2) (Fminus2 (k, acc))
| Fminus2 (k, acc') -> fib_cont_eval (acc+acc') k
let fib n = fib_cps_defun n Finit
One benefit of CPS is error handling. If you need to fail you just call your failure method.
I think the biggest situation is when you are not talking about calculations, where memoization is great. If you are instead talking about IO or other operations, the benefits of CPS are there but memoization doesn't work.
As to an instance where CPS and memoization are both applicable and CPS is better, I am not sure since I consider them different pieces of functionality.
Finally CPS is a bit diminished in F# since tail recursion makes the whole stack overflow thing a non-issue already.
Related
Consider the binary tree algebraic datatype
type btree = Empty | Node of btree * int * btree
and a new datatype deļ¬ned as follows:
type finding = NotFound | Found of int
Heres my code so far:
let s = Node (Node(Empty, 5, Node(Empty, 2, Empty)), 3, Node (Empty, 6, Empty))
(*
(3)
/ \
(5) (6)
/ \ | \
() (2) () ()
/ \
() ()
*)
(* size: btree -> int *)
let rec size t =
match t with
Empty -> false
| Node (t1, m, t2) -> if (m != Empty) then sum+1 || (size t1) || (size t2)
let num = occurs s
printfn "There are %i nodes in the tree" num
This probably isn't close, I took a function that would find if an integer existed in a tree and tried changing the code for what I was trying to do.
I am very new to using F# and would appreciate any help. I am trying to count all non empty nodes in the tree. For example the tree I'm using should print the value 4.
I did not run the compiler on your code, but I believe this does even compile.
However your idea to use a pattern match in a recursive function is good.
As rmunn commented, you want to determine the number of nodes in each case:
An empty tree has no nodes, hence the result is zero.
A non-empty tree, has at least the root node plus the count of its left and right subtrees.
So something along the lines of the following should work
let rec size t =
match t with
| Empty -> 0
| Node (t1, _, t2) -> 1 + (size t1) + (size t2)
The most important detail here is, that you do not need a global variable sum to store any intermediate values. The whole idea of a recursive function is that those intermediate values are the results of recursive calls.
As a remark, your tree in the comment should look like this, I believe.
(*
(3)
/ \
(5) (6)
/ \ | \
() (2) () ()
/ \
() ()
*)
Edit: I misread the misaligned () as leaves of an empty tree, where in fact they are leaves of the subtree (2). So it was just an ASCII art issue :-)
Friedrich already posted a simple version of the size function that will work for most trees. However, the solution is not "tail-recursive", so it can cause a Stack Overflow for large trees. In functional programming languages like F#, recursion is often the preferred technique for things like counting and other aggregate functions. However, recursive functions generally consume a stack frame for each recursive call. This means that for large structures, the call stack can be exhausted before the function completes. In order to avoid this problem, compilers can optimize functions that are considered "tail-recursive" so that they use only one stack frame regardless of how many times they recurse. Unfortunately, this optimization cannot just be implemented for any recursive algorithm. It requires that the recursive call be the last thing that the function does, thereby ensuring that the compiler does not have to worry about jumping back into the function after the call, allowing it to overwrite the stack frame instead of adding another one.
In order to change the size function to be tail-recursive, we need some way to avoid having to call it twice in the case of a non-empty node, so that the call can be the last step of the function, instead of the addition between the two calls in Friedrich's solution. This can be accomplished using a couple different techniques, generally either using an accumulator or using Continuation Passing Style. The simpler solution is often to use an accumulator to keep track of the total size instead of having it be the return value, while Continuation Passing Style is a more general solution that can handle more complex recursive algorithms.
In order to make an accumulator pattern work for a tree where we have to sum both the left and right sub-trees, we need some way to make one tail-call at the end of the function, while still making sure that both sub-trees are evaluated. A simple way to do that is to also accumulate the right sub-trees in addition to the total count, so we can make subsequent tail-calls to evaluate those trees while evaluating the left sub-trees first. That solution might look something like this:
let size t =
let rec size acc ts = function
| Empty ->
match ts with
| [] -> acc
| head :: tail -> head |> size acc tail
| Node (t1, _, t2) ->
t1 |> size (acc + 1) (t2 :: ts)
t |> size 0 []
This adds the acc parameter and the ts parameter to represent the total count and remaining unevaluated sub-trees. When we hit a populated node, we evaluate the left sub-tree while adding the right sub-tree to our list of trees to evaluate later. When we hit the an empty node, we start evaluating any ts we've accumulated, until we have no further populated nodes or unevaluated sub-trees. This isn't the best possible solution for computing the tree-size, and most real solutions would use Continuation Passing Style to make it tail-recusive, but that should make a good exercise as you get more familiar with the language.
I have the following code to perform the Sieve of Eratosthenes in F#:
let sieveOutPrime p numbers =
numbers
|> Seq.filter (fun n -> n % p <> 0)
let primesLessThan n =
let removeFirstPrime = function
| s when Seq.isEmpty s -> None
| s -> Some(Seq.head s, sieveOutPrime (Seq.head s) (Seq.tail s))
let remainingPrimes =
seq {3..2..n}
|> Seq.unfold removeFirstPrime
seq { yield 2; yield! remainingPrimes }
This is excruciatingly slow when the input to primesLessThan is remotely large: primes 1000000 |> Seq.skip 1000;; takes nearly a minute for me, though primes 1000000 itself is naturally very fast because it's just a Sequence.
I did some playing around, and I think that the culprit has to be that Seq.tail (in my removeFirstPrime) is doing something intensive. According to the docs, it's generating a completely new sequence, which I could imagine being slow.
If this were Python and the sequence object were a generator, it would be trivial to ensure that nothing expensive happens at this point: just yield from the sequence, and we've cheaply dropped its first element.
LazyList in F# doesn't seem to have the unfold method (or, for that matter, the filter method); otherwise I think LazyList would be the thing I wanted.
How can I make this implementation fast by preventing unnecessary duplications/recomputations? Ideally primesLessThan n |> Seq.skip 1000 would take the same amount of time regardless of how large n was.
Recursive solutions and sequences don't go well together (compare the answers here, it's very much the same pattern you are using). You might want to inspect the generated code, but I'd just consider this a rule of thumb.
LazyList (as defined in FSharpX) does of course come with unfold and filter defined, it would have been quite bizarre if it didn't. Typically in F# code this sort of functionality is provided in separate modules rather than as instance members on the type itself, a convention that does seem to confuse most of those documentation systems.
As you probably know Seq is a lazily evaluated collection. Sieve algorithm is about filtering out non-primes from a sequence so that you don't have to consider them again.
However, when you combine Sieve with a lazily evaluated collection you end up do the filtering of the same non-primes over and over again.
You see much better performance if you switch from Seq to Array or List because of the non-lazy aspect of those collections means that you only filter non-primes once.
One way to improve performance in your code is to introduce caching.
let removeFirstPrime s =
let s = s |> Seq.cache
match s with
| s when Seq.isEmpty s -> None
| s -> Some(Seq.head s, sieveOutPrime (Seq.head s) (Seq.tail s))
I implemented a LazyList that works alot like Seq that allows me to count the number of evaluations:
For all primes up to 2000.
Without caching: 14753706 evaluations
With caching: 97260 evaluations
Of course if you really need performance you use a mutable array implementation.
PS. Performance metrics
Running 'seq' ...
it took 271 ms with cc (16, 4, 0), the result is: 1013507
Running 'list' ...
it took 14 ms with cc (16, 0, 0), the result is: 1013507
Running 'array' ...
it took 14 ms with cc (10, 0, 0), the result is: 1013507
Running 'mutable' ...
it took 0 ms with cc (0, 0, 0), the result is: 1013507
This is Seq with caching. Seq in F# has rather high overhead, there are interesting lazy alternatives to Seq like Nessos.
List and Array run roughly similar but because of the more compact memory layout the GC metrics are better for Array (10 cc0 collections for Array vs 16 cc0 collections for List). Seq has worse GC metrics in that it forced 4 cc1 collections.
The mutable implementation of sieve algorithm has better memory and performance metrics by a large margin.
I am trying to implement a tail-recursive function that calculates the powers of 2:
let rec po2 x =
match x with
| 1 -> 1I
| _ -> po2 (x-1) * 2I
This works like intended, but since i multiply my recursive calls result by 2 this code isn't tail recursive.
Any ideas on how to make this tail recursive?
Any linearly recursive function can be turned into a tail-recursive one by using an "accumulator" - a thread-through parameter that accumulates the "computed so far" value. In general, you'd be trading stack memory for heap memory (need to store that "so far" value somewhere), but in some cases you can save by not storing the whole "so far" value, but only a part of it. In your case, all you really need to store is the result of the last multiplication, not the whole history of multiplications:
let rec po2Cached acc x =
match x with
| 0 -> acc
| x -> po2Cached (acc*2l) (x-1)
(I've omitted the memoization part for simplicity)
P.S. Note that this function will produce wrong result for negative powers.
When I compare IL code that F# generates for seq{} expressions vs that for user-defined computational workflows, it's quite obvious that seq{} is implemented very differently: it generates a state machine similar to the once C# uses for its' iterator methods. User-defined workflows, on the other hand, use the corresponding builder object as you'd expect.
So I am wondering - why the difference?
Is this for historical reasons, e.g. "seq was there before workflows"?
Or, is there significant performance to be gained?
Some other reason?
This is an optimization performed by the F# compiler. As far as I know, it has actually been implemented later - F# compiler first had list comprehensions, then a general-purpose version of computation expressions (also used for seq { ... }) but that was less efficient, so the optimization was added in some later version.
The main reason is that this removes many allocations and indirections. Let's say you have something like:
seq { for i in input do
yield i
yield i * 10 }
When using computation expressions, this gets translated to something like:
seq.Delay(fun () -> seq.For(input, fun i ->
seq.Combine(seq.Yield(i), seq.Delay(fun () -> seq.Yield(i * 10)))))
There is a couple of function allocations and the For loop always needs to invoke the lambda function. The optimization turns this into a state machine (similar to the C# state machine), so the MoveNext() operation on the generated enumerator just mutates some state of the class and then returns...
You can easily compare the performance by defining a custom computation builder for sequences:
type MSeqBuilder() =
member x.For(en, f) = Seq.collect f en
member x.Yield(v) = Seq.singleton v
member x.Delay(f) = Seq.delay f
member x.Combine(a, b) = Seq.concat [a; b]
let mseq = MSeqBuilder()
let input = [| 1 .. 100 |]
Now we can test this (using #time in F# interactive):
for i in 0 .. 10000 do
mseq { for x in input do
yield x
yield x * 10 }
|> Seq.length |> ignore
On my computer, this takes 2.644sec when using the custom mseq builder but only 0.065sec when using the built-in optimized seq expression. So the optimization makes sequence expressions significantly more efficient.
Historically, computations expressions ("workflows") were a generalization of sequence expressions: http://blogs.msdn.com/b/dsyme/archive/2007/09/22/some-details-on-f-computation-expressions-aka-monadic-or-workflow-syntax.aspx.
But, the answer is certainly that there is significant performance to be gained. I can't turn up any solid links (though there is a mention of "optimizations related to 'when' filters in sequence expressions" in http://blogs.msdn.com/b/dsyme/archive/2007/11/30/full-release-notes-for-f-1-9-3-7.aspx), but I do recall that this was an optimization that made its way in at some point in time. I'd like to say that the benefit is self-evident: sequence expressions are a "core" language feature and deserving of any optimizations that can be made.
Similarly, you'll see that certain tail-recursive functions will be optimized in to loops, rather than tail calls.
It is easy to implement the algorithm using a single process, however, how can I use multiple processes to do the job?
Here is what I have done so far.
find_largest([H], _) -> H;
find_largest([H, Q | T], R) ->
if H > Q -> find_largest([H | T], [Q | R]);
true -> find_largest([Q | T], [H | R])
end.
Thanks
Given how Erlang represents lists, this is probably not a good idea to try and do in parallel. Partitioning the list implies a lot of copying (since they are linked lists) and so does sending these partitions to other processes. I expect the comparison to be far cheaper than copying everything twice and then combining the results.
The implementation is also not correct, you can find a good one in lists.erl as max/1
%% max(L) -> returns the maximum element of the list L
-spec max([T,...]) -> T.
max([H|T]) -> max(T, H).
max([H|T], Max) when H > Max -> max(T, H);
max([_|T], Max) -> max(T, Max);
max([], Max) -> Max.
If by some chance your data are already in separate processes, simply get the lists:max/1 or each of the lists and send them to a single place, and then get the lists:max/1 of the result list. You could also do the comparison as you receive the results to avoid building this intermediate list.
The single process version of your code should be replaced by lists:max/1. A useful function for parallelizing code is as follows:
pmap(Fun, List) ->
Parent = self(),
P = fun(Elem) ->
Ref = make_ref(),
spawn_link(fun() -> Parent ! {Ref, Fun(Elem)} end),
Ref
end,
Refs = [P(Elem) || Elem <- List],
lists:map(fun(Ref) -> receive {Ref, Elem} -> Elem end end, Refs).
pmap/2 applies Fun to each member of List in parallel and collects the results in input order. To use pmap with this problem, you would need to segment your original list into a list of lists and pass that to pmap. e.g. lists:max(pmap(fun lists:max/1, ListOfLists)). Of course, the act of segmenting the lists would be more expensive than simply calling lists:max/1, so this solution would require that the list be pre-segmented. Even then, it's likely that the overhead of copying the lists outweighs any benefit of parallelization - especially on a single node.
The inherent problem with your situation is that the computation of each sub-task is tiny when compared with the overhead of managing the data. Tasks which are more computationally intensive, (e.g. factoring a list of large numbers), are more easily parallelized.
This isn't to say that finding a max value can't be parallelized, but I believe it would require that your data be pre-segmented or segmented in a way that didn't require iterating over every value.