Recursive sync faster than Recursive async - f#

How come that Solution 2 is more efficient than Solution 1?
(The time is the average of 100 runs, and the total folders they go through is 13217)
// Solution 1 (2608,9ms)
let rec folderCollector path =
async { let! dirs = Directory.AsyncGetDirectories path
do! [for z in dirs -> folderCollector z]
|> Async.Parallel |> Async.Ignore }
// Solution 2 (2510,9ms)
let rec folderCollector path =
let dirs = Directory.GetDirectories path
for z in dirs do folderCollector z
I would have thought that Solution 1 would be faster because it's async, and that I run it in Parallel. What am I'm missing?

As Daniel and Brian already clearly explained, your solution is probably creating too many short-lived asynchronous computations (so the overhead is more than the gains from parallelism). The AsyncGetDirectories operation also probably isn't really non-blocking as it is not doing much work. I don't see a truly async version of this operation anywhere - how is it defined?
Anyway, using the ordinary GetDirectories, I tried the following version (which creates only a small number of parallel asyncs):
// Synchronous version
let rec folderCollectorSync path =
let dirs = Directory.GetDirectories path
for z in dirs do folderCollectorSync z
// Asynchronous version that uses synchronous when 'nesting <= 0'
let rec folderCollector path nesting =
async { if nesting <= 0 then return folderCollectorSync path
else let dirs = Directory.GetDirectories path
do! [for z in dirs -> folderCollector z (nesting - 1) ]
|> Async.Parallel |> Async.Ignore }
Calling a simple synchronous version after certain number of recursive calls is a common trick - it is used when parallelizing any tree-like structure that is very deep. Using folderCollector path 2, this will start only tens of parallel tasks (as opposed to thousands), so it will be more efficient.
On a sample directory I used (with 4800 sub-dirs and 27000 files), I get:
folderCollectorSync path takes 1 second
folderCollector path 2 takes takes 600ms (result is similar for any nesting between 1 and 4)

From the comments:
Your function incurs the cost of async without any of the benefits because
you're creating too many asyncs for the short amount of work to be done
your function is not CPU, but rather IO, bound

I expect for a problem like this, you may have the best results if at the top-level you do async/parallel work, but then have the sub-calls be sync. (Or if the trees are very deep, maybe have the first two levels be async, and then sync after that.)
The keys are load-balancing and granularity. Too tiny a piece of work, and the overhead of async outweighs the benefits of parallelism. So you want big enough chunks of work to leverage parallel and overcome the overheads. But if the work pieces are too large and unbalanced (e.g. one top-level dir has 10000 files, and 3 other top-level dirs have 1000 each), then you also suffer because one guy is busy while the rest finish quickly, and you don't maximize parallelism.
If you can estimate the work for each sub-tree beforehand, you can do even better scheduling.

Apparently, your code is IO-bound. Keep in mind how HDDs work. When u use Async to do multiple read, the reading heads of the HDD have to jump back and forth to serve different read commands at the same time, which introduces latency. This will likely become much worse if the data on disk is heavily fragmented.

Related

working on sequences with lagged operators

A machine is turning on and off.
seqStartStop is a seq<DateTime*DateTime> that collects the start and end time of the execution of the machine tasks.
I would like to produce the sequence of periods where the machine is idle. In order to do so, I would like to build a sequence of tuples (beginIdle, endIdle).
beginIdle corresponds to the stopping time of the machine during
the previous cycle.
endIdle corresponds to the start time of the current production
cycle.
In practice, I have to build (beginIdle, endIdle) by taking the second element of the tuple for i-1 and the fist element of the following tuple i
I was wondering how I could get this task done without converting seqStartStop to an array and then looping through the array in an imperative fashion.
Another idea creating two copies of seqStartStop: one where the head is tail is removed, one where the head is removed (shifting backwards the elements); and then appying map2.
I could use skipand take as described here
All these seem rather cumbersome. Is there something more straightforward
In general, I was wondering how to execute calculations on elements with different lags in a sequence.
You can implement this pretty easily with Seq.pairwise and Seq.map:
let idleTimes (startStopTimes : seq<DateTime * DateTime>) =
startStopTimes
|> Seq.pairwise
|> Seq.map (fun (_, stop) (start, _) ->
stop, start)
As for the more general question of executing on sequences with different lag periods, you could implement that by using Seq.skip and Seq.zip to produce a combined sequence with whatever lag period you require.
The idea of using map2 with two copies of the sequence, one slightly shifted by taking the tail of the original sequence, is quite a standard one in functional programming, so I would recommend that route.
The Seq.map2 function is fine with working with lists with different lengths - it just stops when you reach the end of the shorter list - so you don't need to chop the last element of the original copy.
One thing to be careful of is how your original seq<DateTime*DateTime> is calculated. It will be recalculated each time it is enumerated, so with the map2 idea it will be calculated twice. If it's cheap to calculate and doesn't involve side-effects, this is fine. Otherwise, convert it to a list first with List.ofSeq.
You can still use Seq.map2 on lists as a list is an IEnumerable (i.e. a seq). Don't use List.map2 unless the lists are the same length though as it is more picky than Seq.map2.

Using Partitioner with PLINQ in F#

I use the following PLINQ-implemented parallel map function.
let parmap f (xs:list<_>) = xs.AsParallel().Select(fun x -> f x) |> Seq.toList
I want to improve my speedup on 4 cores which I'm not able to get above 2. I found one can do custom partitioning to improve the parallel performance. But I have seen C# examples mainly and not sure how to get it to work with F#. The following doesn't change anything but then I think it is the default partitioning used by TPL? How can I use the different (static, dynamic,...) partitioning options here?
let pmap_plinqlst_parts f (xs:list<_>) =
let parts = Partitioner.Create(xs)
parts.AsParallel().Select(fun x -> f x) |> Seq.toList
Typically a custom partitioner would be used if the work units were extremely small. When faced with this problem you may be better off switching to Task rather than Async as its generally more suited to smaller but more numerous amounts of work, with Async being more suited to IO type operations where the latency would be typically longer.
For example you would batch up calculations in sequences amongst parallel threads. The yield would vary depending on the size of the work units and also the number of total items.
There is no limit to scaling in any of the methods you mentioned. I parallelised a Black Scholes calculation and managed to get around 6.8x on an 8 core machine using Async.Parallel. Although not a perfect mechanism I used a simple division of work amongst the initial sequences that were passed to Async.Parallel.
Are you sure you have a true four core machine or a two core machine with hyper threading?

F# PSeq.iter does not seem to be using all cores

I've been doing some computationally intensive work in F#. Functions like Array.Parallel.map which use the .Net Task Parallel Library have sped up my code exponentially for a really quite minimal effort.
However, due to memory concerns, I remade a section of my code so that it can be lazily evaluated inside a sequence expression (this means I have to store and pass less information). When it came time to evaluate I used:
// processor and memory intensive task, results are not stored
let calculations : seq<Calculation> = seq { ...yield one thing at a time... }
// extract results from calculations for summary data
PSeq.iter someFuncToExtractResults results
Instead of:
// processor and memory intensive task, storing these results is an unnecessary task
let calculations : Calculation[] = ...do all the things...
// extract results from calculations for summary data
Array.Parallel.map someFuncToExtractResults calculations
When using any of the Array.Parallel functions I can clearly see all the cores on my computer kick into gear (~100% CPU usage). However the extra memory required means the program never finished.
With the PSeq.iter version when I run the program, there's only about 8% CPU usage (and minimal RAM usage).
So: Is there some reason why the PSeq version runs so much slower? Is it because of the lazy evaluation? Is there some magic "be parallel" stuff I am missing?
Thanks,
Other resources, source code implementations of both (they seem to use different Parallel libraries in .NET):
https://github.com/fsharp/fsharp/blob/master/src/fsharp/FSharp.Core/array.fs
https://github.com/fsharp/powerpack/blob/master/src/FSharp.PowerPack.Parallel.Seq/pseq.fs
EDIT: Added more detail to code examples and details
Code:
Seq
// processor and memory intensive task, results are not stored
let calculations : seq<Calculation> =
seq {
for index in 0..data.length-1 do
yield calculationFunc data.[index]
}
// extract results from calculations for summary data (different module)
PSeq.iter someFuncToExtractResults results
Array
// processor and memory intensive task, storing these results is an unnecessary task
let calculations : Calculation[] =
Array.Parallel.map calculationFunc data
// extract results from calculations for summary data (different module)
Array.Parallel.map someFuncToExtractResults calculations
Details:
The storing the intermediate array version runs quick (as far as it gets before crash) in under 10 minutes but uses ~70GB RAM before it crashes (64GB physical, the rest paged)
The seq version takes over 34mins and uses a fraction of the RAM (only around 30GB)
There's a ~billion values I'm calculating. Hence a billion doubles (at 64bits each) = 7.4505806GB. There's more complex forms of data... and a few unnecessary copies I'm cleaning up hence the current massive RAM usage.
Yes the architecture isn't great, the lazy evaluation is the first part of me attempting to optimize the program and/or batch up the data into smaller chunks
With a smaller dataset, both chunks of code output the same results.
#pad, I tried what you suggested, the PSeq.iter seemed to work properly (all cores active) when fed the Calculation[], but there is still the matter of RAM (it eventually crashed)
both the summary part of the code and the calculation part are CPU intensive (mainly because of large data sets)
With the Seq version I just aim to parallelize once
Based on your updated information, I'm shortening my answer to just the relevant part. You just need this instead of what you currently have:
let result = data |> PSeq.map (calculationFunc >> someFuncToExtractResults)
And this will work the same whether you use PSeq.map or Array.Parallel.map.
However, your real problem is not going to be solved. This problem can be stated as: when the desired degree of parallel work is reached in order to get to 100% CPU usage, there is not enough memory to support the processes.
Can you see how this will not be solved? You can either process things sequentially (less CPU efficient, but memory efficient) or you can process things in parallel (more CPU efficient, but runs out of memory).
The options then are:
Change the degree of parallelism to be used by these functions to something that won't blow your memory:
let result = data
|> PSeq.withDegreeOfParallelism 2
|> PSeq.map (calculationFunc >> someFuncToExtractResults)
Change the underlying logic for calculationFunc >> someFuncToExtractResults so that it is a single function that is more efficient and streams data through to results. Without knowing more detail, it's not simple to see how this could be done. But internally, certainly some lazy loading may be possible.
Array.Parallel.map uses Parallel.For under the hood while PSeq is a thin wrapper around PLINQ. But the reason they behave differently here is there is not enough workloads for PSeq.iter when seq<Calculation> is sequential and too slow in yielding new results.
I do not get the idea of using intermediate seq or array. Suppose data to be the input array, moving all calculations in one place is the way to go:
// Should use PSeq.map to match with Array.Parallel.map
PSeq.map (calculationFunc >> someFuncToExtractResults) data
and
Array.Parallel.map (calculationFunc >> someFuncToExtractResults) data
You avoid consuming too much memory and have intensive computation in one place which leads to better efficiency in parallel execution.
I had a problem similar to yours and solved it by adding the following to the solution's App.config file:
<runtime>
<gcServer enabled="true" />
<gcConcurrent enabled="true"/>
</runtime>
A calculation that was taking 5'49'' and showing roughly 22% CPU utilization on Process Lasso took 1'36'' showing roughly 80% CPU utilization.
Another factor that may influence the speed of parallelized code is whether hyperthreading (Intel) or SMT (AMD) is enabled in the BIOS. I have seen cases where disabling leads to faster execution.

What are the advantages of the "apply" functions? When are they better to use than "for" loops, and when are they not? [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Is R's apply family more than syntactic sugar
Just what the title says. Stupid question, perhaps, but my understanding has been that when using an "apply" function, the iteration is performed in compiled code rather than in the R parser. This would seem to imply that lapply, for instance, is only faster than a "for" loop if there are a great many iterations and each operation is relatively simple. For instance, if a single call to a function wrapped up in lapply takes 10 seconds, and there are only, say, 12 iterations of it, I would imagine that there's virtually no difference at all between using "for" and "lapply".
Now that I think of it, if the function inside the "lapply" has to be parsed anyway, why should there be ANY performance benefit from using "lapply" instead of "for" unless you're doing something that there are compiled functions for (like summing or multiplying, etc)?
Thanks in advance!
Josh
There are several reasons why one might prefer an apply family function over a for loop, or vice-versa.
Firstly, for() and apply(), sapply() will generally be just as quick as each other if executed correctly. lapply() does more of it's operating in compiled code within the R internals than the others, so can be faster than those functions. It appears the speed advantage is greatest when the act of "looping" over the data is a significant part of the compute time; in many general day-to-day uses you are unlikely to gain much from the inherently quicker lapply(). In the end, these all will be calling R functions so they need to be interpreted and then run.
for() loops can often be easier to implement, especially if you come from a programming background where loops are prevalent. Working in a loop may be more natural than forcing the iterative computation into one of the apply family functions. However, to use for() loops properly, you need to do some extra work to set-up storage and manage plugging the output of the loop back together again. The apply functions do this for you automagically. E.g.:
IN <- runif(10)
OUT <- logical(length = length(IN))
for(i in IN) {
OUT[i] <- IN > 0.5
}
that is a silly example as > is a vectorised operator but I wanted something to make a point, namely that you have to manage the output. The main thing is that with for() loops, you always allocate sufficient storage to hold the outputs before you start the loop. If you don't know how much storage you will need, then allocate a reasonable chunk of storage, and then in the loop check if you have exhausted that storage, and bolt on another big chunk of storage.
The main reason, in my mind, for using one of the apply family of functions is for more elegant, readable code. Rather than managing the output storage and setting up the loop (as shown above) we can let R handle that and succinctly ask R to run a function on subsets of our data. Speed usually does not enter into the decision, for me at least. I use the function that suits the situation best and will result in simple, easy to understand code, because I'm far more likely to waste more time than I save by always choosing the fastest function if I can't remember what the code is doing a day or a week or more later!
The apply family lend themselves to scalar or vector operations. A for() loop will often lend itself to doing multiple iterated operations using the same index i. For example, I have written code that uses for() loops to do k-fold or bootstrap cross-validation on objects. I probably would never entertain doing that with one of the apply family as each CV iteration needs multiple operations, access to lots of objects in the current frame, and fills in several output objects that hold the output of the iterations.
As to the last point, about why lapply() can possibly be faster that for() or apply(), you need to realise that the "loop" can be performed in interpreted R code or in compiled code. Yes, both will still be calling R functions that need to be interpreted, but if you are doing the looping and calling directly from compiled C code (e.g. lapply()) then that is where the performance gain can come from over apply() say which boils down to a for() loop in actual R code. See the source for apply() to see that it is a wrapper around a for() loop, and then look at the code for lapply(), which is:
> lapply
function (X, FUN, ...)
{
FUN <- match.fun(FUN)
if (!is.vector(X) || is.object(X))
X <- as.list(X)
.Internal(lapply(X, FUN))
}
<environment: namespace:base>
and you should see why there can be a difference in speed between lapply() and for() and the other apply family functions. The .Internal() is one of R's ways of calling compiled C code used by R itself. Apart from a manipulation, and a sanity check on FUN, the entire computation is done in C, calling the R function FUN. Compare that with the source for apply().
From Burns' R Inferno (pdf), p25:
Use an explicit for loop when each
iteration is a non-trivial task. But a
simple loop can be more clearly and
compactly expressed using an apply
function. There is at least one
exception to this rule ... if the result will
be a list and some of the components
can be NULL, then a for loop is
trouble (big trouble) and lapply gives
the expected answer.

Do recursive sequences leak memory?

I like to define sequences recursively as follows:
let rec startFrom x =
seq {
yield x;
yield! startFrom (x + 1)
}
I'm not sure if recursive sequences like this should be used in practice. The yield! appears to be tail recursive, but I'm not 100% sure since its being called from inside another IEnumerable. From my point of view, the code creates an instance of IEnumerable on each call without closing it, which would actually make this function leak memory as well.
Will this function leak memory? For that matter is it even "tail recursive"?
[Edit to add]: I'm fumbling around with NProf for an answer, but I think it would be helpful to get a technical explanation regarding the implementation of recursive sequences on SO.
I'm at work now so I'm looking at slightly newer bits than Beta1, but on my box in Release mode and then looking at the compiled code with .Net Reflector, it appears that these two
let rec startFromA x =
seq {
yield x
yield! startFromA (x + 1)
}
let startFromB x =
let z = ref x
seq {
while true do
yield !z
incr z
}
generate almost identical MSIL code when compiled in 'Release' mode. And they run at about the same speed as this C# code:
public class CSharpExample
{
public static IEnumerable<int> StartFrom(int x)
{
while (true)
{
yield return x;
x++;
}
}
}
(e.g. I ran all three versions on my box and printed the millionth result, and each version took about 1.3s, +/- 1s). (I did not do any memory profiling; it's possible I'm missing something important.)
In short, I would not sweat too much thinking about issues like this unless you measure and see a problem.
EDIT
I realize I didn't really answer the question... I think the short answer is "no, it does not leak". (There is a special sense in which all 'infinite' IEnumerables (with a cached backing store) 'leak' (depending on how you define 'leak'), see
Avoiding stack overflow (with F# infinite sequences of sequences)
for an interesting discussion of IEnumerable (aka 'seq') versus LazyList and how the consumer can eagerly consume LazyLists to 'forget' old results to prevent a certain kind of 'leak'.)
.NET applications don't "leak" memory in this way. Even if you are creating many objects a garbage collection will free any objects that have no roots to the application itself.
Memory leaks in .NET usually come in the form of unmanaged resources that you are using in your application (database connections, memory streams, etc.). Instances like this in which you create multiple objects and then abandon them are not considered a memory leak since the garbage collector is able to free the memory.
It won't leak any memory, it'll just generate an infinite sequence, but since sequences are IEnumerables you can enumerate them without memory concerns.
The fact that the recursion happens inside the sequence generation function doesn't impact the safety of the recursion.
Just mind that in debug mode tail call optimization may be disabled in order to allow full debugging, but in release there'll be no issue whatsoever.

Resources