I use the following PLINQ-implemented parallel map function.
let parmap f (xs:list<_>) = xs.AsParallel().Select(fun x -> f x) |> Seq.toList
I want to improve my speedup on 4 cores which I'm not able to get above 2. I found one can do custom partitioning to improve the parallel performance. But I have seen C# examples mainly and not sure how to get it to work with F#. The following doesn't change anything but then I think it is the default partitioning used by TPL? How can I use the different (static, dynamic,...) partitioning options here?
let pmap_plinqlst_parts f (xs:list<_>) =
let parts = Partitioner.Create(xs)
parts.AsParallel().Select(fun x -> f x) |> Seq.toList
Typically a custom partitioner would be used if the work units were extremely small. When faced with this problem you may be better off switching to Task rather than Async as its generally more suited to smaller but more numerous amounts of work, with Async being more suited to IO type operations where the latency would be typically longer.
For example you would batch up calculations in sequences amongst parallel threads. The yield would vary depending on the size of the work units and also the number of total items.
There is no limit to scaling in any of the methods you mentioned. I parallelised a Black Scholes calculation and managed to get around 6.8x on an 8 core machine using Async.Parallel. Although not a perfect mechanism I used a simple division of work amongst the initial sequences that were passed to Async.Parallel.
Are you sure you have a true four core machine or a two core machine with hyper threading?
Related
I have a function which returns a dataframe to me. I am trying to use this function in parallel by using dask.
I append the delayed objects of the dataframes into a list. However, the run-time of my code is the same with and without dask.delayed.
I use the reduce function from functools along with pd.merge to merge my dataframes.
Any suggestions on how to improve the run-time?
The visualized graph and code are as below.
from functools import reduce
d = []
for lot in lots:
lot_data = data[data["LOTID"]==lot]
trmat = delayed(LOT)(lot, lot_data).transition_matrix(lot)
d.append(trmat)
df = delayed(reduce)(lambda x, y: x.merge(y, how='outer', on=['from', "to"]), d)
Visualized graph of the operations
General rule: if your data comfortable fits into memory (including the base size times a small number for possible intermediates), then there is a good chance that Pandas is fast and efficient for your use case.
Specifically for your case, there is a good chance that the tasks you are trying to parallelise do not release python's internal lock, the GIL, in which case although you have independent threads, only one can run at a time. The solution would be to use the "distributed" scheduler instead, which can have any mix of multiple threads and processed; however using processes comes at a cost for moving data between client and processes, and you may find that the extra cost dominates any time saving. You would certainly want to ensure that you load the data within the workers rather than passing from the client.
Short story, you should do some experimentation, measure well, and read the data-frame and distributed scheduler documentation carefully.
It is nice to have a wrapper for every primitive value, so that there is no way to misuse it. I suspect this convenience comes at a price. Is there a performance drop? Should I rather use bare primitive values instead if the performance is a concern?
Yes, there's going to be a performance drop when using single-case union types to wrap primitive values. Union cases are compiled into classes, so you'll pay the price of allocating (and later, collecting) the class and you'll also have an additional indirection each time you fetch the value held inside the union case.
Depending on the specifics of your application, and how often you'll incur these additional overheads, it may still be worth doing if it makes your code safer and more modular.
I've written a lot of performance-sensitive code in F#, and my personal preference is to use F# unit-of-measure types whenever possible to "tag" primitive types (e.g., ints). This keeps them from being misused (thanks to the F# compiler's type checker) but also avoids any additional run-time overhead, since the measure types are erased when the code is compiled. If you want some examples of this, I've used this design pattern extensively in my fsharp-tools projects.
Jack has much more experience with writing high-performance F# code than I do, so I think his answer is absolutely right (I also think the idea to use units of measure is pretty interesting).
Just to put things in context, I wrote a really basic test (using just F# Interactive - so things may differ in Release mode) to compare the performance. It allocates an array of wrapped (vs. non-wrapped) int values. This is probably the scenario where non-wrapped types are really a good choice, because the array will be just a continuous block of memory.
#time
// Using a simple wrapped `int` type
type Foo = Foo of int
let foos = Array.init 1000000 Foo
// Add all 'foos' 1k times and ignore the results
for i in 0 .. 1000 do
let mutable total = 0
for Foo f in foos do total <- total + f
On my machine, the for loop takes on average something around 1050ms. Now, the unwrapped version:
let bars = Array.init 1000000 id
for i in 0 .. 1000 do
let mutable total = 0
for b in bars do total <- total + b
On my machine, this takes about 700ms.
So, there is certainly some performance penalty, but perhaps smaller than one would expect (some 33%). And this is looking at a test that does virtually nothing else than unwrap the values in a loop. In code that does something useful, the overhead would be a lot smaller.
This may be an issue if you're writing high-performance code, something that will process lots of data or something that takes some time and the users will run it frequently (like compiler & tools). On the other hand, if you application is not performance critical, then this is not likely to be a problem.
From F# 4.1 onwards adding the [<Struct>] attribute to suitable single case discriminated unions will increase the performance and reduce the number of memory allocations performed.
I've been doing some computationally intensive work in F#. Functions like Array.Parallel.map which use the .Net Task Parallel Library have sped up my code exponentially for a really quite minimal effort.
However, due to memory concerns, I remade a section of my code so that it can be lazily evaluated inside a sequence expression (this means I have to store and pass less information). When it came time to evaluate I used:
// processor and memory intensive task, results are not stored
let calculations : seq<Calculation> = seq { ...yield one thing at a time... }
// extract results from calculations for summary data
PSeq.iter someFuncToExtractResults results
Instead of:
// processor and memory intensive task, storing these results is an unnecessary task
let calculations : Calculation[] = ...do all the things...
// extract results from calculations for summary data
Array.Parallel.map someFuncToExtractResults calculations
When using any of the Array.Parallel functions I can clearly see all the cores on my computer kick into gear (~100% CPU usage). However the extra memory required means the program never finished.
With the PSeq.iter version when I run the program, there's only about 8% CPU usage (and minimal RAM usage).
So: Is there some reason why the PSeq version runs so much slower? Is it because of the lazy evaluation? Is there some magic "be parallel" stuff I am missing?
Thanks,
Other resources, source code implementations of both (they seem to use different Parallel libraries in .NET):
https://github.com/fsharp/fsharp/blob/master/src/fsharp/FSharp.Core/array.fs
https://github.com/fsharp/powerpack/blob/master/src/FSharp.PowerPack.Parallel.Seq/pseq.fs
EDIT: Added more detail to code examples and details
Code:
Seq
// processor and memory intensive task, results are not stored
let calculations : seq<Calculation> =
seq {
for index in 0..data.length-1 do
yield calculationFunc data.[index]
}
// extract results from calculations for summary data (different module)
PSeq.iter someFuncToExtractResults results
Array
// processor and memory intensive task, storing these results is an unnecessary task
let calculations : Calculation[] =
Array.Parallel.map calculationFunc data
// extract results from calculations for summary data (different module)
Array.Parallel.map someFuncToExtractResults calculations
Details:
The storing the intermediate array version runs quick (as far as it gets before crash) in under 10 minutes but uses ~70GB RAM before it crashes (64GB physical, the rest paged)
The seq version takes over 34mins and uses a fraction of the RAM (only around 30GB)
There's a ~billion values I'm calculating. Hence a billion doubles (at 64bits each) = 7.4505806GB. There's more complex forms of data... and a few unnecessary copies I'm cleaning up hence the current massive RAM usage.
Yes the architecture isn't great, the lazy evaluation is the first part of me attempting to optimize the program and/or batch up the data into smaller chunks
With a smaller dataset, both chunks of code output the same results.
#pad, I tried what you suggested, the PSeq.iter seemed to work properly (all cores active) when fed the Calculation[], but there is still the matter of RAM (it eventually crashed)
both the summary part of the code and the calculation part are CPU intensive (mainly because of large data sets)
With the Seq version I just aim to parallelize once
Based on your updated information, I'm shortening my answer to just the relevant part. You just need this instead of what you currently have:
let result = data |> PSeq.map (calculationFunc >> someFuncToExtractResults)
And this will work the same whether you use PSeq.map or Array.Parallel.map.
However, your real problem is not going to be solved. This problem can be stated as: when the desired degree of parallel work is reached in order to get to 100% CPU usage, there is not enough memory to support the processes.
Can you see how this will not be solved? You can either process things sequentially (less CPU efficient, but memory efficient) or you can process things in parallel (more CPU efficient, but runs out of memory).
The options then are:
Change the degree of parallelism to be used by these functions to something that won't blow your memory:
let result = data
|> PSeq.withDegreeOfParallelism 2
|> PSeq.map (calculationFunc >> someFuncToExtractResults)
Change the underlying logic for calculationFunc >> someFuncToExtractResults so that it is a single function that is more efficient and streams data through to results. Without knowing more detail, it's not simple to see how this could be done. But internally, certainly some lazy loading may be possible.
Array.Parallel.map uses Parallel.For under the hood while PSeq is a thin wrapper around PLINQ. But the reason they behave differently here is there is not enough workloads for PSeq.iter when seq<Calculation> is sequential and too slow in yielding new results.
I do not get the idea of using intermediate seq or array. Suppose data to be the input array, moving all calculations in one place is the way to go:
// Should use PSeq.map to match with Array.Parallel.map
PSeq.map (calculationFunc >> someFuncToExtractResults) data
and
Array.Parallel.map (calculationFunc >> someFuncToExtractResults) data
You avoid consuming too much memory and have intensive computation in one place which leads to better efficiency in parallel execution.
I had a problem similar to yours and solved it by adding the following to the solution's App.config file:
<runtime>
<gcServer enabled="true" />
<gcConcurrent enabled="true"/>
</runtime>
A calculation that was taking 5'49'' and showing roughly 22% CPU utilization on Process Lasso took 1'36'' showing roughly 80% CPU utilization.
Another factor that may influence the speed of parallelized code is whether hyperthreading (Intel) or SMT (AMD) is enabled in the BIOS. I have seen cases where disabling leads to faster execution.
How come that Solution 2 is more efficient than Solution 1?
(The time is the average of 100 runs, and the total folders they go through is 13217)
// Solution 1 (2608,9ms)
let rec folderCollector path =
async { let! dirs = Directory.AsyncGetDirectories path
do! [for z in dirs -> folderCollector z]
|> Async.Parallel |> Async.Ignore }
// Solution 2 (2510,9ms)
let rec folderCollector path =
let dirs = Directory.GetDirectories path
for z in dirs do folderCollector z
I would have thought that Solution 1 would be faster because it's async, and that I run it in Parallel. What am I'm missing?
As Daniel and Brian already clearly explained, your solution is probably creating too many short-lived asynchronous computations (so the overhead is more than the gains from parallelism). The AsyncGetDirectories operation also probably isn't really non-blocking as it is not doing much work. I don't see a truly async version of this operation anywhere - how is it defined?
Anyway, using the ordinary GetDirectories, I tried the following version (which creates only a small number of parallel asyncs):
// Synchronous version
let rec folderCollectorSync path =
let dirs = Directory.GetDirectories path
for z in dirs do folderCollectorSync z
// Asynchronous version that uses synchronous when 'nesting <= 0'
let rec folderCollector path nesting =
async { if nesting <= 0 then return folderCollectorSync path
else let dirs = Directory.GetDirectories path
do! [for z in dirs -> folderCollector z (nesting - 1) ]
|> Async.Parallel |> Async.Ignore }
Calling a simple synchronous version after certain number of recursive calls is a common trick - it is used when parallelizing any tree-like structure that is very deep. Using folderCollector path 2, this will start only tens of parallel tasks (as opposed to thousands), so it will be more efficient.
On a sample directory I used (with 4800 sub-dirs and 27000 files), I get:
folderCollectorSync path takes 1 second
folderCollector path 2 takes takes 600ms (result is similar for any nesting between 1 and 4)
From the comments:
Your function incurs the cost of async without any of the benefits because
you're creating too many asyncs for the short amount of work to be done
your function is not CPU, but rather IO, bound
I expect for a problem like this, you may have the best results if at the top-level you do async/parallel work, but then have the sub-calls be sync. (Or if the trees are very deep, maybe have the first two levels be async, and then sync after that.)
The keys are load-balancing and granularity. Too tiny a piece of work, and the overhead of async outweighs the benefits of parallelism. So you want big enough chunks of work to leverage parallel and overcome the overheads. But if the work pieces are too large and unbalanced (e.g. one top-level dir has 10000 files, and 3 other top-level dirs have 1000 each), then you also suffer because one guy is busy while the rest finish quickly, and you don't maximize parallelism.
If you can estimate the work for each sub-tree beforehand, you can do even better scheduling.
Apparently, your code is IO-bound. Keep in mind how HDDs work. When u use Async to do multiple read, the reading heads of the HDD have to jump back and forth to serve different read commands at the same time, which introduces latency. This will likely become much worse if the data on disk is heavily fragmented.
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Is R's apply family more than syntactic sugar
Just what the title says. Stupid question, perhaps, but my understanding has been that when using an "apply" function, the iteration is performed in compiled code rather than in the R parser. This would seem to imply that lapply, for instance, is only faster than a "for" loop if there are a great many iterations and each operation is relatively simple. For instance, if a single call to a function wrapped up in lapply takes 10 seconds, and there are only, say, 12 iterations of it, I would imagine that there's virtually no difference at all between using "for" and "lapply".
Now that I think of it, if the function inside the "lapply" has to be parsed anyway, why should there be ANY performance benefit from using "lapply" instead of "for" unless you're doing something that there are compiled functions for (like summing or multiplying, etc)?
Thanks in advance!
Josh
There are several reasons why one might prefer an apply family function over a for loop, or vice-versa.
Firstly, for() and apply(), sapply() will generally be just as quick as each other if executed correctly. lapply() does more of it's operating in compiled code within the R internals than the others, so can be faster than those functions. It appears the speed advantage is greatest when the act of "looping" over the data is a significant part of the compute time; in many general day-to-day uses you are unlikely to gain much from the inherently quicker lapply(). In the end, these all will be calling R functions so they need to be interpreted and then run.
for() loops can often be easier to implement, especially if you come from a programming background where loops are prevalent. Working in a loop may be more natural than forcing the iterative computation into one of the apply family functions. However, to use for() loops properly, you need to do some extra work to set-up storage and manage plugging the output of the loop back together again. The apply functions do this for you automagically. E.g.:
IN <- runif(10)
OUT <- logical(length = length(IN))
for(i in IN) {
OUT[i] <- IN > 0.5
}
that is a silly example as > is a vectorised operator but I wanted something to make a point, namely that you have to manage the output. The main thing is that with for() loops, you always allocate sufficient storage to hold the outputs before you start the loop. If you don't know how much storage you will need, then allocate a reasonable chunk of storage, and then in the loop check if you have exhausted that storage, and bolt on another big chunk of storage.
The main reason, in my mind, for using one of the apply family of functions is for more elegant, readable code. Rather than managing the output storage and setting up the loop (as shown above) we can let R handle that and succinctly ask R to run a function on subsets of our data. Speed usually does not enter into the decision, for me at least. I use the function that suits the situation best and will result in simple, easy to understand code, because I'm far more likely to waste more time than I save by always choosing the fastest function if I can't remember what the code is doing a day or a week or more later!
The apply family lend themselves to scalar or vector operations. A for() loop will often lend itself to doing multiple iterated operations using the same index i. For example, I have written code that uses for() loops to do k-fold or bootstrap cross-validation on objects. I probably would never entertain doing that with one of the apply family as each CV iteration needs multiple operations, access to lots of objects in the current frame, and fills in several output objects that hold the output of the iterations.
As to the last point, about why lapply() can possibly be faster that for() or apply(), you need to realise that the "loop" can be performed in interpreted R code or in compiled code. Yes, both will still be calling R functions that need to be interpreted, but if you are doing the looping and calling directly from compiled C code (e.g. lapply()) then that is where the performance gain can come from over apply() say which boils down to a for() loop in actual R code. See the source for apply() to see that it is a wrapper around a for() loop, and then look at the code for lapply(), which is:
> lapply
function (X, FUN, ...)
{
FUN <- match.fun(FUN)
if (!is.vector(X) || is.object(X))
X <- as.list(X)
.Internal(lapply(X, FUN))
}
<environment: namespace:base>
and you should see why there can be a difference in speed between lapply() and for() and the other apply family functions. The .Internal() is one of R's ways of calling compiled C code used by R itself. Apart from a manipulation, and a sanity check on FUN, the entire computation is done in C, calling the R function FUN. Compare that with the source for apply().
From Burns' R Inferno (pdf), p25:
Use an explicit for loop when each
iteration is a non-trivial task. But a
simple loop can be more clearly and
compactly expressed using an apply
function. There is at least one
exception to this rule ... if the result will
be a list and some of the components
can be NULL, then a for loop is
trouble (big trouble) and lapply gives
the expected answer.