Best practices to parallelize using async workflow - f#

Lets say I wanted to scrape a webpage, and extract some data. I'd most likely write something like this:
let getAllHyperlinks(url:string) =
async { let req = WebRequest.Create(url)
let! rsp = req.GetResponseAsync()
use stream = rsp.GetResponseStream() // depends on rsp
use reader = new System.IO.StreamReader(stream) // depends on stream
let! data = reader.AsyncReadToEnd() // depends on reader
return extractAllUrls(data) } // depends on data
The let! tells F# to execute the code in another thread, then bind the result to a variable, and continue processing. The sample above uses two let statements: one to get the response, and one to read all the data, so it spawns at least two threads (please correct me if I'm wrong).
Although the workflow above spawns several threads, the order of execution is serial because each item in the workflow depends on the previous item. Its not really possible to evaluate any items further down the workflow until the other threads return.
Is there any benefit to having more than one let! in the code above?
If not, how would this code need to change to take advantage of multiple let! statements?

The key is we are not spawning any new threads. During the whole course of the workflow, there are 1 or 0 active threads being consumed from the ThreadPool. (An exception, up until the first '!', the code runs on the user thread that did an Async.Run.) "let!" lets go of a thread while the Async operation is at sea, and then picks up a thread from the ThreadPool when the operation returns. The (performance) advantage is less pressure against the ThreadPool (and of course the major user advantage is the simple programming model - a million times better than all that BeginFoo/EndFoo/callback stuff you otherwise write).
See also http://cs.hubfs.net/forums/thread/8262.aspx

I was writing an answer but Brian beat me to it. I fully agree with him.
I'd like to add that if you want to parallelize synchronous code, the right tool is PLINQ, not async workflows, as Don Syme explains.

Related

Does let!/do! always run the async object in a new thread?

From the wikibook on F# there is a small section where it says:
What does let! do?#
let! runs an async<'a> object on its own thread, then it immediately
releases the current thread back to the threadpool. When let! returns,
execution of the workflow will continue on the new thread, which may
or may not be the same thread that the workflow started out on.
I have not found anywhere else in books or on the web where this fact (highlighted in bold) is stated.
Is this true for all let!/do! regardless of what the async object contains (e.g. Thread.Sleep()) and how it is started (e.g. Async.Start)?
Looking in the F# source code on github, I wasn't able to find the place where a call to bind executes on a new (TP) thread. Where in the code is the magic happening?
Which part of that statement do you find surprising? That parts of a single async can execute on different threadpool threads, or that a threadpool thread is necessarily being released and obtained on each bind?
If it's the latter, then I agree - it sounds wrong. Looking at the code, there are only a few places where a new work item is being queued on the threadpool (namely, the few Async module functions that use queueAsync internally), and Async.SwitchToNewThread spawns a non-threadpool thread and runs the continuation there. A bind alone doesn't seem to be enough to switch threads.
The spirit of the statement however seems to be about the former - no guarantees are made that parts of an async block will run on the same thread. The exact thread that you run on should be treated as an implementation detail, and when you yield control and await some result, you can be pretty sure that you'll land on a different thread at least some of the time.
No. An async operations might execute synchronously on the current thread, or it might wind up completing on a different thread. It depends entirely on how the async API in question is implemented.
See Do the new C# 5.0 'async' and 'await' keywords use multiple cores? for a decent explanation. The implementation details of F# and C# async are different, but the overall principles are the same.
The builder that implements the F# async computation expression is here.

Is it better for an API to dispatch itself to a queue and invoke a callback, or for the API caller to do the dispatching?

Examples:
Asynchronous method with its own dispatching:
// Library
func asyncAPI(callback: Result -> Void) {
dispatch_async(self.queue) {
...
callback(result)
}
}
// Caller
asyncAPI() { result in
...
}
Synchronous method with exposed dispatch queue:
// Library
func syncAPI() -> Result {
assert(isRunningOnCorrectQueue())
...
return result
}
// Caller
dispatch_async(api.queue) {
let result = api.syncAPI()
...
}
These two examples behave the same but I am looking to learn whether one of these ends up complicating a larget codebase more than the other, especially when there is a lot of asynchrony.
I would argue against both of the patterns you propose.
For the first pattern (where the API manages it's own backgrounding) I see little or no benefit to doing it this way, as opposed to leaving it to the caller. If you want to use a private, serial queue to protect data (or any other sort of critical section) internal to your API, that's fine, but that queue should be private, and it should specifically not target any public, non-global-concurrent queue (Note: it should especially not target the main queue). Ideally, the primary implementation of your API would also take a second parameter, so callers can specify on which queue to invoke the callback. (People can work around the lack of such a parameter by passing a callback block that re-dispatches to their desired queue, but I think that's clunkier than having an extra, optional parameter.) This puts the API consumer in complete control of the concurrency, while preserving your freedom to use queues internally to protect state.
As to the second approach, it's my opinion that we all should avoid creating new synchronous, blocking API. When you provide a synchronous, blocking API and don't provide a callback-based version, that means that you have denied consumers of your API any opportunity to avoid blocking. When you only provide synchronous, blocking API, then if someone wants to call your API in the background, at least one thread (in addition to any additional threads that your API consumes behind the scenes) will be consumed from the finite number of threads available to each process. (In the worst case this can lead to starvation conditions that are effectively deadlocks.)
Another red flag with this second example is that it vends a queue; Any time an API vends a queue, something is amiss. As mentioned, if you want to use a private serial queue to protect state or other critical sections internal to your API, go for it, but don't expose that queue to the outside world. If nothing else, it unnecessarily exposes details of your implementation. In looking at the system framework headers, I couldn't find a single case where a dispatch_queue_t was vended where it wasn't immediately obvious that the intent was for the API consumer to push in the queue, and not read it out.
It's also worth mentioning that these patterns are problematic regardless of whether your workload is CPU-bound or IO-bound. If it's CPU-bound, then not managing your own dispatch gives consumers of the API explicit control over how this CPU work is executed. If your workload is IO-bound, then you should use the OS- and libdispatch-provided asynchronous IO mechanisms (dispatch_io, dispatch_sources, kevent, etc) to avoid consuming a thread (or more than one) for the duration of your work.
Another answer here implied that forcing consumers to manage their own concurrency leads to "boilerplate" code. If you feel that the burden of API consumers potentially having to wrap calls to your API with dispatch_async is too great, then feel free to provide a convenience overload that dispatches to the default global concurrent queue, but please always leave the version that allows API consumers the ability to explicitly manage their own concurrency.
If, on the other hand, all this is internal to the implementation, and not part of the public API, then do whatever is most expedient, knowing that you can refactor the implementation behind the public API any time in the future.
As you said, the 2 generally accomplish the same thing but the first is more preferable in most scenarios. There are several benefits to using the first method.
The API is simpler. You simply call the method and provide code for the callback block.
Less boilerplate code, No typing dispatch_async every time you want to call it as it is just included in the method itself.
Less room for bugs/errors. By wrapping the asynchronous logic inside the method itself, you ensure that it is called on the right queue internally without the caller having to worry about any of that.
Touching on the last point, you also have finer control over the queue itself. Let's say you are trying to perform certain tasks on a particular queue. It is way simpler to simply wrap the code in a GCD call on that queue a single time rather than having to remember to reuse that same queue every time you want to call the method.

How to use FSharpx TaskBuilder with functions taking parameters

I have been lately programming with the FSharpx library and especially its TaskBuilder. Now I wonder if it should be possible to define a function which takes parameters and takes a result. Such as
let doTask(parameter:int) =
let task = TaskBuilder(scheduler = TaskScheduler.Current)
task {
return! Task.Factory.StartNew(fun() -> parameter + 1)
}
match FSharpx.Task.run doTask(1) with
| _ -> ()
Looking at the source code I see run expects a function taking no parameters and returning a Task<'a>. There doesn't look like being examples on FSharpx TaskTests either.
I'd appreciate if someone could advice how should I get a scenario like this going with FSharpx or if one isn't supposed to use the library like this for a reason I haven't quite grasped as of yet.
<edit: I believe I could wrap doTask as follows
wrapperDoTask() = doTask(101)
match FSharpx.Task.run wrapperDoTask with
| _ -> ()
And it might work. I'm not with a compiler currently, so this is a bit of a handwaving. Does anyone have an opinion on any direction or did I just answer my own question? :)
<edit2:
I think I need to edit this one more time based on MisterMetaphor's answer. Especially his P.S., I think, was well informing. I use FSharpx TaskBuilder to interop with C#, in which, as noted, tasks are returned as hot (with some minor exceptions), already running. This is in connection with my recent question Translating async-await C# code to F# with respect to the scheduler and in relation Orleans (I'll add some tags to beef up the context, maybe someone else is pondering these too).
When thinking in C# terms, what I try to achieve is to await the task result before returning, but without blocking. The behaviour I'm after is especially of that of await not .Result. The difference can be read, for instance, from
Await, and UI, and deadlocks! Oh my!
Don't Block on Async Code.
Trying to think which context or scheduler or behavior or something is going on in terms of C# is somewhat fuzzy for me. Unfortunatelly it looks like I can't ignore all the details when it comes to interop. :)
You need to use Task.run only if you want to wait for the task completion synchronously on the current thread. It takes a single parameter and you can consider that parameter a task factory -- i.e. a means to create a Task<_>. Unlike Async<_>, the Task<_> starts running as soon as it is created. That is not always a desirable behavior.
You could achieve similar results (a blocking wait for task completion) with (doTask 101).Result, but I think Task.run is more idiomatic to F#, in a way that it uses a Result return type to signal an error instead of raising an exception. It might be arguable which is better, depending on situation, but in my experience in simpler cases a special result type is more composable than exceptions.
Another point here is that you should avoid blocking waits (Task.run, .Wait(), .Result) as much as you can. (Ideally, you'd have one of those only at the top level of your program.)
P.S. This if out of scope of the question, but your doTask function looks funny. task { return! Task.Factory.StartNew( ... ) } is equivalent to Task.Factory.StartNew( ... ). What you probably wanted to do is task { return parameter + 1 }.
EDIT
So, in response to OP's question edit :) If you need the await behavior from C#, you just need to use let! .... Like this:
task {
let! x = someTask 1 2 3
return x + 5
}

.NET Async on the web-- Must It be w/ an Async Controller?

If i'm looking to use TPL async at my data layer, must i also use Task<T> on my MVC controller?
In other words, for async to work with .NET MVC, must it be implemented from the time the request begins in order for it to work on deeper execution layers? Or is there still a benefit to having Task<T> at my DAL/web request level even if i'm using a sync controller?
If you don't use an async controller you will have to wait on a task at some point. At this point the main advantage is gone: Reducing the number of blocked threads.
Of course this is not true if you run multiple async activities at the same time. That would reduce the number of blocked threads from N to one. (If N = 1 there is no benefit, just damage).
Note that async is not faster by default. Its main purpose in ASP.NET is to gain scalability at the extreme end - with 100s of concurrent requests. Only then will it be faster or scale higher.
So if you have a "usual" number of concurrent requests (like < 100), just go synchronous and don't worry about all of this.

Async.Parallel or Array.Parallel.Map?

I'm trying to implement a pattern I read from Don Syme's blog
(https://blogs.msdn.microsoft.com/dsyme/2010/01/09/async-and-parallel-design-patterns-in-f-parallelizing-cpu-and-io-computations/)
which suggests that there are opportunities for massive performance improvements from leveraging asynchronous I/O. I am currently trying to take a piece of code that "works" one way, using Array.Parallel.Map, and see if I can somehow achieve the same result using Async.Parallel, but I really don't understand Async.Parallel, and cannot get anything to work.
I have a piece of code (simplified below to illustrate the point) that successfully retrieves an array of data for one cusip. (A price series, for example)
let getStockData cusip =
let D = DataProvider()
let arr = D.GetPriceSeries(cusip)
return arr
let data = Array.Parallel.map (fun x -> getStockData x) stockCusips
So this approach contructs an array of arrays, by making a connection over the internet to my data vendor for each stock (which could be as many as 3000) and returns me an array of arrays (1 per stock, with a price series for each one). I admittedly don't understand what goes on underneath Array.Parallel.map, but am wondering if this is a scenario where there are resources wasted under the hood, and it actually could be faster using asynchronous I/O? So to test this out, I have attempted to make this function using asyncs, and I think that the function below follows the pattern in Don Syme's article using the URLs, but it won't compile with "let!".
let getStockDataAsync cusip =
async { let D = DataProvider()
let! arr = D.GetData(cusip)
return arr
}
The error I get is:
This expression was expected to have type Async<'a> but here has type obj
It compiles fine with "let" instead of "let!", but I had thought the whole point was that you need the exclamation point in order for the command to run without blocking a thread.
So the first question really is, what's wrong with my syntax above, in getStockDataAsync, and then at a higher level, can anyone offer some additional insight about asychronous I/O and whether the scenario I have presented would benefit from it, making it potentially much, much faster than Array.Parallel.map? Thanks so much.
F# asynchronous workflows allow you to implement asynchronous computations, however, F# makes a distinction between usual computation and asynchronous computations. This difference is tracked by the type-system. For example a method that downloads web page and is synchronous has a type string -> string (taking URL and returning HTML), but a method that does the same thing asynchronously has a type string -> Async<string>. In the async block, you can use let! to call asynchronous operations, but all other (standard synchronous) methods have to be called using let. Now, the problem with your example is that the GetData operation is ordinary synchronous method, so you cannot invoke it with let!.
In the typical F# scenario, if you want to make the GetData member asynchronous, you'll need to implement it using an asynchronous workflow, so you'll also need to wrap it in the async block. At some point, you will reach a location where you really need to run some primitive operation asynchronously (for example, downloading data from a web site). F# provides several primitive asynchronous operations that you can call from async block using let! such as AsyncGetResponse (which is an asynchronous version of GetResponse method). So, in your GetData method, you'll for example write something like this:
let GetData (url:string) = async {
let req = WebRequest.Create(url)
let! rsp = req.AsyncGetResponse()
use stream = rsp.GetResponseStream()
use reader = new System.IO.StreamReader(stream)
let html = reader.AsyncReadToEnd()
return CalculateResult(html) }
The summary is that you need to identify some primitive asynchronous operations (such as waiting for the web server or for the file system), use primitive asynchronous operations at that point and wrap all the code that uses these operations in async blocks. If there are no primitive operations that could be run asynchronously, then your code is CPU-bound and you can just use Parallel.map.
I hope this helps you understand how F# asynchronous workflows work. For more information, you can for example take a look at Don Syme's blog post, series about asynchronous programming by Robert Pickering, or my F# web cast.
#Tomas already has a great answer. I'll just say a couple bits in addition.
The idiom for F# asyncs is to name the method with an "Async" prefix (AsyncFoo, not FooAsync; the latter is an idiom already used by another .NET technology). So your functions should be getStockData and asyncGetStockData.
Inside an async workflow, whenever you use let! instead of let or do! instead of do, the thing on the right should have type Async<T> instead of T. Basically you need an existing async computation in order to 'go async' at this point in the workflow. Each Async<T> will itself be either some other async{...} workflow, or else an async "primitive". The primitives are defined in the F# library or created in user code via Async.FromBeginEnd or Async.FromContinuations which enable defining the low-level details of starting a computation, registering an I/O callback, releasing the thread, and then restarting the computation when getting called back. So you have to 'plumb' async all the way down to some truly-async-I/O-primitive in order to get the full benefits of async I/O.

Resources