I'm trying to scrape some websites that need to run their JavaScript before the document has all the data I'm interested in. I'm trying to open a WebBrowser and wait for the document to load, but I can't get the data when I try to switch back to the thread the WebBrowser is on. Trying to run it without switching back to the thread gives casting errors. = (
What's stopping the async from switching threads? How do I fix this problem?
Script
open System
open System.Windows.Forms
open System.Threading
let step a = do printfn "%A" a
let downloadWebSite (address : Uri) (cont : HtmlDocument -> 'a) =
let browser = new WebBrowser()
let ctx = SynchronizationContext.Current
browser.DocumentCompleted.Add (fun _ ->
printfn "Document Loaded" )
async {
do step 1
do browser.Navigate(address)
do step 2
let! _ = Async.AwaitEvent browser.DocumentCompleted
do step 3
do! Async.SwitchToContext ctx
do step 4
return cont browser.Document }
let test =
downloadWebSite (Uri "http://www.google.com") Some
|> Async.RunSynchronously
Output
>
1
2
Document Loaded
3
# It just hangs here. I have to manually interrupt fsi.
- Interrupt
>
4
The problem with your approach is that RunSynchronously blocks the thread that you are trying to use to run the rest of the asynchronous computation using Async.SwitchToContext ctx.
When using F# Interactive, there is one main thread which runs in the F# Interactive and handles the user interactions. This is the thread that can use Windows Forms controls, so you correctly create WebBrowser outside of async. The waiting for DocumentCompleted happens on a thread pool thread (which runs the async workflow), but when you try to switch back to the main thread, it is already blocked by Async.RunSynchronously.
You can avoid blocking the thread by running a loop that calls Application.DoEvents to process events on the main thread (which will also allow it to run the rest of your async). Your downloadWebSite stays the same, but now you wait using:
let test =
downloadWebSite (Uri "http://www.google.com") Some
|> Async.Ignore
|> Async.StartAsTask
while not test.IsCompleted do
System.Threading.Thread.Sleep(100)
System.Windows.Forms.Application.DoEvents()
This is a bit of a hack - and there might be a better way of structuring this if you do not really need to wait for the result (e.g. just return a task and wait before running the next command), but this should do the trick.
Related
I am using Playwright in F# for web scrapping and I noticed that result is returned randomly.
I have this code.
let getContent (url:string) =
task{
use! paywright = Playwright.CreateAsync()
let! browser = paywright.Chromium.LaunchAsync()
printfn "URL %A" url
let! page = browser.NewPageAsync()
page.SetDefaultTimeout(15000f)
let! goto = page.GotoAsync(url)
let! price = page.Locator("//span[#class='norm-price ng-binding']").AllInnerTextsAsync()
printfn "Price %A" price
}
When I run the console program sometimes it returns result (list of prices), but sometimes its just finished with empty result.
I really dont know what can be wrong. I also try use async wrapper instead of task but the output is same.
The delay I increase to 15s, but it also doesnt help.
Could it be that you do not await the task returned by getContent?
Maybe the program terminates before writing to the console. If the calling code is not asynchronous (and cannot propagate the task), you could try:
let printContent (url : string) =
task { ... } |> Task.RunSynchronously
Update 1:
Probably the page loads it's price data asynchronously.The default timeout on the page is there to specify a maximum timeout, not to wait that long for some data to arrive in the controlled browser instance. Most likely you'll have to wait for some request to finish or some element to appear on the page. Can you share the URL publicly?
I've spent hours combing through documentation and tutorials, but can't figure out how to use ReactiveX to poll an external resource, or anything for that matter, every at an interval. Below is some code I wrote to get information from a REST API at an interval.
open System
open System.Reactive.Linq
module MyObservable =
let getResources =
async {
use client = new HttpClient()
let! response = client.GetStringAsync("http://localhost:8080/items") |> Async.AwaitTask
return response
} |> Async.StartAsTask
let getObservable (interval: TimeSpan) =
let f () = getResources.Result
Observable.Interval(interval)
|> Observable.map(fun _ -> f ())
To test this out, I tried subscribing to the Observable and waiting five seconds. It does receive something every second for five seconds, but the getResources is only called the first time and then the result is just used at each interval. How can I modify this to make the REST call at each interval instead of just the result of the first call being used over and over again?
let mutable res = Seq.empty
getObservable (new TimeSpan(0,0,1))
|> Observable.subscribe(fun (x: seq<string>) -> res <- res |> Seq.append x;)
|> ignore
Threading.Thread.Sleep(5000)
Don't use a Task. Tasks are what we call "hot", meaning that if you have a value of type Task in your hand, it means that the task is already running, and there is nothing you can do about it. In particular, this means you cannot restart it, or start a second instance of it. Once a Task is created, it's too late.
In your particular case it means that getResources is not "a way to start a task", but just "a task". Already started, already running.
If you want to start a new task every time, you have two alternatives:
First (the worse alternative), you could make getResources a function rather than a value, which you can do by giving it a parameter:
let getResources () =
async { ...
And then call it with that parameter:
let f () = getResources().Result
This will run the getResources function afresh every time you call f(), which will create a new Task every time and start it.
Second (a better option), don't use a Task at all. You're creating a perfectly good async computation and then turning it into a Task only to block on getting its result. Why? You can block on an async's result just as well!
let getResources = async { ... }
let getObservable interval =
let f () = getResources |> Async.RunSynchronously
...
This works, even though getResources is not a function, because asyncs, unlike Tasks, are what we call "cold". This means that, if you have an async in your hand, it doesn't mean that it's already running. async, unlike Task, represents not an "already running" computation, but rather "a way to start a computation". A corollary is that you can start it multiple times from the same async value.
One way to start it is via Async.RunSynchronously as I'm doing in my example above. This is not the best way, because it blocks the current thread until the computation is done, but it's equivalent to what you were doing with accessing the Task.Result property, which also blocks until the Task is done.
I found a variety of SO questions on this but couldn't figure out an F# solution. I need to block wait for an event to fire at me to check the data it returns. I am using Rx to receive event 3 times:
let disposable =
Observable.take 3 ackNack
|> Observable.subscribe (
fun (sender, data) ->
Console.WriteLine("{0}", data.AckNack)
Assert.True(data.TotalAckCount > 0u)
)
I would like to either turn results into a list, so they can be checked later on by the test framework (xUnit), or wait for all 3 events to complete and pass the Assert.True.
How would I wait for 3 events to fire before continuing? I can see there's an Observable.wait other sources suggest Async.RunSynchronously.
I think the easiest option is to use Async.AwaitObservable function - sadly, this is not yet available in the F# core library, but you can get it from the FSharpx.Async package, or just copy the function soruce from GitHub.
Using the function, you should be able to write something like:
let _, data =
Observable.take 3 ackNack
|> Async.AwaitObservable
|> Async.RunSynchronously
Console.WriteLine("{0}", data.AckNack)
Assert.True(data.TotalAckCount > 0u)
I have a COM object, which I connect to, and I should recieve an event, which would confirm that connection is established. I write code and test it in F# interactive, and for some reason it wouldn't catch COM events when I use Async.RunSynchronously.
/// This class wraps COM event into F# Async-compatible event
type EikonWatcher(eikon : EikonDesktopDataAPI) =
let changed = new Event<_>()
do eikon.add_OnStatusChanged (fun e -> changed.Trigger true)
member self.StatusChanged = changed.Publish
/// My method
let ``will that connection work?`` () =
let eikon = EikonDesktopDataAPIClass() :> EikonDesktopDataAPI // create COM object
let a = async {
let watcher = EikonWatcher eikon // wrap it
eikon.Initialize() |> ignore // send connection request
let! result = Async.AwaitEvent watcher.StatusChanged // waiting event
printfn "%A" result // printing result
return result
}
// I use either first or second line of code, not both of them
Async.Start (Async.Ignore a) // does not hang, result prints
Async.RunSynchronously (Async.Ignore) a // hangs!!!
/// Running
``will that connection work?`` ()
At the same time, code works perfectly well with RunSynchronously when I insert it into console app.
What should I do so that to prevent that nasty behavior?
The code we write under within a single Thread (as in STA) feels like it is made of independant pieces each having their own life, but this is actually a fallacy : everything is mediated under a common event loop, which "linearizes" the various calls.
So everything we do, unless explicitely spoecified otherwise, is essentially single threaded and you can not wait for yourself without creating a deadlock.
When you specify Async.Start it does start a new, independant computation which runs on its own, a "thread".
Whereas When you call runsynchronously, it awaits on the same 'thread'.
Now if the event you are waiting, which feels like an independant thing, is actually 'linearized' by the same event loop, you are actually waiting for yourself, hence the deadlock.
Something useful if you want to wait "asynchronously", (aka wait for an event, but not actually block and leave the opportunity for any other task to perform work) you can use the following code within your async block :
async {
....
let! token = myAsyncTask |> Async.StartChild
let! result = token
....
}
Knowing an RPC call to a server method that returns unit is a message passing call, I want to force the call to be asynchronous and be able to fire the next server call only after the first one has gone to the server.
Server code:
[<Rpc>]
let FirstCall value =
printfn "%s" value
async.Zero()
[<Rpc>]
let SecondCall() =
"test"
Client code:
|>! OnClick (fun _ _ -> async {
do! Server.FirstCall "test"
do Server.SecondCall() |> ignore
} |> Async.Start)
This seems to crash on the client since returning unit, replacing the server and client code to:
[<Rpc>]
let FirstCall value =
printfn "%s" value
async { return () }
let! _ = Server.FirstCall "test"
Didn't fix the problem, while the following did:
[<Rpc>]
let FirstCall value =
printfn "%s" value
async { return "" }
let! _ = Server.FirstCall "test"
Is there another way to force a message passing call to be asynchronous instead?
This is most definitely a bug. I added it here:
https://bugs.intellifactory.com/websharper/show_bug.cgi?id=468
Your approach is completely legit. Your workaround is also probably the best for now, e.g. instead of returning Async<unit> return Async<int> with a zero and ignore it.
We are busy with preparing the 2.4 release due next week and the fix will make it there. Thanks!
Also, in 2.4 we'll be dropping synchronous calls, so you will have to use Async throughout for RPC, as discussed in https://bugs.intellifactory.com/websharper/show_bug.cgi?id=467 -- primarily motivated by new targets (Android and WP7) that do not support sync AJAX.