I am using Playwright in F# for web scrapping and I noticed that result is returned randomly.
I have this code.
let getContent (url:string) =
task{
use! paywright = Playwright.CreateAsync()
let! browser = paywright.Chromium.LaunchAsync()
printfn "URL %A" url
let! page = browser.NewPageAsync()
page.SetDefaultTimeout(15000f)
let! goto = page.GotoAsync(url)
let! price = page.Locator("//span[#class='norm-price ng-binding']").AllInnerTextsAsync()
printfn "Price %A" price
}
When I run the console program sometimes it returns result (list of prices), but sometimes its just finished with empty result.
I really dont know what can be wrong. I also try use async wrapper instead of task but the output is same.
The delay I increase to 15s, but it also doesnt help.
Could it be that you do not await the task returned by getContent?
Maybe the program terminates before writing to the console. If the calling code is not asynchronous (and cannot propagate the task), you could try:
let printContent (url : string) =
task { ... } |> Task.RunSynchronously
Update 1:
Probably the page loads it's price data asynchronously.The default timeout on the page is there to specify a maximum timeout, not to wait that long for some data to arrive in the controlled browser instance. Most likely you'll have to wait for some request to finish or some element to appear on the page. Can you share the URL publicly?
Related
I've spent hours combing through documentation and tutorials, but can't figure out how to use ReactiveX to poll an external resource, or anything for that matter, every at an interval. Below is some code I wrote to get information from a REST API at an interval.
open System
open System.Reactive.Linq
module MyObservable =
let getResources =
async {
use client = new HttpClient()
let! response = client.GetStringAsync("http://localhost:8080/items") |> Async.AwaitTask
return response
} |> Async.StartAsTask
let getObservable (interval: TimeSpan) =
let f () = getResources.Result
Observable.Interval(interval)
|> Observable.map(fun _ -> f ())
To test this out, I tried subscribing to the Observable and waiting five seconds. It does receive something every second for five seconds, but the getResources is only called the first time and then the result is just used at each interval. How can I modify this to make the REST call at each interval instead of just the result of the first call being used over and over again?
let mutable res = Seq.empty
getObservable (new TimeSpan(0,0,1))
|> Observable.subscribe(fun (x: seq<string>) -> res <- res |> Seq.append x;)
|> ignore
Threading.Thread.Sleep(5000)
Don't use a Task. Tasks are what we call "hot", meaning that if you have a value of type Task in your hand, it means that the task is already running, and there is nothing you can do about it. In particular, this means you cannot restart it, or start a second instance of it. Once a Task is created, it's too late.
In your particular case it means that getResources is not "a way to start a task", but just "a task". Already started, already running.
If you want to start a new task every time, you have two alternatives:
First (the worse alternative), you could make getResources a function rather than a value, which you can do by giving it a parameter:
let getResources () =
async { ...
And then call it with that parameter:
let f () = getResources().Result
This will run the getResources function afresh every time you call f(), which will create a new Task every time and start it.
Second (a better option), don't use a Task at all. You're creating a perfectly good async computation and then turning it into a Task only to block on getting its result. Why? You can block on an async's result just as well!
let getResources = async { ... }
let getObservable interval =
let f () = getResources |> Async.RunSynchronously
...
This works, even though getResources is not a function, because asyncs, unlike Tasks, are what we call "cold". This means that, if you have an async in your hand, it doesn't mean that it's already running. async, unlike Task, represents not an "already running" computation, but rather "a way to start a computation". A corollary is that you can start it multiple times from the same async value.
One way to start it is via Async.RunSynchronously as I'm doing in my example above. This is not the best way, because it blocks the current thread until the computation is done, but it's equivalent to what you were doing with accessing the Task.Result property, which also blocks until the Task is done.
I'm trying to scrape some websites that need to run their JavaScript before the document has all the data I'm interested in. I'm trying to open a WebBrowser and wait for the document to load, but I can't get the data when I try to switch back to the thread the WebBrowser is on. Trying to run it without switching back to the thread gives casting errors. = (
What's stopping the async from switching threads? How do I fix this problem?
Script
open System
open System.Windows.Forms
open System.Threading
let step a = do printfn "%A" a
let downloadWebSite (address : Uri) (cont : HtmlDocument -> 'a) =
let browser = new WebBrowser()
let ctx = SynchronizationContext.Current
browser.DocumentCompleted.Add (fun _ ->
printfn "Document Loaded" )
async {
do step 1
do browser.Navigate(address)
do step 2
let! _ = Async.AwaitEvent browser.DocumentCompleted
do step 3
do! Async.SwitchToContext ctx
do step 4
return cont browser.Document }
let test =
downloadWebSite (Uri "http://www.google.com") Some
|> Async.RunSynchronously
Output
>
1
2
Document Loaded
3
# It just hangs here. I have to manually interrupt fsi.
- Interrupt
>
4
The problem with your approach is that RunSynchronously blocks the thread that you are trying to use to run the rest of the asynchronous computation using Async.SwitchToContext ctx.
When using F# Interactive, there is one main thread which runs in the F# Interactive and handles the user interactions. This is the thread that can use Windows Forms controls, so you correctly create WebBrowser outside of async. The waiting for DocumentCompleted happens on a thread pool thread (which runs the async workflow), but when you try to switch back to the main thread, it is already blocked by Async.RunSynchronously.
You can avoid blocking the thread by running a loop that calls Application.DoEvents to process events on the main thread (which will also allow it to run the rest of your async). Your downloadWebSite stays the same, but now you wait using:
let test =
downloadWebSite (Uri "http://www.google.com") Some
|> Async.Ignore
|> Async.StartAsTask
while not test.IsCompleted do
System.Threading.Thread.Sleep(100)
System.Windows.Forms.Application.DoEvents()
This is a bit of a hack - and there might be a better way of structuring this if you do not really need to wait for the result (e.g. just return a task and wait before running the next command), but this should do the trick.
I have a COM object, which I connect to, and I should recieve an event, which would confirm that connection is established. I write code and test it in F# interactive, and for some reason it wouldn't catch COM events when I use Async.RunSynchronously.
/// This class wraps COM event into F# Async-compatible event
type EikonWatcher(eikon : EikonDesktopDataAPI) =
let changed = new Event<_>()
do eikon.add_OnStatusChanged (fun e -> changed.Trigger true)
member self.StatusChanged = changed.Publish
/// My method
let ``will that connection work?`` () =
let eikon = EikonDesktopDataAPIClass() :> EikonDesktopDataAPI // create COM object
let a = async {
let watcher = EikonWatcher eikon // wrap it
eikon.Initialize() |> ignore // send connection request
let! result = Async.AwaitEvent watcher.StatusChanged // waiting event
printfn "%A" result // printing result
return result
}
// I use either first or second line of code, not both of them
Async.Start (Async.Ignore a) // does not hang, result prints
Async.RunSynchronously (Async.Ignore) a // hangs!!!
/// Running
``will that connection work?`` ()
At the same time, code works perfectly well with RunSynchronously when I insert it into console app.
What should I do so that to prevent that nasty behavior?
The code we write under within a single Thread (as in STA) feels like it is made of independant pieces each having their own life, but this is actually a fallacy : everything is mediated under a common event loop, which "linearizes" the various calls.
So everything we do, unless explicitely spoecified otherwise, is essentially single threaded and you can not wait for yourself without creating a deadlock.
When you specify Async.Start it does start a new, independant computation which runs on its own, a "thread".
Whereas When you call runsynchronously, it awaits on the same 'thread'.
Now if the event you are waiting, which feels like an independant thing, is actually 'linearized' by the same event loop, you are actually waiting for yourself, hence the deadlock.
Something useful if you want to wait "asynchronously", (aka wait for an event, but not actually block and leave the opportunity for any other task to perform work) you can use the following code within your async block :
async {
....
let! token = myAsyncTask |> Async.StartChild
let! result = token
....
}
I have the following F# program that retrieves a webpage from the internet:
open System.Net
[<EntryPoint>]
let main argv =
let mutable pageData : byte[] = [| |]
let fullURI = "http://www.badaddress.xyz"
let wc = new WebClient()
try
pageData <- wc.DownloadData(fullURI)
()
with
| :? System.Net.WebException as err -> printfn "Web error: \n%s" err.Message
| exn -> printfn "Unknown exception:\n%s" exn.Message
0 // return an integer exit code
This works fine if the URI is valid and the machine has an internet connection and the web server responds properly etc. In an ideal functional programming world the results of a function would not depend on external variables not passed as arguments (side effects).
What I would like to know is what is the appropriate F# design pattern to deal with operations which might require the function to deal with recoverable external errors. For example if the website is down one might want to wait 5 minutes and try again. Should parameters like how many times to retry and delays between retries be passed explicitly or is it OK to embed these variables in the function?
In F#, when you want to handle recoverable errors you almost universally want to use the option or the Choice<_,_> type. In practice the only difference between them is that Choice allows you to return some information about the error while option does not. In other words, option is best when it doesn't matter how or why something failed (only that it did fail); Choice<_,_> is used when having information about how or why something failed is important. For example, you might want to write the error information to a log; or perhaps you want to handle an error situation differently based on why something failed -- a great use case for this is providing accurate error messages to help users diagnose a problem.
With that in mind, here's how I'd refactor your code to handle failures in a clean, functional style:
open System
open System.Net
/// Retrieves the content at the given URI.
let retrievePage (client : WebClient) (uri : Uri) =
// Preconditions
checkNonNull "uri" uri
if not <| uri.IsAbsoluteUri then
invalidArg "uri" "The URI must be an absolute URI."
try
// If the data is retrieved successfully, return it.
client.DownloadData uri
|> Choice1Of2
with
| :? System.Net.WebException as webExn ->
// Return the URI and WebException so they can be used to diagnose the problem.
Choice2Of2 (uri, webExn)
| _ ->
// Reraise any other exceptions -- we don't want to handle them here.
reraise ()
/// Retrieves the content at the given URI.
/// If a WebException is raised when retrieving the content, the request
/// will be retried up to a specified number of times.
let rec retrievePageRetry (retryWaitTime : TimeSpan) remainingRetries (client : WebClient) (uri : Uri) =
// Preconditions
checkNonNull "uri" uri
if not <| uri.IsAbsoluteUri then
invalidArg "uri" "The URI must be an absolute URI."
elif remainingRetries = 0u then
invalidArg "remainingRetries" "The number of retries must be greater than zero (0)."
// Try to retrieve the page.
match retrievePage client uri with
| Choice1Of2 _ as result ->
// Successfully retrieved the page. Return the result.
result
| Choice2Of2 _ as error ->
// Decrement the number of retries.
let retries = remainingRetries - 1u
// If there are no retries left, return the error along with the URI
// for diagnostic purposes; otherwise, wait a bit and try again.
if retries = 0u then error
else
// NOTE : If this is modified to use 'async', you MUST
// change this to use 'Async.Sleep' here instead!
System.Threading.Thread.Sleep retryWaitTime
// Try retrieving the page again.
retrievePageRetry retryWaitTime retries client uri
[<EntryPoint>]
let main argv =
/// WebClient used for retrieving content.
use wc = new WebClient ()
/// The amount of time to wait before re-attempting to fetch a page.
let retryWaitTime = TimeSpan.FromSeconds 2.0
/// The maximum number of times we'll try to fetch each page.
let maxPageRetries = 3u
/// The URI to fetch.
let fullURI = Uri ("http://www.badaddress.xyz", UriKind.Absolute)
// Fetch the page data.
match retrievePageRetry retryWaitTime maxPageRetries wc fullURI with
| Choice1Of2 pageData ->
printfn "Retrieved %u bytes from: %O" (Array.length pageData) fullURI
0 // Success
| Choice2Of2 (uri, error) ->
printfn "Unable to retrieve the content from: %O" uri
printfn "HTTP Status: (%i) %O" (int error.Status) error.Status
printfn "Message: %s" error.Message
1 // Failure
Basically, I split your code out into two functions, plus the original main:
One function that attempts to retrieve the content from a specified URI.
One function containing the logic for retrying attempts; this 'wraps' the first function which performs the actual requests.
The original main function now only handles 'settings' (which you could easily pull from an app.config or web.config) and printing the final results. In other words, it's oblivious to the retrying logic -- you could modify the single line of code with the match statement and use the non-retrying request function instead if you wanted.
If you want to pull content from multiple URIs AND wait for a significant amount of time (e.g., 5 minutes) between retries, you should modify the retrying logic to use a priority queue or something instead of using Thread.Sleep or Async.Sleep.
Shameless plug: my ExtCore library contains some things to make your life significantly easier when building something like this, especially if you want to make it all asynchronous. Most importantly, it provides an asyncChoice workflow and collections functions designed to work with it.
As for your question about passing in parameters (like the retry timeout and number of retries) -- I don't think there's a hard-and-fast rule for deciding whether to pass them in or hard-code them within the function. In most cases, I prefer to pass them in, though if you have more than a few parameters to pass in, you're better off creating a record to hold them all and passing that instead. Another approach I've used is to make the parameters option values, where the defaults are pulled from a configuration file (though you'll want to pull them from the file once and assign them to some private field to avoid re-parsing the configuration file each time your function is called); this makes it easy to modify the default values you've used in your code, but also gives you the flexibility of overriding them when necessary.
Knowing an RPC call to a server method that returns unit is a message passing call, I want to force the call to be asynchronous and be able to fire the next server call only after the first one has gone to the server.
Server code:
[<Rpc>]
let FirstCall value =
printfn "%s" value
async.Zero()
[<Rpc>]
let SecondCall() =
"test"
Client code:
|>! OnClick (fun _ _ -> async {
do! Server.FirstCall "test"
do Server.SecondCall() |> ignore
} |> Async.Start)
This seems to crash on the client since returning unit, replacing the server and client code to:
[<Rpc>]
let FirstCall value =
printfn "%s" value
async { return () }
let! _ = Server.FirstCall "test"
Didn't fix the problem, while the following did:
[<Rpc>]
let FirstCall value =
printfn "%s" value
async { return "" }
let! _ = Server.FirstCall "test"
Is there another way to force a message passing call to be asynchronous instead?
This is most definitely a bug. I added it here:
https://bugs.intellifactory.com/websharper/show_bug.cgi?id=468
Your approach is completely legit. Your workaround is also probably the best for now, e.g. instead of returning Async<unit> return Async<int> with a zero and ignore it.
We are busy with preparing the 2.4 release due next week and the fix will make it there. Thanks!
Also, in 2.4 we'll be dropping synchronous calls, so you will have to use Async throughout for RPC, as discussed in https://bugs.intellifactory.com/websharper/show_bug.cgi?id=467 -- primarily motivated by new targets (Android and WP7) that do not support sync AJAX.