Webcrawler - Fetch links - f#

I'm trying to crawl a webpage, and get all the links, and add them to a list<string> which will be returned in the end, from the function.
My code:
let getUrls s : seq<string> =
let doc = new HtmlDocument() in
doc.LoadHtml s
doc.DocumentNode.SelectNodes "//a[#href]"
|> Seq.map(fun z -> (string z.Attributes.["href"]))
let crawler uri : seq<string> =
let rec crawl url =
let web = new WebClient()
let data = web.DownloadString url
getUrls data |> Seq.map crawl (* <-- ERROR HERE *)
crawl uri
The problem is that at the last line in the crawl function (the getUrls seq.map...), it simply throws an error:
Type mismatch. Expecting a string -> 'a but given a string
-> seq<'a> The resulting type would be infinite when unifying ''a'
and 'seq<'a>'

crawl is returning unit, but is expected to return seq<string>. I think you want something like:
let crawler uri =
let rec crawl url =
seq {
let web = new WebClient()
let data = web.DownloadString url
for url in getUrls data do
yield url
yield! crawl url
}
crawl uri
Adding a type annotation to crawl should point out the issue.

i think something like this:
let crawler (uri : seq<string>) =
let rec crawl url =
let data = Seq.empty
getUrls data
|> Seq.toList
|> function
| h :: t ->
crawl h
t |> List.iter crawl
| _-> ()
crawl uri

In order to fetch links:
open System.Net
open System.IO
open System.Text.RegularExpressions
type Url(x:string)=
member this.tostring = sprintf "%A" x
member this.request = System.Net.WebRequest.Create(x)
member this.response = this.request.GetResponse()
member this.stream = this.response.GetResponseStream()
member this.reader = new System.IO.StreamReader(this.stream)
member this.html = this.reader.ReadToEnd()
let linkex = "href=\s*\"[^\"h]*(http://[^&\"]*)\""
let getLinks (txt:string) = [
for m in Regex.Matches(txt,linkex)
-> m.Groups.Item(1).Value
]
let collectLinks (url:Url) = url.html
|> getLinks

Related

F# Fable Fetch Correctly Unwrap Promise to another Promise type

I need to request data from several URLs and then use the results.
I am using plain Fable 3 with the Fable-Promise and Fable-Fetch libraries.
I have worked out how to fetch from multiple URLs and combine the results into a single Promise that I can then use to update the UI (the multiple results need to be drawn only once).
But if one of the fetch errors then the whole thing falls over. Ideally I'd like to use tryFetch and then propagate the Result<TermData, None | Exception> but nothing I do seems to compile.
How exactly do I use tryFetch and then unwrap the result with a second let! in the CE? (The comments explain more)
module App
open Browser.Dom
open App
open System.Collections.Generic
open System.Text.RegularExpressions
open Fetch
open System
type TermData =
abstract counts : int []
abstract scores : int []
abstract term : string
abstract allWords : bool
type QueryTerm =
{ mutable Term: string
mutable AllWords: bool }
let loadSingleSeries (term: QueryTerm) =
promise {
let url =
$"/api/plot/{term.Term}?allWords={term.AllWords}"
// Works but doesn't handle errors.
let! plotData = fetch url [] // type of plotData: Response
// let plotDataResult = tryFetch url []
// This ideally becomes Promise<Result<TermData, None>>
// let unwrapped = match plotDataResult with
// | Ok res -> Ok (res.json<TermData>()) // type: Promise<TermData>
// | Error err -> ??? // tried Error (Promise.create(fun resolve reject -> resolve None)) among others
let! result = plotData.json<TermData>() // type of result: TermData
return result
}
let dataArrays =
parsed // type Dictionary<int, TermData>
|> Seq.map (fun term -> loadSingleSeries term.Value)
|> Promise.Parallel
|> Promise.map (fun allData -> console.log(allData))
// Here we will handle None when we have it
I don't have much Fable experience, but if I understand your question correctly, I think the following should work:
let loadSingleSeries (term: QueryTerm) =
promise {
let url =
$"/api/plot/{term.Term}?allWords={term.AllWords}"
let! plotDataResult = tryFetch url []
match plotDataResult with
| Ok resp ->
let! termData = resp.json<TermData>()
return Ok termData
| Error ex ->
return Error ex
}
The idea here is that if you get an error, you simply propagate that error in the new Result value. This returns a Promise<Result<TermData, Exception>>, as you requested.
Update: Fixed return type using a second let!.
I haven't run this code but looking at the docs it looks like you need to use Promise.catch
let loadSingleSeries (term: QueryTerm) =
promise {
let url =
$"/api/plot/{term.Term}?allWords={term.AllWords}"
let! plotDataResult =
fetch url []
|> Promise.map Ok // Wraps the "happy path" in a Result.Ok
|> Promise.catch (fun err ->
//handle the error
Error err)
return
match plotDataResult with
| Ok res -> ...
| Error err -> ...
}
I ended up having to use the pipeline rather than CE approach for this as follows:
let loadSingleSeries (term: QueryTerm) =
let url =
$"/api/plot/{term.Term}?allWords={term.AllWords}"
let resultPromise =
fetch url []
|> Promise.bind (fun response ->
let arr = response.json<TermData> ()
arr)
|> Promise.map (Ok)
|> Promise.catch (Error)
resultPromise
The key was using Promise.bind to convert the first promise to get the Response to the promise of Promise<TermData>. The map and catch then convert to a Promise<Result<TermData, exn>>.

how can I combine / compose computation expressions, in F#?

This is not for a practical need, but rather to try to learn something.
I am using FSToolKit's asyncResult expression which is very handy and I would like to know if there is a way to 'combine' expressions, such as async and result here, or does a custom expression have to be written?
Here is an example of my function to set the ip to a subdomain, with CloudFlare:
let setSubdomainToIpAsync zoneName url ip =
let decodeResult (r: CloudFlareResult<'a>) =
match r.Success with
| true -> Ok r.Result
| false -> Error r.Errors.[0].Message
let getZoneAsync (client: CloudFlareClient) =
asyncResult {
let! r = client.Zones.GetAsync()
let! d = decodeResult r
return!
match d |> Seq.filter (fun x -> x.Name = zoneName) |> Seq.toList with
| z::_ -> Ok z // take the first one
| _ -> Error $"zone '{zoneName}' not found"
}
let getRecordsAsync (client: CloudFlareClient) zoneId =
asyncResult {
let! r = client.Zones.DnsRecords.GetAsync(zoneId)
return! decodeResult r
}
let updateRecordAsync (client: CloudFlareClient) zoneId (records: DnsRecord seq) =
asyncResult {
return!
match records |> Seq.filter (fun x -> x.Name = url) |> Seq.toList with
| r::_ -> client.Zones.DnsRecords.UpdateAsync(zoneId, r.Id, ModifiedDnsRecord(Name = url, Content = ip, Type = DnsRecordType.A, Proxied = true))
| [] -> client.Zones.DnsRecords.AddAsync(zoneId, NewDnsRecord(Name = url, Content = ip, Proxied = true))
}
asyncResult {
use client = new CloudFlareClient(Credentials.CloudFlare.Email, Credentials.CloudFlare.Key)
let! zone = getZoneAsync client
let! records = getRecordsAsync client zone.Id
let! update = updateRecordAsync client zone.Id records
return! decodeResult update
}
It is interfacing with a C# lib that handles all the calls to the CloudFlare API and returns a CloudFlareResult object which has a success flag, a result and an error.
I remapped that type to a Result<'a, string> type:
let decodeResult (r: CloudFlareResult<'a>) =
match r.Success with
| true -> Ok r.Result
| false -> Error r.Errors.[0].Message
And I could write an expression for it (hypothetically since I've been using them but haven't written my own yet), but then I would be happy to have an asyncCloudFlareResult expression, or even an asyncCloudFlareResultOrResult expression, if that makes sense.
I am wondering if there is a mechanism to combine expressions together, the same way FSToolKit does (although I suspect it's just custom code there).
Again, this is a question to learn something, not about the practicality since it would probably add more code than it's worth.
Following Gus' comment, I realized it would be good to illustrate the point with some simpler code:
function DoA : int -> Async<AWSCallResult<int, string>>
function DoB : int -> Async<Result<int, string>>
AWSCallResultAndResult {
let! a = DoA 3
let! b = DoB a
return b
}
in this example I would end up with two types that can take an int and return an error string, but they are different. Both have their expressions so I can chain them as needed.
And the original question is about how these can be combined together.
It's possible to extend CEs with overloads.
The example below makes it possible to use the CustomResult type with a usual result builder.
open FsToolkit.ErrorHandling
type CustomResult<'T, 'TError> =
{ IsError: bool
Error: 'TError
Value: 'T }
type ResultBuilder with
member inline _.Source(result : CustomResult<'T, 'TError>) =
if result.IsError then
Error result.Error
else
Ok result.Value
let computeA () = Ok 42
let computeB () = Ok 23
let computeC () =
{ CustomResult.Error = "oops. This went wrong"
CustomResult.IsError = true
CustomResult.Value = 64 }
let computedResult =
result {
let! a = computeA ()
let! b = computeB ()
let! c = computeC ()
return a + b + c
}

Convert String to Key Value Pair in F#

Given a string such as
one:1.0|two:2.0|three:3.0
how do we create a dictionary of the form string: float?
open System
open System.Collections.Generic
let ofSeq (src:seq<'a * 'b>) =
// from fssnip
let d = new Dictionary<'a, 'b>()
for (k,v) in src do
d.Add(k,v)
d
let msg = "one:1.0|two:2.0|three:3.0"
let msgseq = msg.Split[|'|'|] |> Array.toSeq |> Seq.map (fun i -> i.Split(':'))
let d = ofSeq msgseq // The type ''a * 'b' does not match the type 'string []'
This operation would be inside a tight loop so efficiency would be a plus. Although I'd like to see a simple solution as well just to get my F# bearings.
Thanks.
How about something like this:
let msg = "one:1.0|two:2.0|three:3.0"
let splitKeyVal (str : string) =
match str.Split(':') with
|[|key; value|] -> (key, System.Double.Parse(value))
|_ -> invalidArg "str" "str must have the format key:value"
let createDictionary (str : string) =
str.Split('|')
|> Array.map (splitKeyVal)
|> dict
|> System.Collections.Generic.Dictionary
You could drop the System.Collections.Generic.Dictionary if you don't mind an IDictionary return type.
If you expect the splitKeyVal function to fail then you'd be better off expressing it as a function that returns option, e.g.:
let splitKeyVal (str : string) =
match str.Split(':') with
|[|key; valueStr|] ->
match System.Double.TryParse(valueStr) with
|true, value -> Some (key, value)
|false, _ -> None
|_ -> None
But then you'd also have to decide how you wanted to handle failure in the createDictionary function.
Not sure about the perf side but if you're sure of your input and can "afford" a warning you can go with :
let d =
msg.Split '|'
|> Array.map (fun s -> let [|key; value|] (*warning here*) = s.Split ':' in key, value)
|> dict
|> System.Collections.Generic.Dictionary // optional if a IDictionary<string, string> suffice

Loop through a string array to match a pattern

I have a log file that I'm trying to parse with Regex.
I create an array of rows from the log file like this:
let loadLog =
File.ReadAllLines "c:/access.log"
|> Seq.filter (fun l -> not (l.StartsWith("#")))
|> Seq.map (fun s -> s.Split())
|> Seq.map (fun l -> l.[7],1)
|> Seq.toArray
I then need to loop through this array. But I don't think this will work because line needs to be a string.
Is there a special way to handle something like this in f#?
type ActorDetails =
{
Date: DateTime
Name: string
Email: string
}
for line in loadLog do
let line queryString =
match queryString with
| Regex #"[\?|&]system=([^&]+)" [json] ->
let jsonValue = JValue.Parse(Uri.UnescapeDataString(json))
{
Date = DateTime.UtcNow (* replace with parsed date *)
Name = jsonValue.Value<JArray>("name").[0].Value<string>()
Email = jsonValue.Value<JArray>("mbox").[0].Value<string>().[7..]
}
Use a Partial Active Pattern (|Regex|_|) to do that
open System.Text.RegularExpressions
let (|Regex|_|) regexPattern input =
let regex = new Regex(regexPattern)
let regexMatch = regex.Match(input)
if regexMatch.Success
then Some regexMatch.Value
else None
let queryString input = function
| Regex #"[\?|&]system=([^&]+)" s -> s
| _ -> sprintf "other: %s" input

parse log files with f#

I'm trying to parse data from iis log files.
Each row has a date that I need like this:
u_ex15090503.log:3040:2015-09-05 03:57:45
And a name and email address I need in here:
&actor=%7B%22name%22%3A%5B%22James%2C%20Smith%22%5D%2C%22mbox%22%3A%5B%22mailto%3AJames.Smith%40student.colled.edu%22%5D%7D&
I start off by getting the correct column like this. This part works fine.
//get the correct column
let getCol =
let line = fileReader inputFile
line
|> Seq.filter (fun line -> not (line.StartsWith("#")))
|> Seq.map (fun line -> line.Split())
|> Seq.map (fun line -> line.[7],1)
|> Seq.toArray
getCol
Now I need to parse the above and get the date, name, and email, but I'm having a hard time figuring out how to do that.
So far I have this, which gives me 2 errors(below):
//split the above column at every "&"
let getDataInCol =
let line = getCol
line
|> Seq.map (fun line -> line.Split('&'))
|> Seq.map (fun line -> line.[5], 1)
|> Seq.toArray
getDataInCol
Seq.map (fun line -> line.Split('&'))
the field constructor 'Split' is not defined
The errors:
Seq.map (fun line -> line.[5], 1)
the operator 'expr.[idx]' has been used on an object of indeterminate type based on information prior to this program point.
Maybe I'm going about this all wrong. I'm very new to f# so I apologize for the sloppy code.
Something like this would get the name and email. You'll still need to parse the date.
#r "Newtonsoft.Json.dll"
open System
open System.Text.RegularExpressions
open Newtonsoft.Json.Linq
let (|Regex|_|) pattern input =
let m = Regex.Match(input, pattern)
if m.Success then Some(List.tail [ for g in m.Groups -> g.Value ])
else None
type ActorDetails =
{
Date: DateTime
Name: string
Email: string
}
let parseActorDetails queryString =
match queryString with
| Regex #"[\?|&]actor=([^&]+)" [json] ->
let jsonValue = JValue.Parse(Uri.UnescapeDataString(json))
{
Date = DateTime.UtcNow (* replace with parsed date *)
Name = jsonValue.Value<JArray>("name").[0].Value<string>()
Email = jsonValue.Value<JArray>("mbox").[0].Value<string>().[7..]
}
| _ -> invalidArg "queryString" "Invalid format"
parseActorDetails "&actor=%7B%22name%22%3A%5B%22James%2C%20Smith%22%5D%2C%22mbox%22%3A%5B%22mailto%3AJames.Smith%40student.colled.edu%22%5D%7D&"
val it : ActorDetails = {Date = 11/10/2015 9:14:25 PM;
Name = "James, Smith";
Email = "James.Smith#student.colled.edu";}

Resources