Site scraping with F# and Canopy - f#

I am trying to write a simple scraper using F# and Canopy (see http://lefthandedgoat.github.io/canopy/). I am trying to extract text from all element with the class ".application-tile". However, in the code below, I get the following build error and I don't understand it.
This expression was expected to have type
OpenQA.Selenium.IWebElement -> 'a
but here has type
OpenQA.Selenium.IWebElement
Any idea why this is happening? Thanks!
open canopy
open runner
open System
[<EntryPoint>]
let main argv =
start firefox
"taking canopy for a spin" &&& fun _ ->
url "https://abc.com/"
// Login Page
"#i0116" << "abc#abc.com"
"#i0118" << "abc"
click "#abcButton"
// Get the Application Tiles -- BUILD ERROR HAPPENS HERE
elements ".application-tile" |> List.map (fun tile -> (tile |> (element ".application-name breakWordWrap"))) |> ignore
run()

open canopy
open runner
start firefox
"taking canopy for a spin" &&& fun _ ->
url "http://lefthandedgoat.github.io/canopy/testpages/"
// Get the tds in tr
let results = elements "#value_list td" |> List.map read
//or print them using iter
elements "#value_list td"
|> List.iter (fun element -> System.Console.WriteLine(read element))
run()
That should do what you want.
canopy has function called 'read' that takes in either a selector or an element. Since you have all of them from 'elements "selector"' you can map read over the list.
List.map takes in a function, runs it, and returns a list of results. (in C# its like elements.Select(x => read(x))
List.iter is the same as .foreach(x => System.Console.Writeline(read(x))

I believe that the error is happening in the projection lambda inside your List.map call. From the canopy documentation elements returns all elements that match css selector or text. element gets an element with given css selectors or text.
So here you are obtaining a list of Elements that match the selector ".application-tile". List.map requires a lambda that takes an IElement (the type contained in elements) that will project it into a new form (the generic 'a).
I don't know much about this framework but I'm not sure why you're taking an element and then piping it into another call to element.
Looking further through the documentation we find the read function:
"Read the text (or value or selected option) of an element." Is this what you want?

Related

F#: How exactly is ToString supposed to be used?

I am learning F# but I just don't understand how I am supposed to use ToString. Below are a few attempts. The syntax errors are saying it is expecting type string but that it is actually type uint -> string. So it doens't actually appear to be invoking a function? Could this be explained? This seems like such a simple thing to do but I can't figure it out.
open System
open System.IO
open FSharp.Data
[<EntryPoint>]
let main (args: string[]) =
let htmlPage = HtmlDocument.Load("https://scrapethissite.com/")
printfn "%s" htmlPage.ToString // This causes a syntax error
htmlPage.ToString
|> (fun x -> printfn "%s" x) // This also causes a syntax error
0
.ToString is a method, not a value. In F# every method and every function has a parameter. In fact, that's how functions differ from values (and methods from properties): by having a parameter.
Unlike in C#, F# methods and functions cannot be parameterless. If there is nothing meaningful that you'd want to pass to the method, that method would still have one parameter of type unit. See how this is visible in the error message? unit -> string is the type.
To call such method, you have to pass it the parameter. The sole value of type unit is denoted (). So to call the method you should do:
htmlPage.ToString ()
|> printfn "%s"
Your first example is a bit more complicated. The following would not work:
printfn "%s" htmlPage.ToString ()
Why? Because according to F# syntax this looks like calling printfn and passing it three parameters: first "%s", then htmlPage.ToString, and finally (). To get the correct order of calls you have to use parentheses:
printfn "%s" (htmlPage.ToString ())
And finally, general piece of advice: when possible try to avoid methods and classes in F# code. Most things can be done with functions. In this particular case, the ToString methods can be replaced with the equivalent function string:
printfn "%s" (string htmlPage)

Why does iterating previously read-in sequence trigger a new read?

In this SO post, adding
inSeq
|> Seq.length
|> printfn "%d lines read"
caused the lazy sequence in inSeq to be read in.
OK, I've expanded on that code and want to first print out that sequence (see new program below).
When the Visual Studio (2012) debugger gets to
inSeq |> Seq.iter (fun x -> printfn "%A" x)
the read process starts over again. When I examine inSeq using the debugger, inSeq appears to have no elements in it.
If I have first read elements into inSeq, how can I see (examine) those elements and why won't they print out with the call to Seq.iter?
open System
open System.Collections.Generic
open System.Text
open System.IO
#nowarn "40"
let rec readlines () =
seq {
let line = Console.ReadLine()
if not (line.Equals("")) then
yield line
yield! readlines ()
}
[<EntryPoint>]
let main argv =
let inSeq = readlines ()
inSeq
|> Seq.length
|> printfn "%d lines read"
inSeq |> Seq.iter (fun x -> printfn "%A" x)
// This will keep it alive enough to read your output
Console.ReadKey() |> ignore
0
I've read somewhere that results of lazy evaluation are not cached. Is that what is going on here? How can I cache the results?
Sequence is not a "container" of items, rather it's a "promise" to deliver items sometime in the future. You can think of it as a function that you call, except it returns its result in chunks, not all at once. If you call that function once, it returns you the result once. If you call it second time, it will return the result second time.
Because your particular sequence is not pure, you can compare it to a non-pure function: you call it once, it returns a result; you call it second time, it may return something different.
Sequences do not automatically "remember" their items after the first read - exactly same way as functions do not automatically "remember" their result after the first call. If you want that from a function, you can wrap it in a special "caching" wrapper. And so you can do for a sequence as well.
The general technique of "caching return value" is usually called "memoization". For F# sequences in particular, it is implemented in the Seq.cache function.

Generating an infinite list of items does not "pause" when getting external input

I have some code which I'm expecting to pause when it asks for user input. It only does this however, if the last expression is Seq.initInfinite.
let consoleaction (i : int) =
Console.WriteLine ("Enter Input: ")
(Console.ReadLine().Trim(), i)
Seq.initInfinite (fun i -> consoleaction i) |> Seq.map (fun f -> printfn "%A" f)
printfn "foo" // program will not pause unless this line is commented out.
Very new to F# and I've spent way too much time on this already. Would like to know what is going on :)
If you try that piece of code in F# interactive you will see different effects depending on how you execute it.
For instance if you execute it in one shot it will create values but nothing will be executed since the Seq.initInfinite instruction is 'lost' I mean, not let-bound to anything and at the same time is a lazy expression so its side effects will not be executed. If you remove the last instruction it will start prompting, that's because fsi bounds to it the last expression so in order to show you the value of it it starts evaluating the seq expression.
Things are different if you put this in a function, for example:
open System
let myProgram() =
let consoleaction ...
Now you will get a warning on the Seq.initInfinite:
warning FS0020: This expression should have type 'unit', but has type
'seq<unit>'. Use 'ignore' to discard the result of the expression, or
'let' to bind the result to a name.
Which is very clear. Additionally to ignore as the warning suggest you can change the Seq.map to Seq.iter since you are not interested in the result of the map which will be a seq of units.
But now again your program will not execute (try myProgram())unless you remove the last line, the printfn and it's clear why, this is because it returns the last expression which is not the Seq.initInfinite which is lost since it's lazy and ignored.
If you remove the printfn it will become the 'return value' of your function so it will be evaluated when calling the function.

f# signature matching explained

I am running into difficulty with F# in numerous scenarios. I believe I'm not grasping some fundamental concepts. I'm hoping someone can track my reasoning and figure out the (probably many) things I'm missing.
Say I'm using Xunit. What I'd like to do is, provided two lists, apply the Assert.Equal method pairwise. For instance:
Open Xunit
let test1 = [1;2;3]
let test2 = [1;2;4]
List.map2 Assert.Equal test1 test2
The compiler complains that the function Equal does not take one parameter. As far as I can tell, shouldn't map2 be providing it 2 parameters?
As a sanity check, I use the following code in f# immediate:
let doequal = fun x y -> printf "result: %b\n" (x = y)
let test1 = [1;2;3]
let test2 = [1;2;4]
List.map2 doequal test1 test2;;
This seems identical. doequal is a lambda taking two generic parameters and returning unit. List.map2 hands each argument pairwise into the lambda and I get exactly what I expected as output:
result: true
result: true
result: false
So what gives? Source shows Xunit.Equal has signature public static void Equal<T>(T expected, T actual). Why won't my parameters map right over the method signature?
EDIT ONE
I thought two variables x and y vs a tuple (x, y) could construct and deconstruct interchangeably. So I tried two options and got different results. It seems the second may be further along than the first.
List.map2 Assert.Equal(test1, test2)
The compiler now complains that 'Successive arguments should be separated spaces or tupled'
List.map2(Assert.Equal(test1, test2))
The compiler now complains that 'A unique overload method could not be determined... A type annotation may be needed'
I think that part of the problem comes from mixing methods (OO style) and functions (FP style).
FP style functions have multiple parameters separated by spaces.
OO style methods have parens and parameters separated by commas.
Methods in other .NET libraries are always called using "tuple" syntax (actually subtly different from tuples though) and a tuple is considered to be one parameter.
The F# compiler tries to handle both approaches, but needs some help occasionally.
One approach is to "wrap" the OO method with an FP function.
// wrap method call with function
let assertEqual x y = Assert.Equal(x,y)
// all FP-style functions
List.map2 assertEqual test1 test2
If you don't create a helper function, you will often need to convert multiple function parameters to one tuple when calling a method "inline" with a lambda:
List.map2 (fun x y -> Assert.Equal(x,y)) test1 test2
When you mix methods and functions in one line, you often get the "Successive arguments should be separated" error.
printfn "%s" "hello".ToUpper()
// Error: Successive arguments should be separated
// by spaces or tupled
That's telling you that the compiler is having problems and needs some help!
You can solve this with extra parens around the method call:
printfn "%s" ("hello".ToUpper()) // ok
Or sometimes, with a reverse pipe:
printfn "%s" <| "hello".ToUpper() // ok
The wrapping approach is often worth doing anyway so that you can swap the parameters to make it more suitable for partial application:
// wrap method call with function AND swap params
let contains searchFor (s:string) = s.Contains(searchFor)
// all FP-style functions
["a"; "b"; "c"]
|> List.filter (contains "a")
Note that in the last line I had to use parens to give precedence to contains "a" over List.filter
public static void Equal<T>(T expected, T actual)
doesn't take two parameters - it takes one parameter, which is a tuple with two elements: (T expected, T actual).
Try this instead:
List.map2 Assert.Equal(test1, test2)
It's all there in the type signatures.
The signature for Assert.Equals is something along the lines of 'a * 'a -> unit. List.map2 expects a 'a -> 'b -> 'c.
They just don't fit together.
List.map2 (fun x y -> Assert.Equal(x,y)) test1 test2 - works because the lambda wrapping Equals has the expected signature.
List.zip test1 test2 |> List.map Assert.Equal - works because you now have a single list of tuples, and since List.map wants an 'a -> 'b function (where 'a is now a tuple), Assert.Equal is now fair game.
It's simply not true that two values and a tuple are implicitly interchangeable. At least not as far as F# the language is concerned, or the underlying IL representation is concerned. You can think that it's that way when you call into F# code from, say, C# - an 'a -> 'b -> 'c function there is indeed called the same way syntactically as an 'a * 'b -> 'c function - but this is more of an exception than a rule.
According to its signature Xunit.Assert.Equal() takes a single 2 values tuple parameter

Working with large text files?

I need to import a large text file (55MB) (525000 * 25) and manipulate the data and produce some output. As usual I started exploring with f# interactive, and I get some really strange behaviours.
Is this file too large or my code wrong?
First test was to import and simply comute the sum over one column (not the end goal but first test):
let calctest =
let reader = new StreamReader(path)
let csv = reader.ReadToEnd()
csv.Split([|'\n'|])
|> Seq.skip 1
|> Seq.map (fun line -> line.Split([|','|]))
|> Seq.filter (fun a -> a.[11] = "M")
|> Seq.map (fun values -> float(values.[14]))
As expected this produces a seq of float both in typecheck and in interactive. If I know add:
|> Seq.sum
Type check works and says this function should return a float but if I run it in interactive I get this error:
System.IndexOutOfRangeException: Index was outside the bounds of the array
Then I removed the last line again and thought I look at the seq of float in a text file:
let writetest =
let str = calctest |> Seq.map (fun i -> i.ToString())
System.IO.File.WriteAllLines("test.txt", str )
Again, this passes the type check but throws errors in interactive.
Can the standard StreamReader not handle that amount of data? or am I going wrong somewhere? Should I use a different function then Streamreader?
Thanks.
Seq is lazy, which means that only when you add the Seq.sum is all the mapping and filtering actually being done, that's why you don't see the error before adding that line. Are you sure you have 15 columns on all rows? That's probably the problem
I would advise you to use the CSV Type Provider instead of just doing a string.Split, that way you'll be sure to not have an accidental IndexOutOfRangeException, and you'll handle , escaping correctly.
Additionaly, you're reading the whole csv file into memory by calling reader.ReadToEnd(), the CsvProvider supports streaming if you set the Cache parameter to false. It's not a problem with a 55MB file, but if you have something much larger it might be

Resources