F# Api crawler using recursion or loop - f#

I am trying to build a simple API crawler, that goes through paginated content and returns the aggregate result. This was my first attempt on writing a recursive function:
namespace MyProject
open FSharp.Data
type MyJson = JsonProvider<"./Resources/Example.json">
module ApiClient =
let rec getAllPagesRec baseUrl (acc: MyJson.ItemList[]) (page: int) =
async {
let pageUrl = baseUrl + "&page=" + page.ToString()
let! response = Http.AsyncRequestString(pageUrl)
let content = MyJson.Parse(response)
let mergedArray = Array.concat [ content.ItemList; acc ]
if content.PageNumber < content.PageCount
then return! getAllPagesRec baseUrl mergedArray (page+1)
else return mergedArray
}
However, then I read in Microsoft Docs that code like this is wasteful of memory and processor time. So I re-wrote my function into a loop:
let getAllPagesLoop baseUrl =
async {
let mutable continueLoop = true
let mutable page = 1
let mutable acc = [||]
while continueLoop do
let pageUrl = baseUrl + "&page=" + page.ToString()
let! response = Http.AsyncRequestString(pageUrl)
let content = MyJson.Parse(response)
acc <- Array.concat [ content.ItemList; acc ]
if content.PageNumber < content.PageCount
then page <- page + 1
else continueLoop <- false
return acc
}
The second function looks much more C#-y and it contains a lot of mutation, which seems to contradict the philosophy of F#. Is there any other way how to "optimize" the first function? Maybe using the yield/yield! keyword? Or are both functions good enough?

Your first piece of code is fine*. What the Microsoft docs say is to avoid using recursions where the recursions rely on previous values. Fibonacci is defined as
Fib n = fin (n-1) + Fib(n-2)
So if you calculate Fib 5 you do
Fib 4 + Fib 3
But when you do Fib 4 you do
Fib 3 + Fib 2
Which means you're calculating Fib 3 again. Idem for Fib 2. For a large n there would be lots of recalculations. What they're saying is in these scenarios you want to cache those results.
In your scenario, presumably page 1 has links to page 10,11,12 but not to page 1 and page 10 has links to page 100,101,102 but not to pages 1.., 10... etc. Then there's no wasted computation as you're not recalculating anything.
If this is not the case and there might cyclic links , then you need to keep track of the visited pages e.g. pass a list with the visited pages and each time you visit one you add the visited to the list, and avoid fetching the list if it's already in the visited list.
On another note your code may not be taking advantage of parallelization. If you could get the number of pages in advance (maybe a single call for the first page). You could parallelize the downloading of the pages like this:
let getPage baseUrl (page: int) : MyJson.ItemList[] =
async {
let pageUrl = baseUrl + "&page=" + page.ToString()
let! response = Http.AsyncRequestString(pageUrl)
let content = MyJson.Parse(response)
return content.ItemList
}
[| 1..nPages |]
|>Array.map (getPage baseUrl)
|>Async.Parallel
|>Async.RunSynchronously
|>Array.concat

Related

F# hidden mutation

Anyone have a decent example, preferably practical/useful, they could post demonstrating the concept?
I came across this term somewhere that I’m unable to find, probably it has to do something with a function returning a function while enclosing on some mutable variable. So there’s no visible mutation.
Probably Haskell community has originated the idea where mutation happens in another area not visible to the scope. I maybe vague here so seeking help to understand more.
It's a good idea to hide mutation, so the consumers of the API won't inadvartently change something unexpectedly. This just means that you have to encapsulate your mutable data/state. This can be done via objects (yes, objects), but what you are referring to in your question can be done with a closure, the canonical example is a counter:
let countUp =
let mutable count = 0
(fun () -> count <- count + 1
count)
countUp() // 1
countUp() // 2
countUp() // 3
You cannot access the mutable count variable directly.
Another example would be using mutable state within a function so that you cannot observe it, and the function is, for all intents and purposes, referentially transparent. Take for example the following function that reverses a string not character-wise, but rather by taking individual text elements (which, depending on language, can be more than one character):
let reverseStringU s =
if Core.string.IsNullOrEmpty s then s else
let rec iter acc (ee : System.Globalization.TextElementEnumerator) =
if not <| ee.MoveNext () then acc else
let e = ee.GetTextElement ()
iter (e :: acc) ee
let inline append x s = (^s : (member Append : ^x -> ^s) (s, x))
let sb = System.Text.StringBuilder s.Length
System.Globalization.StringInfo.GetTextElementEnumerator s
|> iter []
|> List.fold (fun a e -> append e a) sb
|> string
It uses a StringBuilder internally but you cannot observe this externally.

F# break from while loop

There is any way to do it like C/C#?
For example (C# style)
for (int i = 0; i < 100; i++)
{
if (i == 66)
break;
}
The short answer is no. You would generally use some higher-order function to express the same functionality. There is a number of functions that let you do this, corresponding to different patterns (so if you describe what exactly you need, someone might give you a better answer).
For example, tryFind function returns the first value from a sequence for which a given predicate returns true, which lets you write something like this:
seq { 0 .. 100 } |> Seq.tryFind (fun i ->
printfn "%d" i
i=66)
In practice, this is the best way to go if you are expressing some high-level logic and there is a corresponding function. If you really need to express something like break, you can use a recursive function:
let rec loop n =
if n < 66 then
printfn "%d" n
loop (n + 1)
loop 0
A more exotic option (that is not as efficient, but may be nice for DSLs) is that you can define a computation expression that lets you write break and continue. Here is an example, but as I said, this is not as efficient.
This is really ugly, but in my case it worked.
let mutable Break = false
while not Break do
//doStuff
if breakCondition then
Break <- true
done
This is useful for do-while loops, because it guarantees that the loop is executed at least once.
I hope there's a more elegant solution. I don't like the recursive one, because I'm afraid of stack overflows. :-(
You have to change it to a while loop.
let (i, ans) = (ref 0, ref -1)
while(!i < 100 and !ans < 0) do
if !i = 66 then
ans := !i
ans
(This breaks when i gets to 66--but yes the syntax is quite different, another variable is introduced, etc.)
seq {
for i = 0 to 99 do
if i = 66 then yield ()
}
|> Seq.tryItem 0
|> ignore
Try this:
exception BreakException
try
for i = 0 to 99 do
if i = 66 then
raise BreakException
with BreakException -> ()
I know that some folks don't like to use exceptions. But it has merits.
You don't have to think about complicated recursive function. Of
cause you can do that, but sometimes it is unnecessarily bothersome
and using exception is simpler.
This method allows you to break at halfway of the loop body. (Break "flag" method is simple too but it only allows to break at the end of the loop body.)
You can easily escape from nested loop.
For these kind of problems you could use a recursive function.
let rec IfEqualsNumber start finish num =
if start = finish then false
elif
start = num then true
else
let start2 = start + 1
IfEqualsNumber start2 finish num
Recently I tried to solve a similar situation:
A list of, say, 10 pieces of data. Each of them must be queried against a Restful server, then get a result for each.
let lst = [4;6;1;8]
The problem:
If there is a failed API call (e.g. network issue), there is no point making further calls as we need all the 10 results available. The entire process should stop ASAP when an API call fails.
The naive approach: use List.map()
lst |> List.map (fun x ->
try
use sqlComd = ...
sqlComd.Parameters.Add("#Id", SqlDbType.BigInt).Value <- x
sqlComd.ExecuteScala() |> Some
with
| :? System.Data.SqlClient.SqlException as ex -> None
)
But as said, it's not optimal. When a failed API occurs, the remaining items keep being processed. They do something that is ignored at the end anyway.
The hacky approach: use List.tryFindIndex()
Unlike map(), we must store the results somewhere in the lamda function. A reasonable choice is to use mutable list. So when tryFindIndex() returns None, we know that everything was ok and can start making use of the mutable list.
val myList: List<string>
let res = lst |> List.tryFindIndex (fun x ->
try
use sqlComd = ...
sqlComd.Parameters.Add("#Id", SqlDbType.BigInt).Value <- x
myList.Add(sqlComd.ExecuteScala())
false
with
|:? System.Data.SqlClient.SqlException as ex -> true
)
match res with
| Some _ -> printfn "Something went wrong"
| None -> printfn "Here is the 10 results..."
The idiomatic approach: use recursion
Not very idiomatic as it uses Exception to stop the operation.
exception MyException of string
let makeCall lstLocal =
match lstLocal with
| [] -> []
| head::tail ->
try
use sqlComd = ...
sqlComd.Parameters.Add("#Id", SqlDbType.BigInt).Value <- x
let temp = sqlComd.ExecuteScala()
temp :: makeCall (tail)
with
|:? System.Data.SqlClient.SqlException as ex -> raise MyException ex.Message
try
let res = makeCall lst
printfn "Here is the 10 results..."
with
| :? MyException -> printfn "Something went wrong"
The old-fashion imperative approach: while... do
This still involves mutable list.

Lazy.. but eager data loader in F#

Does anyone know of 'prior art' regarding the following subject :
I have data that take some decent time to load. they are historical level for various stocks.
I would like to preload them somehow, to avoid the latency when using my app
However, preloading them in one chunk at start makes my app unresponsive first which is not user friendly
So I would like to not load my data.... unless the user is not requesting any and playing with what he already has, in which case I would like to get little by little. So it is neither 'lazy' nor 'eager', more 'lazy when you need' and 'eager when you can', hence the acronym LWYNEWYC.
I have made the following which seems to work, but I just wonder if there is a recognized and blessed approach for such thing ?
let r = LoggingFakeRepo () :> IQuoteRepository
r.getHisto "1" |> ignore //prints Getting histo for 1 when called
let rc = RepoCached (r) :> IQuoteRepository
rc.getHisto "1" |> ignore //prints Getting histo for 1 the first time only
let rcc = RepoCachedEager (r) :> IQuoteRepository
rcc.getHisto "100" |> ignore //prints Getting histo 1..100 by itself BUT
//prints Getting histo 100 immediately when called
And the classes
type IQuoteRepository =
abstract getUnderlyings : string seq
abstract getHisto : string -> string
type LoggingFakeRepo () =
interface IQuoteRepository with
member x.getUnderlyings = printfn "getting underlyings"
[1 .. 100] |> List.map string :> _
member x.getHisto udl = printfn "getting histo for %A" udl
"I am a historical dataset in a disguised party"
type RepoCached (rep : IQuoteRepository) =
let memoize f =
let cache = new System.Collections.Generic.Dictionary<_, _>()
fun x ->
if cache.ContainsKey(x) then cache.[x]
else let res = f x
cache.[x] <- res
res
let udls = lazy (rep.getUnderlyings )
let gethistom = memoize rep.getHisto
interface IQuoteRepository with
member x.getUnderlyings = udls.Force()
member x.getHisto udl = gethistom udl
type Message = string * AsyncReplyChannel<UnderlyingWrap>
type RepoCachedEager (rep : IQuoteRepository) =
let udls = rep.getUnderlyings
let agent = MailboxProcessor<Message>.Start(fun inbox ->
let repocached = RepoCached (rep) :> IQuoteRepository
let rec loop l =
async { try
let timeout = if l|> List.isEmpty then -1 else 50
let! (udl, replyChannel) = inbox.Receive(timeout)
replyChannel.Reply(repocached.getHisto udl)
do! loop l
with
| :? System.TimeoutException ->
let udl::xs = l
repocached.getHisto udl |> ignore
do! loop xs
}
loop (udls |> Seq.toList))
interface IQuoteRepository with
member x.getUnderlyings = udls
member x.getHisto udl = agent.PostAndReply(fun reply -> udl, reply)
I like your solution. I think using agent to implement some background loading with a timeout is a great way to go - agents can nicely encapsulate mutable state, so it is clearly safe and you can encode the behaviour you want quite easily.
I think asynchronous sequences might be another useful abstraction (if I'm correct, they are available in FSharpX these days). An asynchronous sequence represents a computation that asynchronously produces more values, so they might be a good way to separate the data loader from the rest of the code.
I think you'll still need an agent to synchronize at some point, but you can nicely separate different concerns using async sequences.
The code to load the data might look something like this:
let loadStockPrices repo = asyncSeq {
// TODO: Not sure how you detect that the repository has no more data...
while true do
// Get next item from the repository, preferably asynchronously!
let! data = repo.AsyncGetNextHistoricalValue()
// Return the value to the caller...
yield data }
This code represents the data loader, and it separates it from the code that uses it. From the agent that consumes the data source, you can use AsyncSeq.iterAsync to consume the values and do something with them.
With iterAsync, the function that you specify as a consumer is asynchronous. It may block (i.e. using Sleep) and when it blocks, the source - that is.your loader - is also blocked. This is quite nice implicit way to control the loader from the code that consumes the data.
A feature that is not in the library yet (but would be useful) is an partially eager evaluator that takes AsyncSeq<'T> and returns a new AsyncSeq<'T> but obtains a certain number of elements from the source as soon as possible and caches them (so that the consumer does not have to wait when it asks for a value, as long as the source can produce values fast enough).

retrieving all results in RavenDB

I thought I could force to retrieve all results through multiple page and skip, using the statistics function
type Linq.IRavenQueryable<'T>
with member q.getAll() = let mutable stat = Linq.RavenQueryStatistics()
let total = stat.TotalResults
let a = q.Statistics(&stat)
let rec addone n = seq { yield q.Skip(n*1024).Take(1024).ToArray()
if n*1024 < total then
yield! addone (n + 1) }
addone 0 |> Array.concat
It works when you do
let q = session.Query<productypfield>()
let r = q.getAll()
but breaks with
let q = session.Query<productypfield>().Where(System.Func ....)
let r = q.getAll()
As the type Linq.IRavenQueryable is not idempotent through Linq composition : If I use Linq, I get an IEnumerable on which no q.Statistics(&stat) is defined.
I read the doc, and I dont see anyway to keep the type through Linq composition.
IS the only way to loop a fixed (high) amount of times, or set a high servepagesize, and take(a lot of elements) ?
edit : actually, even the code above does not work as apparently, to get a valid count, you need to run the query once. one has to call Take(0) to trigger it.
use session = store.OpenSession()
let q = session.Query<productypfield>()
let mutable s = Linq.RavenQueryStatistics()
let a = q.Statistics(&s)
s.TotalResults = 0 //true
printfn "%A" a //triggers evaluation
s.TotalResults = 0 //false
Can you change your 2nd code sample to this (I'm not familiar with F#):
let q = session.Query<productypfield>().Where(Expression<System.Func<....>>)
let r = q.getAll()
that should let you keep the IQueryable that you need

How to Translate Dictionary of Functions

I struggled to think of a good title here, but hopefully my description makes up for it.
As a hobby side-project, I'm attempting to port an interpreter for a toy language (you have to pay for the book, just linking to show where I am coming from) from Go to F#.
This has all gone fine until the point where I am needing to call other functions in a dictionary of functions.
Here's a very simplified example of what the Go code tries to do written in F#:
let processA a remainingCharacters =
1
let processB b remainingCharacters =
2
let processC c remainingCharacters =
// this doesn't work, obviously, as funcMap is declared below
let problem = funcMap.[remainingCharacters.Head]
3 + problem
// i assume there is a better way of doing this, I'm just not sure what it is
let funcMap = dict[('a', processA); ('b', processB); ('c', processC)]
let processCharacter currentCharacter remainingCharacters =
let processFunc = funcMap.[currentCharacter]
processFunc currentCharacter remainingCharacters
let input = ['a'; 'b'; 'a'; 'c']
let processInput() =
let rec processInputRec currentCharacter (remainingCharacters: char list) sum =
if remainingCharacters.IsEmpty then
sum
else
let currentValue = processCharacter currentCharacter remainingCharacters
processInputRec remainingCharacters.Head remainingCharacters.Tail (sum + currentValue)
processInputRec input.Head input.Tail 0
let result = processInput()
sprintf "%i" result |> ignore
So basically, it is trying to map given input values to different functions and in certain cases, needing to refer back to that mapping (or at least getting at another one of those mapped functions) inside those functions.
How would I go about doing that in F#?
The order of compilation in F# is a feature, not a bug. It helps to make sure that your code is not spaghetti, that all dependencies are nice and linear.
The "loopback" scenarios like this one are usually solved via parametrization.
So in this particular case, if you want the specific functions to call processCharacter recursively, just pass it in:
let processA a remainingCharacters _ = // an extra unused parameter here
1
let processB b remainingCharacters _ = // and here
2
let processC c remainingCharacters procChar = // here is where the extra parameter is used
let problem = procChar (List.head remainingCharacters) (List.tail remainingCharacters)
3 + problem
...
let rec processCharacter currentCharacter remainingCharacters =
let processFunc = funcMap.[currentCharacter]
processFunc currentCharacter remainingCharacters processCharacter
Note, however, that, although this will solve your immediate problem, this will (probably) not work all the way, because you're not keeping track of which characters got consumed from the input. So that, if processC decides to process one more character, the surrounding code won't know about it, and upon return from processC will process the same character again. I'm not sure if this was your intent (hard to tell from the code), and if it was, please disregard this warning.
The usual approach to parsing a stream of inputs like this is to have each processing function return a pair - the result of the processing plus the tail of remaining inputs, e.g.:
let processA chars =
1, (List.tail chars)
Then the surrounding "driver" function would thread the returned tail of the list to the next processing function. This way, each processing function can consume not necessarily one, but any number of inputs - from zero to all of them.
This approach has been implemented in libraries, too. Take a look at FParsec.
Another note: your code seems very un-F#-y. You're not using many F# features, making your code longer and more complicated than it needs to be. For example, instead of accessing .Tail and .Head, it is customary to pattern-match on the list:
let rec processInputRec current rest sum =
match rest with
| [] -> sum
| (next, rest') ->
let currentValue = processCharacter current rest
processInputRec next rest' (sum + currentValue)

Resources