F#: How to enumerate through multiple files correctly?

F#: How to enumerate through multiple files correctly? - f#

I have a bunch of files several MiB in size which are very simple:
They have a size of multiples of 8
They only contain doubles in little endian, so can be read with BinaryReader's ReadDouble() method
When lexicographically sorted, they contain all values in the sequence they need to be.
I can't keep everything in memory as a float list or float array so I need a float seq that goes through the necessary files when actually being accessed. The portion that goes through the sequence actually does it in imperative style using GetEnumerator() because I don't want any resource leaks and want to close all files correctly.
My first functional approach was:
let readFile file =
let rec readReader (maybeReader : BinaryReader option) =
match maybeReader with
| None ->
let openFile() =
printfn "Opening the file"
new BinaryReader(new FileStream(file, FileMode.Open, FileAccess.Read, FileShare.Read))
|> Some
|> readReader
seq { yield! openFile() }
| Some reader when reader.BaseStream.Position >= reader.BaseStream.Length ->
printfn "Closing the file"
reader.Dispose()
Seq.empty
| Some reader ->
reader.BaseStream.Position |> printfn "Reading from position %d"
let bytesToRead = Math.Min(1048576L, reader.BaseStream.Length - reader.BaseStream.Position) |> int
let bytes = reader.ReadBytes bytesToRead
let doubles = Array.zeroCreate<float> (bytesToRead / 8)
Buffer.BlockCopy(bytes, 0, doubles, 0, bytesToRead)
seq {
yield! doubles
yield! readReader maybeReader
}
readReader None
And then, when I have a string list containing all the files, I can say something like:
let values = files |> Seq.collect readFile
use ve = values.GetEnumerator()
// Do stuff that only gets partial data from one file
However, this only closes the files when the reader reaches its end (which is clear when looking at the function). So as a second approach I implemented the file enumerating imperatively:
type FileEnumerator(file : string) =
let reader = new BinaryReader(new FileStream(file, FileMode.Open, FileAccess.Read, FileShare.Read))
let mutable _current : float = Double.NaN
do file |> printfn "Enumerator active for %s"
interface IDisposable with
member this.Dispose() =
reader.Dispose()
file |> printfn "Enumerator disposed for %s"
interface IEnumerator with
member this.Current = _current :> obj
member this.Reset() = reader.BaseStream.Position <- 0L
member this.MoveNext() =
let stream = reader.BaseStream
if stream.Position >= stream.Length then false
else
_current <- reader.ReadDouble()
true
interface IEnumerator<float> with
member this.Current = _current
type FileEnumerable(file : string) =
interface IEnumerable with
member this.GetEnumerator() = new FileEnumerator(file) :> IEnumerator
interface IEnumerable<float> with
member this.GetEnumerator() = new FileEnumerator(file) :> IEnumerator<float>
let readFile' file = new FileEnumerable(file) :> float seq
now, when I say
let values = files |> Seq.collect readFile'
use ve = values.GetEnumerator()
// do stuff with the enumerator
disposing the enumerator correctly bubbles through to my imperative enumerator.
While this is a feasible solution for what I want to achieve (I could make it faster by reading it blockwise like the first functional approach but for brevity I didn't do it here) I wonder if there is a truly functional approach for this avoiding the mutable state in the enumerator.

I don't quite get what you mean when you say that using GetEnumerator() will prevent resource leaks and allow to close all files correctly. The below would be my attempt at this (ignoring block copy part for demonstration purposes) and I think it results in the files properly closed.
let eof (br : BinaryReader) =
br.BaseStream.Position = br.BaseStream.Length
let readFileAsFloats filePath =
seq{
use file = new FileStream(filePath, FileMode.Open, FileAccess.Read, FileShare.Read)
use reader = new BinaryReader(file)
while (not (eof reader)) do
yield reader.ReadDouble()
}
let readFilesAsFloats filePaths =
filePaths |> Seq.collect readFileAsFloats
let floats = readFilesAsFloats ["D:\\floatFile1.txt"; "D:\\floatFile2.txt"]
Is that what you had in mind?

Related

Shared domain types and single responsibility principle, how do they work together?

I think my problem is best explained on a hypothetical example.
Suppose you want to design a software for several banks. Two use cases are: calculating a credit score and evaluating taxes. The score does not depend on the taxes and vice versa.
The return type of the credit scoring function is Score and the taxes are returned in TaxAmount. You want the solution as clean and extensible as possible.
Where would you put the shared/not so shared types?
Putting Score and TaxAmount in one shared type module would violate the single responsibility principle.
Putting Score and TaxAmountin different modules increases the number of files/modules/git submodules very quickly as the code base grows. And the problem spreads through all layers higher up which results in a lot of open module statements.
Having each function its own type module results in repetition.
Not using custom domain types results in losing the benefits of custom domain types.
My constructed onion architecture example:
namespace Domain.Shared
module Types =
type Score =
| GOOD
| BAD
type TaxAmount = int
namespace Domain.Scoring1
module Main =
open Domain.Shared.Types
let go input :Score = Score.BAD
namespace Domain.Scoring2
module Main =
open Domain.Shared.Types
let go input :Score = Score.GOOD
namespace Domain.ComputeTaxes1
module Main =
open Domain.Shared.Types
let go input :TaxAmount = 1
namespace Application
module BankA =
let standardize x = "a"
let score input =
input
|> standardize
|> Domain.Scoring1.Main.go
let computeTaxes input =
input
|> standardize
|> Domain.ComputeTaxes1.Main.go
module BankB =
let standardize x = "b"
let score input =
input
|> standardize
|> Domain.Scoring2.Main.go
namespace Infrastructure
module BankA =
let read filename = "A"
let score filename =
filename
|> read
|> Application.BankA.score
let computeTaxes (filename:string) =
filename
|> read
|> Application.BankA.computeTaxes
module BankB =
let read filename = "B"
let score (filename:string) =
filename
|> read
|> Application.BankB.score
module Shell =
[<EntryPoint>]
let main (args: array<string>) =
match args with
| [| "A"; "score"; filename |] -> printfn "%A" (BankA.score filename)
| [| "A"; "tax"; filename |] -> printfn "%A" (BankA.computeTaxes filename)
| [| "B"; "score"; filename |] -> printfn "%A" (BankB.score filename)
| _ -> printfn "Not implemented"
0

Using TaskCompletionSource in F# for use in a .NET library

Considering the following
type MyClass () =
member x.ReadStreamAsync(stream:Stream) =
async {
let tcs = new TaskCompletionSource<int>()
let buffer = Array.create 2048 0uy
let! bytesReadCount = stream.ReadAsync(buffer, 0, buffer.Length) |> Async.AwaitTask
if bytesReadCount > 0 then
for i in 0..bytesReadCount do
if buffer.[i] = 10uy then
tcs.SetResult(i)
// Omitted more code to handle the case if 10uy is not found..
return tcs.Task
}
The code reads from a stream until in meets a certain character (represented by a byte value) at which point the task returned by the method completes.
The function signature of DoSomethingAsync is unit -> Async<Task<int>>, but I would like it to be unit -> Task<int> such that it can be used more generally in .NET.
Can this be done in F# using an asynchronous expression, or do I can to rely more on the Task constructs of .NET?

Given that you don't actually use the async workflow for anything in your example, the easiest solution would be to forgo it entirely:
member x.DoSomethingAsync() =
let tcs = new TaskCompletionSource<int>()
Task.Delay(100).Wait()
tcs.SetResult(10)
tcs.Task
This implementation of DoSomethingAsync has the type unit -> Task<int>.
It's not clear to me exactly what you're trying to do, but why don't you just do the following?
member x.DoSomethingAsync() =
async {
do! Async.Sleep 100
return 10 } |> Async.StartAsTask
This implementation also has the type unit -> Task<int>.
Based on the updated question, here's a way to do it:
member x.DoSomethingAsync(stream:Stream) =
async {
let buffer = Array.create 2048 0uy
let! bytesReadCount =
stream.ReadAsync(buffer, 0, buffer.Length) |> Async.AwaitTask
if bytesReadCount > 0
then
let res =
[0..bytesReadCount]
|> List.tryFind (fun i -> buffer.[i] = 10uy)
return defaultArg res -1
else return -1
}
|> Async.StartAsTask
The DoSomethingAsync function has the type Stream -> System.Task<int>. I didn't know what to do in the else case, so I just put -1, but I'm sure you can replace it with something more correct.

Mutable states in F# object expressions

I would like to have a mutable state in an F# object expression.
The first approach is to use ref cells as follows:
type PP =
abstract member A : int
let foo =
let a = ref 0
{ new PP with
member x.A =
let ret = !a
a := !a + 1
ret
}
printfn "%A" foo.A
printfn "%A" foo.A
printfn "%A" foo.A
printfn "%A" foo.A
A different approach would be as follows:
type State(s : int) =
let mutable intState = s
member x.state
with get () = intState
and set v = intState <- v
[<AbstractClass>]
type PPP(state : State) =
abstract member A : int
member x.state
with get () = state.state
and set v = state.state <- v
let bar n =
{ new PPP(State(n)) with
member x.A =
let ret = x.state
x.state <- ret + 1
ret
}
let barA1 = bar 0
printfn "%A" barA1.A
printfn "%A" barA1.A
printfn "%A" barA1.A
printfn "%A" barA1.A
Which version would be likely more performing (I need the state updating x.state <- ret + 1
in performance critical sections)? My guess is that the State object is also allocated on the heap so there is no reason why the second version should be faster. However it is slightly more appealing to use.
Thanks for any feedback and suggestions

As Daniel said, the last approach is essentially equivalent to using built-in ref.
When using ref, you're allocating two objects - the one that you're returning and the reference cell itself. You can reduce this to just a single allocated object by using a concrete implementation (but I don't think this will matter in practice):
type Stateful(initial:int) =
let mutable state = initial
interface PP with
member x.A =
let ret = state
state <- state + 1
ret
let foo =
Statefull(0) :> PP // Creates a single object that keeps the state as mutable field
Aside, you are using read-only property that modifies internal state of the object and returns a new state each time. This is a dangerous pattern that could be quite confusing - properties with getter shouldn't modify the state, so you should probably use a method (unit -> int) instead.

Your State class is identical to ref. They're both reference types (you can't capture a mutable value type from an object expression). I would prefer a built-in type when possible. ref is the idiomatic way to represent a heap-allocated mutable value.
If ever in doubt about performance, benchmark it.

How read a file into a seq of lines in F#

This is C# version:
public static IEnumerable<string> ReadLinesEnumerable(string path) {
using ( var reader = new StreamReader(path) ) {
var line = reader.ReadLine();
while ( line != null ) {
yield return line;
line = reader.ReadLine();
}
}
}
But directly translating needs a mutable variable.

If you're using .NET 4.0, you can just use File.ReadLines.
> let readLines filePath = System.IO.File.ReadLines(filePath);;
val readLines : string -> seq<string>

open System.IO
let readLines (filePath:string) = seq {
use sr = new StreamReader (filePath)
while not sr.EndOfStream do
yield sr.ReadLine ()
}

To answer the question whether there is a library function for encapsulating this pattern - there isn't a function exactly for this, but there is a function that allows you to generate sequence from some state called Seq.unfold. You can use it to implement the functionality above like this:
new StreamReader(filePath) |> Seq.unfold (fun sr ->
match sr.ReadLine() with
| null -> sr.Dispose(); None
| str -> Some(str, sr))
The sr value represents the stream reader and is passed as the state. As long as it gives you non-null values, you can return Some containing an element to generate and the state (which could change if you wanted). When it reads null, we dispose it and return None to end the sequence. This isn't a direct equivalent, because it doesn't properly dispose StreamReader when an exception is thrown.
In this case, I would definitely use sequence expression (which is more elegant and more readable in most of the cases), but it's useful to know that it could be also written using a higher-order function.

let lines = File.ReadLines(path)
// To check
lines |> Seq.iter(fun x -> printfn "%s" x)

On .NET 2/3 you can do:
let readLines filePath = File.ReadAllLines(filePath) |> Seq.cast<string>
and on .NET 4:
let readLines filePath = File.ReadLines(filePath);;

In order to avoid the "System.ObjectDisposedException: Cannot read from a closed TextReader." exception, use:
let lines = seq { yield! System.IO.File.ReadLines "/path/to/file.txt" }

f# byte[] -> hex -> string conversion

I have byte array as input. I would like to convert that array to string that contains hexadecimal representation of array values. This is F# code:
let ByteToHex bytes =
bytes
|> Array.map (fun (x : byte) -> String.Format("{0:X2}", x))
let ConcatArray stringArray = String.Join(null, (ByteToHex stringArray))
This produces result I need, but I would like to make it more compact so that I have only one function.
I could not find function that would concat string representation of each byte at the end
of ByteToHex.
I tried Array.concat, concat_map, I tried with lists, but the best I could get is array or list of strings.
Questions:
What would be simplest, most elegant way to do this?
Is there string formatting construct in F# so that I can replace String.Format from System assembly?
Example input: [|0x24uy; 0xA1uy; 0x00uy; 0x1Cuy|] should produce string "24A1001C"

There is nothing inherently wrong with your example. If you'd like to get it down to a single expression then use the String.contcat method.
let ByteToHex bytes =
bytes
|> Array.map (fun (x : byte) -> System.String.Format("{0:X2}", x))
|> String.concat System.String.Empty
Under the hood, String.concat will just call into String.Join. Your code may have to be altered slighly though because based on your sample you import System. This may create a name resolution conflict between F# String and System.String.

If you want to transform and accumulate in one step, fold is your answer. sprintf is the F# string format function.
let ByteToHex (bytes:byte[]) =
bytes |> Array.fold (fun state x-> state + sprintf "%02X" x) ""
This can also be done with a StringBuilder
open System.Text
let ByteToHex (bytes:byte[]) =
(StringBuilder(), bytes)
||> Array.fold (fun state -> sprintf "%02X" >> state.Append)
|> string
produces:
[|0x24uy; 0xA1uy; 0x00uy; 0x1Cuy|] |> ByteToHex;;
val it : string = "24A1001C"

Here's another answer:
let hashFormat (h : byte[]) =
let sb = StringBuilder(h.Length * 2)
let rec hashFormat' = function
| _ as currIndex when currIndex = h.Length -> sb.ToString()
| _ as currIndex ->
sb.AppendFormat("{0:X2}", h.[currIndex]) |> ignore
hashFormat' (currIndex + 1)
hashFormat' 0
The upside of this one is that it's tail-recursive and that it pre-allocates the exact amount of space in the string builder as will be required to convert the byte array to a hex-string.
For context, I have it in this module:
module EncodingUtils
open System
open System.Text
open System.Security.Cryptography
open Newtonsoft.Json
let private hmacmd5 = new HMACMD5()
let private encoding = System.Text.Encoding.UTF8
let private enc (str : string) = encoding.GetBytes str
let private json o = JsonConvert.SerializeObject o
let md5 a = a |> (json >> enc >> hmacmd5.ComputeHash >> hashFormat)
Meaning I can pass md5 any object and get back a JSON hash of it.

Here's another. I'm learning F#, so feel free to correct me with more idiomatic ways of doing this:
let bytesToHexString (bytes : byte[]) : string =
bytes
|> Seq.map (fun c -> c.ToString("X2"))
|> Seq.reduce (+)

Looks fine to me. Just to point out another, in my opinion, very helpful function in the Printf module, have a look at ksprintf. It passes the result of a formated string into a function of your choice (in this case, the identity function).
val ksprintf : (string -> 'd) -> StringFormat<'a,'d> -> 'a
sprintf, but call the given 'final' function to generate the result.

To be honest, that doesn't look terrible (although I also have very little F# experience). Does F# offer an easy way to iterate (foreach)? If this was C#, I might use something like (where raw is a byte[] argument):
StringBuilder sb = new StringBuilder();
foreach (byte b in raw) {
sb.Append(b.ToString("x2"));
}
return sb.ToString()
I wonder how that translates to F#...

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

F#: How to enumerate through multiple files correctly? - f#

Related

Shared domain types and single responsibility principle, how do they work together?

Using TaskCompletionSource in F# for use in a .NET library

Mutable states in F# object expressions

How read a file into a seq of lines in F#

f# byte[] -> hex -> string conversion

Categories

Resources