As a F# training, I'm trying to decompress a gzip file. Here is the code I wrote :
let decompress (inputStream: Stream) (outputStream : Stream) = async {
use gs = new GZipStream(inputStream, CompressionMode.Decompress)
let buffer = Array.zeroCreate<byte> 4096
let rec decompressInternal() = async {
let! read = gs.ReadAsync(buffer, 0, buffer.Length) |> Async.AwaitTask
match read with
| 0 ->
inputStream.Dispose()
return ()
| _ ->
do! outputStream.WriteAsync(buffer, 0, read) |> Async.AwaitTask |> Async.Ignore
return! decompressInternal()
}
return! decompressInternal()
}
I also wrote a unit test, which fails for the moment (I get an empty string, as if the assertion occurs after the decompress operation) :
[<Fact>]
let ``Decompress a gzipped file`` () =
use fs = new FileStream("Input/test.txt.gz", FileMode.Open)
use ms = new MemoryStream()
decompress fs ms |> Async.RunSynchronously
use sr = new StreamReader(ms)
Assert.Equal ("This is a test", sr.ReadToEnd())
I think I misued one of the F# async capablibilites, but I cannot see my mistake...
Your use of Async is perfectly fine. You just have a MemoryStream whose position is still after the write, so there is nothing to read from there. Use the following before creating the StreamReader to reset the position and it will work:
ms.Seek(0L, SeekOrigin.Begin) |> ignore
Related
I have a bunch of files several MiB in size which are very simple:
They have a size of multiples of 8
They only contain doubles in little endian, so can be read with BinaryReader's ReadDouble() method
When lexicographically sorted, they contain all values in the sequence they need to be.
I can't keep everything in memory as a float list or float array so I need a float seq that goes through the necessary files when actually being accessed. The portion that goes through the sequence actually does it in imperative style using GetEnumerator() because I don't want any resource leaks and want to close all files correctly.
My first functional approach was:
let readFile file =
let rec readReader (maybeReader : BinaryReader option) =
match maybeReader with
| None ->
let openFile() =
printfn "Opening the file"
new BinaryReader(new FileStream(file, FileMode.Open, FileAccess.Read, FileShare.Read))
|> Some
|> readReader
seq { yield! openFile() }
| Some reader when reader.BaseStream.Position >= reader.BaseStream.Length ->
printfn "Closing the file"
reader.Dispose()
Seq.empty
| Some reader ->
reader.BaseStream.Position |> printfn "Reading from position %d"
let bytesToRead = Math.Min(1048576L, reader.BaseStream.Length - reader.BaseStream.Position) |> int
let bytes = reader.ReadBytes bytesToRead
let doubles = Array.zeroCreate<float> (bytesToRead / 8)
Buffer.BlockCopy(bytes, 0, doubles, 0, bytesToRead)
seq {
yield! doubles
yield! readReader maybeReader
}
readReader None
And then, when I have a string list containing all the files, I can say something like:
let values = files |> Seq.collect readFile
use ve = values.GetEnumerator()
// Do stuff that only gets partial data from one file
However, this only closes the files when the reader reaches its end (which is clear when looking at the function). So as a second approach I implemented the file enumerating imperatively:
type FileEnumerator(file : string) =
let reader = new BinaryReader(new FileStream(file, FileMode.Open, FileAccess.Read, FileShare.Read))
let mutable _current : float = Double.NaN
do file |> printfn "Enumerator active for %s"
interface IDisposable with
member this.Dispose() =
reader.Dispose()
file |> printfn "Enumerator disposed for %s"
interface IEnumerator with
member this.Current = _current :> obj
member this.Reset() = reader.BaseStream.Position <- 0L
member this.MoveNext() =
let stream = reader.BaseStream
if stream.Position >= stream.Length then false
else
_current <- reader.ReadDouble()
true
interface IEnumerator<float> with
member this.Current = _current
type FileEnumerable(file : string) =
interface IEnumerable with
member this.GetEnumerator() = new FileEnumerator(file) :> IEnumerator
interface IEnumerable<float> with
member this.GetEnumerator() = new FileEnumerator(file) :> IEnumerator<float>
let readFile' file = new FileEnumerable(file) :> float seq
now, when I say
let values = files |> Seq.collect readFile'
use ve = values.GetEnumerator()
// do stuff with the enumerator
disposing the enumerator correctly bubbles through to my imperative enumerator.
While this is a feasible solution for what I want to achieve (I could make it faster by reading it blockwise like the first functional approach but for brevity I didn't do it here) I wonder if there is a truly functional approach for this avoiding the mutable state in the enumerator.
I don't quite get what you mean when you say that using GetEnumerator() will prevent resource leaks and allow to close all files correctly. The below would be my attempt at this (ignoring block copy part for demonstration purposes) and I think it results in the files properly closed.
let eof (br : BinaryReader) =
br.BaseStream.Position = br.BaseStream.Length
let readFileAsFloats filePath =
seq{
use file = new FileStream(filePath, FileMode.Open, FileAccess.Read, FileShare.Read)
use reader = new BinaryReader(file)
while (not (eof reader)) do
yield reader.ReadDouble()
}
let readFilesAsFloats filePaths =
filePaths |> Seq.collect readFileAsFloats
let floats = readFilesAsFloats ["D:\\floatFile1.txt"; "D:\\floatFile2.txt"]
Is that what you had in mind?
I have the following code
open FSharp.Data
let downloadFile link =
......
use os = File.Create(...)
Http.RequestStream(....).ReponseStream.CopyTo(os)
let rec consume() = async {
......
|> Seq.iter (fun x ->
xxx |> Seq.iter(fun link ->
downloadFile link
))
}
I found that the sync downloading makes the code not run concurrently. So I'm trying to do somthing like the following. How to change it to use the FSharp.Data http AsyncRequestStream? Maybe the CopyTo can be async too?
open FSharp.Data
let downloadFile link = async {
......
use os = File.Create(...)
Http.AsyncRequestStream(....).ReponseStream.CopyTo(os) // Error
}
let rec consume() = async {
......
|> Seq.iter (fun x ->
xxx |> Seq.iter(fun link ->
downloadFile link |> Async.Start // do! downloadFile link????
))
}
consume() |> Async.RunSynchronously
Here's a skeleton solution, worthy of all the blank spots in your example:
let downloadFile link =
async {
......
use os = File.Create(...)
let! resp = Http.AsyncRequestStream(....)
return resp.ReponseStream.CopyTo(os)
}
let consume link =
async {
let comps : Async<unit> [] =
xxx
|> Seq.map (fun link -> downloadFile link)
|> Array.ofSeq
return! Async.Parallel comps
}
I think you should read up on asynchronicity and concurrency in general, as well as how to use it in F# in particular. From the OP it seems the whole thing is a bit hazy to you.
Edit: to answer the question in the comment:
With return! (or let!, or do!) you execute the nested workflow asynchronously, then pick up executing the current workflow from that point. That is, everything "below" the do! is put into a continuation that gets called once the thing "after" the do! finishes.
Whereas Async.Start fires up the workflow on (another) background thread and returns immediately without waiting for it to finish.
Considering the following
type MyClass () =
member x.ReadStreamAsync(stream:Stream) =
async {
let tcs = new TaskCompletionSource<int>()
let buffer = Array.create 2048 0uy
let! bytesReadCount = stream.ReadAsync(buffer, 0, buffer.Length) |> Async.AwaitTask
if bytesReadCount > 0 then
for i in 0..bytesReadCount do
if buffer.[i] = 10uy then
tcs.SetResult(i)
// Omitted more code to handle the case if 10uy is not found..
return tcs.Task
}
The code reads from a stream until in meets a certain character (represented by a byte value) at which point the task returned by the method completes.
The function signature of DoSomethingAsync is unit -> Async<Task<int>>, but I would like it to be unit -> Task<int> such that it can be used more generally in .NET.
Can this be done in F# using an asynchronous expression, or do I can to rely more on the Task constructs of .NET?
Given that you don't actually use the async workflow for anything in your example, the easiest solution would be to forgo it entirely:
member x.DoSomethingAsync() =
let tcs = new TaskCompletionSource<int>()
Task.Delay(100).Wait()
tcs.SetResult(10)
tcs.Task
This implementation of DoSomethingAsync has the type unit -> Task<int>.
It's not clear to me exactly what you're trying to do, but why don't you just do the following?
member x.DoSomethingAsync() =
async {
do! Async.Sleep 100
return 10 } |> Async.StartAsTask
This implementation also has the type unit -> Task<int>.
Based on the updated question, here's a way to do it:
member x.DoSomethingAsync(stream:Stream) =
async {
let buffer = Array.create 2048 0uy
let! bytesReadCount =
stream.ReadAsync(buffer, 0, buffer.Length) |> Async.AwaitTask
if bytesReadCount > 0
then
let res =
[0..bytesReadCount]
|> List.tryFind (fun i -> buffer.[i] = 10uy)
return defaultArg res -1
else return -1
}
|> Async.StartAsTask
The DoSomethingAsync function has the type Stream -> System.Task<int>. I didn't know what to do in the else case, so I just put -1, but I'm sure you can replace it with something more correct.
I am using the csv type provider to collect some data from a series of files I have on Azure blob storage:
#r "../packages/FSharp.Data.2.0.9/lib/portable-net40+sl5+wp8+win8/FSharp.Data.dll"
open FSharp.Data
type censusDataContext = CsvProvider<"https://portalvhdspgzl51prtcpfj.blob.core.windows.net/censuschicken/AK.TXT">
type stateCodeContext = CsvProvider<"https://portalvhdspgzl51prtcpfj.blob.core.windows.net/censuschicken/states.csv">
let stateCodes = stateCodeContext.Load("https://portalvhdspgzl51prtcpfj.blob.core.windows.net/censuschicken/states.csv");
let fetchStateData (stateCode:string)=
let uri = System.String.Format("https://portalvhdspgzl51prtcpfj.blob.core.windows.net/censuschicken/{0}.TXT",stateCode)
censusDataContext.Load(uri).Rows
let usaData = stateCodes.Rows
|> Seq.collect(fun r -> fetchStateData(r.Abbreviation))
|> Seq.length
I now want to run these async and I am running into a problem with AsyncLoad:
let fetchStateDataAsync(stateCode:string)=
async{
let uri = System.String.Format("https://portalvhdspgzl51prtcpfj.blob.core.windows.net/censuschicken/{0}.TXT",stateCode)
let! stateData = censusDataContext.AsyncLoad(uri)
return stateData.Rows
}
let usaData = stateCodes.Rows
|> Seq.collect(fun r -> fetchStateDataAsync(r.Abbreviation))
|> Seq.length
The error message is
The type 'Async<seq<CsvProvider<...>.Row>>' is not compatible with the type 'seq<'a>'
Forgive my lack of async knowledge, but do I have to use something other than Seq.Collect when applying async functions?
Thanks in advance
The problem is that turning code to asynchronous (by wrapping it in the async { .. } block) changes the result from seq<Row> to Async<seq<Row>> - that is, you now get an asynchronous computation that will eventually complete and return the sequence.
To fix this, you need to somehow start the computation and wait for the result. There is a number of choices - like running one by one sequentially. Probably the easiest option (and maybe the best - depending on what you want to do) is to run the computations in parallel:
let getAll =
stateCodes.Rows
|> Seq.map(fun r -> fetchStateDataAsync(r.Abbreviation))
|> Async.Parallel
This gives you an asynchronous computation that runs all the downloads and returns an array of results. You can run this synchronously (and block) and get the results:
getAll |> Async.RunSynchronously
|> Seq.collect id
|> Seq.length
If you want to run the downloads asynchronously in the background you can do that to, but you need to specify what to do with the result. For example:
async {
let! all = getAll
all |> Seq.collect id |> Seq.length |> printfn "Length %d" }
|> Async.Start
I just finish writing my first F# program. Functionality wise the code works the way I wanted, but not sure if the code is efficient. I would much appreciate if someone could review the code for me and point out the areas where the code can be improved.
Thanks
Sudaly
open System
open System.IO
open System.IO.Pipes
open System.Text
open System.Collections.Generic
open System.Runtime.Serialization
[<DataContract>]
type Quote = {
[<field: DataMember(Name="securityIdentifier") >]
RicCode:string
[<field: DataMember(Name="madeOn") >]
MadeOn:DateTime
[<field: DataMember(Name="closePrice") >]
Price:float
}
let m_cache = new Dictionary<string, Quote>()
let ParseQuoteString (quoteString:string) =
let data = Encoding.Unicode.GetBytes(quoteString)
let stream = new MemoryStream()
stream.Write(data, 0, data.Length);
stream.Position <- 0L
let ser = Json.DataContractJsonSerializer(typeof<Quote array>)
let results:Quote array = ser.ReadObject(stream) :?> Quote array
results
let RefreshCache quoteList =
m_cache.Clear()
quoteList |> Array.iter(fun result->m_cache.Add(result.RicCode, result))
let EstablishConnection() =
let pipeServer = new NamedPipeServerStream("testpipe", PipeDirection.InOut, 4)
let mutable sr = null
printfn "[F#] NamedPipeServerStream thread created, Wait for a client to connect"
pipeServer.WaitForConnection()
printfn "[F#] Client connected."
try
// Stream for the request.
sr <- new StreamReader(pipeServer)
with
| _ as e -> printfn "[F#]ERROR: %s" e.Message
sr
while true do
let sr = EstablishConnection()
// Read request from the stream.
printfn "[F#] Ready to Receive data"
sr.ReadLine()
|> ParseQuoteString
|> RefreshCache
printfn "[F#]Quot Size, %d" m_cache.Count
let quot = m_cache.["MSFT.OQ"]
printfn "[F#]RIC: %s" quot.RicCode
printfn "[F#]MadeOn: %s" (String.Format("{0:T}",quot.MadeOn))
printfn "[F#]Price: %f" quot.Price
In general, you should try using immutable data types and avoid imperative constructs such as global variables and imperative loops - although using them in F# is fine in many cases, they should be used only when there is a good reason for doing so. Here are a couple of examples where you could use functional approach:
First of all, to make the code more functional, you should avoid using global mutable cache. Instead, your RefreshCache function should return the data as the result (preferably using some functional data structure, such as F# Map type):
let PopulateCache quoteList =
quoteList
// Generate a sequence of tuples containing key and value
|> Seq.map (fun result -> result.RicCode, result)
// Turn the sequence into an F# immutable map (replacement for hashtable)
|> Map.ofSeq
The code that uses it would be changed like this:
let cache =
sr.ReadLine()
|> ParseQuoteString
|> PopulateCache
printfn "[F#]Quot Size, %d" m_cache.Count
let quot = m_cache.["MSFT.OQ"]
// The rest of the sample stays the same
In the EstablishConnection function, you definitely don't need to declare a mutable variable sr, because in case of an exception, the function will return null. I would instead use option type to make sure that this case is handled:
let EstablishConnection() =
let pipeServer =
new NamedPipeServerStream("testpipe", PipeDirection.InOut, 4)
printfn "[F#] NamedPipeServerStream thread created..."
pipeServer.WaitForConnection()
printfn "[F#] Client connected."
try // Wrap the result in 'Some' to denote success
Some(new StreamReader(pipeServer))
with e ->
printfn "[F#]ERROR: %s" e.Message
// Return 'None' to denote a failure
None
The main loop can be written using a recursive function that stops when EstablishConnection fails:
let rec loop() =
match EstablishConnection() with
| Some(conn) ->
printfn "[F#] Ready to Receive data"
// rest of the code
loop() // continue looping
| _ -> () // Quit
Just a couple thoughts...
You probably want a 'use' rather than a 'let' in a few places, as I think some of the objects in the program are IDisposable.
You may consider wrapping the EstablishConnection method and the final while loop in async blocks (and make other minor changes), so that e.g. you can wait asynchronously for connections without blocking a thread.
At first glance it is written in imperative style rather than functional style, which does make sense given that most of the program involves side effects (i.e. I/O). Line for line, it almost looks like a C# program.
Given the amount of I/O that is taking place, I don't know that there is much you can do to this particular program to make it more of a functional style of coding.