I am trying to figure out how to manage multiple lazy sequences from a single function in F#.
For example, in the code below, I am trying to get two sequences - one that returns all files in the directories, and one that returns a sequence of tuples of any directories that could not be accessed (for example due to permissions) with the exception.
While the below code compiles and runs, errorSeq never has any elements when used by other code, even though I know that UnauthorizedAccess exceptions have occurred.
I am using F# 2.0.
#light
open System.IO
open System
let rec allFiles errorSeq dir =
Seq.append
(try
dir |> Directory.GetFiles
with
e -> Seq.append errorSeq [|(dir, e)|]
|> ignore
[||]
)
(try
dir
|> Directory.GetDirectories
|> Seq.map (allFiles errorSeq)
|> Seq.concat
with
e -> Seq.append errorSeq [|(dir, e)|]
|> ignore
Seq.empty
)
[<EntryPoint>]
let main args =
printfn "Arguments passed to function : %A" args
let errorSeq = Seq.empty
allFiles errorSeq args.[0]
|> Seq.filter (fun x -> (Path.GetExtension x).ToLowerInvariant() = ".jpg")
|> Seq.iter Console.WriteLine
errorSeq
|> Seq.iter (fun x ->
Console.WriteLine("Error")
x)
0
If you wanted to take a more functional approach, here's one way to do it:
let rec allFiles (errorSeq, fileSeq) dir =
let files, errs =
try
Seq.append (dir |> Directory.GetFiles) fileSeq, errorSeq
with
e -> fileSeq, Seq.append [dir,e] errorSeq
let subdirs, errs =
try
dir |> Directory.GetDirectories, errs
with
e -> [||], Seq.append [dir,e] errs
Seq.fold allFiles (errs, files) subdirs
Now we pass the sequence of errors and the sequence of files into the function each time and return new sequences created by appending to them within the function. I think that the imperative approach is a bit easier to follow in this case, though.
Seq.append returns a new sequence, so this
Seq.append errorSeq [|(dir, e)|]
|> ignore
[||]
has no effect. Perhaps you want your function to return a tuple of two sequences? Or use some kind of mutable collection to write errors as you encounter them?
Related
In some languages after one goes through a lazy sequence it becomes exhausted. That is not the case with F#:
let mySeq = seq [1..5]
mySeq |> Seq.iter (fun x -> printfn "%A" <| x)
mySeq |> Seq.iter (fun x -> printfn "%A" <| x)
1
2
3
4
5
1
2
3
4
5
However, it looks like one can go only once through the rows of a CSV provider:
open FSharp.Data
[<Literal>]
let foldr = __SOURCE_DIRECTORY__ + #"\data\"
[<Literal>]
let csvPath = foldr + #"AssetInfoFS.csv"
type AssetsInfo = CsvProvider<Sample=csvPath,
HasHeaders=true,
ResolutionFolder=csvPath,
AssumeMissingValues=false,
CacheRows=false>
let assetInfo = AssetsInfo.Load(csvPath)
assetInfo.Rows |> Seq.iter (fun x -> printfn "%A" <| x) // Works fine 1st time
assetInfo.Rows |> Seq.iter (fun x -> printfn "%A" <| x) // 2nd time exception
Why does that happen?
From this link on the CSV Parser, the CSV Type Provider is built on top of the CSV Parser. The CSV Parser works in streaming mode, most likely by calling a method like File.ReadLines, which will throw an exception if the enumerator is enumerated a second time. The CSV Parser also has a Cache method. Try setting CacheRows=true (or leaving it out of the declaration since its default value is true) to avoid this issue
CsvProvider<Sample=csvPath,
HasHeaders=true,
ResolutionFolder=csvPath,
AssumeMissingValues=false,
CacheRows=true>
The sequence iterator stays put where you point it; after the first loop, that is the end of the sequence.
If you want it to go back to the beginning, you have to set it there.
I am currently working on a beginner's project to implement my own duplicate file finder. This is my first time working with a .NET language, so I am still extremely unfamiliar with .NET APIs.
Here is the code that I have written so far:
open System
open System.IO
open System.Collections.Generic
let directory = #somePath
let getAllFiles (directory : string) =
Directory.GetFiles(directory)
let getFileInfo (directory : string) =
directory
|> getAllFiles
|> Seq.map (fun eachFile -> (eachFile, new FileInfo(eachFile)))
let getFileLengths (directory: string) =
directory
|> getFileInfo
|> Seq.map (fun (eachFile, eachFileInfo : FileInfo) -> (eachFile, eachFileInfo.Length))
// If two files have the same lengths, they might be duplicates of each other.
let groupByFileLengths (directory: string) =
directory
|> getFileLengths
|> Seq.groupBy snd
|> Seq.map (fun (fileLength, files) -> fileLength, files |> Seq.map fst |> List.ofSeq)
let findGroupsOfTwoOrMore (directory: string) =
directory
|> groupByFileLengths
|> Seq.filter (snd >> List.length >> (<>) 1)
let constructHashtable (someTuple) =
let hashtable = new Hashtable()
someTuple
|> Seq.iter hashtable.Add
hashtable
let readAllBytes (tupleOfFileLengthsAndFiles) =
tupleOfFileLengthsAndFiles
|> snd
|> Seq.map (fun eachFile -> (File.ReadAllBytes eachFile, eachFile))
|> constructHashtable
What I want to do is to construct a hash table with the byte array of each file as the key, and the file name itself as the value. If multiple files with different file names share the same bye array, then they are duplicates, and my goal is to remove the duplicate files.
I have looked through the Hashtable namespace on MSDN, but there is no method for identifying hashtable keys that contain multiple values.
Edit: Here is my attempt at implementing MD5:
let readAllBytesMD5 (tupleOfFileLengthsAndFiles) =
let md5 = MD5.Create()
tupleOfFileLengthsAndFiles
|> snd
|> Seq.map (fun eachFile -> (File.ReadAllBytes eachFile, eachFile))
|> Seq.map (fun (byteArray, eachFile) -> (md5.ComputeHash(byteArray), eachFile))
|> Seq.map (fun (hashCode, eachFile) -> (hashCode.ToString, eachFile))
Please advise on how I may improve and continue, because I am stuck here due to not having a firm grasp of how MD5 works. Thank you.
Hashtable doesn't support multiple values for the same key - you'll get an exception when you try to add a second entry with the same key. It is also untyped, you should almost always prefer a typed mutable System.Collections.Generic.Dictionary or an immutable F# Map.
What you're looking for is a Map<byte array, Set<string>>. Here's my take on it:
let buildMap (paths: string array) =
paths
|> Seq.map (fun eachFile -> (File.ReadAllBytes eachFile, eachFile))
|> Seq.groupBy fst
|> Seq.map (fun (key, items) ->
key, items |> Seq.map snd |> Set.ofSeq)
|> Map.ofSeq
As an aside, unless you're comparing very, very small files, using the entire file contents as a key won't get you very far. You will probably want to look into generating checksums for those files and using them instead.
I wrote this function which merges two lists together but as I'm fairly new to functional programming I was wondering whether there is a better (simpler) way to do it?
let a = ["a"; "b"; "c"]
let b = ["d"; "b"; "a"]
let merge a b =
// take all a and add b
List.fold (fun acc elem ->
let alreadyContains = acc |> List.exists (fun item -> item = elem)
if alreadyContains = true then
acc
else
elem :: acc |> List.rev
) b a
let test = merge a b
Expected result is: ["a"; "b"; "c"; "d"], I'm reverting the list in order to keep the original order. I thought I would be able to achieve the same using List.foldBack (and dropping List.rev) but it results in an error:
Type mismatch. Expecting a
'a
but given a
'a list
The resulting type would be infinite when unifying ''a' and ''a list'
Why is there a difference when using foldBack?
You could use something like the following
let merge a b =
a # b
|> Seq.distinct
|> List.ofSeq
Note that this will preserve order and remove any duplicates.
In F# 4.0 this will be simplified to
let merge a b = a # b |> List.distinct
If I wanted to write this in a way that is similar to your original version (using fold), then the main change I would do is to move List.rev outside of the function (you are calling List.rev every time you add a new element, which is wrong if you're adding even number of elements!)
So, a solution very similar to yours would be:
let merge a b =
(b, a)
||> List.fold (fun acc elem ->
let alreadyContains = acc |> List.exists (fun item -> item = elem)
if alreadyContains = true then acc
else elem :: acc)
|> List.rev
This uses the double-pipe operator ||> to pass two parameters to the fold function (this is not necessary, but I find it a bit nicer) and then passes the result to List.rev.
I have written a function like this
let GetAllDirectAssignmentsforLists (spWeb : SPWeb) =
spWeb.Lists
|> Seq.cast<SPList>
|> Seq.filter(fun l -> l.HasUniqueRoleAssignments)
|> Seq.collect (fun l -> l.RoleAssignments
|> Seq.cast<SPRoleAssignment>
|> Seq.map(fun ra -> ra.Member)
)
|> Seq.filter (fun p -> p.GetType().Name = "SPUser")
|> Seq.map(fun m -> m.LoginName.ToLower())
I want to return a tuple which contains the list name (taken from l.Title) in the send pipe and the m.LoginName.ToLower().
Is there a cleanway for me to get something from the above pipe elements?
One way ofcourse would be to tuple the return value in the 2nd stage of the pipe and then pass the Title all the way down.... but that would pollute the code all subsequent stages will then have to accept and return tuple values just for the sake of the last stage to get the value.
I wonder if there is a clean and easy way....
Also, in stage 4 of the pipeline (fun p -> p.GetType().Name = "SPUser") could i use if here to compare the types? rather than convert the typename to string and then match strings?
We exploit the fact that Seq.filter and Seq.map can be pushed inside Seq.collect without changing the results. In this case, l is still available to access.
And the last filter function is more idiomatic to use with type test operator :?.
let GetAllDirectAssignmentsforLists(spWeb: SPWeb) =
spWeb.Lists
|> Seq.cast<SPList>
|> Seq.filter (fun l -> l.HasUniqueRoleAssignments)
|> Seq.collect (fun l -> l.RoleAssignments
|> Seq.cast<SPRoleAssignment>
|> Seq.map (fun ra -> ra.Member)
|> Seq.filter (fun p -> match box p with
| :? SPUser -> true
| _ -> false)
|> Seq.map (fun m -> l.Title, m.LoginName.ToLower()))
To simplify further, you could change the series of Seq.map and Seq.filter to Seq.choose:
Seq.choose (fun ra -> match box ra.Member with
| :? SPUser -> Some (l.Title, ra.Member.LoginName.ToLower())
| _ -> None)
While you can solve the problem by lifting the rest of the computation inside collect, I think that you could make the code more readable by using sequence expressions instead of pipelining.
I could not run the code to test it, but this should be equivalent:
let GetAllDirectAssignmentsforLists (spWeb : SPWeb) = seq {
// Corresponds to your 'filter' and 'collect'
for l in Seq.cast<SPList> spWeb.Lists do
if l.HasUniqueRoleAssignments then
// Corresponds to nested 'map' and 'filter'
for ra in Seq.cast<SPRoleAssignment> l.RoleAssignments do
let m = ra.Member
if m.GetType().Name = "SPUser" then
// This implements the last 'map' operation
yield l.Title, m.LoginName.ToLower() }
The code above corresponds more closely to the version by #pad than to your original code, because the rest of the computation is nested under for (which corresponds to nesting under collect) and so you can see all variables that are already in scope - like l which you need.
The nice thing about sequence expressions is that you can use F# constructs like if (instead of filter), for (instead of collect) etc. Also, I think it is more suitable for writing nested operations (which you need here to keep variables in scope), because it remains quite readable and keeps familiar code structure.
In the following code Seq.generateUnique is constrained to be of type ((Assembly -> seq<Assembly>) -> seq<Assembly> -> seq<Assembly>).
open System
open System.Collections.Generic
open System.Reflection
module Seq =
let generateUnique =
let known = HashSet()
fun f initial ->
let rec loop items =
seq {
let cachedSeq = items |> Seq.filter known.Add |> Seq.cache
if not (cachedSeq |> Seq.isEmpty) then
yield! cachedSeq
yield! loop (cachedSeq |> Seq.collect f)
}
loop initial
let discoverAssemblies() =
AppDomain.CurrentDomain.GetAssemblies() :> seq<_>
|> Seq.generateUnique (fun asm -> asm.GetReferencedAssemblies() |> Seq.map Assembly.Load)
let test() = printfn "%A" (discoverAssemblies() |> Seq.truncate 2 |> Seq.map (fun asm -> asm.GetName().Name) |> Seq.toList)
for _ in 1 .. 5 do test()
System.Console.Read() |> ignore
I'd like it to be generic, but putting it into a file apart from its usage yields a value restriction error:
Value restriction. The value
'generateUnique' has been inferred to
have generic type val
generateUnique : (('_a -> '_b) -> '_c
-> seq<'_a>) when '_b :> seq<'_a> and '_c :> seq<'_a> Either make the
arguments to 'generateUnique' explicit
or, if you do not intend for it to be
generic, add a type annotation.
Adding an explicit type parameter (let generateUnique<'T> = ...) eliminates the error, but now it returns different results.
Output without type parameter (desired/correct behavior):
["mscorlib"; "TEST"]
["FSharp.Core"; "System"]
["System.Core"; "System.Security"]
[]
[]
And with:
["mscorlib"; "TEST"]
["mscorlib"; "TEST"]
["mscorlib"; "TEST"]
["mscorlib"; "TEST"]
["mscorlib"; "TEST"]
Why does the behavior change? How could I make the function generic and achieve the desired behavior?
generateUnique is a lot like the standard memoize pattern: it should be used to calculate memoized functions from normal functions, not do the actual caching itself.
#kvb was right about the change in the definition required for this shift, but then you need to change the definition of discoverAssemblies as follows:
let discoverAssemblies =
//"memoize"
let generator = Seq.generateUnique (fun (asm:Assembly) -> asm.GetReferencedAssemblies() |> Seq.map Assembly.Load)
fun () ->
AppDomain.CurrentDomain.GetAssemblies() :> seq<_>
|> generator
I don't think that your definition is quite correct: it seems to me that f needs to be a syntactic argument to generateUnique (that is, I don't believe that it makes sense to use the same HashSet for different fs). Therefore, a simple fix is:
let generateUnique f =
let known = HashSet()
fun initial ->
let rec loop items =
seq {
let cachedSeq = items |> Seq.filter known.Add |> Seq.cache
if not (cachedSeq |> Seq.isEmpty) then
yield! cachedSeq
yield! loop (cachedSeq |> Seq.collect f)
}
loop initial