F#: How to identify keys that have multiple values in a HashTable? - f#

I am currently working on a beginner's project to implement my own duplicate file finder. This is my first time working with a .NET language, so I am still extremely unfamiliar with .NET APIs.
Here is the code that I have written so far:
open System
open System.IO
open System.Collections.Generic
let directory = #somePath
let getAllFiles (directory : string) =
Directory.GetFiles(directory)
let getFileInfo (directory : string) =
directory
|> getAllFiles
|> Seq.map (fun eachFile -> (eachFile, new FileInfo(eachFile)))
let getFileLengths (directory: string) =
directory
|> getFileInfo
|> Seq.map (fun (eachFile, eachFileInfo : FileInfo) -> (eachFile, eachFileInfo.Length))
// If two files have the same lengths, they might be duplicates of each other.
let groupByFileLengths (directory: string) =
directory
|> getFileLengths
|> Seq.groupBy snd
|> Seq.map (fun (fileLength, files) -> fileLength, files |> Seq.map fst |> List.ofSeq)
let findGroupsOfTwoOrMore (directory: string) =
directory
|> groupByFileLengths
|> Seq.filter (snd >> List.length >> (<>) 1)
let constructHashtable (someTuple) =
let hashtable = new Hashtable()
someTuple
|> Seq.iter hashtable.Add
hashtable
let readAllBytes (tupleOfFileLengthsAndFiles) =
tupleOfFileLengthsAndFiles
|> snd
|> Seq.map (fun eachFile -> (File.ReadAllBytes eachFile, eachFile))
|> constructHashtable
What I want to do is to construct a hash table with the byte array of each file as the key, and the file name itself as the value. If multiple files with different file names share the same bye array, then they are duplicates, and my goal is to remove the duplicate files.
I have looked through the Hashtable namespace on MSDN, but there is no method for identifying hashtable keys that contain multiple values.
Edit: Here is my attempt at implementing MD5:
let readAllBytesMD5 (tupleOfFileLengthsAndFiles) =
let md5 = MD5.Create()
tupleOfFileLengthsAndFiles
|> snd
|> Seq.map (fun eachFile -> (File.ReadAllBytes eachFile, eachFile))
|> Seq.map (fun (byteArray, eachFile) -> (md5.ComputeHash(byteArray), eachFile))
|> Seq.map (fun (hashCode, eachFile) -> (hashCode.ToString, eachFile))
Please advise on how I may improve and continue, because I am stuck here due to not having a firm grasp of how MD5 works. Thank you.

Hashtable doesn't support multiple values for the same key - you'll get an exception when you try to add a second entry with the same key. It is also untyped, you should almost always prefer a typed mutable System.Collections.Generic.Dictionary or an immutable F# Map.
What you're looking for is a Map<byte array, Set<string>>. Here's my take on it:
let buildMap (paths: string array) =
paths
|> Seq.map (fun eachFile -> (File.ReadAllBytes eachFile, eachFile))
|> Seq.groupBy fst
|> Seq.map (fun (key, items) ->
key, items |> Seq.map snd |> Set.ofSeq)
|> Map.ofSeq
As an aside, unless you're comparing very, very small files, using the entire file contents as a key won't get you very far. You will probably want to look into generating checksums for those files and using them instead.

Related

F#.Data HTML Parser Extracting Strings From Nodes

I am trying to use FSharp.Data's HTML Parser to extract a string List of links from href attributes.
I can get the links printed out to console, however, i'm struggling to get them into a list.
Working snippet of a code which prints the wanted links:
let results = HtmlDocument.Load(myUrl)
let links =
results.Descendants("td")
|> Seq.filter (fun x -> x.HasClass("pagenav"))
|> Seq.map (fun x -> x.Elements("a"))
|> Seq.iter (fun x -> x |> Seq.iter (fun y -> y.AttributeValue("href") |> printf "%A"))
How do i store those strings into variable links instead of printing them out?
Cheers,
On the very last line, you end up with a sequence of sequences - for each td.pagenav you have a bunch of <a>, each of which has a href. That's why you have to have two nested Seq.iters - first you iterate over the outer sequence, and on each iteration you iterate over the inner sequence.
To flatten a sequence of sequences, use Seq.collect. Further, to convert a sequence to a list, use Seq.toList or List.ofSeq (they're equivalent):
let a = [ [1;2;3]; [4;5;6] ]
let b = a |> Seq.collect id |> Seq.toList
> val b : int list = [1; 2; 3; 4; 5; 6]
Applying this to your code:
let links =
results.Descendants("td")
|> Seq.filter (fun x -> x.HasClass("pagenav"))
|> Seq.map (fun x -> x.Elements("a"))
|> Seq.collect (fun x -> x |> Seq.map (fun y -> y.AttributeValue("href")))
|> Seq.toList
Or you could make it a bit cleaner by applying Seq.collect at the point where you first encounter a nested sequence:
let links =
results.Descendants("td")
|> Seq.filter (fun x -> x.HasClass("pagenav"))
|> Seq.collect (fun x -> x.Elements("a"))
|> Seq.map (fun y -> y.AttributeValue("href"))
|> Seq.toList
That said, I would rather rewrite this as a list comprehension. Looks even cleaner:
let links = [ for td in results.Descendants "td" do
if td.HasClass "pagenav" then
for a in td.Elements "a" ->
a.AttributeValue "href"
]

FAKE: Get all projects referenced by a solution file

How do I get hold of the projects that are referenced by a solution file?
Here I have a concrete use case. I have stolen this target CopyBinaries from ProjectScaffold. It copies the output of the project builds into a separate folder. It is not very choosy and copies the output of every project it finds.
Target "CopyBinaries" (fun _ ->
!! "src/**/*.??proj"
-- "src/**/*.shproj"
|> Seq.map (fun f ->
((System.IO.Path.GetDirectoryName f) </> "bin/Release",
binDir </> (System.IO.Path.GetFileNameWithoutExtension f)))
|> Seq.iter (fun (fromDir, toDir) ->
CopyDir toDir fromDir (fun _ -> true))
)
What if I want only copy the output of projects which are referenced explicitly in a solution file. I think of something like this:
Target "CopyBinaries" (fun _ ->
!! solutionFile
|> GetProjectFiles
|> Seq.map (fun f ->
((System.IO.Path.GetDirectoryName f) </> "bin/Release",
binDir </> (System.IO.Path.GetFileNameWithoutExtension f)))
|> Seq.iter (fun (fromDir, toDir) ->
CopyDir toDir fromDir (fun _ -> true))
)
The function GetProjectFiles takes a solution file and extracts the referenced project files.
Is there anything like this hypothetical function in FAKE available?
There is nothing that I have found which gives this kind of information about solution files out of the box. With a small number of functions, an alternative is possible:
let root = directoryInfo "."
let solutionReferences baseDir = baseDir |> filesInDirMatching "*.sln" |> Seq.ofArray
let solutionNames paths = paths |> Seq.map (fun (f:System.IO.FileInfo) -> f.FullName)
let projectsInSolution solutions = solutions |> Seq.collect ReadFile |> Seq.filter (fun line -> line.StartsWith("Project"))
let projectNames projects = projects |> Seq.map (fun (line:string) -> (line.Split [|','|]).[1])
Target "SLN" ( fun () -> root |> solutionReferences |> solutionNames |> projectsInSolution |> projectNames |> Seq.iter (printfn "%s"))
This can be consolidated and functions skipped into any grouping you like. You may have already found this out or something else, but it is good that everyone knows there are options. Thank you. Good day.

XML scan from C# to F#

Trying to learning F# and I tried to reimplement the following function in F#
private string[] GetSynonyms(string synonyms)
{
var items = Enumerable.Repeat(synonyms, 1)
.Where(s => s != null)
.Select(XDocument.Parse)
.Select(doc => doc.Root)
.Where(root => root != null)
.SelectMany(e => e.Elements(SynonymsNamespace + "synonym"))
.Select(e => e.Value)
.ToArray();
return items;
}
I got this far by myself
let xname = XNamespace.Get "http://localuri"
let syn = "<synonyms xmlns=\"http://localuri\"><synonym>a word</synonym><synonym>another word</synonym></synonyms>"
let synonyms str =
let items = [str]
items
|> List.map System.Xml.Linq.XDocument.Parse
|> List.map (fun x -> x.Root)
|> List.map (fun x -> x.Elements(xname + "synonym") |> Seq.cast<System.Xml.Linq.XElement>)
|> Seq.collect (fun x -> x)
|> Seq.map (fun x -> x.Value)
let a = synonyms syn
Dump a
Now I'm wondering if there is a more-functional way to write the same code.
By extracting the access to the properties to standalone functions I got this version
let xname = XNamespace.Get "http://localuri"
let syn = "<synonyms xmlns=\"http://localuri\"><synonym>a word</synonym><synonym>another word</synonym></synonyms>"
let getRoot (doc:System.Xml.Linq.XDocument) = doc.Root
let getValue (element:System.Xml.Linq.XElement) = element.Value
let getElements (element:System.Xml.Linq.XElement) =
element.Elements(xname + "synonym")
|> Seq.cast<System.Xml.Linq.XElement>
let synonyms str =
let items = [str]
items
|> List.map System.Xml.Linq.XDocument.Parse
|> List.map getRoot
|> List.map getElements
|> Seq.collect (fun x -> x)
|> Seq.map getValue
let a = synonyms syn
Dump a
But I still have a few concerns
Can I rewrite that Seq.collect (fun x -> x) in another way? It sounds redundant
Can I remove all those (fun x -> x.Property) without creating new functions?
How to actually return an array and not a Seq<'a> (which I understand is an IEnumerable<'a>)
Thanks
Seq.collect (fun x -> x) can be rewritten with the predefined id function to Seq.collect id
In F# 4.0 it can be removed for constructors only.
use Seq.toArray or Seq.toList
Would it be very wrong to drop the C#-code and go all in with the XML-provider in F#? In my world its always wrong to parse XML when there exists other solutions (unless Im trying to create octogonal wheels or moist gun powders other have made better before me).
In this regard I would even have used some transformation (XSLT) or selection (XPATH/XQUERY) unless I could use the XML-provider or some XSD (c#) for generating code.
If for some reason the XML is so unstructured that you actually need parsing, then the XML is arguably wrong...
If using the XmlProvider you get namespacing, types etc for free...
#r #"..\correct\this\path\to\packages\FSharp.Data.2.2.5\lib\net40\FSharp.Data.dll"
#r "System.Xml.Linq"
open FSharp.Data
[<Literal>]
let syn = "<synonyms xmlns=\"http://localuri\"><synonym>a word</synonym><synonym>another word</synonym></synonyms>"
type Synonyms = XmlProvider<syn>
let a = Synonyms.GetSample()
a.Synonyms |> Seq.iter (printfn "%A")
Mind that the XmlProvider also can take files or url as examples for inferring types etc, and that you also can have this code as the example and then use
let a = Synonyms.Load(stuff)
where stuff is a read from stream, textreader or URI and inferred according to your example. The sample and the stuff might even point to same file/Uri if this is some standard placing of data.
See also: http://fsharp.github.io/FSharp.Data/library/XmlProvider.html

GroupBy Year then take Pairwise diffs except for the head value then Flatten Using Deedle and F#

I have the following variable:
data:seq<(DateTime*float)>
and I want to do something like the following F# code but using Deedle:
data
|> Seq.groupBy (fun (k,v) -> k.Year)
|> Seq.map (fun (k,v) ->
let vals = v |> Seq.pairwise
let first = seq { yield v |> Seq.head }
let diffs = vals |> Seq.map (fun ((t0,v0),(t1,v1)) -> (t1, v1 - v0))
(k, diffs |> Seq.append first))
|> Seq.collect snd
This works fine using F# sequences but I want to do it using Deedle series. I know I can do something like:
(data:Series<DateTime*float>) |> Series.groupBy (fun k v -> k.Year)...
But then I need to take the within group year diffs except for the head value which should just be the value itself and then flatten the results into on series...I am bit confused with the deedle syntax
Thanks!
I think the following might be doing what you need:
ts
|> Series.groupInto
(fun k _ -> k.Month)
(fun m s ->
let first = series [ fst s.KeyRange => s.[fst s.KeyRange]]
Series.merge first (Series.diff 1 s))
|> Series.values
|> Series.mergeAll
The groupInto function lets you specify a function that should be called on each of the groups
For each group, we create series with the differences using Series.diff and append a series with the first value at the beginning using Series.merge.
At the end, we get all the nested series & flatten them using Series.mergeAll.

F# Manage multiple lazy sequences from a single method?

I am trying to figure out how to manage multiple lazy sequences from a single function in F#.
For example, in the code below, I am trying to get two sequences - one that returns all files in the directories, and one that returns a sequence of tuples of any directories that could not be accessed (for example due to permissions) with the exception.
While the below code compiles and runs, errorSeq never has any elements when used by other code, even though I know that UnauthorizedAccess exceptions have occurred.
I am using F# 2.0.
#light
open System.IO
open System
let rec allFiles errorSeq dir =
Seq.append
(try
dir |> Directory.GetFiles
with
e -> Seq.append errorSeq [|(dir, e)|]
|> ignore
[||]
)
(try
dir
|> Directory.GetDirectories
|> Seq.map (allFiles errorSeq)
|> Seq.concat
with
e -> Seq.append errorSeq [|(dir, e)|]
|> ignore
Seq.empty
)
[<EntryPoint>]
let main args =
printfn "Arguments passed to function : %A" args
let errorSeq = Seq.empty
allFiles errorSeq args.[0]
|> Seq.filter (fun x -> (Path.GetExtension x).ToLowerInvariant() = ".jpg")
|> Seq.iter Console.WriteLine
errorSeq
|> Seq.iter (fun x ->
Console.WriteLine("Error")
x)
0
If you wanted to take a more functional approach, here's one way to do it:
let rec allFiles (errorSeq, fileSeq) dir =
let files, errs =
try
Seq.append (dir |> Directory.GetFiles) fileSeq, errorSeq
with
e -> fileSeq, Seq.append [dir,e] errorSeq
let subdirs, errs =
try
dir |> Directory.GetDirectories, errs
with
e -> [||], Seq.append [dir,e] errs
Seq.fold allFiles (errs, files) subdirs
Now we pass the sequence of errors and the sequence of files into the function each time and return new sequences created by appending to them within the function. I think that the imperative approach is a bit easier to follow in this case, though.
Seq.append returns a new sequence, so this
Seq.append errorSeq [|(dir, e)|]
|> ignore
[||]
has no effect. Perhaps you want your function to return a tuple of two sequences? Or use some kind of mutable collection to write errors as you encounter them?

Resources