Idiomatic way to get directory size in F#

Idiomatic way to get directory size in F# - f#

I want to get the total directory size, including all sub-directories and files. There are two ways I can think to do this, the first is to get a list of all the file attributes (FileInfo in .NET) of all the files, and then just sum together all the Length properties like this:
listOfFileInfos |> List.sumBy (fun f -> f.Length)
The other way would be to add the file size of each file to an accumulator as we traverse the directory structure. So we wouldn't be building a big list, we'd just be adding to the total as we go.
I'm not really sure how to do this in F# though!
Which way is more idiomatic? The first one looks easier to me but I'm only a beginner in functional programming.

The issue here is not so much about how to do the sum (sumBy works fine for that), but how to iterate over all files - you say that you want to do this over all sub-folders, which means that plain Directory.GetFiles will not do.
To handle sub-directories, you can us an overload that takes SearchOptions:
Directory.GetFiles("C:/temp", "*", SearchOption.AllDirectories)
This returns an array, which may be wasteful for directories with a lot of files, so a better option is EnumerateFiles, which returns a lazy sequence:
Directory.EnumerateFiles("C:/temp", "*", SearchOption.AllDirectories)
If you then sum the lengths using Seq.sumBy, the computation will iterate over all files without ever building a large data structure to keep them all in memory:
Directory.EnumerateFiles("C:/temp", "*", SearchOption.AllDirectories)
|> Seq.sumBy (fun f -> FileInfo(f).Length)

Related

Why does F#'s Seq.windowed return seq of array

Seq.windowed in F# returns a sequence where each window within is an array. Is there a reason why each window is returned as an array (a very concrete type) as opposed to say, another sequence or IList<'T>? An IList<'T>, for example, would be sufficient if the purpose was to communicate that the items of the window can be randomly accessed but an array says two things: elements are mutable and randomly accessible. If you can rationalise the choice of array, how is windowed different from Seq.groupBy? Why does that latter (or operators in the same vein) not also return the members of a group as an array?
I'm wondering if this is simply a design oversight or is there a deeper, contractual reason for an array?

I do not know what is the design principle behind this. I suppose it might just be an accidental aspect of the implementation - Seq.windowed can be quite easily implemented by storing items in arrays, while Seq.groupBy probably needs to use some more complicated structure.
In general, I think that F# APIs either use 'T[] if using array is the natural efficient implementation, or return seq<'T> when the data source may be infinite, lazy, or when the implementation would have to convert the data to an array explicitly (then this can be left to the caller).
For Seq.windowed, I think that array makes a good sense, because you know the length of the array and so you are likely to use indexing. For example, assuming that prices is a sequence of date-price tuples (seq<DateTime * float>) you can write:
prices
|> Seq.windowed 5
|> Seq.map (fun win -> fst (win.[2]), Seq.averageBy snd win)
The sample calculates floating average and uses indexing to get the date in the middle.
In summary, I do not really have a good explanation for the design rationale, but I'm quite happy with the choices made - they seem to work really well with the usual use cases for the functions.

A couple of thoughts.
First, know that in their current version, both Seq.windowed and Seq.groupBy use non-lazy collections in their implementation. windowed uses arrays and returns you arrays. groupBy builds up a Dictionary<'tkey, ResizeArray<'tvalue>>, but keeps that a secret and returns the group values back as a seq instead of ResizeArray.
Returning ResizeArrays from groupBy wouldn't fit with anything else, so that obviously needs to be hidden. The other alternative is to give back the ToArray() of the data. This would require another copy of the data to be created, which is a downside. And there isn't really much upside, since you don't know ahead of time how big your group will be anyways, so you aren't expecting to do random access or any of the other special stuff arrays enable. So simply wrapping in a seq seems like a good option.
For windowed, it's a different story. You want an array back in this case. Why? Because you already know how big that array is going to be, so you can safely do random access or, better yet, pattern matching. That's a big upside. The downside remains, though - the data needs to be re-copied into a newly-allocated array for every window.
seq{1 .. 100} |> Seq.windowed 3 |> Seq.map (fun [|x; _; y|] -> x + y)
There is still the open question - "but couldn't we avoid the array allocation/copy downside by internally only using true lazy seqs, and returning them as such? Isn't that more in the 'spirit of the seq'?" It would be kind of tricky (would need some fancy cloning of enumerators?), but sure, probably with some careful coding. There's a huge downside to this, though. You'd need to cache the entire unspooled seq in memory to make it work, which kind of negates the whole goal of doing things lazily. Unlike lists or arrays, enumerating a seq multiple times is not guaranteed to yield the same results (e.g. a seq that returns random numbers), so the backing data for these seq windows you are returning needs to be cached somewhere. When that window is eventually accessed, you can't just tap in and re-enumerate through the original source seq - you might get back different data, or the seq might end at a different place. This points to the other upside of using arrays in Seq.windowed - only windowSize elements need to be kept in memory at once.

This is of course pure guess. I think this is related to the way both functions are implemented.
As already mentioned, in Seq.groupBy the groups are of variable length and in Seq.windowed they are of a fixed size.
So in the implementation from Seq.windowed it makes more sense to use a fixed size array, as opposed to the Generic.List used in Seq.groupBy, which btw in F# is called ResizeArray.
Now to the outside world, an Array though mutable is widely used in F# code and libraries and F# provides syntactic support for creating, initializing and manipulating arrays, whereas the ResizeArray is not that widely used in F# code and the language provides no syntactic support apart from the type alias, so I think that's why they decided to expose it as a Seq.

F#'s underscore: why not just create a variable name?

Reading about F# today and I'm not clear on one thing:
From: http://msdn.microsoft.com/en-us/library/dd233200.aspx
you need only one element of the tuple, the wildcard character (the underscore) can be used to avoid creating a new name for a variable that you do not need
let (a, _) = (1, 2)
I can't think of a time that I've been in this situation. Why would you avoid creating a variable name?

Because you don't need the value. I use this often. It documents the fact that a value is unused and saves naming variables unused, dummy, etc. Great feature if you ask me.

Interesting question. There are many trade-offs involved here.
Your comparisons have been with the Ruby programming language so perhaps the first trade-off you should consider is static typing. If you use the pattern x, _, _ then F# knows you are referring to the first element of a triple of exactly three elements and will enforce this constraint at compile time. Ruby cannot. F# also checks patterns for exhaustiveness and redundancy. Again, Ruby cannot.
Your comparisons have also used only flat patterns. Consider the patterns _, (x, _) or x, None | _, Some x or [] | [_] and so on. These are not so easily translated.
Finally, I'd mention that Standard ML is a programming language related to F# and it does provide operators called #1 etc. to extract the first element of a tuple with an arbitrary number of elements (see here) so this idea was implemented and discarded decades ago. I believe this is because SML's #n notation culminates in incomprehensible error messages within the constraints of the type system. For example, a function that uses #n is not making it clear what the arity of the tuple is but functions cannot be generic over tuple arity so this must result in an error message saying that you must give more type information but many users found that confusing. With the CAML/OCaml/F# approach there is no such confusion.

The let-binding you've given is an example of a language facility called pattern matching, which can be used to destructure many types, not just tuples. In pattern matches, underscores are the idiomatic way to express that you won't refer to a value.
Directly accessing the elements of a tuple can be more concise, but it's less general. Pattern matching allows you to look at the structure of some data and dispatch to an approprate handling case.
match x with
| (x, _, 20) -> x
| (_, y, _) -> y
This pattern match will return the first item in x only if the third element is 20. Otherwise it returns the second element. Once you get beyond trivial cases, the underscores are an important readability aid. Compare the above with:
match x with
| (x, y, 20) -> x
| (x, y, z) -> y
In the first code sample, it's much easier to tell which bindings you care about in the pattern.

Sometimes a method will return multiple values but the code you're writing is only interested in a select few (or one) of them. You can use multiple underscores to essentially ignore the values you don't need, rather than having a bunch of variables hanging around in local scope.

Charting with F# does not take in Seq?

Following this from the Real-World Functional Programming blog about drawing bar and column charts, I was trying to draw a histogram for my data which is stored at a set of tuples (data_value, frequency) in a lazy sequence.
It does not work unless I convert the sequence into a List, the error message being that in case of sequence "the IEnumerable 'T does not support the Reset function". Is there any way to generate a histogram/chart etc. using the .NET library from a lazily-evaluated sequence?
Also (ok newbie query alert), is there any way to make the chart persist when the program is run from the console? The usual System.Console.ReadKey() |> ignore makes the chart window hang, and otherwise it disappears in an instant. I've been using "Send to Interactive" to see results til now.

The problem is that sequences (of type seq<T>, which is just an alias for IEnumerable<T>) generated using the F# sequence expression notation do not support the Reset method. The method is required by the charting library (because it needs to obtain the data each time it redraws the screen).
This means that, for example, the following will not work:
seq { for x in data -> x } |> FSharpChart.Line
Many of the standard library functions from the Seq module are implemented using sequence expressions, so the result will not support Reset. You can fix that by converting the data to a list (using List.ofSeq) or to an array (using Array.ofSeq) or by writing the code using lists:
[ for x in data -> x ] |> FSharpChart.Line
... and if you're using some function, you can take the one from List (not all of the Seq functions are available for List, so sometimes you will need to use conversion):
[ for x in data -> x ] |> List.choose op |> FSharpChart.Line

No, it does not accept sequences.
That said, there is a good reason for not supporting seq. it is about the structure itself : A seq is just that, a seq, and therefore does not and should not, support the kind of operations needed in drawing a graph. That said, I really wish this stack was more advanced and supported more use style.
So the answer is to
|> Seq.toArray
or
|> Seq.toList
before sending to the chart library

Comparing arrays elements in Erlang

I'm trying to learn how to think in a functional programming way, for this, I'm trying to learn Erlang and solving easy problems from codingbat. I came with the common problem of comparing elements inside a list. For example, compare a value of the i-th position element with the value of the i+1-th position of the list. So, I have been thinking and searching how to do this in a functional way in Erlang (or any functional language).
Please, be gentle with me, I'm very newb in this functional world, but I want to learn
Thanks in advance

Define a list:
L = [1,2,3,4,4,5,6]
Define a function f, which takes a list
If it matches a list of one element or an empty list, return the empty list
If it matches the first element and the second element then take the first element and construct a new list by calling the rest of the list recursivly
Otherwise skip the first element of the list.
In Erlang code
f ([]) -> [];
f ([_]) -> [];
f ([X, X|Rest]) -> [X | f(Rest)];
f ([_|Rest]) -> f(Rest).
Apply function
f(L)
This should work... haven't compiled and run it but it should get you started. Also in case you need to do modifications to it to behave differently.
Welcome to Erlang ;)

I try to be gentle ;-) So main thing in functional approach is thinking in terms: What is input? What should be output? There is nothing like comparing the i-th element with the i+1-th element alone. There have to be always purpose of it which will lead to data transformation. Even Mazen Harake's example doing it. In this example there is function which returns only elements which are followed by same value i.e. filters given list. Typically there are very different ways how do similar thing which depends of purpose of it. List is basic functional structure and you can do amazing things with it as Lisp shows us but you have to remember it is not array.
Each time you need access i-th element repeatable it indicates you are using wrong data structure. You can build up different data structures form lists and tuples in Erlang which can serve your purposes better. So when you face problem to compare i-th with i+1-th element you should stop and think. What is purpose of it? Do you need perform some stream data transformation as Mazen Harake does or You need random access? If second you should use different data structure (array for example). Even then you should think about your task characteristics. If you will be mostly read and almost never write then you can use list_to_tuple(L) and then read using element/2. When you need write occasionally you will start thinking about partition it to several tuples and as your write ratio will grow you will end up with array implementation.
So you can use lists:nth/2 if you will do it only once or several times but on short list and you are not performance freak as I'm. You can improve it using [X1,X2|_] = lists:nthtail(I-1, L) (L = lists:nthtail(0,L) works as expected). If you are facing bigger lists and you want call it many times you have to rethink your approach.
P.S.: There are many other fascinating data structures except lists and trees. Zippers for example.

Erlang: Can this be done without lists:reverse?

I am a beginner learning Erlang. After reading about list comprehensions and recursion in Erlang, I wanted to try to implement my own map function, which turned out like this:
% Map: Map all elements in a list by a function
map(List,Fun) -> map(List,Fun,[]).
map([],_,Acc) -> lists:reverse(Acc);
map([H|T],Fun,Acc) -> map(T,Fun,[Fun(H)|Acc]).
My question is: It feels wrong to build up a list through the recursive function, and then reverse it at the end. Is there any way to build up the list in the right order, so we don't need the reverse ?

In oder to understand why accumulating and reversing is quite fast you have to understand how lists are build in Erlang. Erlangs lists like those in Lisp are build out of cons cells (look at the picture in the link).
In a singly linked list like the Erlang lists it is very cheap to prepend a element (or a short list). This is was the List = [H|T] construct does.
Reversing a singly linked list made of cons cells is quite fast since you only need one pass along the list, just prepending the next element to you already reversed partial result. And as was already mentioned, it is also implemented in C in Erlang.
Also building a result in reverse order often can be accomplished by a tail recursive function which means that no stack is built up and (in old versions of Erlang only!) therefore some memory can be saved.
Having said all this: it is one of The Eight Myths of Erlang Performance that it is always better to build in reverse in a tail recursive function and call lists:reverse/1 at the end.
Here is a body recursive version without lists:reverse/1 this would use more memory on Erlang versions before 12 but not so on current versions:
map([H|T], Fun) ->
[ Fun(H) | map(T,Fun) ];
map([],_) -> [].
And here is a version of map using list comprehensions:
map(List, Fun) ->
[ Fun(X) || X <- List ].
As you can see this is so simple because map is just a built in part of the list comprehension, so you would use the list comprehension directly and have no need for map anymore.
As an extra a pure Erlang implementation that shows how reversing a list of cons cells wors efficiently (in Erlang its always faster to call lists:reverse/1 because it is in C, but does the same thing).
reverse(List) ->
reverse(List, []).
reverse([H|T], Acc) ->
reverse(T, [H|Acc]);
reverse([], Acc) ->
Acc.
As you can see there are only [A|B] operations on the list, taking cons cells apart (when pattern matching) and building new when doing the [H|Acc].

Such construction [ Element | Acc] and then lists:reverse(Acc) is quite common pattern in functional programming.
You can write code with an accumulator storing data in correct order, but this code will be slower than the one in your question (hint: use Acc ++ [Fun(H)]).
Actually within Erlang lists:reverse is implemented in C and it works quite fast.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart