F#: Generating a word count summary

F#: Generating a word count summary - f#

I am new to programming and F# is my first .NET language.
I would like to read the contents of a text file, count the number of occurrences of each word, and then return the 10 most common words and the number of times each of them appears.
My questions are: Is using a dictionary encouraged in F#? How would I write the code if I wish to use a dictionary? (I have browsed through the Dictionary class on MSDN, but I am still puzzling over how I can update the value to a key.) Do I always have to resort to using Map in functional programming?

While there's nothing wrong with the other answers, I'd like to point out that there's already a specialized function to get the number of unique keys in a sequence: Seq.countBy. Plumbing the relevant parts of Reed's and torbonde's answers together:
let countWordsTopTen (s : string) =
s.Split([|','|])
|> Seq.countBy (fun s -> s.Trim())
|> Seq.sortBy (snd >> (~-))
|> Seq.truncate 10
"one, two, one, three, four, one, two, four, five"
|> countWordsTopTen
|> printfn "%A" // seq [("one", 3); ("two", 2); ("four", 2); ("three", 1); ...]

My questions are: Is using a dictionary encouraged in F#?
Using a Dictionary is fine from F#, though it does use mutability, so it's not quite as common.
How would I write the code if I wish to use a dictionary?
If you read the file, and have a string with comma separated values, you could
parse using something similar to:
// Just an example of input - this would come from your file...
let strings = "one, two, one, three, four, one, two, four, five"
let words =
strings.Split([|','|])
|> Array.map (fun s -> s.Trim())
let dict = Dictionary<_,_>()
words
|> Array.iter (fun w ->
match dict.TryGetValue w with
| true, v -> dict.[w] <- v + 1
| false, _ -> dict.[w] <- 1)
// Creates a sequence of tuples, with (word,count) in order
let topTen =
dict
|> Seq.sortBy (fun kvp -> -kvp.Value)
|> Seq.truncate 10
|> Seq.map (fun kvp -> kvp.Key, kvp.Value)

I would say an obvious choice for this task is to use the Seq module, which is really one of the major workhorses in F#. As Reed said, using dictionary is not as common, since it is mutable. Sequences, on the other hand, are immutable. An example of how to do this using sequences is
let strings = "one, two, one, three, four, one, two, four, five"
let words =
strings.Split([|','|])
|> Array.map (fun s -> s.Trim())
let topTen =
words
|> Seq.groupBy id
|> Seq.map (fun (w, ws) -> (w, Seq.length ws))
|> Seq.sortBy (snd >> (~-))
|> Seq.truncate 10
I think the code speaks pretty much for itself, although maybe the second last line requires a short explanation:
The snd-function gives the second entry in a pair (i.e. snd (a,b) is b), >> is the functional composition operator (i.e. (f >> g) a is the same as g (f a)) and ~- is the unary minus operator. Note here that operators are essentially functions, but when using (and declaring) them as functions, you have to wrap them in parentheses. That is, -3 is the same as (~-) 3, where in the last case we have used the operator as a function.
In total, what the second last line does, is sort the sequence by the negative value of the second entry in the pair (the number of occurrences).

Related

is this a good use of Seq.cache, in F#

I'm going through a mutable ConcurrentDictionary to remove old entries.
let private cache = ConcurrentDictionary<Instrument * DateTimeOffset, SmallSet>()
and since I can't remove entries while iterating through the keys, I was wondering if this would be a good use for Seq.cache:
let old = DateTimeOffset.UtcNow.AddHours(-1.)
cache.Keys
|> Seq.filter (fun x -> snd x <= old)
|> Seq.cache
|> Seq.iter (fun x -> cache.TryRemove x |> ignore)
I have never used Seq.cache, and I assume it creates a separation between the two loops. Am I understanding how it works correctly?

In the scenario you described I don't see any reason why you need to iterate the collection multiple times. You can just go over the KeyValuePairs inside the dictionary and analyze each KeyValuePair if it matches your condition or no.
So, something like this should do the job:
cache |> Seq.iter(function
| x when snd x.Key <= old -> cache.TryRemove(x.Key) |> ignore
| _ -> ())

A more functional way to create tuples from two arrays

I've created a function that gets all integers from 1 to n and then combines with the same sequence to create a sequence of tuples of all combinations. So passing it the integer 2 would give you [(1,1);(1,2);(2,1);(2,2)]:
let allTuplesUntil x =
let primary = seq { 1 .. x }
let secondary = seq { 1 .. x }
[for x in primary do
for y in secondary do
yield (x,y)]
This implementations works, but it uses an inner and outer for loop, similar to what I would do in c#.
Could this be achieved in a more idiomatic functional way? Would a more functional way typically be more desirable or is this acceptable in a functional language because of its brevity and clarity?
I'm relatively new to f# and looking for some feedback.

These loops are part of what's called computation expression, which is quite idiomatic to F#. It's just made to look like familiar loops. I can't see any problem with your code being written in this way. If what you want is to get rid of the loops, you could hide them in functions:
let cartesianProduct xs ys =
xs |> Seq.collect (fun x -> ys |> Seq.map (fun y -> x, y))
cartesianProduct [1;2;3] ['a';'b';'c']
val it : seq<int * char> = seq [(1, 'a'); (1, 'b'); (1, 'c'); (2, 'a'); ...]

First, just because there is a for doesn't mean its not functional. In this example you go over each element and yield a new element that will turn into a new element of a new immutable list. Such feature is also named "List Comprehension" and part of languages like Haskell. Imperative would be to loop over a list and mutate the list.
Second, remember that other functions like map, fold, filter also just loop over each element, like a for expression. They are just less powerful than a for loop.
Third, even if it would be "not 100% functional". Who cares? Code should be easily readable and understandable. The intention of two for loops is easy to understand.
Fourth, the equivalent function of the for expression is usually the bind or in this case the Seq.collect function. You also could write, this code.
[for x in primary do
for y in secondary do
yield (x,y)]
Like this:
primary |> Seq.collect (fun x ->
secondary |> Seq.collect (fun y ->
[x,y]
))
I prefer the for loops for readability!

What are the essential functions to find duplicate elements within a list?

What are the essential functions to find duplicate elements within a list?
Translated, how can I simplify the following function:
let numbers = [ 3;5;5;8;9;9;9 ]
let getDuplicates = numbers |> List.groupBy id
|> List.map snd
|> List.filter (fun set -> set.Length > 1)
|> List.map (fun set -> set.[0])
I'm sure this is a duplicate. However, I am unable to locate the question on this site.
UPDATE
let getDuplicates numbers =
numbers |> List.groupBy id
|> List.choose (fun (k,v) -> match v.Length with
| x when x > 1 -> Some k
| _ -> None)

Simplifying your function:
Whenever you have a filter followed by a map, you can probably replace the pair with a choose. The purpose of choose is to run a function for each value in the list, and return only the items which return Some value (None values are removed, which is the filter portion). Whatever value you put inside Some is the map portion:
let getDuplicates = numbers |> List.groupBy id
|> List.map snd
|> List.choose( fun( set ) ->
if set.Length > 1
then Some( set.[0] )
else None )
We can take it one additional step by removing the map. In this case, keeping the tuple which contains the key is helpful, because it eliminates the need to get the first item of the list:
let getDuplicates = numbers |> List.groupBy id
|> List.choose( fun( key, set ) ->
if set.Length > 1
then Some key
else None )
Is this simpler than the original? Perhaps. Because choose combines two purposes, it is by necessity more complex than those purposes kept separate (the filter and the map), and this makes it harder to understand at a glance, perhaps undoing the more "simplified" code. More on this later.
Decomposing the concept
Simplifying the code wasn't the direct question, though. You asked about functions useful in finding duplicates. At a high level, how do you find a duplicate? It depends on your algorithm and specific needs:
Your given algorithm uses the "put items in buckets based on their value", and "look for buckets with more than one item". This is a direct match to List.groupBy and List.choose (or filter/map)
A different algorithm could be to "iterate through all items", "modify an accumulator as we see each", then "report all items which have been seen multiple times". This is kind of like the first algorithm, where something like List.fold is replacing List.groupBy, but if you need to drag some other kind of state along, it may be helpful.
Perhaps you need to know how many times there are duplicates. A different algorithm satisfying these requirements may be "sort the items so they are always ascending", and "flag if the next item is the same as the current item". In this case, you have a List.sort followed by a List.toSeq then Seq.windowed:
let getDuplicates = numbers |> List.sort
|> List.toSeq
|> Seq.windowed 2
|> Seq.choose( function
| [|x; y|] when x = y -> Some x
| _ -> None )
Note that this returns a sequence with [5; 9; 9], informing you that 9 is duplicated twice.
These were algorithms mostly based on List functions. There are already two answers, one mutable, the other not, which are based on sets and existence.
My point is, a complete list of functions helpful to finding duplicates would read like a who's who list of existing collection functions -- it all depends on what you're trying to do and your specific requirements. I think your choice of List.groupBy and List.choose is probably about as simple as it gets.
Simplifying for maintainability
The last thought on simplification is to remember that simplifying code will improve the readability of your code to a certain extent. "Simplifying" beyond that point will most likely involve tricks, or obscure intent. If I were to look back at a sample of code I wrote even several weeks and a couple of projects ago, the shortest and perhaps simplest code would probably not be the easiest to understand. Thus the last point -- simplifying future code maintainability may be your goal. If this is the case, your original algorithm modified only keeping the groupBy tuple and adding comments as to what each step of the pipeline is doing may be your best bet:
// combine numbers into common buckets specified by the number itself
let getDuplicates = numbers |> List.groupBy id
// only look at buckets with more than one item
|> List.filter( fun (_,set) -> set.Length > 1)
// change each bucket to only its key
|> List.map( fun (key,_) -> key )
The original question comments already show that your code was unclear to people unfamiliar with it. Is this a question of experience? Definitely. But, regardless of whether we work on a team, or are lone wolves, optimizing code (where possible) for quick understanding should probably be close to everyone's top priority. (climbing down off sandbox...) :)
Regardless, best of luck.

If you don't mind using a mutable collection in a local scope, this could do it:
open System.Collections.Generic
let getDuplicates numbers =
let known = HashSet()
numbers |> List.filter (known.Add >> not) |> set

You can wrap the last three operations in a List.choose:
let duplicates =
numbers
|> List.groupBy id
|> List.choose ( function
| _, x::_::_ -> Some x
| _ -> None )

Here's a solution which uses only basic functions and immutable data structures:
let findDups elems =
let findDupsHelper (oneOccurrence, manyOccurrences) elem =
if oneOccurrence |> Set.contains elem
then (oneOccurrence, manyOccurrences |> Set.add elem)
else (oneOccurrence |> Set.add elem, manyOccurrences)
List.fold findDupsHelper (Set.empty, Set.empty) elems |> snd

f# iterating over two arrays, using function from a c# library

I have a list of words and a list of associated part of speech tags. I want to iterate over both, simultaneously (matched index) using each indexed tuple as input to a .NET function. Is this the best way (it works, but doesn't feel natural to me):
let taggingModel = SeqLabeler.loadModel(lthPath +
"models\penn_00_18_split_dict.model");
let lemmatizer = new Lemmatizer(lthPath + "v_n_a.txt")
let input = "the rain in spain falls on the plain"
let words = Preprocessor.tokenizeSentence( input )
let tags = SeqLabeler.tagSentence( taggingModel, words )
let lemmas = Array.map2 (fun x y -> lemmatizer.lookup(x,y)) words tags

Your code looks quite good to me - most of it deals with some loading and initialization, so there isn't much you could do to simplify that part. Alternatively to Array.map2, you could use Seq.zip combined with Seq.map - the zip function combines two sequences into a single one that contains pairs of elements with matching indices:
let lemmas = Seq.zip words tags
|> Seq.map (fun (x, y) -> lemmatizer.lookup (x, y))
Since lookup function takes a tuple that you got as an argument, you could write:
// standard syntax using the pipelining operator
let lemmas = Seq.zip words tags |> Seq.map lemmatizer.lookup
// .. an alternative syntax doing exactly the same thing
let lemmas = (words, tags) ||> Seq.zip |> Seq.map lemmatizer.lookup
The ||> operator used in the second version takes a tuple containing two values and passes them to the function on the right side as two arguments, meaning that (a, b) ||> f means f a b. The |> operator takes only a single value on the left, so (a, b) |> f would mean f (a, b) (which would work if the function f expected tuple instead of two, space separated, parameters).
If you need lemmas to be an array at the end, you'll need to add Array.ofSeq to the end of the processing pipeline (all Seq functions work with sequences, which correspond to IEnumerable<T>)
One more alternative is to use sequence expressions (you can use [| .. |] to construct an array directly if that's what you need):
let lemmas = [| for wt in Seq.zip words tags do // wt is tuple (string * string)
yield lemmatizer.lookup wt |]
Whether to use sequence expressions or not - that's just a personal preference. The first option seems to be more succinct in this case, but sequence expressions may be more readable for people less familiar with things like partial function application (in the shorter version using Seq.map)

f# - looping through array

I have decided to take up f# as my functional language.
My problem: Give a bunch of 50digits in a file, get the first 10 digits of the sum of each line. (euler problem for those who know)
for example (simplified):
1234567890
The sum is 45
The first "ten" digits or in our case the "first" digit is 4.
Heres my problem,
I read my file of numbers,
I can split it using "\n" and now i have each line, and then I try to convert it to an char array, but the problem comes here. I can't access each element of that array.
let total =
lines.Split([|'\n'|])
|> Seq.map (fun line -> line.ToCharArray())
|> Seq.take 1
|> Seq.to_list
|> Seq.length
I get each line, convert it to array, i take the first array (for testing only), and i try to convert it to list, and then get the length of the list. But this length is the length of how many arrays i have (ie, 1). It should be 50 as thats how many elements there are in the array.
Does anyone know how to pipeline it to access each char?

Seq.take is still returning a seq<char array>. To get only the first array you could use Seq.nth 0.

My final answer:
let total =
lines.Split([|'\n'|])
|> Seq.map (fun line -> line.ToCharArray() |> Array.to_seq)
|> Seq.map (fun eachSeq -> eachSeq
|> Seq.take 50 //get rid of the \r
|> Seq.map (fun c -> Double.Parse(c.ToString()))
|> Seq.skip 10
|> Seq.sum
)
|> Seq.average
is what i got finally and it's working :).
Bascially after I convert it to charArray, i make it a sequence. So now i have a sequence of sequence. Then I can loop through each seqquence.

I'm not 100% sure what you're asking for, but I believe you're trying to write something like this:
lines.Split([|'\n'|) |> Seq.map (fun line -> line.Length)
This converts each line to a sequence of integers representing the length of each line.

Here's my solution:
string(Seq.sumBy bigint.Parse (data.Split[|'\n'|])).Substring(0, 10)

I copied the data into a string, each line separated by x. Then the answer is one line (wrapped for SO):
let ans13 = data |> String.split ['x'] |> Seq.map Math.BigInt.Parse
|> Seq.reduce (+)
If you are reading it from a file, you'd add the file reading code:
let ans13 = IO.File.ReadAllLines("filename") |> Seq.map Math.BigInt.Parse
|> Seq.reduce (+)
Edit: Actually, I'm not sure we're talking about the same Euler problem -- this is for 13, but your description sounds slightly different. To get the first 10 digits after the summing, do:
printfn "%s" <| String.sub (string ans13) 0 10

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

F#: Generating a word count summary - f#

Related

is this a good use of Seq.cache, in F#

A more functional way to create tuples from two arrays

What are the essential functions to find duplicate elements within a list?

f# iterating over two arrays, using function from a c# library

f# - looping through array

Categories

Resources