How to make word freq counter more efficient? - f#

I have written this F# code to count word frequencies in a list and return a tuple to C#. Could you tell me how can I make the code more efficient or shorter?
let rec internal countword2 (tail : string list) wrd ((last : string list), count) =
match tail with
| [] -> last, wrd, count
| h::t -> countword2 t wrd (if h = wrd then last, count+1 else last # [h], count)
let internal countword1 (str : string list) wrd =
let temp, wrd, count = countword2 str wrd ([], 0) in
temp, wrd, count
let rec public countword (str : string list) =
match str with
| [] -> []
| h::_ ->
let temp, wrd, count = countword1 str h in
[(wrd, count)] # countword temp

Even pad's version can be made more efficient and concise:
let countWords = Seq.countBy id
Example:
countWords ["a"; "a"; "b"; "c"] //returns: seq [("a", 2); ("b", 1); ("c", 1)]

If you want to count word frequencies in a string list, your approach seems to be overkill. Seq.groupBy is well-fitted for this purpose:
let public countWords (words: string list) =
words |> Seq.groupBy id
|> Seq.map (fun (word, sq) -> word, Seq.length sq)
|> Seq.toList

Your solution iterates over the input list several times, for every new word that it founds. Instead of doing that, you could iterate over the list just once and build a dictionary that holds the number of all occurrences for every word.
To do this in a functional style, you can use F# Map, which is an immutable dictionary:
let countWords words =
// Increment the number of occurrences of 'word' in the map 'counts'
// If it isn't already in the dictionary, add it with count 1
let increment counts word =
match Map.tryFind word counts with
| Some count -> Map.add word (count + 1) counts
| _ -> Map.add word 1 counts
// Start with an empty map and call 'increment'
// to add all words to the dictionary
words |> List.fold increment Map.empty
You can also implement the same thing in an imperative style, which is going to be more efficient, but less elegant (and you don't get all benefits of functional style). However, standard mutable Dictionary can be used nicely from F# too (this is going to be similar to C# version, so I won't write it here).
Finally, if you want a simple solution using just standard F# functions, you can use Seq.groupBy as suggested by pad. This would be probably almost as efficient as the Dictionary based version. But then, if you're just learning F# then writing a few recursive functions like countWords yourself is a great way to learn!
To give you some comments about your code - the complexity of your approach is slightly higher, but that should probably be fine. There are however some common isses:
In your countword2 function, you have if h = wrd then ... else last # [h], count. The call last # [h] is inefficient, because it needs to clone the entire list last. Instead of this, you could just write h::last to add the word to the beginning, because the order does not matter.
On the last line, you're using # again in [(wrd, count)] # countword temp. This is not necessary. If you're adding single element to the beginning of list, you should use: (wrd,count)::(countword temp).

Related

Times a str is shown

I've made a function to read a .txt file and turn it into a string.
From here I need help with collecting how many times a word is shown.
But I'm not sure where to go from here and any kind of help with any of the bulletpoints would be greatly appreciated.
Let's go through this step by step then, creating a function for each bit:
Convert words starting with an upper-case to a lower-case word so that all words are lower case.
Split the string into a sequence of words:
let getWords (s: string) =
s.Split(' ')
Turns "hello world" into ["hello"; "world"]
Sort the amount of times a word is shown. A word in this sense is a sequence of characters without whitespaces or punctuation (!#= etc)
Part #1: Format a word in lower without punctuation:
let isNotPunctuation c =
not (Char.IsPunctuation(c))
let formatWord (s: string) =
let chars =
s.ToLowerInvariant()
|> Seq.filter isNotPunctuation
|> Seq.toArray
new String(chars)
Turns "Hello!" into "hello".
Part #2: Group the list of words by the formatted version of it.
let groupWords (words: string seq) =
words
|> Seq.groupBy formatWord
This returns a tuple, with the first part as the key (formatWord) the second part is a list of the words.
Turns ["hello"; "world"; "hello"] into
[("hello", ["hello"; "hello"]);
("world", ["world"])]
Sort from most frequent word shown and to less frequent word.
let sortWords group =
group
|> Seq.sortByDescending (fun g -> Seq.length (snd g))
Sort the list descending (biggest first) by the length (count) of items in the second part - see the above representation.
Now we just need to clean up the output:
let output group =
group
|> Seq.map fst
This picks the first part of the tuple from the group:
Turns ("hello", ["hello"; "hello"]) into "hello".
Now we have all the functions, we can stick them together into one chain:
let s = "some long string with some repeated words again and some other words"
let finished =
s
|> getWords
|> groupWords
|> sortWords
|> output
printfn "%A" finished
//seq ["some"; "words"; "long"; "string"; ...]
Here's another way using Regex
open System.Text.RegularExpressions
let str = "Some (very) long string with some repeated words again, and some other words, and some punctuation too."
str
|> (Regex #"\W+").Split
|> Seq.choose(fun s -> if s = "" then None else Some (s.ToLower()))
|> Seq.countBy id
|> Seq.sortByDescending snd

In F#, how to get head/tail of a seq without re-evaluating the seq

I'm reading a file and I want to do something with the first line, and something else with all the other lines
let lines = System.IO.File.ReadLines "filename.txt" |> Seq.map (fun r -> r.Trim())
let head = Seq.head lines
let tail = Seq.tail lines
```
Problem: the call to tail fails because the TextReader is closed.
What it means is that the Seq is evaluated twice: once to get the head once to get the tail.
How can I get the firstLine and the lastLines, while keeping a Seq and without reevaluating the Seq ?
the signature could be, for example :
let fn: ('a -> Seq<'a> -> b) -> Seq<'a> -> b
The easiest thing to do is probably just using Seq.cache to wrap your lines sequence:
let lines =
System.IO.File.ReadLines "filename.txt"
|> Seq.map (fun r -> r.Trim())
|> Seq.cache
Of note from the documentation:
This result sequence will have the same elements as the input sequence. The result can be enumerated multiple times. The input sequence is enumerated at most once and only as far as is necessary. Caching a sequence is typically useful when repeatedly evaluating items in the original sequence is computationally expensive or if iterating the sequence causes side-effects that the user does not want to be repeated multiple times.
I generally use a seq expression in which the Stream is scoped inside the expression. That will allow you to enumerate the sequence fully before the stream is disposed. I usually use a function like this:
let readLines file =
seq {
use stream = File.OpenText file
while not stream.EndOfStream do
yield stream.ReadLine().Trim()
}
Then you should be able to call Seq.head and get the first line in the fail, and Seq.last to get the last line in the file. I think this will technically create two different enumerators though. If you want to only read the file exactly one time, then materializing the sequence to a list or using a function like Seq.cache will be your best option.
I had an important use case for this, where I am using Seq.unfold to read a large number of blocks with REST reads, and sequentially processing each block, with further REST reads.
The reading of the sequence had to be both "lazy" but also cached to avoid duplicate re-evaluation (with every Seq.tail operation).
Hence finding this question and the accepted answer (Seq.cache). Thanks!
I experimented with Seq.cache and discovered that it worked as claimed (ie, lazy and avoid re-evaluation), but with one noteworthy condition - the first five elements of the sequence are always read first (and retained with 'cache'), so experiments on five or smaller numbers won't show lazy evaluation. However, after five, lazy evaluation kicks in for each element.
This code can be used to experiment. Try it for 5, and see no lazy evaluation, and then 10, and see each element after 5 being 'lazy' read, as required. Also remove Seq.cache to see the problem we are addressing (re-evaluation)
// Get a Sequence of numbers.
let getNums n = seq { for i in 1..n do printfn "Yield { %d }" i; yield i}
// Unfold a sequence of numbers
let unfoldNums (nums : int seq) =
nums
|> Seq.unfold
(fun (nums : int seq) ->
printfn "unfold: nums = { %A }" nums
if Seq.isEmpty nums then
printfn "Done"
None
else
let num = Seq.head nums // Value to yield
let tl = Seq.tail nums // Next State. CAUSES RE-EVALUTION!
printfn "Yield: < %d >, tl = { %A }" num tl
Some (num,tl))
// Get n numbers as a sequence, then unfold them as a sequence
// Observe that with 'Seq.cache' input is not re-evaluated unnecessarily,
// and also that lazy evaulation kicks in for n > 5
let experiment n =
getNums n
|> Seq.cache
// Without cache, Seq.tail causes the sequence to be re-evaluated
|> unfoldNums
|> Seq.iter (fun x -> printfn "Process: %d" x)

F# Canopy - Generate Random Letters and or Numbers and use in a variable

I am using F# Canopy to complete some web testing. I am trying to create and load a random number with or without letters, not that important and use it to paste to my website.
The code I am currently using is
let genRandomNumbers count =
let rnd = System.Random()
List.init count
let l = genRandomNumbers 1
"#CompanyName" << l()
The #CompanyName is the ID of the element I am trying to pass l into. As it stands I am receiving the error 'The expression was expected to have type string but here it has type a list.
Any help would be greatly appreciated.
The << operator in canopy writes a string to the selector (I haven't used it but the documentation looks pretty clear), but your function returns a list. If you want the random string to work, you could do something like this (not tested code)
let randomNumString n = genRandomNumbers n |> List.map string |> List.reduce (+)
This maps your random list to strings then concats all the strings together using the first element as the accumulator seed. You could also do a fold
let randomNumString n = genRandomNumbers n
|> List.fold (fun acc i -> acc + (string i)) ""
Putting it all together
let rand = new System.Random()
let genRandomNumbers count = List.init count (fun _ -> rand.Next())
let randomNumString n = genRandomNumbers n |> List.map string |> List.reduce (+)
"#CompanyName" << (randomNumString 1)
In general, F# won't do any type promotion for you. Since the << operator wants a string on the right hand side, you need to map your list to a string somehow. That means iterating over each element, converting the number to a string, and adding all the elements together into one final string.

Getting every nth Element of a Sequence

I am looking for a way to create a sequence consisting of every nth element of another sequence, but don't seem to find a way to do that in an elegant way. I can of course hack something, but I wonder if there is a library function that I'm not seeing.
The sequence functions whose names end in -i seem to be quite good for the purpose of figuring out when an element is the nth one or (multiple of n)th one, but I can only see iteri and mapi, none of which really lends itself to the task.
Example:
let someseq = [1;2;3;4;5;6]
let partial = Seq.magicfunction 3 someseq
Then partial should be [3;6]. Is there anything like it out there?
Edit:
If I am not quite as ambitious and allow for the n to be constant/known, then I've just found that the following should work:
let rec thirds lst =
match lst with
| _::_::x::t -> x::thirds t // corrected after Tomas' comment
| _ -> []
Would there be a way to write this shorter?
Seq.choose works nicely in these situations because it allows you do the filter work within the mapi lambda.
let everyNth n elements =
elements
|> Seq.mapi (fun i e -> if i % n = n - 1 then Some(e) else None)
|> Seq.choose id
Similar to here.
You can get the behavior by composing mapi with other functions:
let everyNth n seq =
seq |> Seq.mapi (fun i el -> el, i) // Add index to element
|> Seq.filter (fun (el, i) -> i % n = n - 1) // Take every nth element
|> Seq.map fst // Drop index from the result
The solution using options and choose as suggested by Annon would use only two functions, but the body of the first one would be slightly more complicated (but the principle is essentially the same).
A more efficient version using the IEnumerator object directly isn't too difficult to write:
let everyNth n (input:seq<_>) =
seq { use en = input.GetEnumerator()
// Call MoveNext at most 'n' times (or return false earlier)
let rec nextN n =
if n = 0 then true
else en.MoveNext() && (nextN (n - 1))
// While we can move n elements forward...
while nextN n do
// Retrun each nth element
yield en.Current }
EDIT: The snippet is also available here: http://fssnip.net/1R

Return value in F# - incomplete construct

I've trying to learn F#. I'm a complete beginner, so this might be a walkover for you guys :)
I have the following function:
let removeEven l =
let n = List.length l;
let list_ = [];
let seq_ = seq { for x in 1..n do if x % 2 <> 0 then yield List.nth l (x-1)}
for x in seq_ do
let list_ = list_ # [x];
list_;
It takes a list, and return a new list containing all the numbers, which is placed at an odd index in the original list, so removeEven [x1;x2;x3] = [x1;x3]
However, I get my already favourite error-message: Incomplete construct at or before this point in expression...
If I add a print to the end of the line, instead of list_:
...
print_any list_;
the problem is fixed. But I do not want to print the list, I want to return it!
What causes this? Why can't I return my list?
To answer your question first, the compiler complains because there is a problem inside the for loop. In F#, let serves to declare values (that are immutable and cannot be changed later in the program). It isn't a statement as in C# - let can be only used as part of another expression. For example:
let n = 10
n + n
Actually means that you want the n symbol to refer to the value 10 in the expression n + n. The problem with your code is that you're using let without any expression (probably because you want to use mutable variables):
for x in seq_ do
let list_ = list_ # [x] // This isn't assignment!
list_
The problematic line is an incomplete expression - using let in this way isn't allowed, because it doesn't contain any expression (the list_ value will not be accessed from any code). You can use mutable variable to correct your code:
let mutable list_ = [] // declared as 'mutable'
let seq_ = seq { for x in 1..n do if x % 2 <> 0 then yield List.nth l (x-1)}
for x in seq_ do
list_ <- list_ # [x] // assignment using '<-'
Now, this should work, but it isn't really functional, because you're using imperative mutation. Moreover, appending elements using # is really inefficient thing to do in functional languages. So, if you want to make your code functional, you'll probably need to use different approach. Both of the other answers show a great approach, although I prefer the example by Joel, because indexing into a list (in the solution by Chaos) also isn't very functional (there is no pointer arithmetic, so it will be also slower).
Probably the most classical functional solution would be to use the List.fold function, which aggregates all elements of the list into a single result, walking from the left to the right:
[1;2;3;4;5]
|> List.fold (fun (flag, res) el ->
if flag then (not flag, el::res) else (not flag, res)) (true, [])
|> snd |> List.rev
Here, the state used during the aggregation is a Boolean flag specifying whether to include the next element (during each step, we flip the flag by returning not flag). The second element is the list aggregated so far (we add element by el::res only when the flag is set. After fold returns, we use snd to get the second element of the tuple (the aggregated list) and reverse it using List.rev, because it was collected in the reversed order (this is more efficient than appending to the end using res#[el]).
Edit: If I understand your requirements correctly, here's a version of your function done functional rather than imperative style, that removes elements with odd indexes.
let removeEven list =
list
|> Seq.mapi (fun i x -> (i, x))
|> Seq.filter (fun (i, x) -> i % 2 = 0)
|> Seq.map snd
|> List.ofSeq
> removeEven ['a'; 'b'; 'c'; 'd'];;
val it : char list = ['a'; 'c']
I think this is what you are looking for.
let removeEven list =
let maxIndex = (List.length list) - 1;
seq { for i in 0..2..maxIndex -> list.[i] }
|> Seq.toList
Tests
val removeEven : 'a list -> 'a list
> removeEven [1;2;3;4;5;6];;
val it : int list = [1; 3; 5]
> removeEven [1;2;3;4;5];;
val it : int list = [1; 3; 5]
> removeEven [1;2;3;4];;
val it : int list = [1; 3]
> removeEven [1;2;3];;
val it : int list = [1; 3]
> removeEven [1;2];;
val it : int list = [1]
> removeEven [1];;
val it : int list = [1]
You can try a pattern-matching approach. I haven't used F# in a while and I can't test things right now, but it would be something like this:
let rec curse sofar ls =
match ls with
| even :: odd :: tl -> curse (even :: sofar) tl
| even :: [] -> curse (even :: sofar) []
| [] -> List.rev sofar
curse [] [ 1; 2; 3; 4; 5 ]
This recursively picks off the even elements. I think. I would probably use Joel Mueller's approach though. I don't remember if there is an index-based filter function, but that would probably be the ideal to use, or to make if it doesn't exist in the libraries.
But in general lists aren't really meant as index-type things. That's what arrays are for. If you consider what kind of algorithm would require a list having its even elements removed, maybe it's possible that in the steps prior to this requirement, the elements can be paired up in tuples, like this:
[ (1,2); (3,4) ]
That would make it trivial to get the even-"indexed" elements out:
thelist |> List.map fst // take first element from each tuple
There's a variety of options if the input list isn't guaranteed to have an even number of elements.
Yet another alternative, which (by my reckoning) is slightly slower than Joel's, but it's shorter :)
let removeEven list =
list
|> Seq.mapi (fun i x -> (i, x))
|> Seq.choose (fun (i,x) -> if i % 2 = 0 then Some(x) else None)
|> List.ofSeq

Resources