I would like to sort some tab separated data that is of the following form.
Marketing, Advertising, PR Graduate, Trainees Oil, Gas, Alternative Energy
Marketing, Advertising, PR Graduate, Trainees Public Sector & Services
Marketing, Advertising, PR Graduate, Trainees Recruitment Sales
Marketing, Advertising, PR Graduate, Trainees Secretarial, PAs, Administration
Marketing, Advertising, PR Graduate, Trainees Senior Appointments
Marketing, Advertising, PR Graduate, Trainees Telecommunications
Marketing, Advertising, PR Graduate, Trainees Transport, Logistics
Other Graduate, Trainees Banking, Insurance, Finance
Other Graduate, Trainees Customer Services
Other Graduate, Trainees Education
Other Graduate, Trainees Health, Nursing
Other Graduate, Trainees Legal
Other Graduate, Trainees Management Consultancy
There is a mixture of single phrases words and multi word phrases. The words of the phrases have commas between them. The phrases are tab delimited.
I need to compare it with another set of data where the text cells have been helpfully sorted alphabetically.
Obviously this makes direct comparison difficult (impossible).
Following ovastus's suggestion below I have the following code
open System;;
open System.IO;;
#load #"BigDataModule.fs";;
open BigDataModule;;
let sample = "TruncatedData.txt";;
let outputFile = "SortedOutput.csv";;
let sortWithinRow (row:string) =
let columns = row.Split([|'\t'|])
let sortedColumns =
Seq.append
(columns |> Seq.take (columns.Length) |> Seq.sort)
[ columns.[columns.Length - 1] ]
sortedColumns |> String.concat ",";;
sample |> readLines |> Seq.map sortWithinRow |> saveTo (outputFile);;
Where readLines and saveTo are functions in my own Big Data module for reading in files and saving outputs.
When I get the output from this script, unfortunately the sort has not produced the desired result and the rows are still not sorted alphabetically.
If anyone can help me to further refine my script I will be very grateful.
I apologise for wasting time, having originally underdetermined the problem by oversimplifying the format of the input.
EDIT 1: Clarified I have saved the data as a csv file and will do this in F#.
EDIT 2: I have gotten rid of all of the extraneous parts of the data set, I just need to sort within these rows. I have also given further details of some code I have tried.
EDIT 3:
This was the original data frame I entered, which was an oversimplification
Alpha Bravo Tango Delta 15.00
Bravo Delta Tango 20.30
Delta Alpha Tango 6.17
Charlie Tango Foxtrot Alpha 19.13
I'm not sure if I understand correctly what you want, but if you want to generate this output:
Alpha Bravo Delta Tango 15.00
Bravo Delta Tango 20.30
Alpha Delta Tango 6.17
Alpha Charlie Foxtrot Tango 19.13
You can do it like this:
open System
let sample = """Alpha Bravo Tango Delta 15.00
Bravo Delta Tango 20.30
Delta Alpha Tango 6.17
Charlie Tango Foxtrot Alpha 19.13""".Split [|'\n'|]
let sortWithinRow (row:string) =
let columns = row.Split([|' '|], StringSplitOptions.RemoveEmptyEntries)
let sortedColumns =
Seq.append
(columns |> Seq.take (columns.Length - 1) |> Seq.sort)
[ columns.[columns.Length - 1] ]
sortedColumns |> String.concat " "
sample |> Seq.map sortWithinRow |> String.concat "\n"
What about the following?
sample |>
Seq.map (fun x -> x.Split('\t')) |>
Seq.map (Seq.map (fun x -> x.Trim())) |>
Seq.map (Seq.filter (fun x -> not (String.IsNullOrEmpty(x)))) |>
Seq.map Seq.sort |>
Seq.map (String.concat '\t') |>
String.concat '\n';;
I can't type \t in a way that will paste for an example, so for an executable example I had to switch field delimiters to spaces
open System
let sample2 = """Alpha Bravo Tango Delta 15.00
Bravo Delta Tango 20.30
Delta Alpha Tango 6.17
Charlie Tango Foxtrot Alpha 19.13""".Split [|'\n'|]
sample2 |>
Seq.map (fun x -> x.Split([|" "|], StringSplitOptions.None)) |>
Seq.map (Seq.map (fun x -> x.Trim())) |>
Seq.map (Seq.filter (fun x -> not (String.IsNullOrEmpty(x)))) |>
Seq.map Seq.sort |>
Seq.map (String.concat '\t') |>
String.concat '\n';;
Try using F# Data
[<Literal>]
let sample = """Text1,Text2,Text3,Text4,ValueField
Alpha,Bravo,Tango,Delta,15.00
Bravo,Delta,Tango,,20.30
Delta,Alpha,Tango,,6.17
Charlie,Tango,Foxtrot,Alpha,19.13"""
open FSharp.Data
let csv = CsvProvider<sample, Separator = ",">.Load("input.csv")
let sortedData =
csv.Data
|> Seq.sortBy (fun row -> row.Text1)
|> Seq.map (fun row -> row.Columns |> String.concat ",")
System.IO.File.WriteAllLines("output.csv", sortedData)
If you want to sort by multiple fields you can just tuple them in the sorting function:
|> Seq.sortBy (fun row -> row.Text1, row.Text3)
Related
I've looked at a few examples in Stack Overflow etc and I can't seem to get my specific scenario working.
I want to iterate a list of 1500 items,but each item will hit the API so I want to limit the number of concurrent threads to about 6.
Is this the right way to do this? I'm just afraid it won't limit the actual threads I need to.
let fullOrders =
orders.AsParallel().WithDegreeOfParallelism(6) |>
Seq.map (fun (order) -> getOrderinfo(order) )
You can easily test this by using something like:
let orders = [ 0 .. 10 ]
let getOrderInfo a =
printfn "Starting: %d" a
System.Threading.Thread.Sleep(1000)
printfn "Finished: %d" a
The first issue here is that AsParallel does not work with Seq.map (which is just normal synchronous iteration over a collection), so the following runs the tasks sequentially with no parallelism:
let fullOrders =
orders.AsParallel().WithDegreeOfParallelism(6)
|> Seq.map (fun (order) -> getOrderInfo(order) )
fullOrders |> Seq.length
To make it parallel, you'd need to use the Select method on ParallelQuery instead, which does exactly what you wanted:
let fullOrders =
[0 .. 10].AsParallel().WithDegreeOfParallelism(6).Select(fun order ->
getOrderInfo(order) )
fullOrders |> Seq.length
Note that I added Seq.length to the end just to force the evaluation of the (lazy) sequence.
What are the essential functions to find duplicate elements within a list?
Translated, how can I simplify the following function:
let numbers = [ 3;5;5;8;9;9;9 ]
let getDuplicates = numbers |> List.groupBy id
|> List.map snd
|> List.filter (fun set -> set.Length > 1)
|> List.map (fun set -> set.[0])
I'm sure this is a duplicate. However, I am unable to locate the question on this site.
UPDATE
let getDuplicates numbers =
numbers |> List.groupBy id
|> List.choose (fun (k,v) -> match v.Length with
| x when x > 1 -> Some k
| _ -> None)
Simplifying your function:
Whenever you have a filter followed by a map, you can probably replace the pair with a choose. The purpose of choose is to run a function for each value in the list, and return only the items which return Some value (None values are removed, which is the filter portion). Whatever value you put inside Some is the map portion:
let getDuplicates = numbers |> List.groupBy id
|> List.map snd
|> List.choose( fun( set ) ->
if set.Length > 1
then Some( set.[0] )
else None )
We can take it one additional step by removing the map. In this case, keeping the tuple which contains the key is helpful, because it eliminates the need to get the first item of the list:
let getDuplicates = numbers |> List.groupBy id
|> List.choose( fun( key, set ) ->
if set.Length > 1
then Some key
else None )
Is this simpler than the original? Perhaps. Because choose combines two purposes, it is by necessity more complex than those purposes kept separate (the filter and the map), and this makes it harder to understand at a glance, perhaps undoing the more "simplified" code. More on this later.
Decomposing the concept
Simplifying the code wasn't the direct question, though. You asked about functions useful in finding duplicates. At a high level, how do you find a duplicate? It depends on your algorithm and specific needs:
Your given algorithm uses the "put items in buckets based on their value", and "look for buckets with more than one item". This is a direct match to List.groupBy and List.choose (or filter/map)
A different algorithm could be to "iterate through all items", "modify an accumulator as we see each", then "report all items which have been seen multiple times". This is kind of like the first algorithm, where something like List.fold is replacing List.groupBy, but if you need to drag some other kind of state along, it may be helpful.
Perhaps you need to know how many times there are duplicates. A different algorithm satisfying these requirements may be "sort the items so they are always ascending", and "flag if the next item is the same as the current item". In this case, you have a List.sort followed by a List.toSeq then Seq.windowed:
let getDuplicates = numbers |> List.sort
|> List.toSeq
|> Seq.windowed 2
|> Seq.choose( function
| [|x; y|] when x = y -> Some x
| _ -> None )
Note that this returns a sequence with [5; 9; 9], informing you that 9 is duplicated twice.
These were algorithms mostly based on List functions. There are already two answers, one mutable, the other not, which are based on sets and existence.
My point is, a complete list of functions helpful to finding duplicates would read like a who's who list of existing collection functions -- it all depends on what you're trying to do and your specific requirements. I think your choice of List.groupBy and List.choose is probably about as simple as it gets.
Simplifying for maintainability
The last thought on simplification is to remember that simplifying code will improve the readability of your code to a certain extent. "Simplifying" beyond that point will most likely involve tricks, or obscure intent. If I were to look back at a sample of code I wrote even several weeks and a couple of projects ago, the shortest and perhaps simplest code would probably not be the easiest to understand. Thus the last point -- simplifying future code maintainability may be your goal. If this is the case, your original algorithm modified only keeping the groupBy tuple and adding comments as to what each step of the pipeline is doing may be your best bet:
// combine numbers into common buckets specified by the number itself
let getDuplicates = numbers |> List.groupBy id
// only look at buckets with more than one item
|> List.filter( fun (_,set) -> set.Length > 1)
// change each bucket to only its key
|> List.map( fun (key,_) -> key )
The original question comments already show that your code was unclear to people unfamiliar with it. Is this a question of experience? Definitely. But, regardless of whether we work on a team, or are lone wolves, optimizing code (where possible) for quick understanding should probably be close to everyone's top priority. (climbing down off sandbox...) :)
Regardless, best of luck.
If you don't mind using a mutable collection in a local scope, this could do it:
open System.Collections.Generic
let getDuplicates numbers =
let known = HashSet()
numbers |> List.filter (known.Add >> not) |> set
You can wrap the last three operations in a List.choose:
let duplicates =
numbers
|> List.groupBy id
|> List.choose ( function
| _, x::_::_ -> Some x
| _ -> None )
Here's a solution which uses only basic functions and immutable data structures:
let findDups elems =
let findDupsHelper (oneOccurrence, manyOccurrences) elem =
if oneOccurrence |> Set.contains elem
then (oneOccurrence, manyOccurrences |> Set.add elem)
else (oneOccurrence |> Set.add elem, manyOccurrences)
List.fold findDupsHelper (Set.empty, Set.empty) elems |> snd
I'm an R developer that is interested in getting good at F# so this question is part of a broader theme of how to shape and reshape data.
Question:
There are three months in the NYC Flight Delays dataset where there were more than 7000 weather delays. I would like to filter out all other months so that I have only those three months alone to analyze. How would this be done in F#? Is the long-term F# solution just to call R? Or are there robust data libraries in .NET that can already do these sort of tasks.
You can use the CSV Type Provider from FSharp.Data to get strongly typed access to your data, even directly from the internet address:
#r "../packages/FSharp.Data.2.2.5/lib/net40/FSharp.Data.dll"
open System
open FSharp.Data
type FlightDelays =
CsvProvider<"https://raw.githubusercontent.com/wiki/arunsrinivasan/flights/NYCflights14/delays14.csv">
This gives you strongly typed access to the data source. As an example, to find all the months with weather delays more than 7000, you can do something like this:
let monthsWithDelaysOver7k =
FlightDelays.GetSample().Rows
|> Seq.filter (fun r -> not (Double.IsNaN r.Weather_delay))
|> Seq.groupBy (fun r -> r.Year, r.Month)
|> Seq.map (fun ((y, m), rs) -> y, m, rs |> Seq.sumBy (fun r -> r.Weather_delay))
|> Seq.filter (fun (y, m, d) -> d >= 7000.)
Converted to a list, the data looks like this:
> monthsWithDelaysOver7k |> Seq.toList;;
val it : (int * int * float) list =
[(2014, 1, 118753.0); (2014, 2, 59567.0); (2014, 4, 7618.0);
(2014, 5, 11594.0); (2014, 6, 15928.0); (2014, 7, 54298.0);
(2014, 10, 7241.0)]
You can now use monthsWithDelaysOver7k to get all the rows in those months.
You can probably write some more efficient queries than the above, but this should give you an idea about how to approach the problem.
I am new to programming and F# is my first .NET language.
I would like to read the contents of a text file, count the number of occurrences of each word, and then return the 10 most common words and the number of times each of them appears.
My questions are: Is using a dictionary encouraged in F#? How would I write the code if I wish to use a dictionary? (I have browsed through the Dictionary class on MSDN, but I am still puzzling over how I can update the value to a key.) Do I always have to resort to using Map in functional programming?
While there's nothing wrong with the other answers, I'd like to point out that there's already a specialized function to get the number of unique keys in a sequence: Seq.countBy. Plumbing the relevant parts of Reed's and torbonde's answers together:
let countWordsTopTen (s : string) =
s.Split([|','|])
|> Seq.countBy (fun s -> s.Trim())
|> Seq.sortBy (snd >> (~-))
|> Seq.truncate 10
"one, two, one, three, four, one, two, four, five"
|> countWordsTopTen
|> printfn "%A" // seq [("one", 3); ("two", 2); ("four", 2); ("three", 1); ...]
My questions are: Is using a dictionary encouraged in F#?
Using a Dictionary is fine from F#, though it does use mutability, so it's not quite as common.
How would I write the code if I wish to use a dictionary?
If you read the file, and have a string with comma separated values, you could
parse using something similar to:
// Just an example of input - this would come from your file...
let strings = "one, two, one, three, four, one, two, four, five"
let words =
strings.Split([|','|])
|> Array.map (fun s -> s.Trim())
let dict = Dictionary<_,_>()
words
|> Array.iter (fun w ->
match dict.TryGetValue w with
| true, v -> dict.[w] <- v + 1
| false, _ -> dict.[w] <- 1)
// Creates a sequence of tuples, with (word,count) in order
let topTen =
dict
|> Seq.sortBy (fun kvp -> -kvp.Value)
|> Seq.truncate 10
|> Seq.map (fun kvp -> kvp.Key, kvp.Value)
I would say an obvious choice for this task is to use the Seq module, which is really one of the major workhorses in F#. As Reed said, using dictionary is not as common, since it is mutable. Sequences, on the other hand, are immutable. An example of how to do this using sequences is
let strings = "one, two, one, three, four, one, two, four, five"
let words =
strings.Split([|','|])
|> Array.map (fun s -> s.Trim())
let topTen =
words
|> Seq.groupBy id
|> Seq.map (fun (w, ws) -> (w, Seq.length ws))
|> Seq.sortBy (snd >> (~-))
|> Seq.truncate 10
I think the code speaks pretty much for itself, although maybe the second last line requires a short explanation:
The snd-function gives the second entry in a pair (i.e. snd (a,b) is b), >> is the functional composition operator (i.e. (f >> g) a is the same as g (f a)) and ~- is the unary minus operator. Note here that operators are essentially functions, but when using (and declaring) them as functions, you have to wrap them in parentheses. That is, -3 is the same as (~-) 3, where in the last case we have used the operator as a function.
In total, what the second last line does, is sort the sequence by the negative value of the second entry in the pair (the number of occurrences).
Given a dataset, for example a CSV file that might look like this:
x,y
1,2
1,5
2,1
2,2
1,1
...
I wish to create a map of lists containing the y's for a given x... The result could look like this:
{1:[2,5,1], 2:[1,2]}
In python this would be straight forward to do in an imperative manner.. and would probably look somewhat like this:
d = defaultdict(list)
for x,y in csv_data:
d[x].append(y)
How would you go about achieving the same using functional programming techniques in F#?
Is it possible to do it as short, efficient and concise (and read-able) as in the given python example, using only functional style?, or would you have to fall back to imperative programming style with mutable data structures..?
Note: this is not a homework assignment, just me trying to wrap my head around functional programming
EDIT: My conclusion based on answers thus far
I tried timing each of the provided answers on a relative big csv file, just to get a feeling of the performance.. Furthermore I did a small test with the imperative approach:
let res = new Dictionary<string, List<string>>()
for row in l do
if (res.ContainsKey(fst row) = false) then
res.[fst row] <- new List<string>()
res.[fst row].Add(snd row)
The imperative approach completed in ~0.34 sec.
I think that the answer provided by Lee is the most general FP one, however the running time was ~4sec.
The answer given by Daniel ran in ~1.55sec.
And at last the answer given by jbtule ran in ~0.26. (I find it very interesting that it beat the imperative approach)
I used 'System.Diagnostics.Stopwatch()' for timing, and the code is executed as F# 3.0 in .Net 4.5
EDIT2: fixed stupid error in imperative f# code, and ensured that it uses the same list as the other solutions
[
1,2
1,5
2,1
2,2
1,1
]
|> Seq.groupBy fst
|> Seq.map (fun (x, ys) -> x, [for _, y in ys -> y])
|> Map.ofSeq
let addPair m (x, y) =
match Map.tryFind x m with
| Some(l) -> Map.add x (y::l) m
| None -> Map.add x [y] m
let csv (pairs : (int * int) list) = List.fold addPair Map.empty pairs
Note this adds the y values to the list in reverse order
use LINQ in F#, LINQ is functional.
open System.Linq
let data =[
1,2
1,5
2,1
2,2
1,1
]
let lookup = data.ToLookup(fst,snd)
lookup.[1] //seq [2;5;1]
lookup.[2] //seq [1;2
For fun, an implementation using a query expression:
let res =
query { for (k, v) in data do
groupValBy v k into g
select (g.Key, List.ofSeq g) }
|> Map.ofSeq