CSV Type Provider & Accessing Data

CSV Type Provider & Accessing Data - f#

Good evening! I am a very new programmer getting my feet wet with F#. I am attempting to do some simple data analysis and plotting but I cannot figure out how access the data properly. I get everything set up and use the CSVProvider and it works perfectly:
#load #"packages\FsLab\FsLab.fsx"
#load #"packages\FSharp.Charting\FSharp.Charting.fsx"
open Deedle
open FSharp.Data
type Pt = CsvProvider<"C:/Users/berkl/Test10/CGC.csv">
let data = Pt.Load("C:/Users/berkl/Test10/CGC.csv")
Then, I pull out the data for a specific entry:
let test = data.Rows |> Seq.filter (fun r -> r.``Patient number`` = 2104)
This works as expected and prints the following to FSI:
test;;
val it : seq<CsvProvider<...>.Row> =
seq
[(2104, "Cita 1", "Nuevo", "Femenino", nan, nan, nan);
(2104, "Cita 2", "Establecido", "", 18.85191818, 44.0, 103.0);
(2104, "Cita 3", "Establecido", "Femenino", 17.92617533, 46.0, 108.0);
(2104, "Cita 4", "Establecido", "Femenino", nan, nan, nan); ...]
Here is where I'm at a loss. I want to take out the fifth column and plot it against the sixth column. And I don't know how to access it.
What I can do so far is access a single value in one of the columns:
let Finally = Seq.item 1 test
let PtHt = Finally.Ht_cm
Any help is much appreciated!!

I would probably recommend using the XPlot library instead of F# Charting, because that is the one that's going to be available in FsLab in the long term (it is cross-platform).
To create a chart using XPlot, you need to give it a sequence of pairs with X and Y values:
#load "packages/FsLab/FsLab.fsx"
open XPlot.Plotly
Chart.Scatter [ for x in 0.0 .. 0.1 .. 10.0 -> x, sin x ]
In your example, you can get the required format using sequence comprehensions (as in the above example) or using Seq.map as in the existing answer - both options do the same thing:
// Using sequence comprehensions
Chart.Scatter [ for row in test -> row.Ht_cm, row.Wt_kg ]
// Using Seq.map and piping
test |> Seq.map (fun row -> row.Ht_cm, row.Wt_kg) |> Chart.Scatter
The key thing is that you need to produce one sequence (or a list) containing the X and Y values as a tuple (rather than producing two separate sequences).

What you want to do is transform your sequence of rows to a sequence of values from a column. You use Seq.map for any such transformation.
In your case, you could do (modulo the correct column names which I don't have)
let col5 =
test
|> Seq.map (fun row -> row.Ht_cm)
let col6 =
test
|> Seq.map (fun row -> row.Wt_kg)

Related

Simple exercise of OCaml about list

Good Morning everyone,
I must do an exercise of Programming, but i'm stuck!
Well, the exercise requires a function that given a list not empty of integers, return the first number with maximum number of occurrences.
For example:
mode [1;2;5;1;2;3;4;5;5;4:5;5] ==> 5
mode [2;1;2;1;1;2] ==> 2
mode [-1;2;1;2;5;-1;5;5;2] ==> 2
mode [7] ==> 7
Important: the exercise must be in functional programming
My idea is:
let rec occurences_counter xs i = match xs with
|[] -> failwith "Error"
|x :: xs when x = i -> 1 + occurences_counter xs i
|x :: xs -> occurences_counter xs i;;
In this function i'm stuck:
let rec mode (l : int list) : int = match l with
|[] -> failwith "Error"
|[x] -> x
|x::y::l when occurences_counter l x >= occurences_counter l y -> x :: mode l
|x::y::l when occurences_counter l y > occurences_counter l x -> y :: mode l;;
Thanks in advance, i'm newbie in programming and in stackoverflow
Sorry for my english

one solution : calculate first a list of couples (number , occurences).
hint : use List.assoc.
Then, loop over that list of couple to find the max occurrence and then return the number.

One suggestion:
your algorithm could be simplified if you sort the list before. This has O(N log(N)) complexity. Then measure the longest sequence of identical numbers.
This is a good strategy because you delegate the hard part of the work to a well known algorithm.

It is probably not the most beautiful code, but here is with what i came up (F#). At first i transform every element to an intermediate format. This format contains the element itself, the position of it occurrence and the amount it occurred.
type T<'a> = {
Element: 'a
Position: int
Occurred: int
}
The idea is that those Records can be added. So you can first transform every element, and then add them together. So a list like
[1;3]
will be first transformed to
[{Element=1;Position=0;Occurred=1}; {Element=3;Position=1;Occurred=1}]
By adding two together you only can add those with the same "Element". The Position with the lower number from both is taken, and Occurred is just added together. So if you for example have
{Element=3;Position=1;Occurred=2} {Element=3;Position=3;Occurred=2}
the result will be
{Element=3;Position=1;Occurred=4}
The idea that i had in mind was a Monoid. But in a real Monoid you had to come up that you also could add different Elements together. By trying some stuff out i feel that the restriction of just adding the same Element where way more easier. I created a small Module with the type. Including some helper functions for creating, adding and comparing.
module Occurred =
type T<'a> = {
Element: 'a
Position: int
Occurred: int
}
let create x pos occ = {Element=x; Position=pos; Occurred=occ}
let sameElements x y = x.Element = y.Element
let add x y =
if not <| sameElements x y then failwith "Cannot add two different Occurred"
create x.Element (min x.Position y.Position) (x.Occurred + y.Occurred)
let compareOccurredPosition x y =
let occ = compare x.Occurred y.Occurred
let pos = compare x.Position y.Position
match occ,pos with
| 0,x -> x * -1
| x,_ -> x
With this setup i now wrote two additional function. One aggregate function that first turns every element into a Occurred.T, group them by x.Element (the result is a list of list). And then it uses List.reduce on the inner list to add the Occurred with the same Element together. The result is a List that Contains only a single Occurred.T for every Element with the first Position and the amount of Occurred items.
let aggregate =
List.mapi (fun i x -> Occurred.create x i 1)
>> List.groupBy (fun occ -> occ.Element)
>> List.map (fun (x,occ) -> List.reduce Occurred.add occ)
You could use that aggregate function to now implement different aggregation logic. In your case you only wanted the one with the highest Occurrences and the lowest position. I wrote another function that did that.
let firstMostOccurred =
List.sortWith (fun x y -> (Occurred.compareOccurredPosition x y) * -1) >> List.head >> (fun x -> x.Element)
One note. Occurred.compareOccurredPosition is written that it sorts everything in ascending order. I think people expecting it in this order to go to the smallest to the biggest element by default. So by default the first element would be the element with the lowest occurrence and the biggest Position. By multiplying the result of it with -1 you turn that function into a descending sorting function. The reason why i did that is that i could use List.head. I also could use List.last to get the last element, but i felt that it would be better not to go through the whole list again just to get the last element. On top of it, you didn't wanted an Occurred.T you wanted the element itself, so i unwrap the Element to get the number.
Here is everything in action
let ll = [
[1;2;5;1;2;3;4;5;5;4;5;5]
[2;1;2;1;1;2]
[-1;2;1;2;5;-1;5;5;2]
[7]
]
ll
|> List.map aggregate
|> List.map firstMostOccurred
|> List.iter (printfn "%d")
This code will now print
5
2
2
7
It has still some rough edges like
Occurred.add throws an exception if you try to add Occurred with different Elements
List.head throws an exception for empty lists
And in both cases no code is written to handle those cases or making sure an exception will not raise.

You need to process you input list while maintaining a state, that stores the number of occurrences of each number. Basically, the state can be a map, where keys are in the domain of list elements, and values are in domain of natural numbers. If you will use Map the algorithm would be of O(NlogN) complexity. You can also use associative list (i.e., a list of type ('key,'value) list) to implement map. This will lead to quadratic complexity. Another approach is to use hash table or an array of the length equal to the size of the input domain. Both will give you a linear complexity.
After you collected the statistics, (i.e., a mapping from element to the number of its occurrences) you need to go through the set of winners, and choose the one, that was first on the list.
In OCaml the solution would look like this:
open Core_kernel.Std
let mode xs : int =
List.fold xs ~init:Int.Map.empty ~f:(fun stat x ->
Map.change stat x (function
| None -> Some 1
| Some n -> Some (n+1))) |>
Map.fold ~init:Int.Map.empty ~f:(fun ~key:x ~data:n modes ->
Map.add_multi modes ~key:n ~data:x) |>
Map.max_elt |> function
| None -> invalid_arg "mode: empty list"
| Some (_,ms) -> List.find_exn xs ~f:(List.mem ms)
The algorithm is the following:
Run through input and compute frequency of each element
Run through statistics and compute spectrum (i.e., a mapping from frequency to elements).
Get the set of elements that has the highest frequency, and find an element in the input list, that is in this set.
For example, if we take sample [1;2;5;1;2;3;4;5;5;4;5;5],
stats = {1 => 2; 2 => 2; 3 => 1; 4 => 2; 5 => 5}
mods = {1 => [3]; 2 => [1;2]; 5 => [5]}
You need to install core library to play with it. Use coretop to play with this function in the toplevel. Or corebuild to compile it, like this:
corebuild test.byte --
if the source code is stored in test.ml

F#: Generating a word count summary

I am new to programming and F# is my first .NET language.
I would like to read the contents of a text file, count the number of occurrences of each word, and then return the 10 most common words and the number of times each of them appears.
My questions are: Is using a dictionary encouraged in F#? How would I write the code if I wish to use a dictionary? (I have browsed through the Dictionary class on MSDN, but I am still puzzling over how I can update the value to a key.) Do I always have to resort to using Map in functional programming?

While there's nothing wrong with the other answers, I'd like to point out that there's already a specialized function to get the number of unique keys in a sequence: Seq.countBy. Plumbing the relevant parts of Reed's and torbonde's answers together:
let countWordsTopTen (s : string) =
s.Split([|','|])
|> Seq.countBy (fun s -> s.Trim())
|> Seq.sortBy (snd >> (~-))
|> Seq.truncate 10
"one, two, one, three, four, one, two, four, five"
|> countWordsTopTen
|> printfn "%A" // seq [("one", 3); ("two", 2); ("four", 2); ("three", 1); ...]

My questions are: Is using a dictionary encouraged in F#?
Using a Dictionary is fine from F#, though it does use mutability, so it's not quite as common.
How would I write the code if I wish to use a dictionary?
If you read the file, and have a string with comma separated values, you could
parse using something similar to:
// Just an example of input - this would come from your file...
let strings = "one, two, one, three, four, one, two, four, five"
let words =
strings.Split([|','|])
|> Array.map (fun s -> s.Trim())
let dict = Dictionary<_,_>()
words
|> Array.iter (fun w ->
match dict.TryGetValue w with
| true, v -> dict.[w] <- v + 1
| false, _ -> dict.[w] <- 1)
// Creates a sequence of tuples, with (word,count) in order
let topTen =
dict
|> Seq.sortBy (fun kvp -> -kvp.Value)
|> Seq.truncate 10
|> Seq.map (fun kvp -> kvp.Key, kvp.Value)

I would say an obvious choice for this task is to use the Seq module, which is really one of the major workhorses in F#. As Reed said, using dictionary is not as common, since it is mutable. Sequences, on the other hand, are immutable. An example of how to do this using sequences is
let strings = "one, two, one, three, four, one, two, four, five"
let words =
strings.Split([|','|])
|> Array.map (fun s -> s.Trim())
let topTen =
words
|> Seq.groupBy id
|> Seq.map (fun (w, ws) -> (w, Seq.length ws))
|> Seq.sortBy (snd >> (~-))
|> Seq.truncate 10
I think the code speaks pretty much for itself, although maybe the second last line requires a short explanation:
The snd-function gives the second entry in a pair (i.e. snd (a,b) is b), >> is the functional composition operator (i.e. (f >> g) a is the same as g (f a)) and ~- is the unary minus operator. Note here that operators are essentially functions, but when using (and declaring) them as functions, you have to wrap them in parentheses. That is, -3 is the same as (~-) 3, where in the last case we have used the operator as a function.
In total, what the second last line does, is sort the sequence by the negative value of the second entry in the pair (the number of occurrences).

efficient way to create map of lists in functional style

Given a dataset, for example a CSV file that might look like this:
x,y
1,2
1,5
2,1
2,2
1,1
...
I wish to create a map of lists containing the y's for a given x... The result could look like this:
{1:[2,5,1], 2:[1,2]}
In python this would be straight forward to do in an imperative manner.. and would probably look somewhat like this:
d = defaultdict(list)
for x,y in csv_data:
d[x].append(y)
How would you go about achieving the same using functional programming techniques in F#?
Is it possible to do it as short, efficient and concise (and read-able) as in the given python example, using only functional style?, or would you have to fall back to imperative programming style with mutable data structures..?
Note: this is not a homework assignment, just me trying to wrap my head around functional programming
EDIT: My conclusion based on answers thus far
I tried timing each of the provided answers on a relative big csv file, just to get a feeling of the performance.. Furthermore I did a small test with the imperative approach:
let res = new Dictionary<string, List<string>>()
for row in l do
if (res.ContainsKey(fst row) = false) then
res.[fst row] <- new List<string>()
res.[fst row].Add(snd row)
The imperative approach completed in ~0.34 sec.
I think that the answer provided by Lee is the most general FP one, however the running time was ~4sec.
The answer given by Daniel ran in ~1.55sec.
And at last the answer given by jbtule ran in ~0.26. (I find it very interesting that it beat the imperative approach)
I used 'System.Diagnostics.Stopwatch()' for timing, and the code is executed as F# 3.0 in .Net 4.5
EDIT2: fixed stupid error in imperative f# code, and ensured that it uses the same list as the other solutions

[
1,2
1,5
2,1
2,2
1,1
]
|> Seq.groupBy fst
|> Seq.map (fun (x, ys) -> x, [for _, y in ys -> y])
|> Map.ofSeq

let addPair m (x, y) =
match Map.tryFind x m with
| Some(l) -> Map.add x (y::l) m
| None -> Map.add x [y] m
let csv (pairs : (int * int) list) = List.fold addPair Map.empty pairs
Note this adds the y values to the list in reverse order

use LINQ in F#, LINQ is functional.
open System.Linq
let data =[
1,2
1,5
2,1
2,2
1,1
]
let lookup = data.ToLookup(fst,snd)
lookup.[1] //seq [2;5;1]
lookup.[2] //seq [1;2

For fun, an implementation using a query expression:
let res =
query { for (k, v) in data do
groupValBy v k into g
select (g.Key, List.ofSeq g) }
|> Map.ofSeq

f# iterating over two arrays, using function from a c# library

I have a list of words and a list of associated part of speech tags. I want to iterate over both, simultaneously (matched index) using each indexed tuple as input to a .NET function. Is this the best way (it works, but doesn't feel natural to me):
let taggingModel = SeqLabeler.loadModel(lthPath +
"models\penn_00_18_split_dict.model");
let lemmatizer = new Lemmatizer(lthPath + "v_n_a.txt")
let input = "the rain in spain falls on the plain"
let words = Preprocessor.tokenizeSentence( input )
let tags = SeqLabeler.tagSentence( taggingModel, words )
let lemmas = Array.map2 (fun x y -> lemmatizer.lookup(x,y)) words tags

Your code looks quite good to me - most of it deals with some loading and initialization, so there isn't much you could do to simplify that part. Alternatively to Array.map2, you could use Seq.zip combined with Seq.map - the zip function combines two sequences into a single one that contains pairs of elements with matching indices:
let lemmas = Seq.zip words tags
|> Seq.map (fun (x, y) -> lemmatizer.lookup (x, y))
Since lookup function takes a tuple that you got as an argument, you could write:
// standard syntax using the pipelining operator
let lemmas = Seq.zip words tags |> Seq.map lemmatizer.lookup
// .. an alternative syntax doing exactly the same thing
let lemmas = (words, tags) ||> Seq.zip |> Seq.map lemmatizer.lookup
The ||> operator used in the second version takes a tuple containing two values and passes them to the function on the right side as two arguments, meaning that (a, b) ||> f means f a b. The |> operator takes only a single value on the left, so (a, b) |> f would mean f (a, b) (which would work if the function f expected tuple instead of two, space separated, parameters).
If you need lemmas to be an array at the end, you'll need to add Array.ofSeq to the end of the processing pipeline (all Seq functions work with sequences, which correspond to IEnumerable<T>)
One more alternative is to use sequence expressions (you can use [| .. |] to construct an array directly if that's what you need):
let lemmas = [| for wt in Seq.zip words tags do // wt is tuple (string * string)
yield lemmatizer.lookup wt |]
Whether to use sequence expressions or not - that's just a personal preference. The first option seems to be more succinct in this case, but sequence expressions may be more readable for people less familiar with things like partial function application (in the shorter version using Seq.map)

N-gram split function for string similarity comparison

As part of excersise to better understand F# which I am currently learning , I wrote function to
split given string into n-grams.
1) I would like to receive feedback about my function : can this be written simpler or in more efficient way?
2) My overall goal is to write function that returns string similarity (on 0.0 .. 1.0 scale) based on n-gram similarity; Does this approach works well for short strings comparisons , or can this method reliably be used to compare large strings (like articles for example).
3) I am aware of the fact that n-gram comparisons ignore context of two strings. What method would you suggest to accomplish my goal?
//s:string - target string to split into n-grams
//n:int - n-gram size to split string into
let ngram_split (s:string, n:int) =
let ngram_count = s.Length - (s.Length % n)
let ngram_list = List.init ngram_count (fun i ->
if( i + n >= s.Length ) then
s.Substring(i,s.Length - i) + String.init ((i + n) - s.Length)
(fun i -> "#")
else
s.Substring(i,n)
)
let ngram_array_unique = ngram_list
|> Seq.ofList
|> Seq.distinct
|> Array.ofSeq
//produce tuples of ngrams (ngram string,how much occurrences in original string)
Seq.init ngram_array_unique.Length (fun i -> (ngram_array_unique.[i],
ngram_list |> List.filter(fun item -> item = ngram_array_unique.[i])
|> List.length)
)

I don't know much about evaluating similarity of strings, so I can't give you much feedback regarding points 2 and 3. However, here are a few suggestions that may help to make your implementation simpler.
Many of the operations that you need to do are already available in some F# library function for working with sequences (lists, arrays, etc.). Strings are also sequences (of characters), so you can write the following:
open System
let ngramSplit n (s:string) =
let ngrams = Seq.windowed n s
let grouped = Seq.groupBy id ngrams
Seq.map (fun (ngram, occurrences) ->
String(ngram), Seq.length occurrences) grouped
The Seq.windowed function implements a sliding window, which is exactly what you need to extract the n-grams of your string. The Seq.groupBy function collects the elements of a sequence (n-grams) into a sequence of groups that contain values with the same key. We use id to calculate the key, which means that the n-gram is itself the key (and so we get groups, where each group contains the same n-grams). Then we just convert n-gram to string and count the number of elements in the group.
Alternatively, you can write the entire function as a single processing pipeline like this:
let ngramSplit n (s:string) =
s |> Seq.windowed n
|> Seq.groupBy id
|> Seq.map (fun (ngram, occurrences) ->
String(ngram), Seq.length occurrences)

Your code looks OK to me. Since ngram extraction and similarity comparison are used very often. You should consider some efficiency issues here.
The MapReduce pattern is very suitable for your frequency counting problem:
get a string, emit (word, 1) pairs out
do a grouping of the words and adds all the counting together.
let wordCntReducer (wseq: seq<int*int>) =
wseq
|> Seq.groupBy (fun (id,cnt) -> id)
|> Seq.map (fun (id, idseq) ->
(id, idseq |> Seq.sumBy(fun (id,cnt) -> cnt)))
(* test: wordCntReducer [1,1; 2,1; 1,1; 2,1; 2,2;] *)
You also need to maintain a <word,int> map during your ngram building for a set of strings. As it is much more efficient to handle integers rather than strings during later processing.
(2) to compare the distance between two short strings. A common practice is to use Edit Distance using a simple dynamic programming. To compute the similarity between articles, a state-of-the-art method is to use TFIDF feature representation. Actuallym the code above is for term frequency counting, extracted from my data mining library.
(3) There are complex NLP methods, e.g. tree kernels based on the parse tree, to in-cooperate the context information in.

I think you have some good answers for question (1).
Question (2):
You probably want cosine similarity to compare two arbitrary collections of n-grams (the larger better). This gives you a range of 0.0 - 1.0 without any scaling needed. The Wikipedia page gives an equation, and the F# translation is pretty straightforward:
let cos a b =
let dot = Seq.sum (Seq.map2 ( * ) a b)
let magnitude v = Math.Sqrt (Seq.sum (Seq.map2 ( * ) v v))
dot / (magnitude a * magnitude b)
For input, you need to run something like Tomas' answer to get two maps, then remove keys that only exist in one:
let values map = map |> Map.toSeq |> Seq.map snd
let desparse map1 map2 = Map.filter (fun k _ -> Map.containsKey k map2) map1
let distance textA textB =
let a = ngramSplit 3 textA |> Map.ofSeq
let b = ngramSplit 3 textB |> Map.ofSeq
let aValues = desparse a b |> values
let bValues = desparse b a |> values
cos aValues bValues
With character-based n-grams, I don't know how good your results will be. It depends on what kind of features of the text you are interested in. I do natural language processing, so usually my first step is part-of-speech tagging. Then I compare over n-grams of the parts of speech. I use T'n'T for this, but it has bizarro licencing issues. Some of my colleagues use ACOPOST instead, a Free alternative (as in beer AND freedom). I don't know how good the accuracy is, but POS tagging is a well-understood problem these days, at least for English and related languages.
Question (3):
The best way to compare two strings that are nearly identical is Levenshtein distance. I don't know if that is your case here, although you can relax the assumptions in a number of ways, eg for comparing DNA strings.
The standard book on this subject is Sankoff and Kruskal's "Time Warps, String Edits, and Maromolecules". It's pretty old (1983), but gives good examples of how to adapt the basic algorithm to a number of applications.

Question 3:
My reference book is Computing Patterns in Strings by Bill Smyth

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

CSV Type Provider & Accessing Data - f#

Related

Simple exercise of OCaml about list

F#: Generating a word count summary

efficient way to create map of lists in functional style

f# iterating over two arrays, using function from a c# library

N-gram split function for string similarity comparison

Categories

Resources