F# read from file vs download string and split by newline

F# read from file vs download string and split by newline - f#

I am new to F# and trying to solve this Kata
The file football.dat contains the results from the English Premier League for 2001/2. The columns labeled ‘F’ and ‘A’ contain the total number of goals scored for and against each team in that season (so Arsenal scored 79 goals against opponents, and had 36 goals scored against them). Write a program to print the name of the team with the smallest difference in ‘for’ and ‘against’ goals.
When save the file and and the read it using File.ReadAllLines, my solutons works:
open System.IO
open System
let split (s:string) =
let cells = Array.ofSeq(s.Split([|' '|],StringSplitOptions.RemoveEmptyEntries))
(cells.[1], int cells.[6], int cells.[8])
let balance t =
let (_,f,a) = t
-(f-a)
let lines = List.ofSeq(File.ReadAllLines(#"F:\Users\Igor\Downloads\football.dat"));;
lines
|> Seq.skip 5
|> Seq.filter (fun (s:string) -> s.Split([|' '|],StringSplitOptions.RemoveEmptyEntries).Length = 10)
|> Seq.map split
|> Seq.sortBy balance
|> Seq.take 1
|> Seq.map (fun (n,_,_) -> printfn "%s" n)
but when instead of reading the file I download it using WebClient and split lines the rest of the code does not work. The sequence is the same length but F# Interactive does not show the elements and prints no output. The code is
open System.Net
open System
let split (s:string) =
let cells = Array.ofSeq(s.Split([|' '|],StringSplitOptions.RemoveEmptyEntries))
(cells.[1], int cells.[6], int cells.[8])
let balance t =
let (_,f,a) = t
-(f-a)
let splitLines (s:string) =
List.ofSeq(s.Split([|'\n'|]))
let wc = new WebClient()
let lines = wc.DownloadString("http://pragdave.pragprog.com/data/football.dat")
lines
|> splitLines
|> Seq.skip 5
|> Seq.filter (fun (s:string) -> s.Split([|' '|],StringSplitOptions.RemoveEmptyEntries).Length = 10)
|> Seq.map split
|> Seq.sortBy balance
|> Seq.take 1
|> Seq.map (fun (n,_,_) -> printfn "%s" n)
What is the difference? List.ofSeq(File.ReadAllLines..) retuns a sequence and downloading the file from the internet and splitting it by \n returns the same sequence

The last line should use Seq.iter instead of Seq.map and there are too many spaces in the expression that splits each line.
With these corrections it works ok:
open System.Net
open System
open System.IO
let split (s:string) =
let cells = Array.ofSeq(s.Split([|' '|],StringSplitOptions.RemoveEmptyEntries))
(cells.[1], int cells.[6], int cells.[8])
let balance t =
let (_,f,a) = t
-(f-a)
let splitLines (s:string) =
List.ofSeq(s.Split([|'\n'|]))
let wc = new WebClient()
let lines = wc.DownloadString("http://pragdave.pragprog.com/data/football.dat") |> splitLines
let output =
lines
|> Seq.skip 5
|> Seq.filter (fun (s:string) -> s.Split([|' '|],StringSplitOptions.RemoveEmptyEntries).Length = 10)
|> Seq.map split
|> Seq.sortBy balance
|> Seq.take 1
|> Seq.iter (fun (n,_,_) -> Console.Write(n))
let stop = Console.ReadKey()

That URL is returning an HTML page, not a raw data file, could that be what is causing your problems?
Also it is usually good to verify what delimiter the page is using for newlines. That one is using 0x0A which is \n, but sometimes you will find \r\n or rarely \r.
EDIT:
Also you appear to be using map to handle printing, this isn't a good way to do it. I know I am having difficulty in general getting your sample to show an output when executing all at once.
I would recommend mapping to n and print some other way, such as using Seq.head.

Related

Trying to filter out values in a sequence that are not in another sequence

I am trying to filter out values from a sequence, that are not in another sequence. I was pretty sure my code worked, but it is taking a long time to run on my computer and because of this I am not sure, so I am here to see what the community thinks.
Code is below:
let statezip =
StateCsv.GetSample().Rows
|> Seq.map (fun row -> row.State)
|> Seq.distinct
type State = State of string
let unwrapstate (State s) = s
let neededstates (row:StateCsv) = Seq.contains (unwrapstate row.State) statezip
I am filtering by the neededstates function. Is there something wrong with the way I am doing this?
let datafilter =
StateCsv1.GetSample().Rows
|> Seq.map (fun row -> row.State,row.Income,row.Family)
|> Seq.filter neededstates
|> List.ofSeq
I believe that it should filter the sequence by the values that are true, since neededstates function is a bool. StateCsv and StateCsv1 have the same exact structure, although from different years.

Evaluation of contains on sequences and lists can be slow. For a case where you want to check for the existence of an element in a collection, the F# Set type is ideal. You can convert your sequences to sets using Set.ofSeq, and then run the logic over the sets instead. The following example uses the numbers from 1 to 10000 and then uses both sequences and sets to filter the result to only the odd numbers by checking that the values are not in a collection of even numbers.
Using Sequences:
let numberSeq = {0..10000}
let evenNumberSeq = seq { for n in numberSeq do if (n % 2 = 0) then yield n }
#time
numberSeq |> Seq.filter (fun n -> evenNumberSeq |> Seq.contains n |> not) |> Seq.toList
#time
This runs in about 1.9 seconds for me.
Using sets:
let numberSet = numberSeq |> Set.ofSeq
let evenNumberSet = evenNumberSeq |> Set.ofSeq
#time
numberSet |> Set.filter (fun n -> evenNumberSet |> Set.contains n |> not)
#time
This runs in only 0.005 seconds. Hopefully you can materialize your sequences to sets before performing your contains operation, thereby getting this level of speedup.

Trying to create a function, and then filtering a sequence by "not that function" in F#

My data is a SEQUENCE of:
[(40,"TX");(48,"MO");(15,"TX");(78,"TN");(41,"VT")]
My code is as follows:
type Csvfile = CsvProvider<somefile>
let data = Csvfile.GetSample().Rows
let nullid row =
row.Id = 15
let otherid row =
row.Id= 40
let iddata =
data
|> Seq.filter (not nullid)
|> Seq.filter (not otherid)
I create the functions.
Then I want to call the "not" of those functions to filter them out of a sequence.
But the issue is that I am getting errors for "row.Id" in the first two functions, because you can only do that with a type.
How do I solve this problem so I can accomplish this successfully.
My result should be a SEQUENCE of:
[(48,"MO);(78,"TN");(41,"VT")]

You can use >> operator to compose the two functions:
let iddata =
data
|> Seq.filter (nullid >> not)
|> Seq.filter (othered >> not)
See Function Composition and Pipelining.
Or you can make it more explicit:
let iddata =
data
|> Seq.filter (fun x -> not (nullid x))
|> Seq.filter (fun x -> not (othered x))
You can see that in action:
let input = [|1;2;3;4;5;6;7;8;9;10|];;
let is3 value =
value = 3;;
input |> Seq.filter (fun x -> not (is3 x));;
input |> Seq.filter (not >> is3);;
They both print val it : seq<int> = seq [1; 2; 4; 5; ...]

Please see below what an MCVE might look in your case, for an fsx file you can reference the Fsharp.Data dll with #r, for a compiled project just reference the dll an open it.
#if INTERACTIVE
#r #"..\..\SO2018\packages\FSharp.Data\lib\net45\FSharp.Data.dll"
#endif
open FSharp.Data
[<Literal>]
let datafile = #"C:\tmp\data.csv"
type CsvFile = CsvProvider<datafile>
let data = CsvFile.GetSample().Rows
In the end this is what you want to achieve:
data
|> Seq.filter (fun x -> x.Id <> 15)
|> Seq.filter (fun x -> x.Id <> 40)
//val it : seq<CsvProvider<...>.Row> = seq [(48, "MO"); (78, "TN"); (41, "VT")]
One way to do this is with SRTP, as they allow a way to do structural typing, where the type depends on its shape, for example in this case having the Id property. If you want you can define helper function for the two numbers 15 and 40, and use that in your filter, just like in the second example. However SRTP syntax is a bit strange, and it's designed for a use case where you need to apply a function to different types that have some similarity (basically like interfaces).
let inline getId row =
(^T : (member Id : int) row)
data
|> Seq.filter (fun x -> (getId x <> 15 ))
|> Seq.filter (fun x -> (getId x <> 40))
//val it : seq<CsvProvider<...>.Row> = seq [(48, "MO"); (78, "TN"); (41, "VT")]
Now back to your original post, as you correctly point out your function will show an error, as you define it to be generic, but it needs to operate on a specific Csv row type (that has the Id property). This is very easy to fix, just add a type annotation to the row parameter. In this case your type is CsvFile.Row, and since CsvFile.Row has the Id property we can access that in the function. Now this function returns a Boolean. You could make it return the actual row as well.
let nullid (row: CsvFile.Row) =
row.Id = 15
let otherid (row: CsvFile.Row) =
row.Id = 40
Then what is left is applying this inside a Seq.filter and negating it:
let iddata =
data
|> Seq.filter (not << nullid)
|> Seq.filter (not << otherid)
|> Seq.toList
//val iddata : CsvProvider<...>.Row list = [(48, "MO"); (78, "TN"); (41, "VT")]

Why can't I go twice through the rows of a CSV provider?

In some languages after one goes through a lazy sequence it becomes exhausted. That is not the case with F#:
let mySeq = seq [1..5]
mySeq |> Seq.iter (fun x -> printfn "%A" <| x)
mySeq |> Seq.iter (fun x -> printfn "%A" <| x)
1
2
3
4
5
1
2
3
4
5
However, it looks like one can go only once through the rows of a CSV provider:
open FSharp.Data
[<Literal>]
let foldr = __SOURCE_DIRECTORY__ + #"\data\"
[<Literal>]
let csvPath = foldr + #"AssetInfoFS.csv"
type AssetsInfo = CsvProvider<Sample=csvPath,
HasHeaders=true,
ResolutionFolder=csvPath,
AssumeMissingValues=false,
CacheRows=false>
let assetInfo = AssetsInfo.Load(csvPath)
assetInfo.Rows |> Seq.iter (fun x -> printfn "%A" <| x) // Works fine 1st time
assetInfo.Rows |> Seq.iter (fun x -> printfn "%A" <| x) // 2nd time exception
Why does that happen?

From this link on the CSV Parser, the CSV Type Provider is built on top of the CSV Parser. The CSV Parser works in streaming mode, most likely by calling a method like File.ReadLines, which will throw an exception if the enumerator is enumerated a second time. The CSV Parser also has a Cache method. Try setting CacheRows=true (or leaving it out of the declaration since its default value is true) to avoid this issue
CsvProvider<Sample=csvPath,
HasHeaders=true,
ResolutionFolder=csvPath,
AssumeMissingValues=false,
CacheRows=true>

The sequence iterator stays put where you point it; after the first loop, that is the end of the sequence.
If you want it to go back to the beginning, you have to set it there.

F# Writing to file changes behavior on return type

I have the following function that convert csv files to a specific txt schema (expected by CNTKTextFormat Reader):
open System.IO
open FSharp.Data;
open Deedle;
let convert (inFileName : string) =
let data = Frame.ReadCsv(inFileName)
let outFileName = inFileName.Substring(0, (inFileName.Length - 4)) + ".txt"
use outFile = new StreamWriter(outFileName, false)
data.Rows.Observations
|> Seq.map(fun kvp ->
let row = kvp.Value |> Series.observations |> Seq.map(fun (k,v) -> v) |> Seq.toList
match row with
| label::data ->
let body = data |> List.map string |> String.concat " "
outFile.WriteLine(sprintf "|labels %A |features %s" label body)
printf "%A" label
| _ ->
failwith "Bad data."
)
|> ignore
Strangely, the output file is empty after running in the F# interactive panel and that printf yields no printing at all.
If I remove the ignore to make sure that there are actual rows being processed (evidenced by returning a seq of nulls), instead of an empty file I get:
val it : seq<unit> = Error: Cannot write to a closed TextWriter.
Before, I was declaring the StreamWriter using let and disposing it manually, but I also generated empty files or just a few lines (say 5 out of thousands).
What is happening here? Also, how to fix the file writing?

Seq.map returns a lazy sequence which is not evaluated until it is iterated over. You are not currently iterating over it within convert so no rows are processed. If you return a Seq<unit> and iterate over it outside convert, outFile will already be closed which is why you see the exception.
You should use Seq.iter instead:
data.Rows.Observations
|> Seq.iter (fun kvp -> ...)

Apart from the solutions already mentioned, you could also avoid the StreamWriter altogether, and use one of the standard .Net functions, File.WriteAllLines. You would prepare a sequence of converted lines, and then write that to the file:
let convert (inFileName : string) =
let lines =
Frame.ReadCsv(inFileName).Rows.Observations
|> Seq.map(fun kvp ->
let row = kvp.Value |> Series.observations |> Seq.map snd |> Seq.toList
match row with
| label::data ->
let body = data |> List.map string |> String.concat " "
printf "%A" label
sprintf "|labels %A |features %s" label body
| _ ->
failwith "Bad data."
)
let outFileName = inFileName.Substring(0, (inFileName.Length - 4)) + ".txt"
File.WriteAllLines(outFileName, lines)
Update based on the discussion in the comments: Here's a solution that avoids Deedle altogether. I'm making some assumptions about your input file format here, based on another question you posted today: Label is in column 1, features follow.
let lines =
File.ReadLines inFileName
|> Seq.map (fun line ->
match Seq.toList(line.Split ',') with
| label::data ->
let body = data |> List.map string |> String.concat " "
printf "%A" label
sprintf "|labels %A |features %s" label body
| _ ->
failwith "Bad data."
)

As Lee already mentioned, Seq.map is lazy. And that's also why you were getting "Cannot write to a closed TextWriter": the use keyword disposes of its IDisposable when it goes out of scope. In this case, that's at the end of your function. Since Seq.map is lazy, your function was returning an unevaluated sequence object, which had closed over the StreamWriter in your use statement -- but by the time you evaluated that sequence (in whatever part of your code checked for the Seq of nulls, or in the F# Interactive window), the StreamWriter had already been disposed by going out of scope.
Change Seq.map to Seq.iter and both of your problems will be solved.

XML scan from C# to F#

Trying to learning F# and I tried to reimplement the following function in F#
private string[] GetSynonyms(string synonyms)
{
var items = Enumerable.Repeat(synonyms, 1)
.Where(s => s != null)
.Select(XDocument.Parse)
.Select(doc => doc.Root)
.Where(root => root != null)
.SelectMany(e => e.Elements(SynonymsNamespace + "synonym"))
.Select(e => e.Value)
.ToArray();
return items;
}
I got this far by myself
let xname = XNamespace.Get "http://localuri"
let syn = "<synonyms xmlns=\"http://localuri\"><synonym>a word</synonym><synonym>another word</synonym></synonyms>"
let synonyms str =
let items = [str]
items
|> List.map System.Xml.Linq.XDocument.Parse
|> List.map (fun x -> x.Root)
|> List.map (fun x -> x.Elements(xname + "synonym") |> Seq.cast<System.Xml.Linq.XElement>)
|> Seq.collect (fun x -> x)
|> Seq.map (fun x -> x.Value)
let a = synonyms syn
Dump a
Now I'm wondering if there is a more-functional way to write the same code.
By extracting the access to the properties to standalone functions I got this version
let xname = XNamespace.Get "http://localuri"
let syn = "<synonyms xmlns=\"http://localuri\"><synonym>a word</synonym><synonym>another word</synonym></synonyms>"
let getRoot (doc:System.Xml.Linq.XDocument) = doc.Root
let getValue (element:System.Xml.Linq.XElement) = element.Value
let getElements (element:System.Xml.Linq.XElement) =
element.Elements(xname + "synonym")
|> Seq.cast<System.Xml.Linq.XElement>
let synonyms str =
let items = [str]
items
|> List.map System.Xml.Linq.XDocument.Parse
|> List.map getRoot
|> List.map getElements
|> Seq.collect (fun x -> x)
|> Seq.map getValue
let a = synonyms syn
Dump a
But I still have a few concerns
Can I rewrite that Seq.collect (fun x -> x) in another way? It sounds redundant
Can I remove all those (fun x -> x.Property) without creating new functions?
How to actually return an array and not a Seq<'a> (which I understand is an IEnumerable<'a>)
Thanks

Seq.collect (fun x -> x) can be rewritten with the predefined id function to Seq.collect id
In F# 4.0 it can be removed for constructors only.
use Seq.toArray or Seq.toList

Would it be very wrong to drop the C#-code and go all in with the XML-provider in F#? In my world its always wrong to parse XML when there exists other solutions (unless Im trying to create octogonal wheels or moist gun powders other have made better before me).
In this regard I would even have used some transformation (XSLT) or selection (XPATH/XQUERY) unless I could use the XML-provider or some XSD (c#) for generating code.
If for some reason the XML is so unstructured that you actually need parsing, then the XML is arguably wrong...
If using the XmlProvider you get namespacing, types etc for free...
#r #"..\correct\this\path\to\packages\FSharp.Data.2.2.5\lib\net40\FSharp.Data.dll"
#r "System.Xml.Linq"
open FSharp.Data
[<Literal>]
let syn = "<synonyms xmlns=\"http://localuri\"><synonym>a word</synonym><synonym>another word</synonym></synonyms>"
type Synonyms = XmlProvider<syn>
let a = Synonyms.GetSample()
a.Synonyms |> Seq.iter (printfn "%A")
Mind that the XmlProvider also can take files or url as examples for inferring types etc, and that you also can have this code as the example and then use
let a = Synonyms.Load(stuff)
where stuff is a read from stream, textreader or URI and inferred according to your example. The sample and the stuff might even point to same file/Uri if this is some standard placing of data.
See also: http://fsharp.github.io/FSharp.Data/library/XmlProvider.html

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

F# read from file vs download string and split by newline - f#

Related

Trying to filter out values in a sequence that are not in another sequence

Trying to create a function, and then filtering a sequence by "not that function" in F#

Why can't I go twice through the rows of a CSV provider?

F# Writing to file changes behavior on return type

XML scan from C# to F#

Categories

Resources