FAOCropsLivestock.csv contains more than 14 million row. In my .fs file I have declared
type FAO = CsvProvider<"c:\FAOCropsLivestock.csv">
and tried to work with follwoing code
FAO.GetSample().Rows.Where(fun x -> x.Country = country) |> ....
FAO.GetSample().Filter(fun x -> x.Country = country) |> ....
In both cases, exception was thrown.
I also have tried with follwoing code after loading the csv file in MSSQL Server
type Schema = SqlDataConnection<conStr>
let db = Schema.GetDataContext()
db.FAOCropsLivestock.Where(fun x-> x.Country = country) |> ....
it works. It also works if I issue query using OleDb connection, but it is slow.
How can I get a squence out of it using CsvProvider?
If you refer to the bottom of the CSV Type Provider documentation, you will see a section on handling large datasets. As explained there, you can set CacheRows = false which will aid you when it comes to handling large datasets.
type FAO = CsvProvider<"c:\FAOCropsLivestock.csv", CacheRows = false>
You can then use standard sequence operations over the rows of the CSV as a sequence without loading the entire file into memory. e.g.
FAO.GetSample().Rows |> Seq.filter (fun x -> x.Country = country) |> ....
You should, however, take care to only enumerate the contents once.
Related
I am trying to experiment with live data from the Coronavirus pandemic (unfortunately and good luck to all of us).
I have developed a small script and I am transitioning into a console application: it uses CSV type providers.
I have the following issue. Suppose we want to filter by region the Italian spread we can use this code into a .fsx file:
open FSharp.Data
let provinceData = CsvProvider< #"https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-province/dpc-covid19-ita-province.csv" , IgnoreErrors = true>.GetSample()
let filterDataByProvince province =
provinceData.Rows
|> Seq.filter (fun x -> x.Sigla_provincia = province)
Being sequences lazy, then suppose I force the complier to load in memory the data for the province of Rome, I can add:
let romeProvince = filterDataByProvince "RM" |> Seq.toArray
This works fine, run by FSI, locally.
Now, if I transition this code into a console application using a .fs file; I declare exactly the same functions and using exactly the same type provider loader; but instead of using the last line to gather the data, I put it into a main function:
[<EntryPoint>]
let main _ =
let romeProvince = filterDataByProvince "RM" |> Seq.toArray
Console.Read() |> ignore
0
This results into the following runtime exception:
System.Exception
HResult=0x80131500
Message=totale_casi is missing
Source=FSharp.Data
StackTrace:
at <StartupCode$FSharp-Data>.$TextRuntime.GetNonOptionalValue#139-4.Invoke(String message)
at CoronaSchiatta.Evoluzione.provinceData#10.Invoke(Object parent, String[] row) in C:\Users\glddm\source\repos\CoronaSchiatta\CoronaSchiatta\CoronaEvolution.fs:line 10
at FSharp.Data.Runtime.CsvHelpers.parseIntoTypedRows#174.GenerateNext(IEnumerable`1& next)
Can you explain that?
Some rows have an odd format, possibly, but the FSI session is robust to those, whilst the console version is fragile; why? How can I fix that?
I am using VS2019 Community Edition, targeting .NET Framework 4.7.2, F# runtime: 4.7.0.0;
as FSI, I am using the following: FSI Microsoft (R) F# Interactive version 10.7.0.0 for F# 4.7
PS: Please also be aware that if I use CsvFile, instead of type providers, as in:
let test = #"https://raw.githubusercontent.com/pcm-dpc/COVID-19/master/dati-province/dpc-covid19-ita-province.csv"
|> CsvFile.Load |> (fun x -> x.Rows ) |> Seq.filter ( fun x-> x.[6 ] = "RM")
|> Seq.iter ( fun x -> x.[9] |> Console.WriteLine )
Then it works like a charm also in the console application. Of course I would like to use type providers otherwise I have to add type definition, mapping the schema to the columns (and it will be more fragile). The last line was just a quick test.
Fragility
CSV Type Providers can be fragile if you don't have a good schema or sample.
Now getting a runtime error is almost certainly because your data doesn't match up.
How do you figure it out? One way is to run through your data first:
provinceData.Rows |> Seq.iteri (fun i x -> printfn "Row %d: %A" (i + 1) x)
This runs up to Row 2150. And sure enough, the next row:
2020-03-11 17:00:00,ITA,19,Sicilia,994,In fase di definizione/aggiornamento,,0,0,
You can see the last value (totale_casi) is missing.
One of CsvProvider's options is InferRows. This is the number of rows the provider scans in order to build up a schema - and its default value happens to be 1000.
So:
type COVID = CsvProvider<uri, InferRows = 0>
A better way to prevent this from happening in the future is to manually define a sample from a sub-set of data:
type COVID = CsvProvider<"sample-dpc-covid19-ita-province.csv">
and sample-dpc-covid19-ita-province.csv is:
data,stato,codice_regione,denominazione_regione,codice_provincia,denominazione_provincia,sigla_provincia,lat,long,totale_casi
2020-02-24 18:00:00,ITA,13,Abruzzo,069,Chieti,CH,42.35103167,14.16754574,0
2020-02-24 18:00:00,ITA,13,Abruzzo,066,L'Aquila,AQ,42.35122196,13.39843823,
2020-02-24 18:00:00,ITA,13,Abruzzo,068,Pescara,PE,42.46458398,14.21364822,0
2020-02-24 18:00:00,ITA,13,Abruzzo,067,Teramo,TE,42.6589177,13.70439971,0
With this the type of totale_casi is now Nullable<int>.
If you don't mind NaN values, you can also use:
CsvProvider<..., AssumeMissingValues = true>
Why does FSI seem more robust?
FSI isn't more robust. This is my best guess:
Your schema source is being regularly updated.
Type Providers cache the schema, so that it doesn't regenerate the schema every time you compile your code, which can be impractical. When you restart an FSI session, you end up regenerating your Type Provider, but not so with the console application. So it might sometimes has the effect of being less error-prone, having worked with a newer source.
I'm a C# developer and this is my first attempt at writing F#.
I'm trying to read a Dashlane exported database in the CSV format. These files have no headers and a dynamic number of columns for each possible type of entry. The following file is an example of dummy data that I use to test my software. It only contains password entries and yet they have between 5 and 7 columns (I'll decide how to handle other types of data later)
The first line of the exported file (in this case, but not always) is the email address that was used to create the dashlane account which makes this line only one column wide.
"accountCreation#email.fr"
"Nom0","siteweb0","Identifiant0","",""
"Nom1","siteweb1","identifiant1","email1#email.email","",""
"Nom2","siteweb2","email2#email.email","",""
"Nom3","siteweb3","Identifiant3","password3",""
"Nom4","siteweb4","Identifiant4","email4#email.email","password4",""
"Nom5","siteweb5","Identifiant5","email5#email.email","SecondIdentifiant5","password5",""
"Nom6","siteweb6","Identifiant6","email6#email.email","SecondIdentifiant6","password6","this is a single-line note"
"Nom7","siteweb7","Identifiant7","email7#email.email","SecondIdentifiant7","password7","this is a
multi
line note"
"Nom8","siteweb8","Identifiant8","email8#email.email","SecondIdentifiant8","password8","single line note"
I'm trying to print the first column of each row to the console as a start
let rawCsv = CsvFile.Load("path\to\file.csv", ",", '"', false)
for row in rawCsv.Rows do
printfn "value %s" row.[0]
This code gives me the the following error on the for line
Couldn't parse row 2 according to schema: Expected 1 columns, got 5
I haven't give the CsvFile any schema and I couldn't find on the internet how to specify a schema.
I would be able to remove the first line dynamically if I wanted to but it wouldn't change anything since the other lines have different column counts too.
Is there any way to parse this awakward CSV file in F# ?
Note: For each password row, only the column right before the last one matters to me (the password column)
I do not think that CSV file of as irregular structure as yours is a good candidate for processing with CSV Type Provider or CSV Parser.
At the same time it does not seem difficult to parse this file to your likes with few lines of custom logic. The following snippet:
open System
open System.IO
File.ReadAllLines("Sample.csv") // Get data
|> Array.filter(fun x -> x.StartsWith("\"Nom")) // Only lines starting with "Nom may contain password
|> Array.map (fun x -> x.Split(',') |> Array.map (fun x -> x.[1..(x.Length-2)])) // Split each line into "cells"
|> Array.filter(fun x -> x.[x.Length-2] |> String.IsNullOrEmpty |> not) // Take only those having non-empty cell before the last one
|> Array.map (fun x -> x.[0],x.[x.Length-2]) // show the line key and the password
after parsing your sample file produces
>
val it : (string * string) [] =
[|("Nom3", "password3"); ("Nom4", "password4"); ("Nom5", "password5");
("Nom6", "password6"); ("Nom7", "password7"); ("Nom8", "password8")|]
>
It may be a good starting point for further improving the parsing logic to perfection.
I propose to read the csv file as a text file. I read the file line by line and form a list and then parse each line with CsvFile.Parse. But the problem is that the elements are found in Headers and not in Rows which is of type string [] option
open FSharp.Data
open System.IO
let readLines (filePath:string) = seq {
use sr = new StreamReader(filePath)
while not sr.EndOfStream do
yield sr.ReadLine ()
}
[<EntryPoint>]
let main argv =
let lines = readLines "c:\path_to_file\example.csv"
let rows = List.map (fun str -> CsvFile.Parse(str)) (Seq.toList lines)
for row in List.toArray(rows) do
printfn "New Line"
if row.Headers.IsSome then
for r in row.Headers.Value do
printfn "value %s" (r)
printfn "%A" argv
0 // return an integer exit code
I have a sqlite table with a mix of integer and float columns. I'm trying to get the max and min values of each column. For integer columns the following code works but I get a cast error when using the same code on float columns:
let numCats = query{for row in db do minBy row.NumCats}
For float columns I'm using the following code but it's slow.
let CatHight = query{for row in db do select row.CatHeight} |> Seq.toArray |> Array.max
I have 8 integer columns and 9 float columns and the behavior has been consistent across all columns so that's why I think it's an issue with the column type. But I'm new to F# and don't know anything so I'm hoping you can help me.
Thank you for taking the time to help, it's much appreciated.
SQLProvider version: 1.0.41
System.Data.SQLite.Core version: 1.0.104
The error is: System.InvalidCastException occurred in FSharp.Core.dll
Added Information
I created a new table with one column of type float. I inserted the values 2.2 and 4.2. Using SQLProvider and System.Data.SQLite.Core I connected queried the database using minBy or maxBy and I get the cast exception. If the column type is integer it works correctly.
More Added Information
Exception detail:
System.Exception was unhandled
Message: An unhandled exception of type 'System.Exception' occurred in >FSharp.Core.dll
Additional information: Unsupported execution expression value(FSharp.Data.Sql.Runtime.QueryImplementation+SqlQueryable1[FSharp.>Data.Sql.Common.SqlEntity]).Min(row => >Convert(Convert(row.GetColumn("X"))))`
Code that fails:
open FSharp.Data.Sql
[<Literal>]
let ConnectionString =
"Data Source=c:\MyDB.db;" +
"Version=3;foreign keys=true"
type Sql = SqlDataProvider<Common.DatabaseProviderTypes.SQLITE,
ConnectionString,
//ResolutionPath = resolutionPath,
CaseSensitivityChange = Common.CaseSensitivityChange.ORIGINAL>
let ctx = Sql.GetDataContext()
let Db = ctx.Main.Test
let x = query{for row in Db do minBy row.X}
printfn "x: %A" x
Update 2/1/17
Another user was able to reproduce the issue so I filed an Issue with SQLProvider. I'm now looking at workarounds and while the following code works and is fast, I know there's a better way to do it - I just can't find the correct way. If somebody answers with better code I'll accept that answer. Thanks again for all the help.
let x = query {for row in db do
sortBy row.Column
take 1
select row.Column } |> Seq.toArray |> Array.min
This is my workaround that #s952163 and good people in the SO f# chat room helped me with. Thanks again to everyone who helped.
let x = query {for row in db do
sortBy row.Column
take 1
select row.Column } |> Seq.head
You need to coerce the output column to int or float (whichever you need or is giving trouble to you). You also need to take care in case any of your columns are nullable. The example below will coerce the column to float first (to take care of being nullable), then convert it to int, and finally get the minimum:
let x = query { for row in MYTABLE do
minBy (int (float row.MYCOLUMN))}
You might want to change the order of course, or just say float Mycolumn.
Update:
With Sqlite it indeed causes an error. You might want to do query { ... } |> Seq.minBy to extract the smallest number.
I'm currently trying out the SqlDataConnection type provider, and was wondering how to display these types.
Is there some way I can have my calls to printfn "%A" display something more meaningful than the type name?
I do not think there is a way to intercept the behaviour of printfn "%A" for existing types generated by a type provider (that you cannot modify). If you could modify the type provider, then you could change it to generate the StructuredFormatDisplay attribute for the generated types, but that's not possible for SqlDataConnection.
If you're using it in F# Interactive, then you can use fsi.AddPrintTransformer to define how individual values are printed when they are a result of some computation. For example:
// Using Northwind database as a sample
type DB = SqlDataConnection<"Data Source=.\\SQLExpress;Initial Catalog=Northwind;...">
let db = DB.GetDataContext()
// A simple formatter that creates a list with property names and values
let formatAny (o:obj) =
[ for p in o.GetType().GetProperties() ->
p.Name, p.GetValue(o) ]
// Print all Northwind products using the formatter
fsi.AddPrintTransformer(fun (p:DB.ServiceTypes.Products) ->
formatAny p |> box)
// Take the first product - will be printed using custom formatter
query { for p in db.Products do head }
The specified PrintTransformer is only used when you get the value as a result in F# interactive. It will also work when you write query { .. } |> List.ofSeq for queries that return multiple objects. But for printfn "%A", you'll have to call the conversion function (like formatAny) explicitly...
The Google yields plenty of example of adding and deleting entries in an F# dictionary (or other collection). But I don't see examples to the equivalent of
myDict["Key"] = MyValue;
I've tried
myDict.["Key"] <- MyValue
I have also attempted to declare the Dictionary as
Dictionary<string, mutable string>
as well several variants on this. However, I haven't hit on the correct combination yet... if it is actually possible in F#.
Edit: The offending code is:
type Config(?fileName : string) =
let fileName = defaultArg fileName #"C:\path\myConfigs.ini"
static let settings =
dict[ "Setting1", "1";
"Setting2", "2";
"Debug", "0";
"State", "Disarray";]
let settingRegex = new Regex(#"\s*(?<key>([^;#=]*[^;#= ]))\s*=\s*(?<value>([^;#]*[^;# ]))")
do File.ReadAllLines(fileName)
|> Seq.map(fun line -> settingRegex.Match(line))
|> Seq.filter(fun mtch -> mtch.Success)
|> Seq.iter(fun mtch -> settings.[mtch.Groups.Item("key").Value] <- mtch.Groups.Item("value").Value)
The error I'm getting is:
System.NotSupportedException: This value may not be mutated
at Microsoft.FSharp.Core.ExtraTopLevelOperators.dict#37-2.set_Item(K key, V value)
at <StartupCode$FSI_0036>.$FSI_0036_Config.$ctor#25-6.Invoke(Match mtch)
at Microsoft.FSharp.Collections.SeqModule.iter[T](FastFunc`2 action, IEnumerable`1 sequence)
at FSI_0036.Utilities.Config..ctor(Option`1 fileName)
at <StartupCode$FSI_0041>.$FSI_0041.main#()
stopped due to error
f# has two common associative data structures:
The one you are most used to, the mutable Dictionary which it inherits that's to it's presence in the BCL and uses a hashtable under the hood.
let dict = new System.Collections.Generic.Dictionary<string,int>()
dict.["everything"] <- 42
The other is known as Map and is, in common functional style, immutable and implemented with binary trees.
Instead of operations that would change a Dictionary, maps provide operations which return a new map which is the result of whatever change was requested. In many cases, under the hood there is no need to make an entirely new copy of the entire map, so those parts that can be shared normally are. For example:
let withDouglasAdams = Map.add "everything" 42 Map.empty
The value withDouglasAdams will remain forever as an association of "everything" to 42. so if you later do:
let soLong = Map.remove "everything" withDouglasAdams
Then the effect of this 'removal' is only visible via the soLong value.
F#'s Map is, as mentioned, implemented as a binary tree. Lookup is therefore O(log n) whereas a (well behaved) dictionary should be O(1). In practice a hash based dictionary will tend to outperform the tree based one in almost all simple (low number of elements, low probability of collision) as such is commonly used. That said the immutable aspect of the Map may allow you to use it in situations where the dictionary would instead require more complex locking or to write more 'elegant' code with fewer side effects and thus it remains a useful alternative.
This is not however the source of your problem. The dict 'operator' returns an explicity immutable IDictionary<K,T> implementation (despite not indicating this in it's documentation).
From fslib-extra-pervasives.fs (note also the use of options on the keys):
let dict l =
// Use a dictionary (this requires hashing and equality on the key type)
// Wrap keys in an Some(_) option in case they are null
// (when System.Collections.Generic.Dictionary fails). Sad but true.
let t = new Dictionary<Option<_>,_>(HashIdentity.Structural)
for (k,v) in l do
t.[Some(k)] <- v
let d = (t :> IDictionary<_,_>)
let c = (t :> ICollection<_>)
let ieg = (t :> IEnumerable<_>)
let ie = (t :> System.Collections.IEnumerable)
// Give a read-only view of the dictionary
{ new IDictionary<'key, 'a> with
member s.Item
with get x = d.[Some(x)]
and set (x,v) = raise (NotSupportedException(
"This value may not be mutated"))
...
What error do you get? I tried the following and it compiles just fine
let map = new System.Collections.Generic.Dictionary<string,int>()
map.["foo"] <- 42
EDIT Verify that this code ran just fine as well .