F# deedle cast column datatype - f#

I have load a csv file to a Frame, deedle automatically infers that one column as decimal, whichi in fact should be int.
I have use the following line to do the casting into the correct type,
df?ColumnName <- df.GetColumn<int>("ColumnName")
I am wondering if this is the right way.

You can control the type of columns when reading .csv.
ReadCsv(...) has such parameter as schema:
schema - A string that specifies CSV schema. See the documentation for
information about the schema format.
More information can be found here (in the section Controlling the column types)
Example:
.csv:
Name,Age,Comp1,Comp2
"Joe", 51, 12.1, 20.3
"Tomas", 28, 1.1, 29.3
"Eve", 2, 2.1, 40.3
"Suzanne", 15, 12.4, 26.3
F#:
let pathToCSV = "0.csv"
let schema = "Name,Age(int),Comp1,Comp2"
let loadFrame = Frame.ReadCsv(pathToCSV, schema=schema)
loadFrame.Format() |> printfn "%s"
loadFrame.ColumnTypes |> Seq.iter(printfn "%A")
Print:
Name Age Comp1 Comp2
0 -> Joe 51 12,1 20,3
1 -> Tomas 28 1,1 29,3
2 -> Eve 2 2,1 40,3
3 -> Suzanne 15 12,4 26,3
System.String
System.Int32
System.Decimal
System.Decimal
Although, for me, Frame have the correct column types and without specifying the schema.

Well if it works for you..., this gets the column and overwrites with the type specified.
You have another option, which is to specify the schema of the CSV file:
•inferTypes - Specifies whether the method should attempt to infer
types of columns automatically (set this to false if you want to
specify schema)
•inferRows - If inferTypes=true, this parameter
specifies the number of rows to use for type inference. The default
value is 0, meaning all rows.
•schema - A string that specifies CSV
schema. See the documentation for information about the schema format.
You could probably check the CSV type provider docs for ReadCsv.

Related

Excel DNA UDF obtain unprocessed values as inputs

I have written several helper functions in F# that enable me to deal with the dynamic nature of Excel over the COM/PIA interface. However when I go to use these functions in an Excel-DNA UDF they do not work as expected as Excel-DNA is pre-processing the values in the array from excel.
e.g. null is turned into ExcelDna.Integration.ExcelEmpty
This interferes with my own validation code that was anticipating a null. I am able to work around this by adding an additional case to my pattern matching:
let (|XlEmpty|_|) (x: obj) =
match x with
| null -> Some XlEmpty
| :? ExcelDna.Integration.ExcelEmpty -> Some XlEmpty
| _ -> None
However it feels like a waste to convert and then convert again. Is there a way to tell Excel-DNA not to do additional processing of the range values in a UDF and supply them equivalent to the COM/PIA interface? i.e. Range.Value XlRangeValueDataType.xlRangeValueDefault
EDIT:
I declare my arguments as obj like this:
[<ExcelFunction(Description = "Validates a Test Table Row")>]
let isTestRow (headings: obj) (row: obj) =
let validator = TestTable.validator
let headingsList = TestTable.testHeadings
validateRow validator headingsList headings row
I have done some more digging and #Jim Foye's suggested question also confirms this. For UDF's, Excel-DNA works over the C API rather than COM and therefore has to do its own marshaling. The possible values are shown in this file:
https://github.com/Excel-DNA/ExcelDna/blob/2aa1bd9afaf76084c1d59e2330584edddb888eb1/Distribution/Reference.txt
The reason to use ExcelEmpty (the user supplied an empty cell) is that for a UDF, the argument can also be ExcelMissing (the user supplied no argument) which might both be reasonably null and there is a need to disambiguate.
I will adjust my pattern matching to be compatible with both the COM marshaling and the ExcelDNA marshaling.

How do I convert missing values into strings?

I have a Deedle DataFrame of type Frame<int,string> that contains some missing values. I would like to convert the missing values into empty strings "". I tried to use the valueOr function but that did not help. Is there a way to do this?
Here is my DataFrame:
let s1 = Series.ofOptionalObservations [ 1 => Some("A"); 2 => None ]
let s2 = Series.ofOptionalObservations [ 1 => Some("B"); 2 => Some("C") ]
let df = Frame.ofColumns ["A", s1; "BC", s2]
Typing df;; in FSI yields some information including
ColumnTypes = seq [System.String; System.String];. So the values of df are of type string and not string option.
This is the function valueOr:
let valueOr (someDefault: 'a) (xo: 'a option) : 'a =
match xo with
| Some v -> v
| None -> someDefault
I defined an auxiliary function emptyFoo as:
let emptyFoo = valueOr ""
The signature of emptyFoo is string option -> string. This means emptyFoo should not be acceptable to the compiler in the following command:
let df' = Frame.mapValues emptyFoo df
This is because the values of df are of type string and not string option.
Still, the compiler does not complain and the code runs. However, df' still has a missing value.
Is there a way to transform the missing value into the empty string?
The Deedle documentation for Frame.mapValues:
Builds a new data frame whose values are the results of applying the specified function on these values, but only for those columns which can be converted to the appropriate type for input to the mapping function
So the mapping does nothing because strings are found, rather than string options.
I noticed another function that seems to do exactly what you want.
let df' = Frame.fillMissingWith "" df
The key thing I noticed was that Deedle shows those missing values as <missing>, suggesting that it uses it's own representation (as opposed to option for example). With that knowledge I guessed that the library would provide some way of manipulating missing values, so I explored the API by doing Frame. in my IDE and browsing the list of available functions and their documentation.

f# sqlite sqlprovider minBy maxBy using float

I have a sqlite table with a mix of integer and float columns. I'm trying to get the max and min values of each column. For integer columns the following code works but I get a cast error when using the same code on float columns:
let numCats = query{for row in db do minBy row.NumCats}
For float columns I'm using the following code but it's slow.
let CatHight = query{for row in db do select row.CatHeight} |> Seq.toArray |> Array.max
I have 8 integer columns and 9 float columns and the behavior has been consistent across all columns so that's why I think it's an issue with the column type. But I'm new to F# and don't know anything so I'm hoping you can help me.
Thank you for taking the time to help, it's much appreciated.
SQLProvider version: 1.0.41
System.Data.SQLite.Core version: 1.0.104
The error is: System.InvalidCastException occurred in FSharp.Core.dll
Added Information
I created a new table with one column of type float. I inserted the values 2.2 and 4.2. Using SQLProvider and System.Data.SQLite.Core I connected queried the database using minBy or maxBy and I get the cast exception. If the column type is integer it works correctly.
More Added Information
Exception detail:
System.Exception was unhandled
Message: An unhandled exception of type 'System.Exception' occurred in >FSharp.Core.dll
Additional information: Unsupported execution expression value(FSharp.Data.Sql.Runtime.QueryImplementation+SqlQueryable1[FSharp.>Data.Sql.Common.SqlEntity]).Min(row => >Convert(Convert(row.GetColumn("X"))))`
Code that fails:
open FSharp.Data.Sql
[<Literal>]
let ConnectionString =
"Data Source=c:\MyDB.db;" +
"Version=3;foreign keys=true"
type Sql = SqlDataProvider<Common.DatabaseProviderTypes.SQLITE,
ConnectionString,
//ResolutionPath = resolutionPath,
CaseSensitivityChange = Common.CaseSensitivityChange.ORIGINAL>
let ctx = Sql.GetDataContext()
let Db = ctx.Main.Test
let x = query{for row in Db do minBy row.X}
printfn "x: %A" x
Update 2/1/17
Another user was able to reproduce the issue so I filed an Issue with SQLProvider. I'm now looking at workarounds and while the following code works and is fast, I know there's a better way to do it - I just can't find the correct way. If somebody answers with better code I'll accept that answer. Thanks again for all the help.
let x = query {for row in db do
sortBy row.Column
take 1
select row.Column } |> Seq.toArray |> Array.min
This is my workaround that #s952163 and good people in the SO f# chat room helped me with. Thanks again to everyone who helped.
let x = query {for row in db do
sortBy row.Column
take 1
select row.Column } |> Seq.head
You need to coerce the output column to int or float (whichever you need or is giving trouble to you). You also need to take care in case any of your columns are nullable. The example below will coerce the column to float first (to take care of being nullable), then convert it to int, and finally get the minimum:
let x = query { for row in MYTABLE do
minBy (int (float row.MYCOLUMN))}
You might want to change the order of course, or just say float Mycolumn.
Update:
With Sqlite it indeed causes an error. You might want to do query { ... } |> Seq.minBy to extract the smallest number.

Pattern matching on provided types

Firstly, obtain a schema and parse:
type desc = JsonProvider< """[{"name": "", "age": 1}]""", InferTypesFromValues=true >
let json = """[{"name": "Kitten", "age": 322}]"""
let typedJson = desc.Parse(json)
Now we can access typedJson.[0] .Age and .Name properties, however, I'd like to pattern match on them at compile-time to get an error if the schema is changed.
Since those properties are erased and we cannot obtain them at run-time:
let ``returns false``() =
typedJson.[0].GetType()
.FindMembers(MemberTypes.All, BindingFlags.Public ||| BindingFlags.Instance,
MemberFilter(fun _ _ -> true), null)
|> Array.exists (fun m -> m.ToString().Contains("Age"))
...I've made a runtime-check version using active patterns:
let (|Name|Age|) k =
let toID = NameUtils.uniqueGenerator NameUtils.nicePascalName
let idk = toID k
match idk with
| _ when idk.Equals("Age") -> Age
| _ when idk.Equals("Name") -> Name
| ex_val -> failwith (sprintf "\"%s\" shouldn't even compile!" ex_val)
typedJson.[0].JsonValue.Properties()
|> Array.map (fun (k, v) ->
match k with
| Age -> v.AsInteger().ToString() // ...
| Name -> v.AsString()) // ...
|> Array.iter (printfn "%A")
In theory, if FSharp.Data wasn't OS I wouldn't be able to implement toID. Generally, the whole approach seems wrong and redoing the work.
I know that discriminated unions can't be generated using type providers, but maybe there's a better way to do all this checking at compile-time?
As far as I know it cannot be possible to find out if "Json schema has changed" at compile-time using the given TP.
That's why:
JsonProvider<sample> is exactly what kicks in at compile-time providing a type for manipulating Json contents at run-time. This provided erased type has couple of run-time static methods common for any sample and type Root
extending IJsonDocument with few instance properties including ones based on compile-time provided sample (in your case - properties Name and Age).There is exactly one very relaxed implicit
Json "schema" behind JsonProvider-provided type, no another such entity to compare with for change at compile-time;
at run-time only provided type desc with its static methods and its Root type with correspondent instance methods
are at your service for manipulating arbitrary Json contents. All this jazz is pretty much agnostic with regard to
"Json schema", in your given case as long as run-time Json contents represent an array its elements may be pretty much any.
For example,
type desc = JsonProvider<"""[{"name": "", "age": 1}]"""> // # compile-time
let ``kinda "typed" json`` = desc.Parse("""[]""") // # run-time
let ``another kinda "typed" json`` =
desc.Parse("""[{"contents":"whatever", "text":"blah-blah-blah"},[{"extra":42}]]""")
both will be happily parsed at run-time as "typed Json" conforming to "schema" derived by TP from the given sample, although apparently Name and Age are missing and exceptions will be raised if accessed.
It is doable building another Json TP that relies upon a formal Json schema.
It may consume reference to the given schema from a schema repository upon type creation and allow manipulating elements
of the parsed Json payload only via provided accessors derived from the schema at compile-time.
In this case change of the
referred schema may break compilation if provided accessors being used in the code are incompatible with the change.
Such arrangement accompanied by run-time Json payload validator or validating parser may provide reliable enterprise-quality
Json schema change management.
JsonProvider TP from Fsharp.Data lacks such Json schema handling abilities, so payload validations are to be done in run-time only.
Quoting your comments which explain a little better what you are trying to achieve:
Thank you! But what I'm trying to achieve is to get a compiler error
if I add a new field e.g. Color to the json schema and then ignore it
while later processing. In case of unions it would be instant FS0025.
and:
yes, I must process all fields, so I can't rely on _. I want it so
when the schema changes, my F# program won't compile without adding
necessary handling functionality(and not just ignoring new field or
crashing at runtime).
The simplest solution for your purpose is to construct a "test" object.
The provided type comes with two constructors: one takes a JSonValue and parses it - effectively the same as JsonValue.Parse - while the other requires every field to be filled in.
That's the one that interests us.
We're also going to invoke it using named parameters, so that we'll be safe not only if fields are added or removed, but also if they are renamed or changed.
type desc = JsonProvider< """[{"name": "SomeName", "age": 1}]""", InferTypesFromValues=true >
let TestObjectPleaseIgnore = new desc.Root (name = "Kitten", age = 322)
// compiles
(Note that I changed the value of name in the sample to "SomeName", because "" was being inferred as a generic JsonValue.)
Now if more fields suddenly appear in the sample used by the type provider, the constructor will become incomplete and fail to compile.
type desc = JsonProvider< """[{"name": "SomeName", "age": 1, "color" : "Red"}]""", InferTypesFromValues=true >
let TestObjectPleaseIgnore = new desc.Root (name = "Kitten", age = 322)
// compilation error: The member or object constructor 'Root' taking 1 arguments are not accessible from this code location. All accessible versions of method 'Root' take 1 arguments.
Obviously, the error refers to the 1-argument constructor because that's the one it tried to fit, but you'll see that now the provided type has a 3-parameter constructor replacing the 2-parameter one.
If you use InferTypesFromValues=false, you get a strong type back:
type desc = JsonProvider< """[{"name": "", "age": 1}]""", InferTypesFromValues=false >
You can use the desc type to define active patterns over the properties you care about:
let (|Name|_|) target (candidate : desc.Root) =
if candidate.Name = target then Some target else None
let (|Age|_|) target (candidate : desc.Root) =
if candidate.Age = target then Some target else None
These active patterns can be used like this:
let json = """[{"name": "Kitten", "age": 322}]"""
let typedJson = desc.Parse(json)
match typedJson.[0] with
| Name "Kitten" n -> printfn "Name is %s" n
| Age 322m a -> printfn "Age is %M" a
| _ -> printfn "Nothing matched"
Given the typedJson value here, that match is going to print out "Name is Kitten".
<tldr>
Build some parser to handle your issues. http://www.quanttec.com/fparsec/
</tldr>
So...
You want something that can read something and do something with it. Without knowing apriori what any of those somethings is.
Good luck with that one.
You do not want type provider to do this for you. Type providers are made with the full purpose of being "at compile time this is what I saw, and thats what I will use".
With that said:
You want some other type of parser, where you are able to check the "schema" (some definition of what you know is going to come or you saw last time vs. what actually came). In effect some dynamic parser getting the data to some dynamic structure, with dynamic types.
And mind you, dynamic is not static. F# has static types and a lot is based on that. And type providers more so. Fighting that will make your head ache. It is of course possible, and it might even be possible by fighting the type providers to actually work with such an approach, but then again its not really a type provider nor its purpose.

Deedle IndexRows type annotations

I was trying to implement a Deedle solution for the little challenge from #migueldeicaza to achieve in F# what was done in http://t.co/4YFXk8PQaU with python and R. The csv source data is available from the link.
The start is simple but now, while trying to order based upon a column series of float values I'm struggling to understand the syntax for the IndexRows type annotation.
#I "../packages/FSharp.Charting.0.90.5"
#I "../packages/Deedle.0.9.12"
#load "FSharp.Charting.fsx"
#load "Deedle.fsx"
open System
open Deedle
open FSharp.Charting
let bodyCountData = Frame.ReadCsv(__SOURCE_DIRECTORY__ + "/film_death_counts.csv")
bodyCountData?DeathsPerMinute <- bodyCountData?Body_Count / bodyCountData?Length_Minutes
// select top 3 rows based upon default ordinal indexer
bodyCountData.Rows.[0..3]
// create a new frame indexed and ordered by descending number of screen deaths per minute
let bodyCountDataOrdered =
bodyCountData
|> Frame.indexRows <float>"DeathsPerMinute" // uh oh error here - I'm confused
And because I can't figure that syntax out... various messages like:
Error 1 The type '('a -> Frame<'c,Frame<int,string>>)' does not support the 'comparison' constraint. For example, it does not support the 'System.IComparable' interface. See also c:\wd\RPythonFSharpDFChallenge\RPythonFSharpDFChallenge\EvilMovieQuery.fsx(18,4)-(19,22). c:\wd\RPythonFSharpDFChallenge\RPythonFSharpDFChallenge\EvilMovieQuery.fsx 19 8 RPythonFSharpDFChallenge
Error 2 Type mismatch. Expecting a
'a -> Frame<'c,Frame<int,string>>
but given a
'a -> float
The type 'Frame<'a,Frame<int,string>>' does not match the type 'float' c:\wd\RPythonFSharpDFChallenge\RPythonFSharpDFChallenge\EvilMovieQuery.fsx 19 25 RPythonFSharpDFChallenge
Error 3 This expression was expected to have type
bool
but here has type
string c:\wd\RPythonFSharpDFChallenge\RPythonFSharpDFChallenge\EvilMovieQuery.fsx 19 31 RPythonFSharpDFChallenge
Edit: Just thinking about this... indexing on a measured float is a silly thing to do anyway - duplicates and missing values in real world data. So, I wonder what a more sensible approach to this would be. I still need to find the 25 max values... Maybe I can work this out for myself...
With Deedle 1.0, you can sort on an arbitrary column.
See: http://bluemountaincapital.github.io/Deedle/reference/deedle-framemodule.html#section7

Resources