I am doing some data-mining using Fsharp. To ensure that no blank values occur during the datamining process I need to make sure that empty values do not make it through what I am parsing therefore I am using a double option. An masked version of the data is shown below...
type structure = {
time: double option
pressure: double option
force: double option }
let rawData =
[| {| time = Some(15); pressure = Some(50); force = Some(100)|}
{| time = Some(16); pressure = Some(55); force = Some(110)|}
{| time = Some(17); pressure = None); force = Some(110)|}
{| time = Some(16); pressure = Some(65); force = None|}
{| time = Some(18); pressure = Some(70); force = Some(120)|} |]
I am currently saving this into a Deedle Data frame and saving to a .csv. However when I do this the values have "Some()" associated with them. They also have blank values for the None values.
How would I be able to take the "Some()" around the numbers away and turn the None values to "NaN" then save this to a .csv?
How would I be able to take the "Some()" around the numbers away and turn the None values to "NaN"
Create a function with signature double option -> string. You can process an option using a match expression.
let doubleOptionToStringFlattened =
function
| Some(d:double) -> d.ToString()
| None -> "NaN"
https://learn.microsoft.com/en-us/dotnet/fsharp/language-reference/options
then save this to a .csv
let csvLineFromStructure(s:structure) =
[| s.time; s.pressure; s.force |]
|> Array.map doubleOptionToStringFlattened
|> String.concat ","
// then create a line per structure:
let csvFromStructures(structures:structure[]) =
structures |> Array.map csvLineFromStructure |> String.concat Environment.NewLine
Related
I need to write a Deedle FrameData (including "ID" column and additional "Delta" column with blank entries) to CSV. While I can generate a 2D array of the FrameData, I am unable to write it correctly to a CSV file.
module SOQN =
open System
open Deedle
open FSharp.Data
// TestInput.csv
// ID,Alpha,Beta,Gamma
// 1,no,1,hi
// ...
// TestOutput.csv
// ID,Alpha,Beta,Gamma,Delta
// 1,"no","1","hi",""
// ...
let inputCsv = #"D:\TestInput.csv"
let outputCsv = #"D:\TestOutput.csv"
let (df:Frame<obj,string>) = Frame.ReadCsv(inputCsv, hasHeaders=true, inferTypes=false, separators=",", indexCol="ID")
// See http://www.fssnip.net/sj/title/Insert-Deedle-frame-into-Excel
let data4Frame (frame:Frame<_,_>) = frame.GetFrameData()
// See http://www.fssnip.net/sj/title/Insert-Deedle-frame-into-Excel
let boxOptional obj =
match obj with
| Deedle.OptionalValue.Present obj -> box (obj.ToString())
| _ -> box ""
// See http://www.fssnip.net/sj/title/Insert-Deedle-frame-into-Excel
let frameToArray (data:FrameData) =
let transpose (array:'T[,]) =
Array2D.init (array.GetLength(1)) (array.GetLength(0)) (fun i j -> array.[j, i])
data.Columns
|> Seq.map (fun (typ, vctr) -> vctr.ObjectSequence |> Seq.map boxOptional |> Array.ofSeq)
|> array2D
|> transpose
let main =
printfn ""
printfn "Output Deedle FrameData To CSV"
printfn ""
let dff = data4Frame df
let rzlt = frameToArray dff
printfn "rzlt: %A" rzlt
do
use writer = new StreamWriter(outputCsv)
writer.WriteLine("ID,Alpha,Beta,Gamma,Delta")
// writer.WriteLine rzlt
0
[<EntryPoint>]
main
|> ignore
What am I missing?
I would not use FrameData to do this - frame data is mostly internal and while there are some legitimate uses for it, I don't think it makes sense for this task.
If you simply want to add an empty Delta column to your input CSV, then you can do this:
let df : Frame<int, _> = Frame.ReadCsv("C:/temp/test-input.csv", indexCol="ID")
df.AddColumn("Delta", [])
df.SaveCsv("C:/temp/test-output.csv", ["ID"])
This does almost everything you need - it writes the ID column and the extra Delta column.
The only caveat is that it does not add the extra quotes around the data. This is not required by the CSV specification unless you need to escape a comma in a column and I don't think there is an easy way to get Deedle to do this.
So, I think then you'd have to write your own writing to a CSV file. The following shows how to do this, but it does not correctly escape quotes and commas (which is why you should use SaveCsv even if it does not put in the quotes when they're not needed):
use writer = new StreamWriter("C:/temp/test-output.csv")
writer.WriteLine("ID,Alpha,Beta,Gamma,Delta")
for key, row in Series.observations df.Rows do
writer.Write(key)
for value in Series.valuesAll row do
writer.Write(",")
writer.Write(sprintf "\"%O\"" (if value.IsSome then value.Value else box ""))
writer.WriteLine()
You can get the example of writing to csv from source of the library (it uses FrameData there)
After adding wrapper:
type FrameData with
member frameData.SaveCsv(path:string, ?includeRowKeys, ?keyNames, ?separator, ?culture) =
use writer = new StreamWriter(path)
writeCsv writer (Some path) separator culture includeRowKeys keyNames frameData
you could write like this:
dff.SaveCsv outputCsv
i'm writing a small console application in F#.
[<EntryPoint>]
let main argv =
high_lvl_funcs.print_opt
let opt = Console.ReadLine()
match opt with
| "0" -> printfn "%A" (high_lvl_funcs.calculate_NDL)
| "1" -> printfn ("not implemented yet")
| _ -> printfn "%A is not an option" opt
from module high_lvl_funcs
let print_opt =
let options = [|"NDL"; "Deco"|]
printfn "Enter the number of the option you want"
Array.iteri (fun i x -> printfn "%A: %A" i x) options
let calculate_NDL =
printfn ("enter Depth in m")
let depth = lfuncs.m_to_absolute(float (Console.ReadLine()))
printfn ("enter amount of N2 in gas (assuming o2 is the rest)")
let fn2 = float (Console.ReadLine())
let table = lfuncs.read_table
let tissue = lfuncs.create_initialise_Tissues ATM WATERVAPOUR
lfuncs.calc_NDL depth fn2 table lfuncs.loading_constantpressure tissue 0.0
lfuncs.calc_NDL returns a float
this produces this
Enter the number of the option you want
0: "NDL"
1: "Deco"
enter Depth in m
which means it prints what it's suppose to then jumps straight to high_lvl_funcs.calculate_NDL
I wanted it to produce
Enter the number of the option you want
0: "NDL"
1: "Deco"
then let's assume 0 is entered, and then calculate high_lvl_funcs.calculate_NDL
after some thinking and searching i assume this is because F# wants to assign all values before it starts the rest. Then i thought that i need to declaring a variable without assigning it. but people seem to agree that this is bad in functional programming. From another question: Declaring a variable without assigning
so my question is, is it possible to rewrite the code such that i get the flow i want and avoid declaring variables without assigning them?
You can fix this by making calculate_NDL a function of no arguments, instead of a closure that evaluates to a float:
let calculate_NDL () =
Then call it as a function in your match like this:
match opt with
| "0" -> printfn "%A" (high_lvl_funcs.calculate_NDL())
However I'd suggest refactoring this code so that calculate_NDL takes any necessary inputs as arguments rather than reading them from the console i.e. read the inputs from the console separately and pass them to calculate_NDL.
let calculate_NDL depth fn2 =
let absDepth = lfuncs.m_to_absolute(depth)
let table = lfuncs.read_table
let tissue = lfuncs.create_initialise_Tissues ATM WATERVAPOUR
lfuncs.calc_NDL absDepth fn2 table lfuncs.loading_constantpressure tissue 0.0
It's generally a good idea to write as much code as possible as pure functions that don't rely on I/O (like reading from stdin).
Please consider this dataset, composed by man and woman, and that I filter in a second moment according to few variables:
type ls = JsonProvider<"...">
let dt = ls.GetSamples()
let dt2 =
dt |> Seq.filter (fun c -> c.Sex = "male" && c.Height > Some 150)
dt2
[{"sex":"male","height":180,"weight":85},
{"sex":"male","height":160" "weight":60},
{"sex":"male","height":180,"weight":85}]
Lets suppose that I would like to add a fourth key "body mass index" or "bmi", and that its value is roughly given by "weight"/"height". Hence I expect:
[{"sex":"male","height":180,"weight":85, "bmi":(180/85)},
{"sex":"male","height":160" "weight":60, "bmi":(160/60},
{"sex":"male","height":180,"weight":85, "bmi":(180/85)}]
I thought that map.Add may help.
let dt3 = dt2.Add("bmi", (dt2.Height/dt2.Weight))
Unfortunately, it returns an error:
error FS0039: The field, constructor or member 'Add' is not defined
I am sure there are further errors in my code, but without this function I cannot actually look for them. Am I, at least, approaching the problem correctly?
Creating modified versions of the JSON is sadly one thing that the F# Data type provider does not make particularly easy. What makes that hard is the fact that we can infer the type from the source JSON, but we cannot "predict" what kind of fields people might want to add.
To do this, you'll need to access the underlying representation of the JSON value and operate on that. For example:
type ls = JsonProvider<"""
[{"sex":"male","height":180,"weight":85},
{"sex":"male","height":160,"weight":60},
{"sex":"male","height":180,"weight":85}]""">
let dt = ls.GetSamples()
let newJson =
dt
|> Array.map (fun recd ->
// To do the calculation, you can access the fields via inferred types
let bmi = float recd.Height / float recd.Weight
// But now we need to look at the underlying value, check that it is
// a record and extract the properties, which is an array of key-value pairs
match recd.JsonValue with
| JsonValue.Record props ->
// Append the new property to the existing properties & re-create record
Array.append [| "bmi", JsonValue.Float bmi |] props
|> JsonValue.Record
| _ -> failwith "Unexpected format" )
// Re-create a new JSON array and format it as JSON
JsonValue.Array(newJson).ToString()
Hi I'm looking to find the best way to read in a fixed width text file using F#. The file will be plain text, from one to a couple of thousand lines long and around 1000 characters wide. Each line contains around 50 fields, each with varying lengths. My initial thoughts were to have something like the following
type MyRecord = {
Name : string
Address : string
Postcode : string
Tel : string
}
let format = [
(0,10)
(10,50)
(50,7)
(57,20)
]
and read each line one by one, assigning each field by the format tuple(where the first item is the start character and the second is the number of characters wide).
Any pointers would be appreciated.
The hardest part is probably to split a single line according to the column format. It can be done something like this:
let splitLine format (line : string) =
format |> List.map (fun (index, length) -> line.Substring(index, length))
This function has the type (int * int) list -> string -> string list. In other words, format is an (int * int) list. This corresponds exactly to your format list. The line argument is a string, and the function returns a string list.
You can map a list of lines like this:
let result = lines |> List.map (splitLine format)
You can also use Seq.map or Array.map, depending on how lines is defined. Such a result will be a string list list, and you can now map over such a list to produce a MyRecord list.
You can use File.ReadLines to get a lazily evaluated sequence of strings from a file.
Please note that the above is only an outline of a possible solution. I left out boundary checks, error handling, and such. The above code may contain off-by-one errors.
Here's a solution with a focus on custom validation and error handling for each field. This might be overkill for a data file consisting of just numeric data!
First, for these kinds of things, I like to use the parser in Microsoft.VisualBasic.dll as it's already available without using NuGet.
For each row, we can return the array of fields, and the line number (for error reporting)
#r "Microsoft.VisualBasic.dll"
// for each row, return the line number and the fields
let parserReadAllFields fieldWidths textReader =
let parser = new Microsoft.VisualBasic.FileIO.TextFieldParser(reader=textReader)
parser.SetFieldWidths fieldWidths
parser.TextFieldType <- Microsoft.VisualBasic.FileIO.FieldType.FixedWidth
seq {while not parser.EndOfData do
yield parser.LineNumber,parser.ReadFields() }
Next, we need a little error handling library (see http://fsharpforfunandprofit.com/rop/ for more)
type Result<'a> =
| Success of 'a
| Failure of string list
module Result =
let succeedR x =
Success x
let failR err =
Failure [err]
let mapR f xR =
match xR with
| Success a -> Success (f a)
| Failure errs -> Failure errs
let applyR fR xR =
match fR,xR with
| Success f,Success x -> Success (f x)
| Failure errs,Success _ -> Failure errs
| Success _,Failure errs -> Failure errs
| Failure errs1, Failure errs2 -> Failure (errs1 # errs2)
Then define your domain model. In this case, it is the record type with a field for each field in the file.
type MyRecord =
{id:int; name:string; description:string}
And then you can define your domain-specific parsing code. For each field I have created a validation function (validateId, validateName, etc).
Fields that don't need validation can pass through the raw data (validateDescription).
In fieldsToRecord the various fields are combined using applicative style (<!> and <*>).
For more on this, see http://fsharpforfunandprofit.com/posts/elevated-world-3/#validation.
Finally, readRecords maps each input row to the a record Result and chooses the successful ones only. The failed ones are written to a log in handleResult.
module MyFileParser =
open Result
let createRecord id name description =
{id=id; name=name; description=description}
let validateId (lineNo:int64) (fields:string[]) =
let rawId = fields.[0]
match System.Int32.TryParse(rawId) with
| true, id -> succeedR id
| false, _ -> failR (sprintf "[%i] Can't parse id '%s'" lineNo rawId)
let validateName (lineNo:int64) (fields:string[]) =
let rawName = fields.[1]
if System.String.IsNullOrWhiteSpace rawName then
failR (sprintf "[%i] Name cannot be blank" lineNo )
else
succeedR rawName
let validateDescription (lineNo:int64) (fields:string[]) =
let rawDescription = fields.[2]
succeedR rawDescription // no validation
let fieldsToRecord (lineNo,fields) =
let (<!>) = mapR
let (<*>) = applyR
let validatedId = validateId lineNo fields
let validatedName = validateName lineNo fields
let validatedDescription = validateDescription lineNo fields
createRecord <!> validatedId <*> validatedName <*> validatedDescription
/// print any errors and only return good results
let handleResult result =
match result with
| Success record -> Some record
| Failure errs -> printfn "ERRORS %A" errs; None
/// return a sequence of records
let readRecords parserOutput =
parserOutput
|> Seq.map fieldsToRecord
|> Seq.choose handleResult
Here's an example of the parsing in practice:
// Set up some sample text
let text = """01name1description1
02name2description2
xxname3badid-------
yy badidandname
"""
// create a low-level parser
let textReader = new System.IO.StringReader(text)
let fieldWidths = [| 2; 5; 11 |]
let parserOutput = parserReadAllFields fieldWidths textReader
// convert to records in my domain
let records =
parserOutput
|> MyFileParser.readRecords
|> Seq.iter (printfn "RECORD %A") // print each record
The output will look like:
RECORD {id = 1;
name = "name1";
description = "description";}
RECORD {id = 2;
name = "name2";
description = "description";}
ERRORS ["[3] Can't parse id 'xx'"]
ERRORS ["[4] Can't parse id 'yy'"; "[4] Name cannot be blank"]
By no means is this the most efficient way to parse a file (I think there are some CSV parsing libraries available on NuGet that can do validation while parsing) but it does show how you can have complete control over validation and error handling if you need it.
A record of 50 fields is a bit unwieldy, therefore alternate approaches which allow dynamic generation of the data structure may be preferable (eg. System.Data.DataRow).
If it has to be a record anyway, you could spare at least the manual assignment to each record field and populate it with the help of Reflection instead. This trick relies on the field order as they are defined. I am assuming that every column of fixed width represents a record field, so that start indices are implied.
open Microsoft.FSharp.Reflection
type MyRecord = {
Name : string
Address : string
City : string
Postcode : string
Tel : string } with
static member CreateFromFixedWidth format (line : string) =
let fields =
format
|> List.fold (fun (index, acc) length ->
let str = line.[index .. index + length - 1].Trim()
index + length, box str :: acc )
(0, [])
|> snd
|> List.rev
|> List.toArray
FSharpValue.MakeRecord(
typeof<MyRecord>,
fields ) :?> MyRecord
Example data:
"Postman Pat " +
"Farringdon Road " +
"London " +
"EC1A 1BB" +
"+44 20 7946 0813"
|> MyRecord.CreateFromFixedWidth [16; 16; 16; 8; 16]
// val it : MyRecord = {Name = "Postman Pat";
// Address = "Farringdon Road";
// City = "London";
// Postcode = "EC1A 1BB";
// Tel = "+44 20 7946 0813";}
What would be the most efficient way in F# to remove items in one list based on items in another list?
example:
seq1 = ["blue"; "green"; "red"; "green" ...]
seq2 = ["soda"; "green"; "pop" ...]
seq1 has 50,000 items initially
seq2 has 12 and continues to grow in size over time
What I want to do is remove ALL instances of seq1 if that value is in seq2
I have the following code which is as slow as I can make it - not what I want.
let result = seq1 |> Seq.filter(fun a -> (Seq.exists(fun name -> name = a) seq2) = false)
I am trying to find the fasted way to do this functionally (no loops, etc)
Thanks :-)
If seq1 is relatively long and seq2 is relatively short, then you can create a set from the elements of seq2 and then use the Contains method of the set to check if it contains the specified element. Lookup in a set is much faster than lookup in a sequence using Seq.exists.
I was testing this using a simple script based on your numbers:
#time
let seq1 = Array.init 50000 (fun i -> ["blue"; "green"; "red"].[i%3])
let seq2 = Array.init 12 (fun i -> [ "soda"; "green"; "pop"].[i%3])
Now, here are a few options (I wrap them in for i in 1 .. 10 do to get more reasonable numbers and then divided this by 10):
// 15ms - this is the original version, but I added `Array.ofSeq` to materialize it
let result = seq1 |> Seq.filter(fun a ->
(Seq.exists(fun name -> name = a) seq2) = false) |> Array.ofSeq
// 12ms - this is using `Array.filter` directly, which turns out to be as slow
let result = seq1 |> Array.filter(fun a ->
(Seq.exists(fun name -> name = a) seq2) = false)
// 2ms - using `set.Contains` is much faster, even when we create the set each time
let l = set seq2
let result = seq1 |> Array.filter(fun a -> l.Contains a = false)
Note that I did not push the set seq2 call out of the loop - if you do that, it is even faster (you only need to create the set when changing seq2 and then you can keep it).