How to get .xls data line by line using NPOI in F# - f#

I am trying to parse a .xls file and I need to gather all of the data line by line. I am able to call an individual cell.
let example =sheet1.GetRow(5).GetCell(1)|> string
I am trying to figure out how to use a recursive function to get data from every line from Row 5 till the end of the excel sheet unless there is a row that has no values and then I want to stop. How would I do that? I come from a python background and thinking in a functional language has been a little bit challenging.
Thanks in advance for the help.

You can use the ExcelProvider type provider to read both xls and xlsx files and treat the data as a sequence. You can skip the first 5 lines then access the rest of the rows the same way you would with a sequence, using the header names as field names. You could use seq.takeWhile to stop reading rows if eg the first cell is empty :
type TheFile = ExcelFile<"SomeFile.xls">
let file = new TheFile()
let rows = file.Data
|> Seq.skip 5
|> Seq.takeWhile (fun row->not String.IsNullOrWhitespace row.SomeName)
|> Seq.map (fun row->row.OrderTotal......0
...
Fields can be accessed by position as well

Related

Parsing awkward CSV file with a dynamic number of columns gives error

I'm a C# developer and this is my first attempt at writing F#.
I'm trying to read a Dashlane exported database in the CSV format. These files have no headers and a dynamic number of columns for each possible type of entry. The following file is an example of dummy data that I use to test my software. It only contains password entries and yet they have between 5 and 7 columns (I'll decide how to handle other types of data later)
The first line of the exported file (in this case, but not always) is the email address that was used to create the dashlane account which makes this line only one column wide.
"accountCreation#email.fr"
"Nom0","siteweb0","Identifiant0","",""
"Nom1","siteweb1","identifiant1","email1#email.email","",""
"Nom2","siteweb2","email2#email.email","",""
"Nom3","siteweb3","Identifiant3","password3",""
"Nom4","siteweb4","Identifiant4","email4#email.email","password4",""
"Nom5","siteweb5","Identifiant5","email5#email.email","SecondIdentifiant5","password5",""
"Nom6","siteweb6","Identifiant6","email6#email.email","SecondIdentifiant6","password6","this is a single-line note"
"Nom7","siteweb7","Identifiant7","email7#email.email","SecondIdentifiant7","password7","this is a
multi
line note"
"Nom8","siteweb8","Identifiant8","email8#email.email","SecondIdentifiant8","password8","single line note"
I'm trying to print the first column of each row to the console as a start
let rawCsv = CsvFile.Load("path\to\file.csv", ",", '"', false)
for row in rawCsv.Rows do
printfn "value %s" row.[0]
This code gives me the the following error on the for line
Couldn't parse row 2 according to schema: Expected 1 columns, got 5
I haven't give the CsvFile any schema and I couldn't find on the internet how to specify a schema.
I would be able to remove the first line dynamically if I wanted to but it wouldn't change anything since the other lines have different column counts too.
Is there any way to parse this awakward CSV file in F# ?
Note: For each password row, only the column right before the last one matters to me (the password column)
I do not think that CSV file of as irregular structure as yours is a good candidate for processing with CSV Type Provider or CSV Parser.
At the same time it does not seem difficult to parse this file to your likes with few lines of custom logic. The following snippet:
open System
open System.IO
File.ReadAllLines("Sample.csv") // Get data
|> Array.filter(fun x -> x.StartsWith("\"Nom")) // Only lines starting with "Nom may contain password
|> Array.map (fun x -> x.Split(',') |> Array.map (fun x -> x.[1..(x.Length-2)])) // Split each line into "cells"
|> Array.filter(fun x -> x.[x.Length-2] |> String.IsNullOrEmpty |> not) // Take only those having non-empty cell before the last one
|> Array.map (fun x -> x.[0],x.[x.Length-2]) // show the line key and the password
after parsing your sample file produces
>
val it : (string * string) [] =
[|("Nom3", "password3"); ("Nom4", "password4"); ("Nom5", "password5");
("Nom6", "password6"); ("Nom7", "password7"); ("Nom8", "password8")|]
>
It may be a good starting point for further improving the parsing logic to perfection.
I propose to read the csv file as a text file. I read the file line by line and form a list and then parse each line with CsvFile.Parse. But the problem is that the elements are found in Headers and not in Rows which is of type string [] option
open FSharp.Data
open System.IO
let readLines (filePath:string) = seq {
use sr = new StreamReader(filePath)
while not sr.EndOfStream do
yield sr.ReadLine ()
}
[<EntryPoint>]
let main argv =
let lines = readLines "c:\path_to_file\example.csv"
let rows = List.map (fun str -> CsvFile.Parse(str)) (Seq.toList lines)
for row in List.toArray(rows) do
printfn "New Line"
if row.Headers.IsSome then
for r in row.Headers.Value do
printfn "value %s" (r)
printfn "%A" argv
0 // return an integer exit code

CsvProvider throws OutOfMemoryException

FAOCropsLivestock.csv contains more than 14 million row. In my .fs file I have declared
type FAO = CsvProvider<"c:\FAOCropsLivestock.csv">
and tried to work with follwoing code
FAO.GetSample().Rows.Where(fun x -> x.Country = country) |> ....
FAO.GetSample().Filter(fun x -> x.Country = country) |> ....
In both cases, exception was thrown.
I also have tried with follwoing code after loading the csv file in MSSQL Server
type Schema = SqlDataConnection<conStr>
let db = Schema.GetDataContext()
db.FAOCropsLivestock.Where(fun x-> x.Country = country) |> ....
it works. It also works if I issue query using OleDb connection, but it is slow.
How can I get a squence out of it using CsvProvider?
If you refer to the bottom of the CSV Type Provider documentation, you will see a section on handling large datasets. As explained there, you can set CacheRows = false which will aid you when it comes to handling large datasets.
type FAO = CsvProvider<"c:\FAOCropsLivestock.csv", CacheRows = false>
You can then use standard sequence operations over the rows of the CSV as a sequence without loading the entire file into memory. e.g.
FAO.GetSample().Rows |> Seq.filter (fun x -> x.Country = country) |> ....
You should, however, take care to only enumerate the contents once.

How to format strings to print in a file with F#

This code is printing float numbers in the file with this format f,ffffff (with comma) and the numbers are in a row, but I need to print it like this f.ffffff (with dot) and after each number skip a line, so each number has its own line. Any ideas on how do I do it?
CODE EDITED
module writeFiles =
let (w:float[]) = [|-1.3231725; 1.052134922; 1.23082055; 1.457748868; -0.3481141253; -0.06886428466; -1.473392229; 0.1103078722; -1.047231857; -2.641890652; -1.335060286; -0.9839854216; 0.1844535984; 3.087001584; -0.008467130841; 1.175365466; 1.637297522; 5.557832631; -0.2906445452; -0.4052301538; 1.766454088; -2.604325471; -1.807107036; -2.471407376; -2.204730614;|]
let write secfilePath=
for j in 0 .. 24 do
let z = w.[j].ToString()
File.AppendAllText(secfilePath, z)
//File.AppendAllLines(secfilePath, z)
done
There is couple things that could be done better in your code.
You're opening the file over and over again every time you add a number
z does not need to be mutable
You can pass format pattern and/or culture to ToString call
You can iterate over filterMod.y instead of for loop and array indexer access
I would probably go with something more like
module writeFiles =
let write secfilePath=
let data = filterMod.y
|> Array.map (fun x -> x.ToString(CultureInfo.InvariantCulture))
File.AppendAllLines(secfilePath, data)
It prepares an array of strings, where every number of filterMod.y gets formatted using CultureInfo.InvariantCulture, which will make it use . as decimal separator. And later on it uses AppendAllLines to write the whole array to the file at once, where every element will be written in a separate line.

f# deedle filter data frame based on a list

I wanted to filter a Deedle dataframe based on a list of values how would I go about doing this?
I had an idea to use the following code below:
let d= df1|>filterRowValues(fun row -> row.GetAs<float>("ts") = timex)
However the issue with this is that it is only based on one variable, I then thought of combining this with a for loop and an append function:
for i in 0.. recd.length -1 do
df2.Append(df1|>filterRowValues(fun row -> row.GetAs<float>("ts") = recd.[i]))
This does not work either however and there must be a better way of doing this without using a for loop. In R I could for instance using an %in%.
You can use the F# set type to create a set of the values that you are interested. In the filtering, you can then check whether the set contains the actual value for the row.
For example, say that you have recd of type seq<float>. Then you should be able to write:
let recdSet = set recd
let d = df1 |> Frame.filterRowValues (fun row ->
recdSet.Contains(row.GetAs<float>("ts"))
Some other things that might be useful:
You can replace row.GetAs<float>("ts") with just row?ts (which always returns float and works only when you have a fixed name, like "ts", but it makes the code nicer)
Comparing float values might not be the best thing to do (because of floating point imprecisions, this might not always work as expected).

File transform in F#

I am just starting to work with F# and trying to understand typical idoms and effective ways of thinking and working.
The task at hand is a simple transform of a tab-delimited file to one which is comma-delimited. A typical input line will look like:
let line = "#ES# 01/31/2006 13:31:00 1303.00 1303.00 1302.00 1302.00 2514 0"
I started out with looping code like this:
// inFile and outFile defined in preceding code not shown here
for line in File.ReadLines(inFile) do
let typicalArray = line.Split '\t'
let transformedLine = typicalArray |> String.concat ","
outFile.WriteLine(transformedLine)
I then replaced the split/concat pair of operations with a single Regex.Replace():
for line in File.ReadLines(inFile) do
let transformedLine = Regex.Replace(line, "\t",",")
outFile.WriteLine(transformedLine)
And now, finally, have replaced the looping with a pipeline:
File.ReadLines(inFile)
|> Seq.map (fun x -> Regex.Replace(x, "\t", ","))
|> Seq.iter (fun y -> outFile.WriteLine(y))
// other housekeeping code below here not shown
While all versions work, the final version seems to me the most intuitive. Is this how a more experienced F# programmer would accomplish this task?
I think all three versions are perfectly fine, idiomatic code that F# experts would write.
I generally prefer writing code using built-in language features (like for loops and if conditions) if they let me solve the problem I have. These are imperative, but I think using them is a good idea when the API requires imperative code (like outFile.WriteLine). As you mentioned - you started with this version (and I would do the same).
Using higher-order functions is nice too - although I would probably do that only if I wanted to write data transformation and get a new sequence or list of lines - this would be handy if you were using File.WriteAllLines instead of writing lines one-by-one. Although, that could be also done by simply wrapping your second version with sequence expression:
let transformed =
seq { for line in File.ReadLines(inFile) -> Regex.Replace(line, "\t",",") }
File.WriteAllLines(outFilePath, transformed)
I do not think there is any objective reason to prefer one of the versions. My personal stylistic preference is to use for and refactor to sequence expressions (if needed), but others will likely disagree.
A side note that if you want to write to the same file that you are reading from, you need to remember that Seq is doing lazy evaluation.
Using Array as opposed to Seq makes sure file is closed for reading when it is needed for writing.
This works:
let lines =
file |> File.ReadAllLines
|> Array.map(fun line -> ..modify line..)
File.WriteAllLines(file, lines)
This does not (causes file access file violation)
let lines =
file |> File.ReadLines
|> Seq.map(fun line -> ..modify line..)
File.WriteAllLines(file, lines)
(potential overlap with another discussion here, where intermediate variable helps with the same problem)

Resources