Working with large text files? - f#

I need to import a large text file (55MB) (525000 * 25) and manipulate the data and produce some output. As usual I started exploring with f# interactive, and I get some really strange behaviours.
Is this file too large or my code wrong?
First test was to import and simply comute the sum over one column (not the end goal but first test):
let calctest =
let reader = new StreamReader(path)
let csv = reader.ReadToEnd()
csv.Split([|'\n'|])
|> Seq.skip 1
|> Seq.map (fun line -> line.Split([|','|]))
|> Seq.filter (fun a -> a.[11] = "M")
|> Seq.map (fun values -> float(values.[14]))
As expected this produces a seq of float both in typecheck and in interactive. If I know add:
|> Seq.sum
Type check works and says this function should return a float but if I run it in interactive I get this error:
System.IndexOutOfRangeException: Index was outside the bounds of the array
Then I removed the last line again and thought I look at the seq of float in a text file:
let writetest =
let str = calctest |> Seq.map (fun i -> i.ToString())
System.IO.File.WriteAllLines("test.txt", str )
Again, this passes the type check but throws errors in interactive.
Can the standard StreamReader not handle that amount of data? or am I going wrong somewhere? Should I use a different function then Streamreader?
Thanks.

Seq is lazy, which means that only when you add the Seq.sum is all the mapping and filtering actually being done, that's why you don't see the error before adding that line. Are you sure you have 15 columns on all rows? That's probably the problem
I would advise you to use the CSV Type Provider instead of just doing a string.Split, that way you'll be sure to not have an accidental IndexOutOfRangeException, and you'll handle , escaping correctly.
Additionaly, you're reading the whole csv file into memory by calling reader.ReadToEnd(), the CsvProvider supports streaming if you set the Cache parameter to false. It's not a problem with a 55MB file, but if you have something much larger it might be

Related

formatting Composite function in f#

I have a recursive function in f# that iterates a string[] of commands that need to be run, each command runs a new command to generate a map to be passed to the next function.
The commands run correctly but are large and cumbersome to read, I believe that there is a better way to order / format these composite functions using pipe syntax however coming from c# as a lot of us do i for the life of me cannot seem to get it to work.
my command is :
let rec iterateCommands (map:Map<int,string array>) commandPosition =
if commandPosition < commands.Length then
match splitCommand(commands.[0]).[0] with
|"comOne" ->
iterateCommands (map.Add(commandPosition,create(splitCommand commands.[commandPosition])))(commandPosition+1)
The closest i have managed is by indenting the function but this is messy :
iterateCommands
(map.Add
(commandPosition,create
(splitCommand commands.[commandPosition])
)
)
(commandPosition+1)
Is it even possible to reformat this in f#? From what i have read i believe it possible, any help would be greatly appreciated
The command/variable types are:
commandPosition - int
commands - string[]
splitCommand string -> string[]
create string[] -> string[]
map : Map<int,string[]>
and of course the map.add map -> map + x
It's often hard to make out what is going on in a big statement with multiple inputs. I'd give names to the individual expressions, so that a reader can jump into any position and have a rough idea what's in the values used in a calculation, e.g.
let inCommands = splitCommand commands.[commandPosition]
let map' = map.Add (commandPosition, inCommands)
iterateCommands map' inCommands
Since I don't know what is being done here, the names aren't very meaningful. Ideally, they'd help to understand the individual steps of the calculation.
It'd be a bit easier to compose the call if you changed the arguments around:
let rec iterateCommands commandPosition (map:Map<int,string array>) =
// ...
That would enable you to write something like:
splitCommand commands.[commandPosition]
|> create
|> (fun x -> commandPosition, x)
|> map.Add
|> iterateCommands (commandPosition + 1)
The fact that commandPosition appears thrice in the composition is, in my opinion, a design smell, as is the fact that the type of this entire expression is unit. It doesn't look particularly functional, but since I don't understand exactly what this function attempts to do, I can't suggest a better design.
If you don't control iterateCommands, and hence can't change the order of arguments, you can always define a standard functional programming utility function:
let flip f x y = f y x
This enables you to write the following against the original version of iterateCommands:
splitCommand commands.[commandPosition]
|> create
|> (fun x -> commandPosition, x)
|> map.Add
|> (flip iterateCommands) (commandPosition + 1)

Avoid mutation in this example in F#

Coming from an OO background, I am having trouble wrapping my head around how to solve simple issues with FP when trying to avoid mutation.
let mutable run = true
let player1List = ["he"; "ho"; "ha"]
let addValue lst value =
value :: lst
while run do
let input = Console.ReadLine()
addValue player1List input |> printfn "%A"
if player1List.Length > 5 then
run <- false
printfn "all done" // daz never gunna happen
I know it is ok to use mutation in certain cases, but I am trying to train myself to avoid mutation as the default. With that said, can someone please show me an example of the above w/o using mutation in F#?
The final result should be that player1List continues to grow until the length of items are 6, then exit and print 'all done'
The easiest way is to use recursion
open System
let rec makelist l =
match l |> List.length with
|6 -> printfn "all done"; l
| _ -> makelist ((Console.ReadLine())::l)
makelist []
I also removed some the addValue function as it is far more idiomatic to just use :: in typical F# code.
Your original code also has a common problem for new F# coders that you use run = false when you wanted run <- false. In F#, = is always for comparison. The compiler does actually warn about this.
As others already explained, you can rewrite imperative loops using recursion. This is useful because it is an approach that always works and is quite fundamental to functional programming.
Alternatively, F# provides a rich set of library functions for working with collections, which can actually nicely express the logic that you need. So, you could write something like:
let player1List = ["he"; "ho"; "ha"]
let player2List = Seq.initInfinite (fun _ -> Console.ReadLine())
let listOf6 = Seq.append player1List list2 |> Seq.take 6 |> List.ofSeq
The idea here is that you create an infinite lazy sequence that reads inputs from the console, append it at the end of your initial player1List and then take first 6 elements.
Depending on what your actual logic is, you might do this a bit differently, but the nice thing is that this is probably closer to the logic that you want to implement...
In F#, we use recursion to do loop. However, if you know how many times you need to iterate, you could use F# List.fold like this to hide the recursion implementation.
[1..6] |> List.fold (fun acc _ -> Console.ReadLine()::acc) []
I would remove the pipe from match for readability but use it in the last expression to avoid extra brackets:
open System
let rec makelist l =
match List.length l with
| 6 -> printfn "all done"; l
| _ -> Console.ReadLine()::l |> makelist
makelist []

Array indexer within choose() won't work

Please pardon my dust here while I am trying to learn F#
I have a function that gives me a Seq of Arrays read from a CSV file. Each element of those arrays represent one column data.
let file = readFile("""C:\path\to\file.csv""")
The first column is dates which I am trying to fetch here is my code
let dates =
file
|> Seq.skip(1)
|> Seq.choose(fun x -> x.[0])
I am getting the following compile error
error FS0001: This expression was expected to have type 'a option
Am I using it wrong ? When I point mouse to 'x', intellisense tells me x is of type string[]
What you actually wanted was
let dates =
file
|> Seq.skip(1)
|> Seq.map(fun x -> x.[0])
Seq.choose does filtering as well, but as you don't use the filtering you only need to use map
I got it fixed. Some() is what I wanted.
let dates =
file
|> Seq.skip(1)
|> Seq.choose(fun x -> Some(x.[0]))

Get a column by name as array from CsvFile.Load (or create dictionary of arrays from csv)

I have the following code to load a csv. What is the best way to get a column from "msft" (preferably by name) as an array? Or should I be loading the data in a different way to do this?
#r "FSharp.Data.dll"
open FSharp.Data.Csv
let msft = CsvFile.Load("http://ichart.finance.yahoo.com/table.csv?s=MSFT").Cache()
Edit: Alternatively, what would be an efficient way to import a csv into a dictionary of arrays keyed by column name? If I should really be creating a new question for this, please let me know. Not yet familiar with all stackoverflow standards.
Building on Latkin's answer, this seems like the more functional or F# way of doing what you want.
let getVector columnAccessor msft =
[| yield! msft.Data |> Seq.map columnAccessor |]
(* Now we can get the column all at once *)
let closes = getVector (fun x -> x.Close) msft
(* Or we can create an accessor and pipe our data to it. *)
let getCloses = getVector (fun x -> x.Close)
let closes = msft |> getCloses
I hope that this helps.
I went through this example as well. Something like the following should do it.
let data =
msft.Data
|> List.fold (fun acc row -> row.Date :: acc) List.Empty<DateTime>
Here I am piping the msft.Data list of msft data records and folding it down to a list of one item from that list. Please check the documentation for all functions mentioned. I have not run this.
When you say you want to column "by name" it's not clear if you mean "someone passes me the column name as a string" or "I use the column name in my code." Type providers are perfect for the latter case, but do not really help with the former.
For the latter case, you could use this:
let closes = [| yield! msft.Data |> Seq.map (fun x -> x.Close) |]
If the former, you might want to consider reading in the data some other way, perhaps to a dictionary keyed by column names.
The whole point of type providers is to make all of this strongly typed and code-focused, and to move away from passing column names as strings which might or might not be valid.

Using Array.map omitting the first element of the array in F#

I have just started playing with F#, so this question will probably be quite basic...
I would like to read a text file line by line, and then ignore the first line and process the other lines using a given function. So I was thinking of using something along the lines of:
File.ReadAllLines(path)
|> Array.(ignore first element)
|> Array.map processLine
what would be an elegant yet efficient way to accomplish it?
There is no simple function to skip the first line in an array, because the operation is not efficient (it would have to copy the whole array), but you can do that easily if you use lazy sequences instead:
File.ReadAllLines(path)
|> Seq.skip 1
|> Seq.map processLine
If you need the result in an array (as opposed to seq<'T>, which is an F# alias for IEnumerable<'T>), then you can add Seq.toArray to the end. However, if you just want to iterate over the lines later on, then you can probably just use sequences.
This is an addition to Tomas' answer, which I generally agree with. One thing to watch is what happens if your array or sequence contains no lines. (Or fewer lines than you want to skip.) In that case, Seq.skip will throw an exception. The most concise way around this that I can think of is:
System.IO.File.ReadAllLines fileName
|> Seq.mapi (fun i elem -> i, elem)
|> Seq.choose (fun (i, elem) -> if i > 0 then Some(elem) else None)
You skip the first element in an F# array simply by myArray.[1..]. Gotta love how elegant this language is.

Resources