How to traverse String[][] in F# - f#

Context: Microsoft Visual Studio 2015 Community; F#
I've been learning F# for about 1/2 a day. I do have a vague idea of how to do functional programming from a year spent fiddling with mLite.
The following script traverses a folder tree and pulls in log files. The files have entries delimited by ~ and there may be one or more there.
open System
open System.IO
let files =
System.IO.Directory.GetFiles("C:\\scratch\\snapshots\\", "*.log", SearchOption.AllDirectories)
let readFile (file: string) =
//Console.WriteLine(file)
let text = File.ReadAllText(file)
text
let dataLines (line: string) =
line.Split('~')
let data =
files |> Array.map readFile |> Array.map dataLines
So at this point data contains a String[][] and I'm at a bit of a loss to figure out how to turn it into a String[], the idea being that I want to convert all the logs into one long vector so that I can do some other transformations on it. For example, each log line begins with a datetime so having turned it all into one long list I can then sort on the datetime.
Where to from here?

As stated in the comments, you can use Array.concat :
files |> Array.map readFile |> Array.map dataLines |> Array.concat
Now some refactoring, the composition of two maps is equivalent to the map of the composition of both functions.
files |> Array.map (readFile >> dataLines) |> Array.concat
Finally map >> concat is equivalent to collect. So your code becomes:
files |> Array.collect (readFile >> dataLines)

Related

Release the processed data in sequence

I am doing data processing with F#. First I got all files in a directory, then process each file to generate some data structure. Finally I will store the processed data into SQLite. I known that if I using Seq to store the file name and then pipe-forward to Seq.map that will do lazy process for each file. But how about there are so many files that contain all of them in memory is impossible. Then in imperative programming language, I could read one file, process it, store it and release the intermedia data and do next file. Of course F# could do imperative programming, but I want to know if there are some chances to do it in Functional programming style?
files
|> Seq.map readFile
|> Seq.map processContent
|> Seq.map storeProcessResult
code above shows my opinion. files contains a sequence of file names, then I read the content of file, process it into some structure and finally store the result into database. I know that because of the lazy behaviour, file will be read and processed one by one. But when will the final data released?
Obviously only you know what happens inside your readFile, processContent and storeProcessResult functions. As #FuleSnabel says in his comment you can map and then use fold (recursion) to process the file.
Here's a simple test you can perform to see the difference in memory consumption: create a List of lists with 10 million elements and sum the list, then create a Seq of lists with 10 million elements, and sum the list. I'm using 64-bit FSI.
This will use about 1GB of memory:
let z = [for i in 1..3 -> List.init 10000000 (fun _ -> 1)]
let w = z |> List.map (fun x -> System.GC.Collect();List.sum x)
This will only use a few MB of memory, much less than even one list with 10 million 1s in it:
let x = seq {for i in 1..3 -> List.init 10000000 (fun _ -> 1 ) }
let y = x |> Seq.map (fun x -> System.GC.Collect(); List.sum x)
This is just one (and probably easy) part in the workflow. If you're opening files, you have to make sure to close them as well, hence my suggestion of use above. However I do recognize that accessing the filesystem, and processing large amounts of data in a lazy sequence might cause some problems, in that case you can always profile it and see where the bottleneck is.
By the way, you don't need to call the GC directly in the code, I just did so the intermediate results don't pollute the memory count in the test.

Avoid mutation in this example in F#

Coming from an OO background, I am having trouble wrapping my head around how to solve simple issues with FP when trying to avoid mutation.
let mutable run = true
let player1List = ["he"; "ho"; "ha"]
let addValue lst value =
value :: lst
while run do
let input = Console.ReadLine()
addValue player1List input |> printfn "%A"
if player1List.Length > 5 then
run <- false
printfn "all done" // daz never gunna happen
I know it is ok to use mutation in certain cases, but I am trying to train myself to avoid mutation as the default. With that said, can someone please show me an example of the above w/o using mutation in F#?
The final result should be that player1List continues to grow until the length of items are 6, then exit and print 'all done'
The easiest way is to use recursion
open System
let rec makelist l =
match l |> List.length with
|6 -> printfn "all done"; l
| _ -> makelist ((Console.ReadLine())::l)
makelist []
I also removed some the addValue function as it is far more idiomatic to just use :: in typical F# code.
Your original code also has a common problem for new F# coders that you use run = false when you wanted run <- false. In F#, = is always for comparison. The compiler does actually warn about this.
As others already explained, you can rewrite imperative loops using recursion. This is useful because it is an approach that always works and is quite fundamental to functional programming.
Alternatively, F# provides a rich set of library functions for working with collections, which can actually nicely express the logic that you need. So, you could write something like:
let player1List = ["he"; "ho"; "ha"]
let player2List = Seq.initInfinite (fun _ -> Console.ReadLine())
let listOf6 = Seq.append player1List list2 |> Seq.take 6 |> List.ofSeq
The idea here is that you create an infinite lazy sequence that reads inputs from the console, append it at the end of your initial player1List and then take first 6 elements.
Depending on what your actual logic is, you might do this a bit differently, but the nice thing is that this is probably closer to the logic that you want to implement...
In F#, we use recursion to do loop. However, if you know how many times you need to iterate, you could use F# List.fold like this to hide the recursion implementation.
[1..6] |> List.fold (fun acc _ -> Console.ReadLine()::acc) []
I would remove the pipe from match for readability but use it in the last expression to avoid extra brackets:
open System
let rec makelist l =
match List.length l with
| 6 -> printfn "all done"; l
| _ -> Console.ReadLine()::l |> makelist
makelist []

Working with large text files?

I need to import a large text file (55MB) (525000 * 25) and manipulate the data and produce some output. As usual I started exploring with f# interactive, and I get some really strange behaviours.
Is this file too large or my code wrong?
First test was to import and simply comute the sum over one column (not the end goal but first test):
let calctest =
let reader = new StreamReader(path)
let csv = reader.ReadToEnd()
csv.Split([|'\n'|])
|> Seq.skip 1
|> Seq.map (fun line -> line.Split([|','|]))
|> Seq.filter (fun a -> a.[11] = "M")
|> Seq.map (fun values -> float(values.[14]))
As expected this produces a seq of float both in typecheck and in interactive. If I know add:
|> Seq.sum
Type check works and says this function should return a float but if I run it in interactive I get this error:
System.IndexOutOfRangeException: Index was outside the bounds of the array
Then I removed the last line again and thought I look at the seq of float in a text file:
let writetest =
let str = calctest |> Seq.map (fun i -> i.ToString())
System.IO.File.WriteAllLines("test.txt", str )
Again, this passes the type check but throws errors in interactive.
Can the standard StreamReader not handle that amount of data? or am I going wrong somewhere? Should I use a different function then Streamreader?
Thanks.
Seq is lazy, which means that only when you add the Seq.sum is all the mapping and filtering actually being done, that's why you don't see the error before adding that line. Are you sure you have 15 columns on all rows? That's probably the problem
I would advise you to use the CSV Type Provider instead of just doing a string.Split, that way you'll be sure to not have an accidental IndexOutOfRangeException, and you'll handle , escaping correctly.
Additionaly, you're reading the whole csv file into memory by calling reader.ReadToEnd(), the CsvProvider supports streaming if you set the Cache parameter to false. It's not a problem with a 55MB file, but if you have something much larger it might be

How to evaluate only a part of a lazy sequence?

Lazy evaluation is a great boon for stuff like processing huge files that will not fit in main memory at one go. However, suppose there are some elements in the sequence that I want evaluated immediately, while the rest can be lazily computed - is there any way to specify that?
Specific problem: (in case that helps to answer the question)
Specifically, I am using a series of IEnumerables as iterators for multiple sequences - these sequences are data read from files opened using BinaryReader streams (each sequence is responsible for the reading in of data from one of the files). The MoveNext() on these is to be called in a specific order. Eg. iter0 then iter1 then iter5 then iter3 .... and so on. This order is specified in another sequence index = {0,1,5,3,....}. However sequences being lazy, the evaluation is naturally done only when required. Hence, the file reads (for the sequences right at the beginning that read from files on disk) happens as the IEnumerables for a sequence are moving. This is causing an illegal file access - a file that is being read by one process is accessed again (as per the error msg).
True, the illegal file access could be for other reasons, and after having tried my best to debug other causes a partially lazy evaluation might be worth trying out.
While I agree with Tomas' comment: you shouldn't need this if file sharing is handled properly, here's one way to eagerly evaluate the first N elements:
let cacheFirst n (items: seq<_>) =
seq {
use e = items.GetEnumerator()
let i = ref 0
yield!
[
while !i < n && e.MoveNext() do
yield e.Current
incr i
]
while e.MoveNext() do
yield e.Current
}
Example
let items = Seq.initInfinite (fun i -> printfn "%d" i; i)
items
|> Seq.take 10
|> cacheFirst 5
|> Seq.take 3
|> Seq.toList
Output
0
1
2
3
4
val it : int list = [0; 1; 2]
Daniel's solution is sound, but I don't think we need another operator, just Seq.cache for most cases.
First cache your sequence:
let items = Seq.initInfinite (fun i -> printfn "%d" i; i) |> Seq.cache
Eager evaluation followed by lazy access from the beginning:
let eager = items |> Seq.take 5 |> Seq.toList
let cached = items |> Seq.take 3 |> Seq.toList
This will evaluate the first 5 elements once (during eager) but make them cached for secondary access.

Useful F# Scripts

I have been investigating the use of F# for development and have found (for my situations) building scripts to help me simplify some complex tasks is where I can get value from it (at the moment).
My most common complex task is concatenating files for many tasks (mostly SQL related).
I do this often and every time I try to improve on my F# script to do this.
This is my best effort so far:
open System.IO
let path = "C:\\FSharp\\"
let pattern = "*.txt"
let out_path = path + "concat.out"
File.Delete(out_path)
Directory.GetFiles(path, pattern)
|> Array.collect (fun file -> File.ReadAllLines(file))
|> (fun content -> File.WriteAllLines(out_path, content) )
I'm sure others have scripts which makes their sometimes complex/boring tasks easier.
What F# scripts have you used to do this or what other purposes for F# scripts have you found useful?
I found the best way for me to improve my F# was to browse other scripts to get ideas on how to tackle specific situations. Hopefully this question will help me and others in the future. :)
I have found an article on generating F# scripts that may be of interest:
http://blogs.msdn.com/chrsmith/archive/2008/09/12/scripting-in-f.aspx
I use F# in a similar way when I need to quickly pre-process some data or convert data between various formats. F# has a great advantage that you can create higher-order functions for doing all sorts of similar tasks.
For example, I needed to load some data from SQL database and generate Matlab script files that load the data. I needed to do this for a couple of different SQL queries, so I wrote these two functions:
// Runs the specified query 'str' and reads results using 'f'
let query str f = seq {
let conn = new SqlConnection("<conn.str>");
let cmd = new SqlCommand(str, conn)
conn.Open()
use rdr = cmd.ExecuteReader(CommandBehavior.CloseConnection)
while rdr.Read() do yield f(rdr) }
// Simple function to save all data to the specified file
let save file data =
File.WriteAllLines(#"C:\...\" + file, data |> Array.ofSeq)
Now I could easily write specific calls to read the data I need, convert them to F# data types, does some pre-processing (if needed) and print the outputs to a file. For example for processing companies I had something like:
let comps =
query "SELECT [ID], [Name] FROM [Companies] ORDER BY [ID]"
(fun rdr -> rdr.GetString(1) )
let cdata =
seq { yield "function res = companies()"
yield " res = {"
for name in comps do yield sprintf " %s" name
yield " };"
yield "end" }
save "companies.m" cdata
Generating output as a sequence of strings is also pretty neat, though you could probably write a more efficient computation builder using StringBuilder.
Another example of using F# in the interactive way is described in my functional programming book in Chapter 13 (you can get the source code here). It connects to the World Bank database (which contains a lots of information about various countries), extracts some data, explores the structure of the data, convert them to F# data types and calculates some results (and visualizes them). I think this is (one of many) kinds of tasks that can be very nicely done in F#.
Sometimes if I want a brief of an XML structure (or have a recursive list to use in other forms such as searches), I can print out a tabbed list of nodes in the XML using the following script:
open System
open System.Xml
let path = "C:\\XML\\"
let xml_file = path + "Test.xml"
let print_element level (node:XmlNode) = [ for tabs in 1..level -> " " ] # [node.Name]
|> (String.concat "" >> printfn "%s")
let rec print_tree level (element:XmlNode) =
element
|> print_element level
|> (fun _ -> [ for child in element.ChildNodes -> print_tree (level+1) child ])
|> ignore
new XmlDocument()
|> (fun doc -> doc.Load(xml_file); doc)
|> (fun doc -> print_tree 0 doc.DocumentElement)
I am sure it can be optimised/cut down and would encourage by others' improvements on this code. :)
(For an alternative snippet see the answer below.)
This snippet transforms an XML using an XSLT. I wasn't sure of hte best way to use the XslCompiledTransform and XmlDocument objects the best in F#, but it seemed to work. I am sure there are better ways and would be happy to hear about them.
(* Transforms an XML document given an XSLT. *)
open System.IO
open System.Text
open System.Xml
open System.Xml.Xsl
let path = "C:\\XSL\\"
let file_xml = path + "test.xml"
let file_xsl = path + "xml-to-xhtml.xsl"
(* Compile XSL file to allow transforms *)
let compile_xsl (xsl_file:string) = new XslCompiledTransform() |> (fun compiled -> compiled.Load(xsl_file); compiled)
let load_xml (xml_file:string) = new XmlDocument() |> (fun doc -> doc.Load(xml_file); doc)
(* Transform an Xml document given an XSL (compiled *)
let transform (xsl_file:string) (xml_file:string) =
new MemoryStream()
|> (fun mem -> (compile_xsl xsl_file).Transform((load_xml xml_file), new XmlTextWriter(mem, Encoding.UTF8)); mem)
|> (fun mem -> mem.Position <- (int64)0; mem.ToArray())
(* Return an Xml fo document that has been transformed *)
transform file_xsl file_xml
|> (fun bytes -> File.WriteAllBytes(path + "out.html", bytes))
After clarifying approaches to writing F# code with existing .net classes, the following useful code came up for transforming xml documents given xsl documents. The function also allows you to create a custom function to transform xml documents with a specific xsl document (see example):
let transform =
(fun xsl ->
let xsl_doc = new XslCompiledTransform()
xsl_doc.Load(string xsl)
(fun xml ->
let doc = new XmlDocument()
doc.Load(string xml)
let mem = new MemoryStream()
xsl_doc.Transform(doc.CreateNavigator(), null, mem)
mem
)
)
This allows you to transform docs this way:
let result = transform "report.xml" "report.xsl"
or you can create another function which can be used multiple times:
let transform_report "report.xsl"
let reports = [| "report1.xml"; "report2.xml" |]
let results = [ for report in reports do transform_report report ]

Resources