Hi I'm looking to find the best way to read in a fixed width text file using F#. The file will be plain text, from one to a couple of thousand lines long and around 1000 characters wide. Each line contains around 50 fields, each with varying lengths. My initial thoughts were to have something like the following
type MyRecord = {
Name : string
Address : string
Postcode : string
Tel : string
}
let format = [
(0,10)
(10,50)
(50,7)
(57,20)
]
and read each line one by one, assigning each field by the format tuple(where the first item is the start character and the second is the number of characters wide).
Any pointers would be appreciated.
The hardest part is probably to split a single line according to the column format. It can be done something like this:
let splitLine format (line : string) =
format |> List.map (fun (index, length) -> line.Substring(index, length))
This function has the type (int * int) list -> string -> string list. In other words, format is an (int * int) list. This corresponds exactly to your format list. The line argument is a string, and the function returns a string list.
You can map a list of lines like this:
let result = lines |> List.map (splitLine format)
You can also use Seq.map or Array.map, depending on how lines is defined. Such a result will be a string list list, and you can now map over such a list to produce a MyRecord list.
You can use File.ReadLines to get a lazily evaluated sequence of strings from a file.
Please note that the above is only an outline of a possible solution. I left out boundary checks, error handling, and such. The above code may contain off-by-one errors.
Here's a solution with a focus on custom validation and error handling for each field. This might be overkill for a data file consisting of just numeric data!
First, for these kinds of things, I like to use the parser in Microsoft.VisualBasic.dll as it's already available without using NuGet.
For each row, we can return the array of fields, and the line number (for error reporting)
#r "Microsoft.VisualBasic.dll"
// for each row, return the line number and the fields
let parserReadAllFields fieldWidths textReader =
let parser = new Microsoft.VisualBasic.FileIO.TextFieldParser(reader=textReader)
parser.SetFieldWidths fieldWidths
parser.TextFieldType <- Microsoft.VisualBasic.FileIO.FieldType.FixedWidth
seq {while not parser.EndOfData do
yield parser.LineNumber,parser.ReadFields() }
Next, we need a little error handling library (see http://fsharpforfunandprofit.com/rop/ for more)
type Result<'a> =
| Success of 'a
| Failure of string list
module Result =
let succeedR x =
Success x
let failR err =
Failure [err]
let mapR f xR =
match xR with
| Success a -> Success (f a)
| Failure errs -> Failure errs
let applyR fR xR =
match fR,xR with
| Success f,Success x -> Success (f x)
| Failure errs,Success _ -> Failure errs
| Success _,Failure errs -> Failure errs
| Failure errs1, Failure errs2 -> Failure (errs1 # errs2)
Then define your domain model. In this case, it is the record type with a field for each field in the file.
type MyRecord =
{id:int; name:string; description:string}
And then you can define your domain-specific parsing code. For each field I have created a validation function (validateId, validateName, etc).
Fields that don't need validation can pass through the raw data (validateDescription).
In fieldsToRecord the various fields are combined using applicative style (<!> and <*>).
For more on this, see http://fsharpforfunandprofit.com/posts/elevated-world-3/#validation.
Finally, readRecords maps each input row to the a record Result and chooses the successful ones only. The failed ones are written to a log in handleResult.
module MyFileParser =
open Result
let createRecord id name description =
{id=id; name=name; description=description}
let validateId (lineNo:int64) (fields:string[]) =
let rawId = fields.[0]
match System.Int32.TryParse(rawId) with
| true, id -> succeedR id
| false, _ -> failR (sprintf "[%i] Can't parse id '%s'" lineNo rawId)
let validateName (lineNo:int64) (fields:string[]) =
let rawName = fields.[1]
if System.String.IsNullOrWhiteSpace rawName then
failR (sprintf "[%i] Name cannot be blank" lineNo )
else
succeedR rawName
let validateDescription (lineNo:int64) (fields:string[]) =
let rawDescription = fields.[2]
succeedR rawDescription // no validation
let fieldsToRecord (lineNo,fields) =
let (<!>) = mapR
let (<*>) = applyR
let validatedId = validateId lineNo fields
let validatedName = validateName lineNo fields
let validatedDescription = validateDescription lineNo fields
createRecord <!> validatedId <*> validatedName <*> validatedDescription
/// print any errors and only return good results
let handleResult result =
match result with
| Success record -> Some record
| Failure errs -> printfn "ERRORS %A" errs; None
/// return a sequence of records
let readRecords parserOutput =
parserOutput
|> Seq.map fieldsToRecord
|> Seq.choose handleResult
Here's an example of the parsing in practice:
// Set up some sample text
let text = """01name1description1
02name2description2
xxname3badid-------
yy badidandname
"""
// create a low-level parser
let textReader = new System.IO.StringReader(text)
let fieldWidths = [| 2; 5; 11 |]
let parserOutput = parserReadAllFields fieldWidths textReader
// convert to records in my domain
let records =
parserOutput
|> MyFileParser.readRecords
|> Seq.iter (printfn "RECORD %A") // print each record
The output will look like:
RECORD {id = 1;
name = "name1";
description = "description";}
RECORD {id = 2;
name = "name2";
description = "description";}
ERRORS ["[3] Can't parse id 'xx'"]
ERRORS ["[4] Can't parse id 'yy'"; "[4] Name cannot be blank"]
By no means is this the most efficient way to parse a file (I think there are some CSV parsing libraries available on NuGet that can do validation while parsing) but it does show how you can have complete control over validation and error handling if you need it.
A record of 50 fields is a bit unwieldy, therefore alternate approaches which allow dynamic generation of the data structure may be preferable (eg. System.Data.DataRow).
If it has to be a record anyway, you could spare at least the manual assignment to each record field and populate it with the help of Reflection instead. This trick relies on the field order as they are defined. I am assuming that every column of fixed width represents a record field, so that start indices are implied.
open Microsoft.FSharp.Reflection
type MyRecord = {
Name : string
Address : string
City : string
Postcode : string
Tel : string } with
static member CreateFromFixedWidth format (line : string) =
let fields =
format
|> List.fold (fun (index, acc) length ->
let str = line.[index .. index + length - 1].Trim()
index + length, box str :: acc )
(0, [])
|> snd
|> List.rev
|> List.toArray
FSharpValue.MakeRecord(
typeof<MyRecord>,
fields ) :?> MyRecord
Example data:
"Postman Pat " +
"Farringdon Road " +
"London " +
"EC1A 1BB" +
"+44 20 7946 0813"
|> MyRecord.CreateFromFixedWidth [16; 16; 16; 8; 16]
// val it : MyRecord = {Name = "Postman Pat";
// Address = "Farringdon Road";
// City = "London";
// Postcode = "EC1A 1BB";
// Tel = "+44 20 7946 0813";}
Related
I need to write a Deedle FrameData (including "ID" column and additional "Delta" column with blank entries) to CSV. While I can generate a 2D array of the FrameData, I am unable to write it correctly to a CSV file.
module SOQN =
open System
open Deedle
open FSharp.Data
// TestInput.csv
// ID,Alpha,Beta,Gamma
// 1,no,1,hi
// ...
// TestOutput.csv
// ID,Alpha,Beta,Gamma,Delta
// 1,"no","1","hi",""
// ...
let inputCsv = #"D:\TestInput.csv"
let outputCsv = #"D:\TestOutput.csv"
let (df:Frame<obj,string>) = Frame.ReadCsv(inputCsv, hasHeaders=true, inferTypes=false, separators=",", indexCol="ID")
// See http://www.fssnip.net/sj/title/Insert-Deedle-frame-into-Excel
let data4Frame (frame:Frame<_,_>) = frame.GetFrameData()
// See http://www.fssnip.net/sj/title/Insert-Deedle-frame-into-Excel
let boxOptional obj =
match obj with
| Deedle.OptionalValue.Present obj -> box (obj.ToString())
| _ -> box ""
// See http://www.fssnip.net/sj/title/Insert-Deedle-frame-into-Excel
let frameToArray (data:FrameData) =
let transpose (array:'T[,]) =
Array2D.init (array.GetLength(1)) (array.GetLength(0)) (fun i j -> array.[j, i])
data.Columns
|> Seq.map (fun (typ, vctr) -> vctr.ObjectSequence |> Seq.map boxOptional |> Array.ofSeq)
|> array2D
|> transpose
let main =
printfn ""
printfn "Output Deedle FrameData To CSV"
printfn ""
let dff = data4Frame df
let rzlt = frameToArray dff
printfn "rzlt: %A" rzlt
do
use writer = new StreamWriter(outputCsv)
writer.WriteLine("ID,Alpha,Beta,Gamma,Delta")
// writer.WriteLine rzlt
0
[<EntryPoint>]
main
|> ignore
What am I missing?
I would not use FrameData to do this - frame data is mostly internal and while there are some legitimate uses for it, I don't think it makes sense for this task.
If you simply want to add an empty Delta column to your input CSV, then you can do this:
let df : Frame<int, _> = Frame.ReadCsv("C:/temp/test-input.csv", indexCol="ID")
df.AddColumn("Delta", [])
df.SaveCsv("C:/temp/test-output.csv", ["ID"])
This does almost everything you need - it writes the ID column and the extra Delta column.
The only caveat is that it does not add the extra quotes around the data. This is not required by the CSV specification unless you need to escape a comma in a column and I don't think there is an easy way to get Deedle to do this.
So, I think then you'd have to write your own writing to a CSV file. The following shows how to do this, but it does not correctly escape quotes and commas (which is why you should use SaveCsv even if it does not put in the quotes when they're not needed):
use writer = new StreamWriter("C:/temp/test-output.csv")
writer.WriteLine("ID,Alpha,Beta,Gamma,Delta")
for key, row in Series.observations df.Rows do
writer.Write(key)
for value in Series.valuesAll row do
writer.Write(",")
writer.Write(sprintf "\"%O\"" (if value.IsSome then value.Value else box ""))
writer.WriteLine()
You can get the example of writing to csv from source of the library (it uses FrameData there)
After adding wrapper:
type FrameData with
member frameData.SaveCsv(path:string, ?includeRowKeys, ?keyNames, ?separator, ?culture) =
use writer = new StreamWriter(path)
writeCsv writer (Some path) separator culture includeRowKeys keyNames frameData
you could write like this:
dff.SaveCsv outputCsv
In F# best way to set up a SQLCommand with parameters
some very neat solutions were given for constructing SQLCommand input parameters. Now I need to do some output parameters for calling a stored procedure that returns two output parameters.
So far I have:
let cmd = (createSqlCommand query conn)
let pec = (new SqlParameter("#errorCode", SqlDbType.Int))
pec.Direction <- ParameterDirection.Output
ignore (cmd.Parameters.Add(pec))
let pet = new SqlParameter("#errorMessage", SqlDbType.VarChar, 2000)
pet.Direction <- ParameterDirection.Output
ignore (cmd.Parameters.Add(pet))
let rc = cmd.ExecuteNonQuery()
let errorCode = cmd.Parameters.Item("#errorCode").Value.ToString()
let errorText = cmd.Parameters.Item("#errorMessage").Value.ToString()
Which works, but I find it ugly and too imperative. How can I expand the solutions in the previous example, (especially Tomas, which I'm now using) to handle output parameters too? So input and output in the same command to be issued.
So I tried this:
type Command =
{ Query : string
Timeout : int
Parameters : (string * Parameter) list
OutParameters : Option<(string * OutParameter)> list}
followed by this:
let createSqlCommand cmd connection =
let sql = new SqlCommand(cmd.Query, connection)
sql.CommandTimeout <- cmd.Timeout
for name, par in cmd.Parameters do
let sqlTyp, value =
match par with
| Int n -> SqlDbType.Int, box n
| VarChar s -> SqlDbType.VarChar, box s
| Text s -> SqlDbType.Text, box s
| DateTime dt -> SqlDbType.DateTime, box dt
sql.Parameters.Add(name, sqlTyp).Value <- value
match cmd.OutParameters with
| Some <string * OutParameter> list ->
for name, par in list do
let sqlParameter =
match par with
| OutInt -> new SqlParameter(name, SqlDbType.Int)
| OutVarChar len -> new SqlParameter(name, SqlDbType.VarChar, len)
sqlParameter.Direction <- ParameterDirection.Output
sql.Parameters.Add sqlParameter |> ignore
| _ -> ()
But I can't work out the syntax for the match near the end. I tried:
Some list -> and got
Error 52 This expression was expected to have type
Option list but here has type
'a option
Then I tried:
| Some Option<string * OutParameter> list ->
got the same error, So I tried:
| Some <string * OutParameter> list ->
got a different error:
Error 53 Unexpected identifier in pattern. Expected infix operator,
quote symbol or other token.
Then tried:
| Some <(string * OutParameter)> list ->
got the error:
Error 53 Unexpected symbol '(' in pattern. Expected infix operator,
quote symbol or other token.
Finally tried:
| Some (string * OutParameter) list ->
and got the first error again.
Then, I gave up.
What syntax is needed here?
Thought up a new one:
| Some list : (string * OutParameter) ->
for name, par in list do
but that errors on "for"
Error 53 Unexpected keyword 'for' in type
New Attempt:
I thought maybe I could define a function to build a sql command expecting output parameters and still use the first createSqlCommand function. I tried this:
type OutCommand =
{ Query : string
Timeout : int
Parameters : (string * Parameter) list
OutParameters : (string * OutParameter) list
}
let createSqlCommandOut (cmd : OutCommand) connection =
let sql = createSqlCommand {cmd.Query; cmd.Timeout; cmd.Parameters} connection
for name, par in cmd.OutParameters do
let sqlParameter =
match par with
| OutInt -> new SqlParameter(name, SqlDbType.Int)
| OutVarChar len -> new SqlParameter(name, SqlDbType.VarChar, len)
sqlParameter.Direction <- ParameterDirection.Output
sql.Parameters.Add sqlParameter |> ignore
sql
The idea is to grab the parameters passed in and send them on to the original function to do the work. You probably guessed that this doesn't work. I get the errors;
Error 53 Invalid object, sequence or record expression
On the call to createSqlCommand in the new function. Is this kind of thing possible? Can I make a Command record using the members of an OutCommand record? If so, how do I do the casting? (It seems to be neither an upcast downcast)
Tomas is of course much better qualified to answer this, but I'll give it a try. If he does answer, It'll be interesting to see if I'm on the right track. I guess I'm slightly off.
Bear with me if this doesn't quite run well, since I won't test it. I will base this on the code Tomas gave us.
I think we need a new OutParameter type.
type OutParameter =
| OutInt
| OutVarChar of int // the length is needed?
In the Command type we add an extra field named OutParameters.
type Command =
{ Query : string
Timeout : int
Parameters : (string * Parameter) list
OutParameters : (string * OutParameter) list }
In the cmd function, this must be added.
OutParameters =
[ "#errorCode", OutInt
"#errorMessage", OutVarChar 2000 ]
The function createSqlCommand must now also handle OutParameters. The last for-loop is the only modification here.
let createSqlCommand cmd connection =
let sql = new SqlCommand(cmd.Query, connection)
sql.CommandTimeout <- cmd.Timeout
for name, par in cmd.Parameters do
let sqlTyp, value =
match par with
| Int n -> SqlDbType.Int, box n
| VarChar s -> SqlDbType.VarChar, box s
| Text s -> SqlDbType.Text, box s
| DateTime dt -> SqlDbType.DateTime, box dt
sql.Parameters.Add(name, sqlTyp).Value <- value
for name, par in cmd.OutParameters do
let sqlParameter =
match par with
| OutInt -> new SqlParameter(name, SqlDbType.Int)
| OutVarChar len -> new SqlParameter(name, SqlDbType.VarChar, len)
sqlParameter.Direction <- ParameterDirection.Output
sql.Parameters.Add sqlParameter |> ignore
sql
After you have run your ExecuteNonQuery, you can again take advantage of your list of OutParameters to parse the output.
Now a function to extract the values.
let extractOutParameters (cmd: SqlCommand) (outParms: (string * OutParameter) list) =
outParms
|> List.map (fun (name, outType) ->
match outType with
| OutInt -> cmd.Parameters.Item(name).Value :?> int |> Int
| OutVarChar _ -> cmd.Parameters.Item(name).Value.ToString() |> VarChar
)
I am not at all sure that casting the values like this is good, and you probably should match on the type instead, to handle errors properly. Test it. But that's a minor issue not much related to what I'm trying to demonstrate.
Notice that this function uses the Parameter type for returning the values, rather than the OutParameter type. At this point I would consider changing the names of one or both types, to better reflect their use.
UPDATE
You can use this to create specific functions for commands and queries. Here is a slightly pseudo-codish F# snippet.
type UserInfo = { UserName: string; Name: string; LastLogin: DateTime }
let getUserInfo con userName : UserInfo =
let cmd = {
Query = "some sql to get the data"
Timeout = 1000
Parameters = ... the user name here
OutParameters = ... the userName, Name and LastLogin here
}
let sqlCommand = createSqlCommand cmd con
... run the ExecuteNonQuery or whatever here
let outs = extractOutParameters sqlCommand cmd.OutParameters
{
UserName = getValOfParam outs "#userName"
Name = getValOfParam outs "#name"
LastLogin = getValOfParam outs "#lastLogin"
}
You will have to create the function getValOfParam, which just searches outs for the parameter with the correct name, and returns its value.
You can then use getUserInfo like this.
let userInfo = getUserInfo con "john_smith"
Even if there were ten fields returned, you'd get them in one record, so it's simple to ignore the fields you don't want.
And if you had built another function with results you weren't interested in at all when calling it, you'd call it like this.
startEngineAndGetStatus con "mainEngine" |> ignore
Please consider this dataset, composed by man and woman, and that I filter in a second moment according to few variables:
type ls = JsonProvider<"...">
let dt = ls.GetSamples()
let dt2 =
dt |> Seq.filter (fun c -> c.Sex = "male" && c.Height > Some 150)
dt2
[{"sex":"male","height":180,"weight":85},
{"sex":"male","height":160" "weight":60},
{"sex":"male","height":180,"weight":85}]
Lets suppose that I would like to add a fourth key "body mass index" or "bmi", and that its value is roughly given by "weight"/"height". Hence I expect:
[{"sex":"male","height":180,"weight":85, "bmi":(180/85)},
{"sex":"male","height":160" "weight":60, "bmi":(160/60},
{"sex":"male","height":180,"weight":85, "bmi":(180/85)}]
I thought that map.Add may help.
let dt3 = dt2.Add("bmi", (dt2.Height/dt2.Weight))
Unfortunately, it returns an error:
error FS0039: The field, constructor or member 'Add' is not defined
I am sure there are further errors in my code, but without this function I cannot actually look for them. Am I, at least, approaching the problem correctly?
Creating modified versions of the JSON is sadly one thing that the F# Data type provider does not make particularly easy. What makes that hard is the fact that we can infer the type from the source JSON, but we cannot "predict" what kind of fields people might want to add.
To do this, you'll need to access the underlying representation of the JSON value and operate on that. For example:
type ls = JsonProvider<"""
[{"sex":"male","height":180,"weight":85},
{"sex":"male","height":160,"weight":60},
{"sex":"male","height":180,"weight":85}]""">
let dt = ls.GetSamples()
let newJson =
dt
|> Array.map (fun recd ->
// To do the calculation, you can access the fields via inferred types
let bmi = float recd.Height / float recd.Weight
// But now we need to look at the underlying value, check that it is
// a record and extract the properties, which is an array of key-value pairs
match recd.JsonValue with
| JsonValue.Record props ->
// Append the new property to the existing properties & re-create record
Array.append [| "bmi", JsonValue.Float bmi |] props
|> JsonValue.Record
| _ -> failwith "Unexpected format" )
// Re-create a new JSON array and format it as JSON
JsonValue.Array(newJson).ToString()
I want to write an application that read ip address from xml file. The file looks like
<range>
<start>192.168.40.1</start>
<end>192.168.50.255</end>
<subnet>255.255.255.0</subnet>
<gateway>192.168.50.1</gateway>
</range>
I create an records type to save the ip address
type Scope = { Start: IPAddress; End: IPAddress; Subnetmask: IPAddress; Gateway: IPAddress }
I wrote a unit function, that output the ip's.
loc
|> Seq.iter (fun e -> match e.Name.LocalName with
|"start" -> printfn "Start %s" e.Value
|"end" -> printfn "End %s" e.Value
|"subnet" -> printfn "Subnet %s" e.Value
|"gateway" -> printfn "Gateway %s" e.Value
| _ -> ())
How can I return the scope records type instead of unit?
As mentioned in the comments, the XML type provider makes this a lot easier. You can just point it at a sample file, it will infer the structur and let you read the file easily:
type RangeFile = XmlProvider<"sample.xml">
let range = RangeFile.Load("file-you-want-to-read.xml")
let scope =
{ Start = IPAddress.Parse(range.Start)
End = IPAddress.Parse(range.End)
Subnetmask = IPAddress.Parse(range.Subnet)
Gateway = IPAddress.Parse(range.Gateway) }
That said, you can certainly implement this yourself too. The code you wrote is a good start - there is a number of ways to do this, but in any case, you'll need to do some lookup based on the local name of the element (to find start, end, etc.).
One option is to load all the properties into a dictionary:
let lookup =
loc
|> Seq.map (fun e -> e.Name.LocalName, IPAddress.Parse(e.Value)
|> dict
Now you have a lookup table that contains IPAddress for each of the keys, so you can create Scope value using just:
let scope =
{ Start = lookup.["start"]; End = lookup.["end"];
Subnetmask = lookup.["subnet"]; Gateway = lookup.["gateway"] }
That said, the nice thing about the XML type provider is that it removes the need to do lookup based on string values and so you are less likely to make mistakes caused by typos.
I processed some HTML to extract various information from a website (no proper API exists there), and generated a list of tokens using an F# discriminated union. I have simplified my code to the essence:
type tokens =
| A of string
| B of int
| C of string
let input = [A "1"; B 2; C "2.1"; C "2.2"; B 3; C "3.1"]
// how to transform the input to the following ???
let desiredOutput = [A "1", [[ B 2, [ C "2.1"; C "2.2" ]]; [B 3, [ C "3.1" ]]]]
This roughly corresponds to parsing the grammar: g -> A b* ; b -> B c* ; c-> C
The key thing is my token list is flat, but I want to work with the hierarchy implied by the grammar.
Perhaps there is another representation of my desiredOutput which would be better; what I really want to do is process exactly one A followed by a zero or more sequence of Bs, which happen to contain zero or more Cs.
I've looked at parser combinators articles, e.g. about FParsec, but I couldn't find a good solution that allows me to start from a list of tokens rather than a stream of characters. I'm familiar with imperative techniques for parsing, but I don't know what is idiomatic F#.
Progress made due to Answer
Thanks to the answer from Vandroiy, I was able to write the following to move forward a hobby project I am working on to learn idiomatic F# (and also to scrape quiz websites).
// transform flat data scraped from a Quiz website into a hierarchical data structure
type ScrapedQuiz =
| Title of string
| Description of string
| Blurb of string * picture: string
| QuizId of string
| Question of num:string * text:string * picture : string
| Answer of text:string
| Error of exn
let input =
[Title "Example Quiz Scraped from a website";
Description "What the Quiz is about";
Blurb ("more details","and a URL for a picture");
Question ("#1", "How good is F#", "URL to picture of new F# logo");
Answer ("we likes it");
Answer ("we very likes it");
Question ("#2", "How useful is Stack Overflow", "URL to picture of Stack Overflow logo");
Answer ("very good today");
Answer ("lobsters");
]
type Quiz =
{ Title : string
Description : string
Blurb : string * PictureURL
Questions : Quest list }
and Quest =
{ Number : string
Text : string
Pic : PictureURL
Answers : string list}
and PictureURL = string
let errorMessage = "unexpected input format"
let parseList reader input =
let rec run acc inp =
match reader inp with
| Some(o, inp') -> run (o :: acc) inp'
| None -> List.rev acc, inp
run [] input
let readAnswer = function Answer(a) :: t -> Some(a, t) | _ -> None
let readDescription =
function Description(a) :: t -> (a, t) | _ -> failwith errorMessage
let readBlurb = function Blurb(a,b) :: t -> ((a,b),t) | _ -> failwith errorMessage
let readQuests = function
| Question(n,txt,pic) :: t ->
let answers, input' = parseList readAnswer t
Some( { Number=n; Text=txt; Pic=pic; Answers = answers}, input')
| _ -> None
let readQuiz = function
| Title(s) :: t ->
let d, input' = readDescription t
let b, input'' = readBlurb input'
let qs, input''' = parseList readQuests input''
Some( { Title = s; Description = d; Blurb = b; Questions = qs}, input''')
| _ -> None
match readQuiz input with
| Some(a, []) -> a
| _ -> failwith errorMessage
I could not have written this yesterday; neither the target data type, nor the parsing code. I see room for improvement, but I think I have started to meet my goal of not writing C# in F#.
Indeed, it might help to first find a good representation.
Original output format
I presume the suggested output form, in standard printing, would be:
[(A "1", [(B 2, [C "2.1"; C "2.2"]); (B 3, [C "3.1"])])]
(This differs from the one in the question in the amount of list levels.) The code I used to get there is ugly. In part, this is because it abstracts at an awkward position, constraining input and output types very far without giving them a well-defined type. I'm posting it for the sake of completeness, but I recommend to skip over it.
let rec readBranch checkOne readInner acc = function
| h :: t when checkOne h ->
let dat, inp' = readInner t
readBranch checkOne readInner ((h, dat) :: acc) inp'
| l -> List.rev acc, l
let rec readCs acc = function
| C(s) :: t -> readCs (C(s) :: acc) t
| l -> List.rev acc, l
let readBs = readBranch (function B _ -> true | _ -> false) (readCs []) []
let readAs = readBranch (function A _ -> true | _ -> false) readBs []
input |> readAs |> fst
Surely, other people can do this more sensibly, but I doubt it would tackle the main problem: we're just projecting one weird data structure to the next. If it is difficult to read or formulate a parser's output format, there is probably something going wrong.
Strongly typed output
Rather than focus on how we are parsing, I prefer to first pay attention to what we are parsing into. These A B C things don't mean anything to me. Let's say they represent objects:
type Bravo =
{ ID : int
Charlies : string list }
type Alpha =
{ Name : string
Bravos : Bravo list }
There are two places where sequences of objects of the same type are parsed. Let's create a helper that repeatedly uses a specific parser to read a list of objects:
/// Parses objects into a list. reader takes an input and returns either
/// Some(parsed item, new input state), or None if the list is finished.
/// Returns a list of parsed objects and the remaining input.
let parseList reader input =
let rec run acc inp =
match reader inp with
| Some(o, inp') -> run (o :: acc) inp'
| None -> List.rev acc, inp
run [] input
Note that this is quite generic in the type of input. This helper could be used with strings, sequences, or whatever.
Now, we add concrete parsers. The following functions have the signature used in reader in the helper; they either return the parsed object and the remaining input, or None if parsing wasn't possible.
let readC = function C(s) :: t -> Some(s, t) | _ -> None
let readB = function
| B(i) :: t ->
let charlies, input' = parseList readC t
Some( { ID = i; Charlies = charlies }, input' )
| _ -> None
let readA = function
| A(s) :: t ->
let bravos, input' = parseList readB t
Some( { Name = s; Bravos = bravos }, input' )
| _ -> None
The code for reading Alphas and Bravos is practically a duplicate. If that happens in production code, I would recommend again to check whether the data structure is optimal, and only look at improving the algorithm afterwards.
We request to read one A into one Alpha, which was the goal after all:
match readA input with
| Some(a, []) -> a
| _ -> failwith "Unexpected input format"
There may be many better ways to do the parsing, especially when knowing more about the exact problem. The important fact is not how the parser works, but what the output looks like, which will be the focus when actual work is done in the program. The second version's output should be much easier to navigate in both code and debugger:
val it : Alpha =
{ Name = "1";
Bravos = [ { ID = 2; Charlies = ["2.1"; "2.2"] }
{ ID = 3; Charlies = ["3.1"] } ] }
One could take this a step further and replace the tokenized data structure with DOM (Document Object Model). Then, the first step would be to read HTML into DOM using a standard parsing library. In a second step, the concrete parsers would construct objects, using the DOM representation as input, calling one another top-down.
To work with structured hierarchy, you have to create matching structure of types. Something like
type
RootType = Level1 list
and
Level1 =
| A of string
| B of Level2 list
| C of string
and
Level2 =
{ b: int; c: string list }