File transform in F# - f#

I am just starting to work with F# and trying to understand typical idoms and effective ways of thinking and working.
The task at hand is a simple transform of a tab-delimited file to one which is comma-delimited. A typical input line will look like:
let line = "#ES# 01/31/2006 13:31:00 1303.00 1303.00 1302.00 1302.00 2514 0"
I started out with looping code like this:
// inFile and outFile defined in preceding code not shown here
for line in File.ReadLines(inFile) do
let typicalArray = line.Split '\t'
let transformedLine = typicalArray |> String.concat ","
outFile.WriteLine(transformedLine)
I then replaced the split/concat pair of operations with a single Regex.Replace():
for line in File.ReadLines(inFile) do
let transformedLine = Regex.Replace(line, "\t",",")
outFile.WriteLine(transformedLine)
And now, finally, have replaced the looping with a pipeline:
File.ReadLines(inFile)
|> Seq.map (fun x -> Regex.Replace(x, "\t", ","))
|> Seq.iter (fun y -> outFile.WriteLine(y))
// other housekeeping code below here not shown
While all versions work, the final version seems to me the most intuitive. Is this how a more experienced F# programmer would accomplish this task?

I think all three versions are perfectly fine, idiomatic code that F# experts would write.
I generally prefer writing code using built-in language features (like for loops and if conditions) if they let me solve the problem I have. These are imperative, but I think using them is a good idea when the API requires imperative code (like outFile.WriteLine). As you mentioned - you started with this version (and I would do the same).
Using higher-order functions is nice too - although I would probably do that only if I wanted to write data transformation and get a new sequence or list of lines - this would be handy if you were using File.WriteAllLines instead of writing lines one-by-one. Although, that could be also done by simply wrapping your second version with sequence expression:
let transformed =
seq { for line in File.ReadLines(inFile) -> Regex.Replace(line, "\t",",") }
File.WriteAllLines(outFilePath, transformed)
I do not think there is any objective reason to prefer one of the versions. My personal stylistic preference is to use for and refactor to sequence expressions (if needed), but others will likely disagree.

A side note that if you want to write to the same file that you are reading from, you need to remember that Seq is doing lazy evaluation.
Using Array as opposed to Seq makes sure file is closed for reading when it is needed for writing.
This works:
let lines =
file |> File.ReadAllLines
|> Array.map(fun line -> ..modify line..)
File.WriteAllLines(file, lines)
This does not (causes file access file violation)
let lines =
file |> File.ReadLines
|> Seq.map(fun line -> ..modify line..)
File.WriteAllLines(file, lines)
(potential overlap with another discussion here, where intermediate variable helps with the same problem)

Related

FsCheck: Override generator for a type, but only in the context of a single parent generator

I seem to often run into cases where I want to generate some complex structure, but a special variation with a member type generated differently.
For example, consider this tree
type Tree<'LeafData,'INodeData> =
| LeafNode of 'LeafData
| InternalNode of 'INodeData * Tree<'LeafData,'INodeData> list
I want to generate cases like
No internal node is childless
There are no leaf-type nodes
Only a limited subset of leaf types are used
These are simple to do if I override all generation of a corresponding child type.
The problem is that it seems register is inherently a thread-level action, and there is no gen-local alternative.
For example, what I want could look like
let limitedLeafs =
gen {
let leafGen = Arb.generate<LeafType> |> Gen.filter isAllowedLeaf
do! registerContextualArb (leafGen |> Arb.fromGen)
return! Arb.generate<Tree<NodeType, LeafType>>
}
This Tree example specifically can work around with some creative type shuffling, but that's not always possible.
It's also possible to use some sort of recursive map that enforces assumptions, but that seems relatively complex if the above is possible. I might be misunderstanding the nature of FsCheck generators though.
Does anyone know how to accomplish this kind of gen-local override?
There's a few options here - I'm assuming you're on FsCheck 2.x but keep scrolling for an option in FsCheck 3.
The first is the most natural one but is more work, which is to break down the generator explicitly to the level you need, and then put humpty dumpty together again. I.e don't rely on the type-based generator derivation so much - if I understand your example correctly that would mean implementing a recursive generator - relying on Arb.generate<LeafType> for the generic types.
Second option - Config has an Arbitrary field which you can use to override Arbitrary instances. These overrides will take effect even if the overridden types are part of the automatically generated ones. So as a sketch you could try:
Check.One ({Config.Quick with Arbitrary = [| typeof<MyLeafArbitrary>) |]) (fun safeTree -> ...)
More extensive example which uses FsCheck.Xunit's PropertyAttribute but the principle is the same, set on the Config instead.
Final option! :) In FsCheck 3 (prerelease) you can configure this via a new (as of yet undocumented) concept ArbMap which makes the map from type to Arbitrary instance explicit, instead of this static global nonsense in 2.x (my bad of course. seemed like a good idea at the time.) The implementation is here which may not tell you all that much - the idea is that you put an ArbMap instance together which contains your "safe" generators for the subparts, then you ArbMap.mergeWith that safe map with ArbMap.defaults (thus overriding the default generators with your safe ones, in the resulting ArbMap) and then you use ArbMap.arbitrary or ArbMap.generate with the resulting map.
Sorry for the long winded explanation - but all in all that should give you the best of both worlds - you can reuse the generic union type generator in FsCheck, while surgically overriding certain types in that context.
FsCheck guidance on this is:
To define a generator that generates a subset of the normal range of values for an existing type, say all the even ints, it makes properties more readable if you define a single-case union case, and register a generator for the new type:
As an example, they suggest you could define arbitrary even integers like this:
type EvenInt = EvenInt of int with
static member op_Explicit(EvenInt i) = i
type ArbitraryModifiers =
static member EvenInt() =
Arb.from<int>
|> Arb.filter (fun i -> i % 2 = 0)
|> Arb.convert EvenInt int
Arb.register<ArbitraryModifiers>() |> ignore
You could then generate and test trees whose leaves are even integers like this:
let ``leaves are even`` (tree : Tree<EvenInt, string>) =
let rec leaves = function
| LeafNode leaf -> [leaf]
| InternalNode (_, children) ->
children |> List.collect leaves
leaves tree
|> Seq.forall (fun (EvenInt leaf) ->
leaf % 2 = 0)
Check.Quick ``leaves are even`` // output: Ok, passed 100 tests.
To be honest, I like your idea of a "gen-local override" better, but I don't think FsCheck supports it.

F# Compiler Services incorrectly parses program

UPDATE:
I now realize that the question was stupid, I should have just filed the issue. In hindsight, I don't see why I even asked this question.
The issue is here: https://github.com/fsharp/FSharp.Compiler.Service/issues/544
Original question:
I'm using FSharp Compiler Services for parsing some F# code.
The particular piece of code that I'm facing right now is this:
let f x y = x+y
let g = f 1
let h = (g 2) + 3
This program yields a TAST without the (+) call on the last line. That is, the compiler service returns TAST as if the last line was just let h = g 2.
The question is: is this is a legitimate bug that I ought to report or am I missing something?
Some notes
Here is a repo containing minimal repro (I didn't want to include it in this question, because Compiler Services require quite a bit of dancing around).
Adding more statements after the let h line does not change the outcome.
When compiled to IL (as opposed to parsed with Compiler Services), it seems to work as expected (e.g. see fiddle)
If I make g a value, the program parses correctly.
If I make g a normal function (rather than partially applied one), the program parses correctly.
I have no priori experience with FSharp.Compiler.Services but nevertheless I did a small investigation using Visual Studio's debugger. I analyzed abstract syntax tree of following string:
"""
module X
let f x y = x+y
let g = f 1
let h = (g 2) + 3
"""
I've found out that there's following object inside it:
App (Val (op_Addition,NormalValUse,D:\file.fs (6,32--6,33) IsSynthetic=false),TType_forall ([T1; T2; T3],TType_fun (TType_var T1,TType_fun (...,...))),...,...,...)
As you can see, there's an addition in 6th line between characters 32 and 33.
The most likely explanation why F# Interactive doesn't display it properly is a bug in a library (maybe AST is in an inconsistent state or pretty-printing is broken). I think that you should file a bug in project's issue tracker.
UPDATE:
Aforementioned object can be obtained in a debbuger in a following way:
error.[0]
(option of Microsoft.FSharp.Compiler.SourceCodeServices.FSharpImplementationFileDeclaration.Entity)
.Item2
.[2]
(option of Microsoft.FSharp.Compiler.SourceCodeServices.FSharpImplementationFileDeclaration.MemberOrFunctionOrValue)
.Item3
.f (private member)
.Value
(option of Microsoft.FSharp.Compiler.SourceCodeServices.FSharpExprConvert.ConvExprOnDemand#903)
.expr

Writing F# code to parse "2 + 2" into code

Extremely just-started-yesterday new to F#.
What I want: To write code that parses the string "2 + 2" into (using as an example code from the tutorial project) Expr.Add(Expr.Num 2, Expr.Num 2) for evaluation. Some help to at least point me in the right direction or tell me it's too complex for my first F# project. (This is how I learn things: By bashing my head against stuff that's hard)
What I have: My best guess at code to extract the numbers. Probably horribly off base. Also, a lack of clue.
let script = "2 + 2";
let rec scriptParse xs =
match xs with
| [] -> (double)0
| y::ys -> (double)y
let split = (script.Split([|' '|]))
let f x = (split[x]) // "This code is not a function and cannot be applied."
let list = [ for x in 0..script.Length -> f x ]
let result = scriptParse
Thanks.
The immediate issue that you're running into is that split is an array of strings. To access an element of this array, the syntax is split.[x], not split[x] (which would apply split to the singleton list [x], assuming it were a function).
Here are a few other issues:
Your definition of list is probably wrong: x ranges up to the length of script, not the length of the array split. If you want to convert an array or other sequence to a list you can just use List.ofSeq or Seq.toList instead of an explicit list comprehension [...].
Your "casts" to double are a bit odd - that's not the right syntax for performing conversions in F#, although it will work in this case. double is a function, so the parentheses are unnecessary and what you are doing is really calling double 0 and double y. You should just use 0.0 for the first case, and in the second case, it's unclear what you are converting from.
In general, it would probably be better to do a bit more design up front to decide what your overall strategy will be, since it's not clear to me that you'll be able to piece together a working parser based on your current approach. There are several well known techniques for writing a parser - are you trying to use a particular approach?

Creating a valid function declaration from a complex tuple/list structure

Is there a generic way, given a complex object in Erlang, to come up with a valid function declaration for it besides eyeballing it? I'm maintaining some code previously written by someone who was a big fan of giant structures, and it's proving to be error prone doing it manually.
I don't need to iterate the whole thing, just grab the top level, per se.
For example, I'm working on this right now -
[[["SIP",47,"2",46,"0"],32,"407",32,"Proxy Authentication Required","\r\n"],
[{'Via',
[{'via-parm',
{'sent-protocol',"SIP","2.0","UDP"},
{'sent-by',"172.20.10.5","5060"},
[{'via-branch',"z9hG4bKb561e4f03a40c4439ba375b2ac3c9f91.0"}]}]},
{'Via',
[{'via-parm',
{'sent-protocol',"SIP","2.0","UDP"},
{'sent-by',"172.20.10.15","5060"},
[{'via-branch',"12dee0b2f48309f40b7857b9c73be9ac"}]}]},
{'From',
{'from-spec',
{'name-addr',
[[]],
{'SIP-URI',
[{userinfo,{user,"003018CFE4EF"},[]}],
{hostport,"172.20.10.11",[]},
{'uri-parameters',[]},
[]}},
[{tag,"b7226ffa86c46af7bf6e32969ad16940"}]}},
{'To',
{'name-addr',
[[]],
{'SIP-URI',
[{userinfo,{user,"3966"},[]}],
{hostport,"172.20.10.11",[]},
{'uri-parameters',[]},
[]}},
[{tag,"a830c764"}]},
{'Call-ID',"90df0e4968c9a4545a009b1adf268605#172.20.10.15"},
{'CSeq',1358286,"SUBSCRIBE"},
["date",'HCOLON',
["Mon",44,32,["13",32,"Jun",32,"2011"],32,["17",58,"03",58,"55"],32,"GMT"]],
{'Contact',
[[{'name-addr',
[[]],
{'SIP-URI',
[{userinfo,{user,"3ComCallProcessor"},[]}],
{hostport,"172.20.10.11",[]},
{'uri-parameters',[]},
[]}},
[]],
[]]},
["expires",'HCOLON',3600],
["user-agent",'HCOLON',
["3Com",[]],
[['LWS',["VCX",[]]],
['LWS',["7210",[]]],
['LWS',["IP",[]]],
['LWS',["CallProcessor",[['SLASH',"v10.0.8"]]]]]],
["proxy-authenticate",'HCOLON',
["Digest",'LWS',
["realm",'EQUAL',['SWS',34,"3Com",34]],
[['COMMA',["domain",'EQUAL',['SWS',34,"3Com",34]]],
['COMMA',
["nonce",'EQUAL',
['SWS',34,"btbvbsbzbBbAbwbybvbxbCbtbzbubqbubsbqbtbsbqbtbxbCbxbsbybs",
34]]],
['COMMA',["stale",'EQUAL',"FALSE"]],
['COMMA',["algorithm",'EQUAL',"MD5"]]]]],
{'Content-Length',0}],
"\r\n",
["\n"]]
Maybe https://github.com/etrepum/kvc
I noticed your clarifying comment. I'd prefer to add a comment myself, but don't have enough karma. Anyway, the trick I use for that is to experiment in the shell. I'll iterate a pattern against a sample data structure until I've found the simplest form. You can use the _ match-all variable. I use an erlang shell inside an emacs shell window.
First, bind a sample to a variable:
A = [{a,b},[{c,d}, {e,f}]].
Now set the original structure against the variable:
[{a,b},[{c,d},{e,f}]] = A.
If you hit enter, you'll see they match. Hit alt-p (forget what emacs calls alt, but it's alt on my keyboard) to bring back the previous line. Replace some tuple or list item with an underscore:
[_,[{c,d},{e,f}]].
Hit enter to make sure you did it right and they still match. This example is trivial, but for deeply nested, multiline structures it's trickier, so it's handy to be able to just quickly match to test. Sometimes you'll want to try to guess at whole huge swaths, like using an underscore to match a tuple list inside a tuple that's the third element of a list. If you place it right, you can match the whole thing at once, but it's easy to misread it.
Anyway, repeat to explore the essential shape of the structure and place real variables where you want to pull out values:
[_, [_, _]] = A.
[_, _] = A.
[_, MyTupleList] = A. %% let's grab this tuple list
[{MyAtom,b}, [{c,d}, MyTuple]] = A. %% or maybe we want this atom and tuple
That's how I efficiently dissect and pattern match complex data structures.
However, I don't know what you're doing. I'd be inclined to have a wrapper function that uses KVC to pull out exactly what you need and then distributes to helper functions from there for each type of structure.
If I understand you correctly you want to pattern match some large datastructures of unknown formatting.
Example:
Input: {a, b} {a,b,c,d} {a,[],{},{b,c}}
function({A, B}) -> do_something;
function({A, B, C, D}) when is_atom(B) -> do_something_else;
function({A, B, C, D}) when is_list(B) -> more_doing.
The generic answer is of course that it is undecidable from just data to know how to categorize that data.
First you should probably be aware of iolists. They are created by functions such as io_lib:format/2 and in many other places in the code.
One example is that
[["SIP",47,"2",46,"0"],32,"407",32,"Proxy Authentication Required","\r\n"]
will print as
SIP/2.0 407 Proxy Authentication Required
So, I'd start with flattening all those lists, using a function such as
flatten_io(List) when is_list(List) ->
Flat = lists:map(fun flatten_io/1, List),
maybe_flatten(Flat);
flatten_io(Tuple) when is_tuple(Tuple) ->
list_to_tuple([flatten_io(Element) || Element <- tuple_to_list(Tuple)];
flatten_io(Other) -> Other.
maybe_flatten(L) when is_list(L) ->
case lists:all(fun(Ch) when Ch > 0 andalso Ch < 256 -> true;
(List) when is_list(List) ->
lists:all(fun(X) -> X > 0 andalso X < 256 end, List);
(_) -> false
end, L) of
true -> lists:flatten(L);
false -> L
end.
(Caveat: completely untested and quite inefficient. Will also crash for inproper lists, but you shouldn't have those in your data structures anyway.)
On second thought, I can't help you. Any data structure that uses the atom 'COMMA' for a comma in a string should be taken out and shot.
You should be able to flatten those things as well and start to get a view of what you are looking at.
I know that this is not a complete answer. Hope it helps.
Its hard to recommend something for handling this.
Transforming all the structures in a more sane and also more minimal format looks like its worth it. This depends mainly on the similarities in these structures.
Rather than having a special function for each of the 100 there must be some automatic reformatting that can be done, maybe even put the parts in records.
Once you have records its much easier to write functions for it since you don't need to know the actual number of elements in the record. More important: your code won't break when the number of elements changes.
To summarize: make a barrier between your code and the insanity of these structures by somehow sanitizing them by the most generic code possible. It will be probably a mix of generic reformatting with structure speicific stuff.
As an example already visible in this struct: the 'name-addr' tuples look like they have a uniform structure. So you can recurse over your structures (over all elements of tuples and lists) and match for "things" that have a common structure like 'name-addr' and replace these with nice records.
In order to help you eyeballing you can write yourself helper functions along this example:
eyeball(List) when is_list(List) ->
io:format("List with length ~b\n", [length(List)]);
eyeball(Tuple) when is_tuple(Tuple) ->
io:format("Tuple with ~b elements\n", [tuple_size(Tuple)]).
So you would get output like this:
2> eyeball({a,b,c}).
Tuple with 3 elements
ok
3> eyeball([a,b,c]).
List with length 3
ok
expansion of this in a useful tool for your use is left as an exercise. You could handle multiple levels by recursing over the elements and indenting the output.
Use pattern matching and functions that work on lists to extract only what you need.
Look at http://www.erlang.org/doc/man/lists.html:
keyfind, keyreplace, L = [H|T], ...

Is this a better (more functional way) to write the following fsharp code?

I have pieces of code like this in a project and I realize it's not
written in a functional way:
let data = Array.zeroCreate(3 + (int)firmwareVersions.Count * 27)
data.[0] <- 0x09uy //drcode
data.[1..2] <- firmwareVersionBytes //Number of firmware versions
let mutable index = 0
let loops = firmwareVersions.Count - 1
for i = 0 to loops do
let nameBytes = ASCIIEncoding.ASCII.GetBytes(firmwareVersions.[i].Name)
let timestampBytes = this.getTimeStampBytes firmwareVersions.[i].Timestamp
let sizeBytes = BitConverter.GetBytes(firmwareVersions.[i].Size) |> Array.rev
data.[index + 3 .. index + 10] <- nameBytes
data.[index + 11 .. index + 24] <- timestampBytes
data.[index + 25 .. index + 28] <- sizeBytes
data.[index + 29] <- firmwareVersions.[i].Status
index <- index + 27
firmwareVersions is a List which is part of a csharp library.
It has (and should not have) any knowledge of how it will be converted into
an array of bytes. I realize the code above is very non-functional, so I tried
changing it like this:
let headerData = Array.zeroCreate(3)
headerData.[0] <- 0x09uy
headerData.[1..2] <- firmwareVersionBytes
let getFirmwareVersionBytes (firmware : FirmwareVersion) =
let nameBytes = ASCIIEncoding.ASCII.GetBytes(firmware.Name)
let timestampBytes = this.getTimeStampBytes firmware.Timestamp
let sizeBytes = BitConverter.GetBytes(firmware.Size) |> Array.rev
Array.concat [nameBytes; timestampBytes; sizeBytes]
let data =
firmwareVersions.ToArray()
|> Array.map (fun f -> getFirmwareVersionBytes f)
|> Array.reduce (fun acc b -> Array.concat [acc; b])
let fullData = Array.concat [headerData;data]
So now I'm wondering if this is a better (more functional) way
to write the code. If so... why and what improvements should I make,
if not, why not and what should I do instead?
Suggestions, feedback, remarks?
Thank you
Update
Just wanted to add some more information.
This is part of some library that handles the data for a binary communication
protocol. The only upside I see of the first version of the code is that
people implementing the protocol in a different language (which is the case
in our situation as well) might get a better idea of how many bytes every
part takes up and where exactly they are located in the byte stream... just a remark.
(As not everybody understand english, but all our partners can read code)
I'd be inclined to inline everything because the whole program becomes so much shorter:
let fullData =
[|yield! [0x09uy; firmwareVersionBytes; firmwareVersionBytes]
for firmware in firmwareVersions do
yield! ASCIIEncoding.ASCII.GetBytes(firmware.Name)
yield! this.getTimeStampBytes firmware.Timestamp
yield! BitConverter.GetBytes(firmware.Size) |> Array.rev|]
If you want to convey the positions of the bytes, I'd put them in comments at the end of each line.
I like your first version better because the indexing gives a better picture of the offsets, which are an important piece of the problem (I assume). The imperative code features the byte offsets prominently, which might be important if your partners can't/don't read the documentation. The functional code emphasises sticking together structures, which would be OK if the byte offsets are not important enough to be mentioned in the documentation either.
Indexing is normally accidental complexity, in which case it should be avoided. For example, your first version's loop could be for firmwareVersion in firmwareVersion instead of for i = 0 to loops.
Also, like Brian says, using constants for the offsets would make the imperative version even more readable.
How often does the code run?
The advantage of 'array concatenation' is that it does make it easier to 'see' the logical portions. The disadvantage is that it creates a lot of garbage (allocating temporary arrays) and may also be slower if used in a tight loop.
Also, I think perhaps your "Array.reduce(...)" can just be "Array.concat".
Overall I prefer the first way (just create one huge array), though I would factor it differently to make the logic more apparent (e.g. have a named constant HEADER_SIZE, etc.).
While we're at it, I'd probably add some asserts to ensure that e.g. nameBytes has the expected length.

Resources