Deedle missing values after grouping - f#

I have two frames, each of which contains some IDs and zero to many measures for each ID. I want to get the average measure per ID for each frame and combine to a larger frame.
The problem is that when an ID does not appear in one of the two frames, after grouping it results in a missing value in the combined frame. Here is an example. Notice ID "Chris" does not appear in frame A.
let aF = frame [ "AID" =?> Series.ofValues [ "Andrew"; "Andrew"; "Andrew"]; "AMES" =?> Series.ofValues [ 2; 4; 3]]
let bF = frame [ "BID" =?> Series.ofValues [ "Andrew"; "Chris"; "Andrew"]; "BMES" =?> Series.ofValues [ 1; 6; 7]]
let groupF = frame [ "AG" => (aF |> Frame.groupRowsByString "AID" |> Frame.getCol "AMES") ; "BG" => (bF |> Frame.groupRowsByString "BID" |> Frame.getCol "BMES") ]
let groupFMean = groupF |> Frame.getNumericCols |> Series.mapValues (Stats.levelMean fst) |> Frame.ofColumns |> Frame.fillMissingWith 0
groupFMean.SaveCsv( "tgroupFMean.csv", includeRowKeys=true, keyNames=["Id"] )
The resulting table looks like this:
Id AG BG
Andrew 3 4
Chris 6
And the blank cell is "". I've tried variations with fillMissingWith 0 (at series and and frame level) without success.

The answer is not very obvious - the problem is that fillMissingWith only touches columns that have the same type as the value you are using to fill the data - so for example, fillMissingWith "Unknown" would only fill missing values in columns that are string.
In your case, Frame.fillMissingWith 0 is only applied to columns of type int and there are no such columns. If you use Frame.fillMissingWith 0.0, things work as expected!
PS: If you have any thoughts on how this could be done better, please let us know. I'm really not sure what the right behavior is here!

Related

How to avoid connecting all the points of a function graph with Plotly

In a program that revolves around maths, I find myself using Plotly.NET (F#) to display user-defined functions. This works quite well, but there are cases where a function has discontinuities or even chunks defined over certain regions. For example, for the function f(x) defined by 0 if x <= 0 and 10 elsewhere, the expected graph (I used Wolfram Alpha here) is:
With Plotly and the code below,
let fn x = if x <= 0.0 then 0.0 else 10.0
let xs = [ -10.0 .. 0.1 .. 10.0 ]
let ys = Seq.map fn xs
Chart.Line(xs, ys, UseDefaults = false)
|> Chart.withTitle #"$f(x)$"
|> Chart.savePNG("example")
I get this graph:
As you can see, Plotly connects two points that shouldn't be connected (and I don't blame it, that's how the lib works). I wonder then how to avoid this kind of behaviour, which often happens with piecewise defined functions.
If possible, I would like a solution that is general enough to be applied to all functions / graphs, as my program does not encode functions in advance, the user enters them. The research I've done doesn't lead me anywhere, unfortunately, and the documentation doesn't show an example for what I want.
PS: also, you may have noticed, Plotly doesn't display the LaTex in the exported image, according to my research this is a known issue with Python, but if you know how to solve this with the .NET version of the lib, I'm also interested!
I don't think there's any way for Plotly to know that the function is discontinuous. Note that the vertical portion of your chart isn't truly vertical, because x jumps from 0.0 to 0.1.
However, you can still achieve the effect you're looking for by creating a separate chart for each piece of the function, and then combining them:
let color = Color.fromString "Blue"
let xsA = [ -10.0 .. 0.0 ]
let ysA = xsA |> Seq.map (fun _ -> 0.0)
let chartA = Chart.Line(xsA, ysA, LineColor = color)
let xsB = [ 0.0 .. 10.0 ]
let ysB = xsB |> Seq.map (fun _ -> 10.0)
let chartB = Chart.Line(xsB, ysB, LineColor = color)
[ chartA; chartB ]
|> Chart.combine
|> Chart.withLegend false
|> Chart.show
Note that there are actually two distinct points for x = 0 in the combined chart, so it's technically not a function. (Perhaps there's some way to show that the top piece is open, while the bottom piece is closed in Plotly, but I don't know how.) Result is:

If you can combine 3+ arbitrarily sized integers and still be able to deconstruct it back

Say you have 3 integers:
13105
705016
13
I'm wondering if you could combine these into one integer in any way, so that you can still get back to the original 3 integers.
var startingSet = [ 13105, 705016, 13 ]
var combined = combineIntoOneInteger(startingSet)
// 15158958589285958925895292589 perhaps, I have no idea.
var originalIntegers = deconstructInteger(combined, 3)
// [ 13105, 705016, 13 ]
function combineIntoOneInteger(integers) {
// some sort of hashing-like function...
}
function deconstructInteger(integer, arraySize) {
// perhaps pass it some other parameters
// like how many to deconstruct to, or other params.
}
It doesn't need to technically be an "integer". It is just a string using only the integer characters, though perhaps I might want to use the hex characters instead. But I ask in terms of integers because underneath I do have integers of a bounded size that will be used to construct the combined object.
Some other notes....
The combined value should be unique, so no matter what values you combine, you will always get a different result. That is, there are absolutely no conflicts. Or if that's not possible, perhaps an explanation why and a potential workaround.
The mathematical "set" containing all possible outputs can be composed of different amounts of components. That is to say, you might have the output/combined set containing [ 100, 200, 300, 400 ] but the input set is these 4 arrays: [ [ 1, 2, 3 ], [ 5 ], [ 91010, 132 ], [ 500, 600, 700 ] ]. That is, the input arrays can be of wildly different lengths and wildly different sized integers.
One way to accomplish this more generically is to just use a "separator" character, which makes it super easy. So it would be like 13105:705016:13. But this is cheating, I want it to only use the characters in the integer set (or perhaps the hex set, or some other arbitrary set, but for this case just the integer set or hex).
Another idea for a potential way to accomplish this is to somehow hide a separator in there by doing some hashing or permutation jiu jitsu so that [ 13105, 705016, 13 ] becomes some integer-looking thing like 95918155193915183, where 155 and 5 are some separator like interpolator values based on the preceding input or some other tricks. A simpler approach to this would be like saying "anything following three zeroes 000 like 410001414 means it's a new integer. So basically 000 is a separator. But this specifically is ugly and brittle. Maybe it could get more tricky and work though, like "if the value is odd and followed by a multiple of 3 of itself, then it's a separator" sort of thing. But I can see that also having brittle edge cases.
But basically, given a set of integers n (of strings of integer characters), how to convert that into a single integer (or single integer-charactered string), and then convert it back into the original set of integers n.
Sure, there are lots of ways to do this.
To start with, it's only necessary to have a reversible function which combines two values into one. (For it to be reversible, there must be another function which takes the output value and recreates the two input values.)
Let's call the function which combines two values combine and the reverse function separate. Then we have:
separate(combine(a, b)) == [a, b]
for any values a and b. That means that combine(a, b) == combine(c, d)
can only be true if both a == c and b == d; in other words, every pair of inputs produces a different output.
Encoding arbitrary vectors
Once we have that function, we can encode arbitrary-length input vectors. The simplest case is when we know in advance what the length of the vector is. For example, we could define:
combine3 = (a, b, c) => combine(combine(a, b), c)
combine4 = (a, b, c, d) => combine(combine(combine(a, b), c), d)
and so on. To reverse that computation, we only have to repeatedly call separate the correct number of times, each time keeping the second returned value. For example, if we previously had computed:
m = combine4(a, b, c, d)
we could get the four input values back as follows:
c3, d = separate(m)
c2, c = separate(c3)
a, b = separate(c2)
But your question asks for a way to combine an arbitrary number of values. To do that, we just need to do one final combine, which mixes in the number of values. That lets us get the original vector back out: first, we call separate to get the value count back out, and then we call separate enough times to extract each successive input value.
combine_n = v => combine(v.reduce(combine), v.length)
function separate_n(m) {
let [r, n] = separate(m)
let a = Array(n)
for (let i = n - 1; i > 0; --i) [r, a[i]] = separate(r);
a[0] = r;
return a;
}
Note that the above two functions do not work on the empty vector, which should code to 0. Adding the correct checks for this case is left as an exercise. Also note the warning towards the bottom of this answer, about integer overflow.
A simple combine function: diagonalization
With that done, let's look at how to implement combine. There are actually many solutions, but one pretty simple one is to use the diagonalization function:
diag(a, b) = (a + b)(a + b + 1)
------------------ + a
2
This basically assigns positions in the infinite square by tracing successive diagonals:
<-- b -->
0 1 3 6 10 15 21 ...
^ 2 4 7 11 16 22 ...
| 5 8 12 17 23 ...
a 9 13 18 24 ...
| 14 19 25 ...
v 20 26 ...
27 ...
(In an earlier version of this answer, I had reversed a and b, but this version seems to have slightly more intuitive output values.)
Note that the top row, where a == 0, is exactly the triangular numbers, which is not surprising because the already enumerated positions are the top left triangle of the square.
To reverse the transformation, we start by solving the equation which defines the triangular numbers, m = s(s + 1)/2, which is the same as
0 = s² + s - 2m
whose solution can be found using the standard quadratic formula, resulting in:
s = floor((-1 + sqrt(1 + 8 * m)) / 2)
(s here is the original a+b; that is, the index of the diagonal.)
I should explain the call to floor which snuck in there. s will only be precisely an integer on the top row of the square, where a is 0. But, of course, a will usually not be 0, and m will usually be a little more than the triangular number we're looking for, so when we solve for s, we'll get some fractional value. Floor just discards the fractional part, so the result is the diagonal index.
Now we just have to recover a and b, which is straight-forward:
a = m - combine(0, s)
b = s - a
So we now have the definitions of combine and separate:
let combine = (a, b) => (a + b) * (a + b + 1) / 2 + a
function separate(m) {
let s = Math.floor((-1 + Math.sqrt(1 + 8 * m)) / 2);
let a = m - combine(0, s);
let b = s - a;
return [a, b];
}
One cool feature of this particular encoding is that every non-negative integer corresponds to a distinct vector. Many other encoding schemes do not have this property; the possible return values of combine_n are a subset of the set of non-negative integers.
Example encodings
For reference, here are the first 30 encoded values, and the vectors they represent:
> for (let i = 1; i <= 30; ++i) console.log(i, separate_n(i));
1 [ 0 ]
2 [ 1 ]
3 [ 0, 0 ]
4 [ 1 ]
5 [ 2 ]
6 [ 0, 0, 0 ]
7 [ 0, 1 ]
8 [ 2 ]
9 [ 3 ]
10 [ 0, 0, 0, 0 ]
11 [ 0, 0, 1 ]
12 [ 1, 0 ]
13 [ 3 ]
14 [ 4 ]
15 [ 0, 0, 0, 0, 0 ]
16 [ 0, 0, 0, 1 ]
17 [ 0, 1, 0 ]
18 [ 0, 2 ]
19 [ 4 ]
20 [ 5 ]
21 [ 0, 0, 0, 0, 0, 0 ]
22 [ 0, 0, 0, 0, 1 ]
23 [ 0, 0, 1, 0 ]
24 [ 0, 0, 2 ]
25 [ 1, 1 ]
26 [ 5 ]
27 [ 6 ]
28 [ 0, 0, 0, 0, 0, 0, 0 ]
29 [ 0, 0, 0, 0, 0, 1 ]
30 [ 0, 0, 0, 1, 0 ]
Warning!
Observe that all of the unencoded values are pretty small. The encoded values is similar in size to the concatenation of all the input values, and so it does grow pretty rapidly; you have to be careful to not exceed Javascript's limit on exact integer computation. Once the encoded value exceeds this limit (253) it will no longer be possible to reverse the encoding. If your input vectors are long and/or the encoded values are large, you'll need to find some kind of bignum support in order to do precise integer computations.
Alternative combine functions
Another possible implementation of combine is:
let combine = (a, b) => 2**a * 3**b
In fact, using powers of primes, we could dispense with the combine_n sequence, and just produce the combination directly:
combine(a, b, c, d, e,...) = 2a 3b 5c 7d 11e...
(That assumes that the encoded values are strictly positive; if they could be 0, we'd have no way of knowing how long the sequence was because the encoded value does not distinguish between a vector and the same vector with a 0 appended. But that's not a big issue, because if we needed to deal with 0s, we would just add one to all used exponents:
combine(a, b, c, d, e,...) = 2a+1 3b+1 5c+1 7d+1 11e+1...
That is certainly correct and its very elegant in a theoretical sense. It's the solution which you will find in theoretical CS textbooks because it is much easier to prove uniqueness and reversibility. However, in the real world it is really not practical. Reversing the combination depends on finding the prime factors of the encoded value, and the encoded values are truly enormous, well out of the range of easily representable numbers.
Another possibility is precisely the one you mention in the question: simply put a separator between successive values. One simple way to do this is to rewrite the values to encode in base 9 (or base 15) and then increment all the digit values, so that the digit 0 is not present in any encoded value. Then we can put 0s between the encoded values and read the result in base 10 (or base 16).
Neither of these solutions has the property that every non-negative integer is the encoding of some vector. (The second one almost has that property, and it's a useful exercise to figure out which integers are not possible encodings, and then fix the encoding algorithm to avoid that problem.)

F# Charting No Automatic Axis Range Setting?

It's not really an issue, more of a question. I am using FSharp.Charting to graph a few quick things. One thing I noticed is that the chart doesn't automatically set the axis limits for you. Say I have a list of numbers that has values between 100,000 and 200,000. The y-axis will still be based at 0. It doesn't scale to give you a good view of the data. You have to do this yourself. Or maybe there is a way, and I just haven't figured it out yet. Has anyone else ran into this issue? Any suggestions?
I have searched the FSharp charting code on GitHub and found nothing built in that can do automatic alignment of any axis. The best one can do is do it manually or write a function to look at all the values and then set them based on that.
Since you did not show in your question how to set it manually I will state it here for those that don't know how it is done.
To manually set the Y axis use WithYAxis
let xs1 = [ for x in (double)(-100.0) .. 1.0 .. 100.0 do yield x]
let ys1 = xs1 |> List.map (fun x -> x**4.00)
let values1 = List.zip xs1 ys1
Chart.Line(values1)
.WithXAxis(Min=(-30.0), Max=(30.0), MajorTickMark = ChartTypes.TickMark(Interval=10.0, IntervalOffset = 5.0, LineWidth = 2))
.WithYAxis(Min=(100000.0), Max=(200000.0), MajorTickMark = ChartTypes.TickMark(Interval=20000.0, IntervalOffset = 10000.0, LineWidth = 2))

Understanding F# memory consumption

I've been toying around with F# lately and wrote this little snippet below, it just creates a number of randomized 3d-vectors, puts them into a list, maps each vector to its length and sums up all those values.
Running the program (as a Release Build .exe, not interactive), the binary consumes in this particular case (10 mio vectors) roughly 550 MB RAM. One Vec3 object should account for 12 bytes (or 16 assuming some alignment takes place). Even if you do the rough math with 32 bytes to account for some book-keeping overhead (bytes per object*10 mio) / 1024 / 1024) you're still 200 MB off the actual consumption. Naively i'd assume to have 10 mio * 4 bytes per single in the end, since the Vec3 objects are 'mapped away'.
My guess so far: either i keep one (or several) copy/copies of my list somewhere and i'm not aware of that, or some intermediate results get never garbage collected? I can't imagine that inheriting from System.Object brings in so much overhead.
Could someone point me into the right direction with this?
TiA
type Vec3(x: single, y: single, z:single) =
let mag = sqrt(x*x + y*y + z*z)
member self.Magnitude = mag
override self.ToString() = sprintf "[%f %f %f]" x y z
let how_much = 10000000
let mutable rng = System.Random()
let sw = new System.Diagnostics.Stopwatch()
sw.Start()
let random_vec_iter len =
let mutable result = []
for x = 1 to len do
let mutable accum = []
for i = 1 to 3 do
accum <- single(rng.NextDouble())::accum
result <- Vec3(accum.[0], accum.[1], accum.[2])::result
result
sum_len_func = List.reduce (fun x y -> x+y)
let map_to_mag_func = List.map (fun (x:Vec3) -> x.Magnitude)
[<EntryPoint>]
let main argv =
printfn "Hello, World"
let res = sum_len_func (map_to_mag_func (random_vec_iter(how_much)))
printfn "doing stuff with %i items took %i, result is %f" how_much (sw.ElapsedMilliseconds) res
System.Console.ReadKey() |> ignore
0 // return an integer exit code
First, your vec is a ref type not a value type (not a struct). So you hold a pointer on top of your 12 bytes (12+16). Then the list is a single-linked list, so another 16 bytes for a .net ref. Then, your List.map will create an intermediate list.

Filling missing data after outer join

I have two time series which are at the same sampling rate. I would like to perform an outer join and then fill in any missing data (post outer join, there can be points in time where data exists in one series but not the other even though they are the same sampling rate) with the most recent previous value.
How can I perform this operating using Deedle?
Edit:
Based on this, I suppose you can re-sample before the join like so:
// Get the most recent value, sampled at 2 hour intervals
someSeries|> Series.sampleTimeInto
(TimeSpan(2, 0, 0)) Direction.Backward Series.lastValue
After doing this you can safely Join. Perhaps there is another way?
You should be able to perform the outer join on the original series (it is better to turn them into frames, because then you'll get nice multi-column frame) and then fill the missing values Frame.fillMissing.
// Note that s1[2] is undefined and s2[3] is undefined
let s1 = series [ 1=>1.0; 3=>3.0; 5=>5.0 ]
let s2 = series [ 1=>1.1; 2=>2.2; 5=>5.5 ]
// Build frames to make joining easier
let f1, f2 = frame [ "S1" => s1 ], frame [ "S2" => s2 ]
// Perform outer join and then fill the missing data
let f = f1.Join(f2, JoinKind.Outer)
let res = f |> Frame.fillMissing Direction.Forward
The final result and the intermediate frame with missing values look like this:
val it : Frame<int,string> =
S1 S2
1 -> 1 1.1
2 -> <missing> 2.2
3 -> 3 <missing>
5 -> 5 5.5
>
val it : Frame<int,string> =
S1 S2
1 -> 1 1.1
2 -> 1 2.2
3 -> 3 2.2
5 -> 5 5.5
Note that the result can still contain missing values - if the first value is missing, the fillMissing function has no previous value to propagate and so the series may start with some missing values.

Resources