Understanding F# memory consumption - memory

I've been toying around with F# lately and wrote this little snippet below, it just creates a number of randomized 3d-vectors, puts them into a list, maps each vector to its length and sums up all those values.
Running the program (as a Release Build .exe, not interactive), the binary consumes in this particular case (10 mio vectors) roughly 550 MB RAM. One Vec3 object should account for 12 bytes (or 16 assuming some alignment takes place). Even if you do the rough math with 32 bytes to account for some book-keeping overhead (bytes per object*10 mio) / 1024 / 1024) you're still 200 MB off the actual consumption. Naively i'd assume to have 10 mio * 4 bytes per single in the end, since the Vec3 objects are 'mapped away'.
My guess so far: either i keep one (or several) copy/copies of my list somewhere and i'm not aware of that, or some intermediate results get never garbage collected? I can't imagine that inheriting from System.Object brings in so much overhead.
Could someone point me into the right direction with this?
TiA
type Vec3(x: single, y: single, z:single) =
let mag = sqrt(x*x + y*y + z*z)
member self.Magnitude = mag
override self.ToString() = sprintf "[%f %f %f]" x y z
let how_much = 10000000
let mutable rng = System.Random()
let sw = new System.Diagnostics.Stopwatch()
sw.Start()
let random_vec_iter len =
let mutable result = []
for x = 1 to len do
let mutable accum = []
for i = 1 to 3 do
accum <- single(rng.NextDouble())::accum
result <- Vec3(accum.[0], accum.[1], accum.[2])::result
result
sum_len_func = List.reduce (fun x y -> x+y)
let map_to_mag_func = List.map (fun (x:Vec3) -> x.Magnitude)
[<EntryPoint>]
let main argv =
printfn "Hello, World"
let res = sum_len_func (map_to_mag_func (random_vec_iter(how_much)))
printfn "doing stuff with %i items took %i, result is %f" how_much (sw.ElapsedMilliseconds) res
System.Console.ReadKey() |> ignore
0 // return an integer exit code

First, your vec is a ref type not a value type (not a struct). So you hold a pointer on top of your 12 bytes (12+16). Then the list is a single-linked list, so another 16 bytes for a .net ref. Then, your List.map will create an intermediate list.

Related

Why fsharp autogenerated gethashcode generates too many collisions?

In our fsharp code autogenerated gethashcode implementation shows very bad performance and big collisions rate. Is it a problem in fsharp implementation of gethashcode generator or just an edge case?
open System
open System.Collections.Generic
let check keys e name =
let dict = new Dictionary<_,_>(Array.length keys, e)//, HashIdentity.Structural)
let stopWatch = System.Diagnostics.Stopwatch.StartNew()
let add k = dict.Add(k, 1.02)
Array.iter add keys
stopWatch.Stop()
let hsahes = new HashSet<int>()
let add_hash x = hsahes.Add(e.GetHashCode(x)) |> not
let collisions = Array.filter add_hash keys |> Array.length
printfn "%s %f sec %f collisions" name stopWatch.Elapsed.TotalSeconds (double(collisions) / double(keys.Length))
type StructTuple<'T,'T2> =
struct
val fst: 'T
val snd : 'T2
new(fst: 'T, snd : 'T2) = {fst = fst; snd = snd}
end
let bad_keys = seq{
let rnd = new Random();
while true do
let j = uint32(rnd.Next(0, 3346862))
let k = uint16 (rnd.Next(0, 658))
yield StructTuple(j,k)
}
let good_keys = seq{
for k in 0us..658us do
for j in 0u.. 3346862u do
yield StructTuple(j,k)
}
module CmpHelpers =
let inline combine (h1:int) (h2:int) = (h1 <<< 5) + h1 ^^^ h2;
type StructTupleComparer<'T,'T2>() =
let cmparer = EqualityComparer<Object>.Default
interface IEqualityComparer<StructTuple<'T,'T2>> with
member this.Equals (a,b) = cmparer.Equals(a.fst, b.fst) && cmparer.Equals(a.snd, b.snd)
member this.GetHashCode (x) = CmpHelpers.combine (cmparer.GetHashCode(x.fst)) (cmparer.GetHashCode(x.snd))
type AutoGeneratedStructTupleComparer<'T,'T2>() =
let cmparer = LanguagePrimitives.GenericEqualityComparer
interface IEqualityComparer<StructTuple<'T,'T2>> with
member this.Equals (a:StructTuple<'T,'T2>,b:StructTuple<'T,'T2>) =
LanguagePrimitives.HashCompare.GenericEqualityERIntrinsic<'T> a.fst b.fst
&& LanguagePrimitives.HashCompare.GenericEqualityERIntrinsic<'T2> a.snd b.snd
member this.GetHashCode (x:StructTuple<'T,'T2>) =
let mutable num = 0
num <- -1640531527 + (LanguagePrimitives.HashCompare.GenericHashWithComparerIntrinsic<'T2> cmparer x.snd + ((num <<< 6) + (num >>> 2)))
-1640531527 + (LanguagePrimitives.HashCompare.GenericHashWithComparerIntrinsic<'T> cmparer x.fst + ((num <<< 6) + (num >>> 2)));
let uniq (sq:seq<'a>) = Array.ofSeq (new HashSet<_>(sq))
[<EntryPoint>]
let main argv =
let count = 15000000
let keys = good_keys |> Seq.take count |> uniq
printfn "good keys"
check keys (new StructTupleComparer<_,_>()) "struct custom"
check keys HashIdentity.Structural "struct auto"
check keys (new AutoGeneratedStructTupleComparer<_,_>()) "struct auto explicit"
let keys = bad_keys |> Seq.take count |> uniq
printfn "bad keys"
check keys (new StructTupleComparer<_,_>()) "struct custom"
check keys HashIdentity.Structural "struct auto"
check keys (new AutoGeneratedStructTupleComparer<_,_>()) "struct auto explicit"
Console.ReadLine() |> ignore
0 // return an integer exit code
output
good keys
struct custom 1.506934 sec 0.000000 collisions
struct auto 4.832881 sec 0.776863 collisions
struct auto explicit 3.166931 sec 0.776863 collisions
bad keys
struct custom 3.631251 sec 0.061893 collisions
struct auto 10.340693 sec 0.777034 collisions
struct auto explicit 8.893612 sec 0.777034 collisions
I am no expert on the overall algorithm used to produce auto-generated Equals and GetHashCode, but it just seems to produce something non-optimal here. I don't know offhand if that is normal for a general-purpose auto-generated implementation, or if there are practical ways of auto-generating close-to-optimal implementations reliably.
It's worth noting that if you just use the standard tuple, the autogenerated hashing and comparison give the same collision rate and equal performance as your custom implementation. And using the latest F# 4.0 bits (where there has recently been a significant perf improvement in this area), the autogenerated stuff becomes significantly faster than the custom implementation.
My numbers:
// F# 3.1, struct tuples
good keys
custom 0.951254 sec 0.000000 collisions
auto 2.737166 sec 0.776863 collisions
bad keys
custom 2.923103 sec 0.061869 collisions
auto 7.706678 sec 0.777040 collisions
// F# 3.1, standard tuples
good keys
custom 0.995701 sec 0.000000 collisions
auto 0.965949 sec 0.000000 collisions
bad keys
custom 3.091821 sec 0.061869 collisions
auto 2.924721 sec 0.061869 collisions
// F# 4.0, standard tuples
good keys
custom 1.018672 sec 0.000000 collisions
auto 0.619066 sec 0.000000 collisions
bad keys
custom 3.082988 sec 0.061869 collisions
auto 1.829720 sec 0.061869 collisions
Opened issue in fsharp issue tracker. Accepted as a bug https://github.com/fsharp/fsharp/issues/343

Huge performance difference running f# same code on fsi 4.0.30319.1 and 2.0.0.0

I am running the same F# code with the two versions of fsi.exe which I can find under my FSharp-2.0.0.0 install:
C:\Program Files\FSharp-2.0.0.0\bin\fsi.exe - Microsoft (R) F# 2.0 Interactive build 2.0.0
C:\Program Files\FSharp-2.0.0.0\v4.0\bin\fsi.exe - Microsoft (R) F# 2.0 Interactive build 4.0.30319.1
What I find is that the same code runs about three times faster on the 2.0.0.0 build. Does this make any sense? Is there something messed up with my environment or possibly code??
Incidentally, the reason I am trying to use the v4.0 build is to be able to use the TPL and compare sequential and parallel implementations of my code. When my parallel implementation was much slower than the sequential one, after much head-scratching I realized that the parallel version was running under a different fsi.exe, and that's when I realized that the same (sequential) version of the code is much slower under version 4.0.
Thanks in advance for any help
IS
The code:
module Options
//Gaussian module is from http://fssnip.net/3g, by Tony Lee
open Gaussian
//The European Option type
type EuropeanOption =
{StockCode: string
StockPrice: float
ExercisePrice: float
NoRiskReturn: float
Volatility: float
Time: float
}
//Read one row from the file and return a European Option
//File format is:
//StockCode<TAB>StockPrice,ExercisePrice,NoRiskReturn,Volatility,Time
let convertDataRow(line:string) =
let option = List.ofSeq(line.Split('\t'))
match option with
| code::data::_ ->
let dataValues = (data.Split(','))
let euopt = {StockCode = code;
StockPrice = float (dataValues.[0]);
ExercisePrice = float (dataValues.[1]);
NoRiskReturn = float (dataValues.[2]);
Volatility = float (dataValues.[3]);
Time = float (dataValues.[4])
}
euopt
| _ -> failwith "Incorrect Data Format"
//Returns the future value of an option.
//0 if excercise price is greater than the sum of the stock price and the calculated asset price at expiration.
let futureValue sp ep nrr vol t =
//TODO: Is there no better way to get the value from a one-element sequence?
let assetPriceAtExpiration = sp+sp*nrr*t+sp*sqrt(t)*vol*(Gaussian.whiteNoise |> Seq.take 1 |> List.ofSeq |> List.max)
[0.0;assetPriceAtExpiration - ep] |> List.max
//Sequence to hold the values generated by the MonteCarlo iterations
//50,000 iterations is the minimum for a good aprox to the Black-Scholes equation
let priceValues count sp ep nrr vol t =
seq { for i in 1..count
-> futureValue sp ep nrr vol t
}
//Discount a future to a present value given the risk free rate and the time in years
let discount value noriskreturn time =
value * exp(-1.0*noriskreturn*time)
//Get the price for a European Option and a given number of Monte Carlo iterations (use numIters >= 50000)
let priceOption europeanOption numIters =
let futureValuesSeq = priceValues numIters europeanOption.StockPrice europeanOption.ExercisePrice europeanOption.NoRiskReturn europeanOption.Volatility europeanOption.Time
//The simulated future value is just the average of all the MonteCarlo runs
let presentValue = discount (futureValuesSeq |> List.ofSeq |> List.average) europeanOption.NoRiskReturn europeanOption.Time
//Return a list of tuples with the stock code and the calculated present value
europeanOption.StockCode + "_to_" + string europeanOption.Time + "_years \t" + string presentValue
module Program =
open Options
open System
open System.Diagnostics
open System.IO
//Write to a file
let writeFile path contentsArray =
File.WriteAllLines(path, contentsArray |> Array.ofList)
//TODO: This whole "method" is sooooo procedural.... is there a more functional way?
//Unique code for each run
//TODO: Something shorter, please
let runcode = string DateTime.Now.Month + "_" + string DateTime.Now.Day + "_" + string DateTime.Now.Hour + "_" + string DateTime.Now.Minute + "_" + string DateTime.Now.Second
let outputFile = #"C:\TMP\optionpricer_results_" + runcode + ".txt"
let statsfile = #"C:\TMP\optionpricer_stats_" + runcode + ".txt"
printf "Starting"
let mutable stats = ["Starting at: [" + string DateTime.Now + "]" ]
let stopWatch = Stopwatch.StartNew()
//Read the file
let lines = List.ofSeq(File.ReadAllLines(#"C:\tmp\9000.txt"))
ignore(stats <- "Read input file done at: [" + string stopWatch.Elapsed.TotalMilliseconds + "]"::stats)
printfn "%f" stopWatch.Elapsed.TotalMilliseconds
//Build the list of European Options
let options = lines |> List.map convertDataRow
ignore(stats <- ("Created Options done at: [" + string stopWatch.Elapsed.TotalMilliseconds + "]")::stats)
printfn "%f" stopWatch.Elapsed.TotalMilliseconds
//Calculate the option prices
let results = List.map (fun o -> priceOption o 50000) options
ignore(stats <- "Option prices calculated at: [" + string stopWatch.Elapsed.TotalMilliseconds + "]"::stats)
printfn "%f" stopWatch.Elapsed.TotalMilliseconds
//Write results and statistics
writeFile outputFile results
ignore(stats <- "Output file written at: [" + string stopWatch.Elapsed.TotalMilliseconds + "]"::stats)
ignore(stats <- "Total Ellapsed Time (minus stats file write): [" + string (stopWatch.Elapsed.TotalMilliseconds / 60000.0) + "] minutes"::stats)
printfn "%f" stopWatch.Elapsed.TotalMilliseconds
writeFile statsfile (stats |> List.rev)
stopWatch.Stop()
ignore(Console.ReadLine())
I haven't run your code but it looks like you're creating lots of linked lists. That is very inefficient but the representation of lists was changed in recent years and the new representation is slower.

List comprehensions with float iterator in F#

Consider the following code:
let dl = 9.5 / 11.
let min = 21.5 + dl
let max = 40.5 - dl
let a = [ for z in min .. dl .. max -> z ] // should have 21 elements
let b = a.Length
"a" should have 21 elements but has got only 20 elements. The "max - dl" value is missing. I understand that float numbers are not precise, but I hoped that F# could work with that. If not then why F# supports List comprehensions with float iterator? To me, it is a source of bugs.
Online trial: http://tryfs.net/snippets/snippet-3H
Converting to decimals and looking at the numbers, it seems the 21st item would 'overshoot' max:
let dl = 9.5m / 11.m
let min = 21.5m + dl
let max = 40.5m - dl
let a = [ for z in min .. dl .. max -> z ] // should have 21 elements
let b = a.Length
let lastelement = List.nth a 19
let onemore = lastelement + dl
let overshoot = onemore - max
That is probably due to lack of precision in let dl = 9.5m / 11.m?
To get rid of this compounding error, you'll have to use another number system, i.e. Rational. F# Powerpack comes with a BigRational class that can be used like so:
let dl = 95N / 110N
let min = 215N / 10N + dl
let max = 405N / 10N - dl
let a = [ for z in min .. dl .. max -> z ] // Has 21 elements
let b = a.Length
Properly handling float precision issues can be tricky. You should not rely on float equality (that's what list comprehension implicitely does for the last element). List comprehensions on float are useful when you generate an infinite stream. In other cases, you should pay attention to the last comparison.
If you want a fixed number of elements, and include both lower and upper endpoints, I suggest you write this kind of function:
let range from to_ count =
assert (count > 1)
let count = count - 1
[ for i = 0 to count do yield from + float i * (to_ - from) / float count]
range 21.5 40.5 21
When I know the last element should be included, I sometimes do:
let a = [ for z in min .. dl .. max + dl*0.5 -> z ]
I suspect the problem is with the precision of floating point values. F# adds dl to the current value each time and checks if current <= max. Because of precision problems, it might jump over max and then check if max+ε <= max (which will yield false). And so the result will have only 20 items, and not 21.
After running your code, if you do:
> compare a.[19] max;;
val it : int = -1
It means max is greater than a.[19]
If we do calculations the same way the range operator does but grouping in two different ways and then compare them:
> compare (21.5+dl+dl+dl+dl+dl+dl+dl+dl) ((21.5+dl)+(dl+dl+dl+dl+dl+dl+dl));;
val it : int = 0
> compare (21.5+dl+dl+dl+dl+dl+dl+dl+dl+dl) ((21.5+dl)+(dl+dl+dl+dl+dl+dl+dl+dl));;
val it : int = -1
In this sample you can see how adding 7 times the same value in different order results in exactly the same value but if we try it 8 times the result changes depending on the grouping.
You're doing it 20 times.
So if you use the range operator with floats you should be aware of the precision problem.
But the same applies to any other calculation with floats.

using Array.Parallel.map for decreasing running time

Hello everyone
I have converted a project in C# to F# that paints the Mandelbrot set.
Unfortunately does it take around one minute to render a full screen so I am try to find some ways to speed it up.
It is one call that take almost all of the time:
Array.map (fun x -> this.colorArray.[CalcZ x]) xyArray
xyArray (double * double) [] => (array of tuple of double)
colorArray is an array of int32 length = 255
CalcZ is defined as:
let CalcZ (coord:double * double) =
let maxIterations = 255
let rec CalcZHelper (xCoord:double) (yCoord:double) // line break inserted
(x:double) (y:double) iters =
let newx = x * x + xCoord - y * y
let newy = 2.0 * x * y + yCoord
match newx, newy, iters with
| _ when Math.Abs newx > 2.0 -> iters
| _ when Math.Abs newy > 2.0 -> iters
| _ when iters = maxIterations -> iters
| _ -> CalcZHelper xCoord yCoord newx newy (iters + 1)
CalcZHelper (fst coord) (snd coord) (fst coord) (snd coord) 0
As I only use around half of the processor capacity is an idea to use more threads and specifically Array.Parallel.map, translates to system.threading.tasks.parallel
Now my question
A naive solution, would be:
Array.Parallel.map (fun x -> this.colorArray.[CalcZ x]) xyArray
but that took twice the time, how can I rewrite this to take less time, or can I take some other way to utilize the processor better?
Thanks in advance
Gorgen
---edit---
the function that is calling CalcZ looks like this:
let GetMatrix =
let halfX = double bitmap.PixelWidth * scale / 2.0
let halfY = double bitmap.PixelHeight * scale / 2.0
let rect:Mandelbrot.Rectangle =
{xMax = centerX + halfX; xMin = centerX - halfX;
yMax = centerY + halfY; yMin = centerY - halfY;}
let size:Mandelbrot.Size =
{x = bitmap.PixelWidth; y = bitmap.PixelHeight}
let xyList = GenerateXYTuple rect size
let xyArray = Array.ofList xyList
Array.map (fun x -> this.colorArray.[CalcZ x]) xyArray
let region:Int32Rect = new Int32Rect(0,0,bitmap.PixelWidth,bitmap.PixelHeight)
bitmap.WritePixels(region, GetMatrix, bitmap.PixelWidth * 4, region.X, region.Y);
GenerateXYTuple:
let GenerateXYTuple (rect:Rectangle) (pixels:Size) =
let xStep = (rect.xMax - rect.xMin)/double pixels.x
let yStep = (rect.yMax - rect.yMin)/double pixels.y
[for column in 0..pixels.y - 1 do
for row in 0..pixels.x - 1 do
yield (rect.xMin + xStep * double row,
rect.yMax - yStep * double column)]
---edit---
Following a suggestion from kvb (thanks a lot!) in a comment to my question, I built the program in Release mode. Building in the Relase mode generally speeded up things.
Just building in Release took me from 50s to around 30s, moving in all transforms on the array so it all happens in one pass made it around 10 seconds faster. At last using the Array.Parallel.init brought me to just over 11 seconds.
What I learnt from this is.... Use the release mode when timing things and using parallel constructs...
One more time, thanks for the help I have recieved.
--edit--
by using SSE assember from a native dll I have been able to slash the time from around 12 seconds to 1.2 seconds for a full screen of the most computational intensive points. Unfortunately I don't have a graphics processor...
Gorgen
Per the comment on the original post, here is the code I wrote to test the function. The fast version only takes a few seconds on my average workstation. It is fully sequential, and has no parallel code.
It's moderately long, so I posted it on another site: http://pastebin.com/Rjj8EzCA
I'm suspecting that the slowdown you are seeing is in the rendering code.
I don't think that the Array.Parallel.map function (which uses Parallel.For from .NET 4.0 under the cover) should have trouble parallelizing the operation if it runs a simple function ~1 million times. However, I encountered some weird performance behavior in a similar case when F# didn't optimize the call to the lambda function (in some way).
I'd try taking a copy of the Parallel.map function from the F# sources and adding inline. Try adding the following map function to your code and use it instead of the one from F# libraries:
let inline map (f: 'T -> 'U) (array : 'T[]) : 'U[]=
let inputLength = array.Length
let result = Array.zeroCreate inputLength
Parallel.For(0, inputLength, fun i ->
result.[i] <- f array.[i]) |> ignore
result
As an aside, it looks like you're generating an array of coordinates and then mapping it to an array of results. You don't need to create the coordinate array if you use the init function instead of map: Array.Parallel.init 1000 (fun y -> Array.init 1000 (fun x -> this.colorArray.[CalcZ (x, y)]))
EDIT: The following may be inaccurate:
Your problem could be that you call a tiny function a million times, causing the scheduling overhead to overwhelm that actual work you're doing. You should partition the array into much larger chunks so that each individual task takes a millisecond or so. You can use an array of arrays so that you would call Array.Parallel.map on the outer arrays and Array.map on the inner arrays. That way each parallel operation will operate on a whole row of pixels instead of just a single pixel.

F#/"Accelerator v2" DFT algorithm implementation probably incorrect

I'm trying to experiment with software defined radio concepts. From this article I've tried to implement a GPU-parallelism Discrete Fourier Transform.
I'm pretty sure I could pre-calculate 90 degrees of the sin(i) cos(i) and then just flip and repeat rather than what I'm doing in this code and that that would speed it up. But so far, I don't even think I'm getting correct answers. An all-zeros input gives a 0 result as I'd expect, but all 0.5 as inputs gives 78.9985886f (I'd expect a 0 result in this case too). Basically, I'm just generally confused. I don't have any good input data and I don't know what to do with the result or how to verify it.
This question is related to my other post here
open Microsoft.ParallelArrays
open System
// X64MulticoreTarget is faster on my machine, unexpectedly
let target = new DX9Target() // new X64MulticoreTarget()
ignore(target.ToArray1D(new FloatParallelArray([| 0.0f |]))) // Dummy operation to warm up the GPU
let stopwatch = new System.Diagnostics.Stopwatch() // For benchmarking
let Hz = 50.0f
let fStep = (2.0f * float32(Math.PI)) / Hz
let shift = 0.0f // offset, once we have to adjust for the last batch of samples of a stream
// If I knew that the periodic function is periodic
// at whole-number intervals, I think I could keep
// shift within a smaller range to support streams
// without overflowing shift - but I haven't
// figured that out
//let elements = 8192 // maximum for a 1D array - makes sense as 2^13
//let elements = 7240 // maximum on my machine for a 2D array, but why?
let elements = 7240
// need good data!!
let buffer : float32[,] = Array2D.init<float32> elements elements (fun i j -> 0.5f) //(float32(i * elements) + float32(j)))
let input = new FloatParallelArray(buffer)
let seqN : float32[,] = Array2D.init<float32> elements elements (fun i j -> (float32(i * elements) + float32(j)))
let steps = new FloatParallelArray(seqN)
let shiftedSteps = ParallelArrays.Add(shift, steps)
let increments = ParallelArrays.Multiply(fStep, steps)
let cos_i = ParallelArrays.Cos(increments) // Real component series
let sin_i = ParallelArrays.Sin(increments) // Imaginary component series
stopwatch.Start()
// From the documentation, I think ParallelArrays.Multiply does standard element by
// element multiplication, not matrix multiplication
// Then we sum each element for each complex component (I don't understand the relationship
// of this, or the importance of the generalization to complex numbers)
let real = target.ToArray1D(ParallelArrays.Sum(ParallelArrays.Multiply(input, cos_i))).[0]
let imag = target.ToArray1D(ParallelArrays.Sum(ParallelArrays.Multiply(input, sin_i))).[0]
printf "%A in " ((real * real) + (imag * imag)) // sum the squares for the presence of the frequency
stopwatch.Stop()
printfn "%A" stopwatch.ElapsedMilliseconds
ignore (System.Console.ReadKey())
I share your surprise that your answer is not closer to zero. I'd suggest writing naive code to perform your DFT in F# and seeing if you can track down the source of the discrepancy.
Here's what I think you're trying to do:
let N = 7240
let F = 1.0f/50.0f
let pi = single System.Math.PI
let signal = [| for i in 1 .. N*N -> 0.5f |]
let real =
seq { for i in 0 .. N*N-1 -> signal.[i] * (cos (2.0f * pi * F * (single i))) }
|> Seq.sum
let img =
seq { for i in 0 .. N*N-1 -> signal.[i] * (sin (2.0f * pi * F * (single i))) }
|> Seq.sum
let power = real*real + img*img
Hopefully you can use this naive code to get a better intuition for how the accelerator code ought to behave, which could guide you in your testing of the accelerator code. Keep in mind that part of the reason for the discrepancy may simply be the precision of the calculations - there are ~52 million elements in your arrays, so accumulating a total error of 79 may not actually be too bad. FWIW, I get a power of ~0.05 when running the above single precision code, but a power of ~4e-18 when using equivalent code with double precision numbers.
Two suggestions:
ensure you're not somehow confusing degrees with radians
try doing it sans-parallelism, or just with F#'s asyncs for parallelism
(In F#, if you have an array of floats
let a : float[] = ...
then you can 'add a step to all of them in parallel' to produce a new array with
let aShift = a |> (fun x -> async { return x + shift })
|> Async.Parallel |> Async.RunSynchronously
(though I expect this might be slower that just doing a synchronous loop).)

Resources