Unwrapping nested loops in F# - f#

I've been struggling with the following code. It's an F# implementation of the Forward-Euler algorithm used for modelling stars moving in a gravitational field.
let force (b1:Body) (b2:Body) =
let r = (b2.Position - b1.Position)
let rm = (float32)r.MagnitudeSquared + softeningLengthSquared
if (b1 = b2) then
VectorFloat.Zero
else
r * (b1.Mass * b2.Mass) / (Math.Sqrt((float)rm) * (float)rm)
member this.Integrate(dT, (bodies:Body[])) =
for i = 0 to bodies.Length - 1 do
for j = (i + 1) to bodies.Length - 1 do
let f = force bodies.[i] bodies.[j]
bodies.[i].Acceleration <- bodies.[i].Acceleration + (f / bodies.[i].Mass)
bodies.[j].Acceleration <- bodies.[j].Acceleration - (f / bodies.[j].Mass)
bodies.[i].Position <- bodies.[i].Position + bodies.[i].Velocity * dT
bodies.[i].Velocity <- bodies.[i].Velocity + bodies.[i].Acceleration * dT
While this works it isn't exactly "functional". It also suffers from horrible performance, it's 2.5 times slower than the equivalent c# code. bodies is an array of structs of type Body.
The thing I'm struggling with is that force() is an expensive function so usually you calculate it once for each pair and rely on the fact that Fij = -Fji. But this really messes up any loop unfolding etc.
Suggestions gratefully received! No this isn't homework...
Thanks,
Ade
UPDATED: To clarify Body and VectorFloat are defined as C# structs. This is because the program interops between F#/C# and C++/CLI. Eventually I'm going to get the code up on BitBucket but it's a work in progress I have some issues to sort out before I can put it up.
[StructLayout(LayoutKind.Sequential)]
public struct Body
{
public VectorFloat Position;
public float Size;
public uint Color;
public VectorFloat Velocity;
public VectorFloat Acceleration;
'''
}
[StructLayout(LayoutKind.Sequential)]
public partial struct VectorFloat
{
public System.Single X { get; set; }
public System.Single Y { get; set; }
public System.Single Z { get; set; }
}
The vector defines the sort of operators you'd expect for a standard Vector class. You could probably use the Vector3D class from the .NET framework for this case (I'm actually investigating cutting over to it).
UPDATE 2: Improved code based on the first two replies below:
for i = 0 to bodies.Length - 1 do
for j = (i + 1) to bodies.Length - 1 do
let r = ( bodies.[j].Position - bodies.[i].Position)
let rm = (float32)r.MagnitudeSquared + softeningLengthSquared
let f = r / (Math.Sqrt((float)rm) * (float)rm)
bodies.[i].Acceleration <- bodies.[i].Acceleration + (f * bodies.[j].Mass)
bodies.[j].Acceleration <- bodies.[j].Acceleration - (f * bodies.[i].Mass)
bodies.[i].Position <- bodies.[i].Position + bodies.[i].Velocity * dT
bodies.[i].Velocity <- bodies.[i].Velocity + bodies.[i].Acceleration * dT
The branch in the force function to cover the b1 == b2 case is the worst offender. You do't need this if softeningLength is always non-zero, even if it's very small (Epsilon). This optimization was in the C# code but not the F# version (doh!).
Math.Pow(x, -1.5) seems to be a lot slower than 1/ (Math.Sqrt(x) * x). Essentially this algorithm is slightly odd in that it's perfromance is dictated by the cost of this one step.
Moving the force calculation inline and getting rid of some divides also gives some improvement, but the performance was really being killed by the branching and is dominated by the cost of Sqrt.
WRT using classes over structs: There are cases (CUDA and native C++ implementations of this code and a DX9 renderer) where I need to get the array of bodies into unmanaged code or onto a GPU. In these scenarios being able to memcpy a contiguous block of memory seems like the way to go. Not something I'd get from an array of class Body.

I'm not sure if it's wise to rewrite this code in a functional style. I've seen some attempts to write pair interaction calculations in a functional manner and each one of them was harder to follow than two nested loops.
Before looking at structs vs. classes (I'm sure someone else has something smart to say about this), maybe you can try optimizing the calculation itself?
You're calculating two acceleration deltas, let's call them dAi and dAj:
dAi = r*m1*m2/(rm*sqrt(rm)) / m1
dAj = r*m1*m2/(rm*sqrt(rm)) / m2
[note: m1 = bodies.[i].mass, m2=bodies.[j].mass]]
The division by mass cancels out like this:
dAi = rm2 / (rmsqrt(rm))
dAj = rm1 / (rmsqrt(rm))
Now you only have to calculate r/(rmsqrt(rm)) for each pair (i,j).
This can be optimized further, because 1/(rmsqrt(rm)) = 1/(rm^1.5) = rm^-1.5, so if you let r' = r * (rm ** -1.5), then Edit: no it can't, that's premature optimization talking right there (see comment). Calculating r' = 1.0 / (r * sqrt r) is fastest.
dAi = m2 * r'
dAj = m1 * r'
Your code would then become something like
member this.Integrate(dT, (bodies:Body[])) =
for i = 0 to bodies.Length - 1 do
for j = (i + 1) to bodies.Length - 1 do
let r = (b2.Position - b1.Position)
let rm = (float32)r.MagnitudeSquared + softeningLengthSquared
let r' = r * (rm ** -1.5)
bodies.[i].Acceleration <- bodies.[i].Acceleration + r' * bodies.[j].Mass
bodies.[j].Acceleration <- bodies.[j].Acceleration - r' * bodies.[i].Mass
bodies.[i].Position <- bodies.[i].Position + bodies.[i].Velocity * dT
bodies.[i].Velocity <- bodies.[i].Velocity + bodies.[i].Acceleration * dT
Look, ma, no more divisions!
Warning: untested code. Try at your own risk.

I'd like to play arround with your code, but it's difficult since the definition of Body and FloatVector is missing and they also seem to be missing from the orginal blog post you point to.
I'd hazard a guess that you could improve your performance and rewrite in a more functional style using F#'s lazy computations:
http://msdn.microsoft.com/en-us/library/dd233247(VS.100).aspx
The idea is fairly simple you wrap any expensive computation that could be repeatedly calculated in a lazy ( ... ) expression then you can force the computation as many times as you like and it will only ever be calculated once.

Related

Minimizing memory usage in Julia function

This function is a workhorse which I want to optimize. Any idea on how its memory usage can be limited would be great.
function F(len, rNo, n, ratio = 0.5)
s = zeros(len); m = copy(s); d = copy(s);
s[rNo]=1
rNo ≤ len-1 && (m[rNo + 1] = s[rNo+1] = -n[rNo])
rNo > 1 && (m[rNo - 1] = s[rowNo-1] = n[rowNo-1])
r=1
while true
for i ∈ 2:len-1
d[i] = (n[i]*m[i+1] - n[i-1]*m[i-1])/(r+1)
end
d[1] = n[1]*m[2]/(r+1);
d[len] = -n[len-1]*m[len-1]/(r+1);
for i ∈ 1:len
s[i]+=d[i]
end
sum(abs.(d))/sum(abs.(m)) < ratio && break #converged
m = copy(d); r+=1
end
return reshape(s, 1, :)
end
It calculates rows of a special matrix exponential which I stack later.
Although the full method is quite faster than built in exp thanks to the special properties, it takes up far more memory as measured by #time.
Since I am a noob in memory management and also in Julia, I am sure it can be optimized quite a bit..
Am I doing something obviously wrong?
I think most of your allocations come from sum(abs.(d))/sum(abs.(m)) < ratio && break #converged. If you replace it with sum(abs, d)/sum(abs,m) < ratio && break #converged those allocations should go away. (it also will be a speed boost).
Your other allocations can be removed by replacing m = copy(d) with m .= d which does an element-wise copy.
There are also a couple of style things where I think you could make this a nicer function to read and use. My changes would be as follows
function F(rNo, v, ratio = 0.5)
len = length(v)
s = zeros(len+1); m = copy(s); d = copy(s);
s[rNo]=1
rNo ≤ len && (m[rNo + 1] = s[rNo+1] = -v[rNo])
rNo > 1 && (m[rNo - 1] = s[rowNo-1] = v[rowNo-1])
r=1
while true
for i ∈ 2:len
d[i] = (v[i]*m[i+1] - v[i-1]*m[i-1]) / (r+1)
end
d[1] = v[1]*m[2]/(r+1);
d[end] = -v[end]*m[end]/(r+1);
s .+= d
sum(abs, d)/sum(abs, m) < ratio && break #converged
m .= d; r+=1
end
return reshape(s, 1, :)
end
The most notable change is removing len from the arguments. Including an array length argument is common in C (and probably others) where finding the length of an array is hard, but in Julia length is cheap (O(1)), and adding extra arguments is just more clutter and confusion for the people using it. I also made use of the fact that julia is able to turn s[end] into s[length(x)] to make this a little cleaner. Also, in general when using Julia you should look for ways to use dotted operations rather than writing for loops. The for loops will be fast, but why take 3 lines to do what you could in 1 shorter line? (I also renamed n to v since to me n is a number and v is a vector, but that is pure preference).
I hope this helps.

Why doesn't this Fibonacci Number function work in O(log N)?

So the Fibonacci number for log (N) — without matrices.
Ni // i-th Fibonacci number
= Ni-1 + Ni-2 // by definition
= (Ni-2 + Ni-3) + Ni-2 // unwrap Ni-1
= 2*Ni-2 + Ni-3 // reduce the equation
= 2*(Ni-3 + Ni-4) + Ni-3 //unwrap Ni-2
// And so on
= 3*Ni-3 + 2*Ni-4
= 5*Ni-4 + 3*Ni-5
= 8*Ni-5 + 5*Ni-6
= Nk*Ni-k + Nk-1*Ni-k-1
Now we write a recursive function, where at each step we take k~=I/2.
static long N(long i)
{
if (i < 2) return 1;
long k=i/2;
return N(k) * N(i - k) + N(k - 1) * N(i - k - 1);
}
Where is the fault?
You get a recursion formula for the effort: T(n) = 4T(n/2) + O(1). (disregarding the fact that the numbers get bigger, so the O(1) does not even hold). It's clear from this that T(n) is not in O(log(n)). Instead one gets by the master theorem T(n) is in O(n^2).
Btw, this is even slower than the trivial algorithm to calculate all Fibonacci numbers up to n.
The four N calls inside the function each have an argument of around i/2. So the length of the stack of N calls in total is roughly equal to log2N, but because each call generates four more, the bottom 'layer' of calls has 4^log2N = O(n2) Thus, the fault is that N calls itself four times. With only two calls, as in the conventional iterative method, it would be O(n). I don't know of any way to do this with only one call, which could be O(log n).
An O(n) version based on this formula would be:
static long N(long i) {
if (i<2) {
return 1;
}
long k = i/2;
long val1;
long val2;
val1 = N(k-1);
val2 = N(k);
if (i%2==0) {
return val2*val2+val1*val1;
}
return val2*(val2+val1)+val1*val2;
}
which makes 2 N calls per function, making it O(n).
public class fibonacci {
public static int count=0;
public static void main(String[] args) {
Scanner scan = new Scanner(System.in);
int i = scan.nextInt();
System.out.println("value of i ="+ i);
int result = fun(i);
System.out.println("final result is " +result);
}
public static int fun(int i) {
count++;
System.out.println("fun is called and count is "+count);
if(i < 2) {
System.out.println("function returned");
return 1;
}
int k = i/2;
int part1 = fun(k);
int part2 = fun(i-k);
int part3 = fun(k-1);
int part4 = fun(i-k-1);
return ((part1*part2) + (part3*part4)); /*RESULT WILL BE SAME FOR BOTH METHODS*/
//return ((fun(k)*fun(i-k))+(fun(k-1)*fun(i-k-1)));
}
}
I tried to code to problem defined by you in java. What i observed is that complexity of above code is not completely O(N^2) but less than that.But as per conventions and standards the worst case complexity is O(N^2) including some other factors like computation(division,multiplication) and comparison time analysis.
The output of above code gives me information about how many times the function
fun(int i) computes and is being called.
OUTPUT
So including the time taken for comparison and division, multiplication operations, the worst case time complexity is O(N^2) not O(LogN).
Ok if we use Analysis of the recursive Fibonacci program technique.Then we end up getting a simple equation
T(N) = 4* T(N/2) + O(1)
where O(1) is some constant time.
So let's apply Master's method on this equation.
According to Master's method
T(n) = aT(n/b) + f(n) where a >= 1 and b > 1
There are following three cases:
If f(n) = Θ(nc) where c < Logba then T(n) = Θ(nLogba)
If f(n) = Θ(nc) where c = Logba then T(n) = Θ(ncLog n)
If f(n) = Θ(nc) where c > Logba then T(n) = Θ(f(n))
And in our equation a=4 , b=2 & c=0.
As case 1 c < logba => 0 < 2 (which is log base 2 and equals to 2) is satisfied
hence T(n) = O(n^2).
For more information about how master's algorithm works please visit: Analysis of Algorithms
Your idea is correct, and it will perform in O(log n) provided you don't compute the same formula
over and over again. The whole point of having N(k) * N(i-k) is to have (k = i - k) so you only have to compute one instead of two. But if you only call recursively, you are performing the computation twice.
What you need is called memoization. That is, store every value that you already have computed, and
if it comes up again, then you get it in O(1).
Here's an example
const int MAX = 10000;
// memoization array
int f[MAX] = {0};
// Return nth fibonacci number using memoization
int fib(int n) {
// Base case
if (n == 0)
return 0;
if (n == 1 || n == 2)
return (f[n] = 1);
// If fib(n) is already computed
if (f[n]) return f[n];
// (n & 1) is 1 iff n is odd
int k = n/2;
// Applying your formula
f[n] = fib(k) * fib(n - k) + fib(k - 1) * fib(n - k - 1);
return f[n];
}

How can I fix this issue with my Mandelbrot fractal generator?

I've been working on a project that renders a Mandelbrot fractal. For those of you who know, it is generated by iterating through the following function where c is the point on a complex plane:
function f(c, z) return z^2 + c end
Iterating through that function produces the following fractal (ignore the color):
When you change the function to this, (z raised to the third power)
function f(c, z) return z^3 + c end
the fractal should render like so (again, the color doesn't matter):
(source: uoguelph.ca)
However, when I raised z to the power of 3, I got an image extremely similar as to when you raise z to the power of 2. How can I make the fractal render correctly? This is the code where the iterations are done: (the variables real and imaginary simply scale the screen from -2 to 2)
--loop through each pixel, col = column, row = row
local real = (col - zoomCol) * 4 / width
local imaginary = (row - zoomRow) * 4 / width
local z, c, iter = 0, 0, 0
while math.sqrt(z^2 + c^2) <= 2 and iter < maxIter do
local zNew = z^2 - c^2 + real
c = 2*z*c + imaginary
z = zNew
iter = iter + 1
end
So I recently decided to remake a Mandelbrot fractal generator, and it was MUCH more successful than my attempt last time, as my programming skills have increased with practice.
I decided to generalize the mandelbrot function using recursion for anyone who wants it. So, for example, you can do f(z, c) z^2 + c or f(z, c) z^3 + c
Here it is for anyone that may need it:
function raise(r, i, cr, ci, pow)
if pow == 1 then
return r + cr, i + ci
end
return raise(r*r-i*i, 2*r*i, cr, ci, pow - 1)
end
and it's used like this:
r, i = raise(r, i, CONSTANT_REAL_PART, CONSTANT_IMAG_PART, POWER)

using Array.Parallel.map for decreasing running time

Hello everyone
I have converted a project in C# to F# that paints the Mandelbrot set.
Unfortunately does it take around one minute to render a full screen so I am try to find some ways to speed it up.
It is one call that take almost all of the time:
Array.map (fun x -> this.colorArray.[CalcZ x]) xyArray
xyArray (double * double) [] => (array of tuple of double)
colorArray is an array of int32 length = 255
CalcZ is defined as:
let CalcZ (coord:double * double) =
let maxIterations = 255
let rec CalcZHelper (xCoord:double) (yCoord:double) // line break inserted
(x:double) (y:double) iters =
let newx = x * x + xCoord - y * y
let newy = 2.0 * x * y + yCoord
match newx, newy, iters with
| _ when Math.Abs newx > 2.0 -> iters
| _ when Math.Abs newy > 2.0 -> iters
| _ when iters = maxIterations -> iters
| _ -> CalcZHelper xCoord yCoord newx newy (iters + 1)
CalcZHelper (fst coord) (snd coord) (fst coord) (snd coord) 0
As I only use around half of the processor capacity is an idea to use more threads and specifically Array.Parallel.map, translates to system.threading.tasks.parallel
Now my question
A naive solution, would be:
Array.Parallel.map (fun x -> this.colorArray.[CalcZ x]) xyArray
but that took twice the time, how can I rewrite this to take less time, or can I take some other way to utilize the processor better?
Thanks in advance
Gorgen
---edit---
the function that is calling CalcZ looks like this:
let GetMatrix =
let halfX = double bitmap.PixelWidth * scale / 2.0
let halfY = double bitmap.PixelHeight * scale / 2.0
let rect:Mandelbrot.Rectangle =
{xMax = centerX + halfX; xMin = centerX - halfX;
yMax = centerY + halfY; yMin = centerY - halfY;}
let size:Mandelbrot.Size =
{x = bitmap.PixelWidth; y = bitmap.PixelHeight}
let xyList = GenerateXYTuple rect size
let xyArray = Array.ofList xyList
Array.map (fun x -> this.colorArray.[CalcZ x]) xyArray
let region:Int32Rect = new Int32Rect(0,0,bitmap.PixelWidth,bitmap.PixelHeight)
bitmap.WritePixels(region, GetMatrix, bitmap.PixelWidth * 4, region.X, region.Y);
GenerateXYTuple:
let GenerateXYTuple (rect:Rectangle) (pixels:Size) =
let xStep = (rect.xMax - rect.xMin)/double pixels.x
let yStep = (rect.yMax - rect.yMin)/double pixels.y
[for column in 0..pixels.y - 1 do
for row in 0..pixels.x - 1 do
yield (rect.xMin + xStep * double row,
rect.yMax - yStep * double column)]
---edit---
Following a suggestion from kvb (thanks a lot!) in a comment to my question, I built the program in Release mode. Building in the Relase mode generally speeded up things.
Just building in Release took me from 50s to around 30s, moving in all transforms on the array so it all happens in one pass made it around 10 seconds faster. At last using the Array.Parallel.init brought me to just over 11 seconds.
What I learnt from this is.... Use the release mode when timing things and using parallel constructs...
One more time, thanks for the help I have recieved.
--edit--
by using SSE assember from a native dll I have been able to slash the time from around 12 seconds to 1.2 seconds for a full screen of the most computational intensive points. Unfortunately I don't have a graphics processor...
Gorgen
Per the comment on the original post, here is the code I wrote to test the function. The fast version only takes a few seconds on my average workstation. It is fully sequential, and has no parallel code.
It's moderately long, so I posted it on another site: http://pastebin.com/Rjj8EzCA
I'm suspecting that the slowdown you are seeing is in the rendering code.
I don't think that the Array.Parallel.map function (which uses Parallel.For from .NET 4.0 under the cover) should have trouble parallelizing the operation if it runs a simple function ~1 million times. However, I encountered some weird performance behavior in a similar case when F# didn't optimize the call to the lambda function (in some way).
I'd try taking a copy of the Parallel.map function from the F# sources and adding inline. Try adding the following map function to your code and use it instead of the one from F# libraries:
let inline map (f: 'T -> 'U) (array : 'T[]) : 'U[]=
let inputLength = array.Length
let result = Array.zeroCreate inputLength
Parallel.For(0, inputLength, fun i ->
result.[i] <- f array.[i]) |> ignore
result
As an aside, it looks like you're generating an array of coordinates and then mapping it to an array of results. You don't need to create the coordinate array if you use the init function instead of map: Array.Parallel.init 1000 (fun y -> Array.init 1000 (fun x -> this.colorArray.[CalcZ (x, y)]))
EDIT: The following may be inaccurate:
Your problem could be that you call a tiny function a million times, causing the scheduling overhead to overwhelm that actual work you're doing. You should partition the array into much larger chunks so that each individual task takes a millisecond or so. You can use an array of arrays so that you would call Array.Parallel.map on the outer arrays and Array.map on the inner arrays. That way each parallel operation will operate on a whole row of pixels instead of just a single pixel.

F#/"Accelerator v2" DFT algorithm implementation probably incorrect

I'm trying to experiment with software defined radio concepts. From this article I've tried to implement a GPU-parallelism Discrete Fourier Transform.
I'm pretty sure I could pre-calculate 90 degrees of the sin(i) cos(i) and then just flip and repeat rather than what I'm doing in this code and that that would speed it up. But so far, I don't even think I'm getting correct answers. An all-zeros input gives a 0 result as I'd expect, but all 0.5 as inputs gives 78.9985886f (I'd expect a 0 result in this case too). Basically, I'm just generally confused. I don't have any good input data and I don't know what to do with the result or how to verify it.
This question is related to my other post here
open Microsoft.ParallelArrays
open System
// X64MulticoreTarget is faster on my machine, unexpectedly
let target = new DX9Target() // new X64MulticoreTarget()
ignore(target.ToArray1D(new FloatParallelArray([| 0.0f |]))) // Dummy operation to warm up the GPU
let stopwatch = new System.Diagnostics.Stopwatch() // For benchmarking
let Hz = 50.0f
let fStep = (2.0f * float32(Math.PI)) / Hz
let shift = 0.0f // offset, once we have to adjust for the last batch of samples of a stream
// If I knew that the periodic function is periodic
// at whole-number intervals, I think I could keep
// shift within a smaller range to support streams
// without overflowing shift - but I haven't
// figured that out
//let elements = 8192 // maximum for a 1D array - makes sense as 2^13
//let elements = 7240 // maximum on my machine for a 2D array, but why?
let elements = 7240
// need good data!!
let buffer : float32[,] = Array2D.init<float32> elements elements (fun i j -> 0.5f) //(float32(i * elements) + float32(j)))
let input = new FloatParallelArray(buffer)
let seqN : float32[,] = Array2D.init<float32> elements elements (fun i j -> (float32(i * elements) + float32(j)))
let steps = new FloatParallelArray(seqN)
let shiftedSteps = ParallelArrays.Add(shift, steps)
let increments = ParallelArrays.Multiply(fStep, steps)
let cos_i = ParallelArrays.Cos(increments) // Real component series
let sin_i = ParallelArrays.Sin(increments) // Imaginary component series
stopwatch.Start()
// From the documentation, I think ParallelArrays.Multiply does standard element by
// element multiplication, not matrix multiplication
// Then we sum each element for each complex component (I don't understand the relationship
// of this, or the importance of the generalization to complex numbers)
let real = target.ToArray1D(ParallelArrays.Sum(ParallelArrays.Multiply(input, cos_i))).[0]
let imag = target.ToArray1D(ParallelArrays.Sum(ParallelArrays.Multiply(input, sin_i))).[0]
printf "%A in " ((real * real) + (imag * imag)) // sum the squares for the presence of the frequency
stopwatch.Stop()
printfn "%A" stopwatch.ElapsedMilliseconds
ignore (System.Console.ReadKey())
I share your surprise that your answer is not closer to zero. I'd suggest writing naive code to perform your DFT in F# and seeing if you can track down the source of the discrepancy.
Here's what I think you're trying to do:
let N = 7240
let F = 1.0f/50.0f
let pi = single System.Math.PI
let signal = [| for i in 1 .. N*N -> 0.5f |]
let real =
seq { for i in 0 .. N*N-1 -> signal.[i] * (cos (2.0f * pi * F * (single i))) }
|> Seq.sum
let img =
seq { for i in 0 .. N*N-1 -> signal.[i] * (sin (2.0f * pi * F * (single i))) }
|> Seq.sum
let power = real*real + img*img
Hopefully you can use this naive code to get a better intuition for how the accelerator code ought to behave, which could guide you in your testing of the accelerator code. Keep in mind that part of the reason for the discrepancy may simply be the precision of the calculations - there are ~52 million elements in your arrays, so accumulating a total error of 79 may not actually be too bad. FWIW, I get a power of ~0.05 when running the above single precision code, but a power of ~4e-18 when using equivalent code with double precision numbers.
Two suggestions:
ensure you're not somehow confusing degrees with radians
try doing it sans-parallelism, or just with F#'s asyncs for parallelism
(In F#, if you have an array of floats
let a : float[] = ...
then you can 'add a step to all of them in parallel' to produce a new array with
let aShift = a |> (fun x -> async { return x + shift })
|> Async.Parallel |> Async.RunSynchronously
(though I expect this might be slower that just doing a synchronous loop).)

Resources