Parallelizing the element wise multiplication of two matrices in F# - f#

I'm trying to parallelize the element by element multiplication of two matrices in F#. I can't quite figure it out thought. I keep trying to create tasks but it never wants to compile. My non-working messy code is the following:
let myBigElemMultiply (m:matrix) (n:matrix) =
let AddTwoRows (row:int) (destination:matrix) (source1:matrix) (source2:matrix) =
for i in 0 .. destination.NumCols
destination.[row, i] <- source1.[row,i] + source2.[row,i]
destination
let result = Matrix.zero(m.NumRows)
let operations = [ for i in 0 .. m.NumRows -> AddTwoRows i result m n ]
let parallelTasks = Async.Parallel operations
Async.RunSynchronously parallelTasks
result

You have made several small mistakes, e.g., you haven't figured how to do matrix multiplication.
let myBigElemMultiply (m:matrix) (n:matrix) =
let AddTwoRows (row:int) (destination:matrix) (source1:matrix) (source2:matrix) =
for col=0 to destination.NumCols-1 do
let mutable sum = 0.0
for k=0 to m.NumCols-1 do
sum <- sum + source1.[row,k] * source2.[k,col]
destination.[row,col] <- sum
let result = Matrix.zero m.NumRows n.NumCols
let operations = [ for i=0 to m.NumRows-1 do yield async { AddTwoRows i result m n} ]
let parallelTasks = Async.Parallel operations
Async.RunSynchronously parallelTasks |> ignore
result
One thing to notice is that this code would perform very badly because m.[i,j] is an inefficient way to access elements in a matrix. You'd better use a 2D array:
let myBigElemMultiply2 (m:matrix) (n:matrix) =
let AddTwoRows (row:int) (destination:matrix) (source1:matrix) (source2:matrix) =
let destination = destination.InternalDenseValues
let source1 = source1.InternalDenseValues
let source2 = source2.InternalDenseValues
for col=0 to Array2D.length2 destination - 1 do
let mutable sum = 0.0
for k=0 to Array2D.length1 source2 - 1 do
sum <- sum + source1.[row,k] * source2.[k,col]
destination.[row,col] <- sum
let result = Matrix.zero m.NumRows n.NumCols
let operations = [ for i=0 to m.NumRows-1 do yield async { AddTwoRows i result m n} ]
let parallelTasks = Async.Parallel operations
Async.RunSynchronously parallelTasks |> ignore
result
testing:
let r = new Random()
let A = Matrix.init 280 10340 (fun i j -> r.NextDouble() )
let B = A.Transpose
some timing:
> myBigElemMultiply A B;;
Real: 00:00:22.111, CPU: 00:00:41.777, GC gen0: 0, gen1: 0, gen2: 0
val it : unit = ()
> myBigElemMultiply2 A B;;
Real: 00:00:08.736, CPU: 00:00:15.303, GC gen0: 0, gen1: 0, gen2: 0
val it : unit = ()
> A*B;;
Real: 00:00:13.635, CPU: 00:00:13.166, GC gen0: 0, gen1: 0, gen2: 0
val it : unit = ()
>
Check here by using ParallelFor, which should have a better performance than async.

Here's at least some code that compiles, perhaps this will get you headed in the right direction?
let myBigElemMultiply (m:matrix) (n:matrix) =
let AddTwoRows (row:int) (destination:matrix) (source1:matrix) (source2:matrix) =
async {
for i in 0 .. destination.NumCols do
destination.[row, i] <- source1.[row,i] + source2.[row,i]
}
let result = Matrix.zero m.NumRows m.NumCols
let operations = [ for i in 0 .. m.NumRows -> AddTwoRows i result m n ]
let parallelTasks = Async.Parallel operations
Async.RunSynchronously parallelTasks |> ignore
result

There's no point. Out-of-place element-wise multiplication of a pair of matrices is little more that copying at which point a single core will happily max out the entire memory bandwidth of your machine and adding more cores will not improve performance. So it is almost certainly a waste of time.

Related

Understanding F# memory consumption

I've been toying around with F# lately and wrote this little snippet below, it just creates a number of randomized 3d-vectors, puts them into a list, maps each vector to its length and sums up all those values.
Running the program (as a Release Build .exe, not interactive), the binary consumes in this particular case (10 mio vectors) roughly 550 MB RAM. One Vec3 object should account for 12 bytes (or 16 assuming some alignment takes place). Even if you do the rough math with 32 bytes to account for some book-keeping overhead (bytes per object*10 mio) / 1024 / 1024) you're still 200 MB off the actual consumption. Naively i'd assume to have 10 mio * 4 bytes per single in the end, since the Vec3 objects are 'mapped away'.
My guess so far: either i keep one (or several) copy/copies of my list somewhere and i'm not aware of that, or some intermediate results get never garbage collected? I can't imagine that inheriting from System.Object brings in so much overhead.
Could someone point me into the right direction with this?
TiA
type Vec3(x: single, y: single, z:single) =
let mag = sqrt(x*x + y*y + z*z)
member self.Magnitude = mag
override self.ToString() = sprintf "[%f %f %f]" x y z
let how_much = 10000000
let mutable rng = System.Random()
let sw = new System.Diagnostics.Stopwatch()
sw.Start()
let random_vec_iter len =
let mutable result = []
for x = 1 to len do
let mutable accum = []
for i = 1 to 3 do
accum <- single(rng.NextDouble())::accum
result <- Vec3(accum.[0], accum.[1], accum.[2])::result
result
sum_len_func = List.reduce (fun x y -> x+y)
let map_to_mag_func = List.map (fun (x:Vec3) -> x.Magnitude)
[<EntryPoint>]
let main argv =
printfn "Hello, World"
let res = sum_len_func (map_to_mag_func (random_vec_iter(how_much)))
printfn "doing stuff with %i items took %i, result is %f" how_much (sw.ElapsedMilliseconds) res
System.Console.ReadKey() |> ignore
0 // return an integer exit code
First, your vec is a ref type not a value type (not a struct). So you hold a pointer on top of your 12 bytes (12+16). Then the list is a single-linked list, so another 16 bytes for a .net ref. Then, your List.map will create an intermediate list.

Why fsharp autogenerated gethashcode generates too many collisions?

In our fsharp code autogenerated gethashcode implementation shows very bad performance and big collisions rate. Is it a problem in fsharp implementation of gethashcode generator or just an edge case?
open System
open System.Collections.Generic
let check keys e name =
let dict = new Dictionary<_,_>(Array.length keys, e)//, HashIdentity.Structural)
let stopWatch = System.Diagnostics.Stopwatch.StartNew()
let add k = dict.Add(k, 1.02)
Array.iter add keys
stopWatch.Stop()
let hsahes = new HashSet<int>()
let add_hash x = hsahes.Add(e.GetHashCode(x)) |> not
let collisions = Array.filter add_hash keys |> Array.length
printfn "%s %f sec %f collisions" name stopWatch.Elapsed.TotalSeconds (double(collisions) / double(keys.Length))
type StructTuple<'T,'T2> =
struct
val fst: 'T
val snd : 'T2
new(fst: 'T, snd : 'T2) = {fst = fst; snd = snd}
end
let bad_keys = seq{
let rnd = new Random();
while true do
let j = uint32(rnd.Next(0, 3346862))
let k = uint16 (rnd.Next(0, 658))
yield StructTuple(j,k)
}
let good_keys = seq{
for k in 0us..658us do
for j in 0u.. 3346862u do
yield StructTuple(j,k)
}
module CmpHelpers =
let inline combine (h1:int) (h2:int) = (h1 <<< 5) + h1 ^^^ h2;
type StructTupleComparer<'T,'T2>() =
let cmparer = EqualityComparer<Object>.Default
interface IEqualityComparer<StructTuple<'T,'T2>> with
member this.Equals (a,b) = cmparer.Equals(a.fst, b.fst) && cmparer.Equals(a.snd, b.snd)
member this.GetHashCode (x) = CmpHelpers.combine (cmparer.GetHashCode(x.fst)) (cmparer.GetHashCode(x.snd))
type AutoGeneratedStructTupleComparer<'T,'T2>() =
let cmparer = LanguagePrimitives.GenericEqualityComparer
interface IEqualityComparer<StructTuple<'T,'T2>> with
member this.Equals (a:StructTuple<'T,'T2>,b:StructTuple<'T,'T2>) =
LanguagePrimitives.HashCompare.GenericEqualityERIntrinsic<'T> a.fst b.fst
&& LanguagePrimitives.HashCompare.GenericEqualityERIntrinsic<'T2> a.snd b.snd
member this.GetHashCode (x:StructTuple<'T,'T2>) =
let mutable num = 0
num <- -1640531527 + (LanguagePrimitives.HashCompare.GenericHashWithComparerIntrinsic<'T2> cmparer x.snd + ((num <<< 6) + (num >>> 2)))
-1640531527 + (LanguagePrimitives.HashCompare.GenericHashWithComparerIntrinsic<'T> cmparer x.fst + ((num <<< 6) + (num >>> 2)));
let uniq (sq:seq<'a>) = Array.ofSeq (new HashSet<_>(sq))
[<EntryPoint>]
let main argv =
let count = 15000000
let keys = good_keys |> Seq.take count |> uniq
printfn "good keys"
check keys (new StructTupleComparer<_,_>()) "struct custom"
check keys HashIdentity.Structural "struct auto"
check keys (new AutoGeneratedStructTupleComparer<_,_>()) "struct auto explicit"
let keys = bad_keys |> Seq.take count |> uniq
printfn "bad keys"
check keys (new StructTupleComparer<_,_>()) "struct custom"
check keys HashIdentity.Structural "struct auto"
check keys (new AutoGeneratedStructTupleComparer<_,_>()) "struct auto explicit"
Console.ReadLine() |> ignore
0 // return an integer exit code
output
good keys
struct custom 1.506934 sec 0.000000 collisions
struct auto 4.832881 sec 0.776863 collisions
struct auto explicit 3.166931 sec 0.776863 collisions
bad keys
struct custom 3.631251 sec 0.061893 collisions
struct auto 10.340693 sec 0.777034 collisions
struct auto explicit 8.893612 sec 0.777034 collisions
I am no expert on the overall algorithm used to produce auto-generated Equals and GetHashCode, but it just seems to produce something non-optimal here. I don't know offhand if that is normal for a general-purpose auto-generated implementation, or if there are practical ways of auto-generating close-to-optimal implementations reliably.
It's worth noting that if you just use the standard tuple, the autogenerated hashing and comparison give the same collision rate and equal performance as your custom implementation. And using the latest F# 4.0 bits (where there has recently been a significant perf improvement in this area), the autogenerated stuff becomes significantly faster than the custom implementation.
My numbers:
// F# 3.1, struct tuples
good keys
custom 0.951254 sec 0.000000 collisions
auto 2.737166 sec 0.776863 collisions
bad keys
custom 2.923103 sec 0.061869 collisions
auto 7.706678 sec 0.777040 collisions
// F# 3.1, standard tuples
good keys
custom 0.995701 sec 0.000000 collisions
auto 0.965949 sec 0.000000 collisions
bad keys
custom 3.091821 sec 0.061869 collisions
auto 2.924721 sec 0.061869 collisions
// F# 4.0, standard tuples
good keys
custom 1.018672 sec 0.000000 collisions
auto 0.619066 sec 0.000000 collisions
bad keys
custom 3.082988 sec 0.061869 collisions
auto 1.829720 sec 0.061869 collisions
Opened issue in fsharp issue tracker. Accepted as a bug https://github.com/fsharp/fsharp/issues/343

Trouble With Kinect Skeleton Tracking in F#

I am using the F# skeleton tracking template provided by KinectContrib. The template in C# that does the same thing works so I know the hardware is OK.
I am using Windows Kinect SDK v1.8.
The program will track once in a rare while but with no consistent pattern. I have been playing with the code since last night so I am looking for someone to confirm the same behavior on another system or for any pointers on how to change the code.
Thanks in advance.
This is the template code:
#light
open System
open System.Windows
open System.Windows.Media.Imaging
open Microsoft.Kinect
open System.Diagnostics
let sensor = KinectSensor.KinectSensors.[0]
//The main canvas that is handling the ellipses
let canvas = new System.Windows.Controls.Canvas()
canvas.Background <- System.Windows.Media.Brushes.Transparent
let ds : byte = Convert.ToByte(1)
let dummySkeleton : Skeleton = new Skeleton(TrackingState = SkeletonTrackingState.Tracked)
// Thanks to Richard Minerich (#rickasaurus) for helping me figure out
// some array concepts in F#.
let mutable pixelData : byte array = [| |]
let mutable skeletons : Skeleton array = [| |]
//Right hand ellipse
let rhEllipse = new System.Windows.Shapes.Ellipse()
rhEllipse.Height <- 20.0
rhEllipse.Width <- 20.0
rhEllipse.Fill <- System.Windows.Media.Brushes.Red
rhEllipse.Stroke <- System.Windows.Media.Brushes.White
//Left hand ellipse
let lhEllipse = new System.Windows.Shapes.Ellipse()
lhEllipse.Height <- 20.0
lhEllipse.Width <- 20.0
lhEllipse.Fill <- System.Windows.Media.Brushes.Red
lhEllipse.Stroke <- System.Windows.Media.Brushes.White
//Head ellipse
let hEllipse = new System.Windows.Shapes.Ellipse()
hEllipse.Height <- 20.0
hEllipse.Width <- 20.0
hEllipse.Fill <- System.Windows.Media.Brushes.Red
hEllipse.Stroke <- System.Windows.Media.Brushes.White
canvas.Children.Add(rhEllipse) |> ignore
canvas.Children.Add(lhEllipse) |> ignore
canvas.Children.Add(hEllipse) |> ignore
let grid = new System.Windows.Controls.Grid()
let winImage = new System.Windows.Controls.Image()
winImage.Height <- 600.0
winImage.Width <- 800.0
grid.Children.Add(winImage) |> ignore
grid.Children.Add(canvas) |> ignore
//Video frame is ready to be processed.
let VideoFrameReady (sender : obj) (args: ColorImageFrameReadyEventArgs) =
let receivedData = ref false
using (args.OpenColorImageFrame()) (fun r ->
if (r <> null) then
(
pixelData <- Array.create r.PixelDataLength ds
//Array.Resize(ref pixelData, r.PixelDataLength)
r.CopyPixelDataTo(pixelData)
receivedData := true
)
if (receivedData <> ref false) then
(
winImage.Source <- BitmapSource.Create(640, 480, 96.0, 96.0, Media.PixelFormats.Bgr32, null, pixelData, 640 * 4)
)
)
//Required to correlate the skeleton data to the PC screen
//IMPORTANT NOTE: Code for vector scaling was imported from the Coding4Fun Kinect Toolkit
//available here: http://c4fkinect.codeplex.com/
//I only used this part to avoid adding an extra reference.
let ScaleVector (length : float32, position : float32) =
let value = (((length / 1.0f) / 2.0f) * position) + (length / 2.0f)
if value > length then
length
elif value < 0.0f then
0.0f
else
value
//This will set the ellipse positions depending on the passed instance and joint
let SetEllipsePosition (ellipse : System.Windows.Shapes.Ellipse, joint : Joint) =
let vector = new Microsoft.Kinect.SkeletonPoint(X = ScaleVector(640.0f, joint.Position.X), Y=ScaleVector(480.0f, -joint.Position.Y),Z=joint.Position.Z)
let mutable uJoint = joint
uJoint.TrackingState <- JointTrackingState.Tracked
uJoint.Position <- vector
System.Windows.Controls.Canvas.SetLeft(ellipse,(float uJoint.Position.X))
System.Windows.Controls.Canvas.SetTop(ellipse,(float uJoint.Position.Y))
//Triggered when a new skeleton frame is ready for processing
let SkeletonFrameReady (sender : obj) (args: SkeletonFrameReadyEventArgs) =
let receivedData = ref false
using (args.OpenSkeletonFrame()) (fun r ->
if (r <> null) then
(
skeletons <- Array.create r.SkeletonArrayLength dummySkeleton
r.CopySkeletonDataTo(skeletons)
for i in skeletons do
Debug.WriteLine(i.TrackingState.ToString())
receivedData := true
)
if (receivedData <> ref false) then
(
for i in skeletons do
if i.TrackingState <> SkeletonTrackingState.NotTracked then
(
let currentSkeleton = i
SetEllipsePosition(hEllipse, currentSkeleton.Joints.[JointType.Head])
SetEllipsePosition(lhEllipse, currentSkeleton.Joints.[JointType.HandLeft])
SetEllipsePosition(rhEllipse, currentSkeleton.Joints.[JointType.HandRight])
)
)
)
let WindowLoaded (sender : obj) (args: RoutedEventArgs) =
sensor.Start()
sensor.ColorStream.Enable()
sensor.SkeletonStream.Enable()
sensor.ColorFrameReady.AddHandler(new EventHandler<ColorImageFrameReadyEventArgs>(VideoFrameReady))
sensor.SkeletonFrameReady.AddHandler(new EventHandler<SkeletonFrameReadyEventArgs>(SkeletonFrameReady))
let WindowUnloaded (sender : obj) (args: RoutedEventArgs) =
sensor.Stop()
//Defining the structure of the test window
let window = new Window()
window.Width <- 800.0
window.Height <- 600.0
window.Title <- "Kinect Skeleton Application"
window.Loaded.AddHandler(new RoutedEventHandler(WindowLoaded))
window.Unloaded.AddHandler(new RoutedEventHandler(WindowUnloaded))
window.Content <- grid
window.Show()
[<STAThread()>]
do
let app = new Application() in
app.Run(window) |> ignore
I ended up rewriting it based off of this post http://channel9.msdn.com/coding4fun/kinect/Kinecting-with-F and the skeleton tracking is now working. Still interested in why the original code doesn't work as well.
// Learn more about F# at http://fsharp.net
#light
open System
open System.Windows
open System.Windows.Media.Imaging
open System.Windows.Threading
open Microsoft.Kinect
open System.Diagnostics
[<STAThread>]
do
let sensor = KinectSensor.KinectSensors.[0]
sensor.SkeletonStream.Enable()
sensor.Start()
// Set-up the WPF window and its contents
let width = 1024.
let height = 768.
let w = Window(Width=width, Height=height)
let g = Controls.Grid()
let c = Controls.Canvas()
let hd = Shapes.Rectangle(Fill=Media.Brushes.Red, Width=15., Height=15.)
let rh = Shapes.Rectangle(Fill=Media.Brushes.Blue, Width=15., Height=15.)
let lh = Shapes.Rectangle(Fill=Media.Brushes.Green, Width=15., Height=15.)
ignore <| c.Children.Add hd
ignore <| c.Children.Add rh
ignore <| c.Children.Add lh
ignore <| g.Children.Add c
w.Content <- g
w.Unloaded.Add(fun args -> sensor.Stop())
let getDisplayPosition w h (joint : Joint) =
let x = ((w * (float)joint.Position.X + 2.0) / 4.0) + (w/2.0)
let y = ((h * -(float)joint.Position.Y + 2.0) / 4.0) + (h/2.0)
System.Console.WriteLine("X:" + x.ToString() + " Y:" + y.ToString())
new Point(x,y)
let draw (joint : Joint) (sh : System.Windows.Shapes.Shape) =
let p = getDisplayPosition width height joint
sh.Dispatcher.Invoke(DispatcherPriority.Render, Action(fun () -> System.Windows.Controls.Canvas.SetLeft(sh, p.X))) |> ignore
sh.Dispatcher.Invoke(DispatcherPriority.Render, Action(fun () -> System.Windows.Controls.Canvas.SetTop(sh, p.Y))) |> ignore
let drawJoints (sk : Skeleton) =
draw (sk.Joints.Item(JointType.Head)) hd
draw (sk.Joints.Item(JointType.WristRight)) rh
draw (sk.Joints.Item(JointType.WristLeft)) lh
let skeleton (sensor : KinectSensor) =
let rec loop () =
async {
let! args = Async.AwaitEvent sensor.SkeletonFrameReady
use frame = args.OpenSkeletonFrame()
let skeletons : Skeleton[] = Array.zeroCreate(frame.SkeletonArrayLength)
frame.CopySkeletonDataTo(skeletons)
skeletons
|> Seq.filter (fun s -> s.TrackingState <> SkeletonTrackingState.NotTracked)
|> Seq.iter (fun s -> drawJoints s)
return! loop ()
}
loop ()
skeleton sensor |> Async.Start
let a = Application()
ignore <| a.Run(w)
In F#, any value bindings (e.g., let or do) you declare within a module itself will be executed the first time the module is opened or accessed from another module. If you're familiar with C#, you can think of these value bindings as executing within a type constructor (i.e., a static constructor).
I suspect the reason the second version of your code works, but not the first, is that in the second version, you're creating the Window and drawing the shapes into it from within the STA thread running the application's message loop. In the first version, I'd guess that code is executing on some other thread, and that's why it isn't working as expected.
There's nothing wrong with the second version of your code, but a more-canonical F# approach would be to lift your functions (getDisplayPosition, draw, etc.) out of the top-level do binding. That makes the code a bit easier to read by making it obvious that those functions aren't capturing any of the local values created within the do.

using Array.Parallel.map for decreasing running time

Hello everyone
I have converted a project in C# to F# that paints the Mandelbrot set.
Unfortunately does it take around one minute to render a full screen so I am try to find some ways to speed it up.
It is one call that take almost all of the time:
Array.map (fun x -> this.colorArray.[CalcZ x]) xyArray
xyArray (double * double) [] => (array of tuple of double)
colorArray is an array of int32 length = 255
CalcZ is defined as:
let CalcZ (coord:double * double) =
let maxIterations = 255
let rec CalcZHelper (xCoord:double) (yCoord:double) // line break inserted
(x:double) (y:double) iters =
let newx = x * x + xCoord - y * y
let newy = 2.0 * x * y + yCoord
match newx, newy, iters with
| _ when Math.Abs newx > 2.0 -> iters
| _ when Math.Abs newy > 2.0 -> iters
| _ when iters = maxIterations -> iters
| _ -> CalcZHelper xCoord yCoord newx newy (iters + 1)
CalcZHelper (fst coord) (snd coord) (fst coord) (snd coord) 0
As I only use around half of the processor capacity is an idea to use more threads and specifically Array.Parallel.map, translates to system.threading.tasks.parallel
Now my question
A naive solution, would be:
Array.Parallel.map (fun x -> this.colorArray.[CalcZ x]) xyArray
but that took twice the time, how can I rewrite this to take less time, or can I take some other way to utilize the processor better?
Thanks in advance
Gorgen
---edit---
the function that is calling CalcZ looks like this:
let GetMatrix =
let halfX = double bitmap.PixelWidth * scale / 2.0
let halfY = double bitmap.PixelHeight * scale / 2.0
let rect:Mandelbrot.Rectangle =
{xMax = centerX + halfX; xMin = centerX - halfX;
yMax = centerY + halfY; yMin = centerY - halfY;}
let size:Mandelbrot.Size =
{x = bitmap.PixelWidth; y = bitmap.PixelHeight}
let xyList = GenerateXYTuple rect size
let xyArray = Array.ofList xyList
Array.map (fun x -> this.colorArray.[CalcZ x]) xyArray
let region:Int32Rect = new Int32Rect(0,0,bitmap.PixelWidth,bitmap.PixelHeight)
bitmap.WritePixels(region, GetMatrix, bitmap.PixelWidth * 4, region.X, region.Y);
GenerateXYTuple:
let GenerateXYTuple (rect:Rectangle) (pixels:Size) =
let xStep = (rect.xMax - rect.xMin)/double pixels.x
let yStep = (rect.yMax - rect.yMin)/double pixels.y
[for column in 0..pixels.y - 1 do
for row in 0..pixels.x - 1 do
yield (rect.xMin + xStep * double row,
rect.yMax - yStep * double column)]
---edit---
Following a suggestion from kvb (thanks a lot!) in a comment to my question, I built the program in Release mode. Building in the Relase mode generally speeded up things.
Just building in Release took me from 50s to around 30s, moving in all transforms on the array so it all happens in one pass made it around 10 seconds faster. At last using the Array.Parallel.init brought me to just over 11 seconds.
What I learnt from this is.... Use the release mode when timing things and using parallel constructs...
One more time, thanks for the help I have recieved.
--edit--
by using SSE assember from a native dll I have been able to slash the time from around 12 seconds to 1.2 seconds for a full screen of the most computational intensive points. Unfortunately I don't have a graphics processor...
Gorgen
Per the comment on the original post, here is the code I wrote to test the function. The fast version only takes a few seconds on my average workstation. It is fully sequential, and has no parallel code.
It's moderately long, so I posted it on another site: http://pastebin.com/Rjj8EzCA
I'm suspecting that the slowdown you are seeing is in the rendering code.
I don't think that the Array.Parallel.map function (which uses Parallel.For from .NET 4.0 under the cover) should have trouble parallelizing the operation if it runs a simple function ~1 million times. However, I encountered some weird performance behavior in a similar case when F# didn't optimize the call to the lambda function (in some way).
I'd try taking a copy of the Parallel.map function from the F# sources and adding inline. Try adding the following map function to your code and use it instead of the one from F# libraries:
let inline map (f: 'T -> 'U) (array : 'T[]) : 'U[]=
let inputLength = array.Length
let result = Array.zeroCreate inputLength
Parallel.For(0, inputLength, fun i ->
result.[i] <- f array.[i]) |> ignore
result
As an aside, it looks like you're generating an array of coordinates and then mapping it to an array of results. You don't need to create the coordinate array if you use the init function instead of map: Array.Parallel.init 1000 (fun y -> Array.init 1000 (fun x -> this.colorArray.[CalcZ (x, y)]))
EDIT: The following may be inaccurate:
Your problem could be that you call a tiny function a million times, causing the scheduling overhead to overwhelm that actual work you're doing. You should partition the array into much larger chunks so that each individual task takes a millisecond or so. You can use an array of arrays so that you would call Array.Parallel.map on the outer arrays and Array.map on the inner arrays. That way each parallel operation will operate on a whole row of pixels instead of just a single pixel.

F#/"Accelerator v2" DFT algorithm implementation probably incorrect

I'm trying to experiment with software defined radio concepts. From this article I've tried to implement a GPU-parallelism Discrete Fourier Transform.
I'm pretty sure I could pre-calculate 90 degrees of the sin(i) cos(i) and then just flip and repeat rather than what I'm doing in this code and that that would speed it up. But so far, I don't even think I'm getting correct answers. An all-zeros input gives a 0 result as I'd expect, but all 0.5 as inputs gives 78.9985886f (I'd expect a 0 result in this case too). Basically, I'm just generally confused. I don't have any good input data and I don't know what to do with the result or how to verify it.
This question is related to my other post here
open Microsoft.ParallelArrays
open System
// X64MulticoreTarget is faster on my machine, unexpectedly
let target = new DX9Target() // new X64MulticoreTarget()
ignore(target.ToArray1D(new FloatParallelArray([| 0.0f |]))) // Dummy operation to warm up the GPU
let stopwatch = new System.Diagnostics.Stopwatch() // For benchmarking
let Hz = 50.0f
let fStep = (2.0f * float32(Math.PI)) / Hz
let shift = 0.0f // offset, once we have to adjust for the last batch of samples of a stream
// If I knew that the periodic function is periodic
// at whole-number intervals, I think I could keep
// shift within a smaller range to support streams
// without overflowing shift - but I haven't
// figured that out
//let elements = 8192 // maximum for a 1D array - makes sense as 2^13
//let elements = 7240 // maximum on my machine for a 2D array, but why?
let elements = 7240
// need good data!!
let buffer : float32[,] = Array2D.init<float32> elements elements (fun i j -> 0.5f) //(float32(i * elements) + float32(j)))
let input = new FloatParallelArray(buffer)
let seqN : float32[,] = Array2D.init<float32> elements elements (fun i j -> (float32(i * elements) + float32(j)))
let steps = new FloatParallelArray(seqN)
let shiftedSteps = ParallelArrays.Add(shift, steps)
let increments = ParallelArrays.Multiply(fStep, steps)
let cos_i = ParallelArrays.Cos(increments) // Real component series
let sin_i = ParallelArrays.Sin(increments) // Imaginary component series
stopwatch.Start()
// From the documentation, I think ParallelArrays.Multiply does standard element by
// element multiplication, not matrix multiplication
// Then we sum each element for each complex component (I don't understand the relationship
// of this, or the importance of the generalization to complex numbers)
let real = target.ToArray1D(ParallelArrays.Sum(ParallelArrays.Multiply(input, cos_i))).[0]
let imag = target.ToArray1D(ParallelArrays.Sum(ParallelArrays.Multiply(input, sin_i))).[0]
printf "%A in " ((real * real) + (imag * imag)) // sum the squares for the presence of the frequency
stopwatch.Stop()
printfn "%A" stopwatch.ElapsedMilliseconds
ignore (System.Console.ReadKey())
I share your surprise that your answer is not closer to zero. I'd suggest writing naive code to perform your DFT in F# and seeing if you can track down the source of the discrepancy.
Here's what I think you're trying to do:
let N = 7240
let F = 1.0f/50.0f
let pi = single System.Math.PI
let signal = [| for i in 1 .. N*N -> 0.5f |]
let real =
seq { for i in 0 .. N*N-1 -> signal.[i] * (cos (2.0f * pi * F * (single i))) }
|> Seq.sum
let img =
seq { for i in 0 .. N*N-1 -> signal.[i] * (sin (2.0f * pi * F * (single i))) }
|> Seq.sum
let power = real*real + img*img
Hopefully you can use this naive code to get a better intuition for how the accelerator code ought to behave, which could guide you in your testing of the accelerator code. Keep in mind that part of the reason for the discrepancy may simply be the precision of the calculations - there are ~52 million elements in your arrays, so accumulating a total error of 79 may not actually be too bad. FWIW, I get a power of ~0.05 when running the above single precision code, but a power of ~4e-18 when using equivalent code with double precision numbers.
Two suggestions:
ensure you're not somehow confusing degrees with radians
try doing it sans-parallelism, or just with F#'s asyncs for parallelism
(In F#, if you have an array of floats
let a : float[] = ...
then you can 'add a step to all of them in parallel' to produce a new array with
let aShift = a |> (fun x -> async { return x + shift })
|> Async.Parallel |> Async.RunSynchronously
(though I expect this might be slower that just doing a synchronous loop).)

Resources