I am looking to convert the following C code into F# (this is the fast inverse square root algorithm):
float Q_rsqrt( float number )
{
long i;
float x2, y;
x2 = number * 0.5F;
y = number;
i = * ( long * ) &y; // Extract bit pattern
i = 0x5f3759df - ( i >> 1 );
y = * ( float * ) &i; // Convert back to float.
y = y * ( 1.5F - ( x2 * y * y ) );
return y;
}
First of all you should do some research. Then if you stuck specify with what you have problem.
Here is solution by Kit Eason.
let fastInvSqrt (n : float32) : float32 =
let MAGIC_NUMBER : int32 = 0x5f3759df
let THREE_HALVES = 1.5f
let x2 = n * 0.5f
let i = MAGIC_NUMBER - (System.BitConverter.ToInt32(System.BitConverter.GetBytes(n), 0) >>> 1)
let y = System.BitConverter.ToSingle(System.BitConverter.GetBytes(i), 0)
y * (THREE_HALVES - (x2 * y * y))
// Examples:
let x = fastInvSqrt 4.0f
// Output: val x : float32 = 0.499153584f
let x' = 1. / sqrt(4.0)
// Output: val x' : float = 0.5
When it comes to performance and low-level optimization it is often a good idea to measure before and after. The fast-inverse trick is very cool but it's approximates the inverse square and the question is if tricky code like this is truely necessary these days (in the DOOM days when float performace was crap the trick was amazing).
Anyway so I built a simple performance test bench in order to compare the trivial implementation with the solution provided by Kit Eason/lad2025 and another one that doesn't allocate byte arrays.
open System
open System.Diagnostics
open System.Runtime.InteropServices
[<Literal>]
let MAGIC_NUMBER : int32 = 0x5f3759df
[<Literal>]
let THREE_HALVES = 1.5F
[<Literal>]
let HALF = 0.5F
[<Literal>]
let OUTER = 1000
[<Literal>]
let INNER = 10000
let inline invSqr (x : float32) : float32 = 1.F / sqrt x
let fInvSqr (x : float32) : float32 =
let x2 = x * 0.5f
// Allocates two byte arrays creating GC pressure ==> hurts performance
let i = MAGIC_NUMBER - (BitConverter.ToInt32(BitConverter.GetBytes(x), 0) >>> 1)
let y = BitConverter.ToSingle(BitConverter.GetBytes(i), 0)
y * (THREE_HALVES - (x2 * y * y))
// Susceptible to race conditions & endianess issues
[<StructLayout (LayoutKind.Explicit)>]
type Bits =
struct
[<FieldOffset(0)>]
val mutable f: float32
[<FieldOffset(0)>]
val mutable i: int32
end
let mutable bits = Bits ()
let fInvSqr2 (x : float32) : float32 =
let x2 = x * 0.5F
bits.f <- x
let i = MAGIC_NUMBER - (bits.i >>> 1)
bits.i <- i
let y = bits.f
y * (THREE_HALVES - (x2 * y * y))
let timeIt n (a : unit -> 'T) : int64 * 'T =
let r = a ()
let sw = Stopwatch ()
sw.Start ()
for i = 1 to n do
ignore <| a ()
sw.Stop ()
sw.ElapsedMilliseconds, r
[<EntryPoint>]
let main argv =
let testCases =
[|
"invSqr" , fun () ->
let mutable sum = 0.F
for x = 1 to INNER do
sum <- sum + invSqr (float32 x)
sum
"fInvSqr" , fun () ->
let mutable sum = 0.F
for x = 1 to INNER do
sum <- sum + fInvSqr (float32 x)
sum
"fInvSqr2" , fun () ->
let mutable sum = 0.F
for x = 1 to INNER do
sum <- sum + fInvSqr2 (float32 x)
sum
|]
for name, action in testCases do
printfn "Running %s %d times..." name (OUTER*INNER)
let elapsed, result = timeIt OUTER action
printfn "... it took %d ms product result: %f" elapsed result
0
The performance test result on my machine:
Running invSqr 10000000 times...
... it took 78 ms product result: 198.544600
Running fInvSqr 10000000 times...
... it took 311 ms product result: 198.358200
Running fInvSqr2 10000000 times...
... it took 49 ms product result: 198.358200
Press any key to continue . . .
So we see that fInvSqr is actually 3 times slower than the trivial solution, most likely because of the byte allocation. In addition the cost of GC is hidden in these numbers and might add non-deterministic performance degration.
fInvSqr2 seems to perform slightly better but there are drawbacks here as well
The result is off by 0.1%
The Bits trick is susceptible to race conditions (fixable)
The Bits trick is suspectible to endian issues (if you are run the program on a CPU with different endianess it might break)
Is the performance gains worth the drawbacks? Since a program probably is not just built up from inverse square operations the effective performance gain might be much smaller in reality. I have a hard time imagining a scenario where I would so presurres for performance I opt for the fast inverse trick today but then it all depends on your context.
Related
The XYZ color space encompasses all possible colors, not just those which can be generated by a particular device like a monitor. Not all XYZ triplets represent a color that is physically possible. Is there a way, given an XYZ triplet, to determine if it represents a real color?
I wanted to generate a CIE 1931 chromaticity diagram (seen bellow) for myself, but wasn't sure how to go about it. It's easy to, for example, take all combinations of sRGB triplets and then transform them into the xy coordinates of the chromaticity diagram and then plot them. You cannot use this same approach in the XYZ color space though since not all combinations are valid colors. So far the best I have come up with is a stochastic approach, where I generate a random spectral distribution by summing a random number of random Gaussians, then converting it to XYZ using the standard observer functions.
Having thought about it a little more I felt the obvious solution is to generate a list of xy points around the edge of spectral locus, corresponding to pure monochromatic colors. It seems to me that this can be done by directly inputting the visible frequencies (~380-780nm) into the CIE XYZ standard observer color matching functions. Treating these points like a convex polygon you could determine if a point is within the spectral locus using one algorithm or another. In my case, since what I really wanted to do is simply generate the chromaticity diagram, I simply input these points into a graphics library's polygon drawing routine and then for each pixel of the polygon I can transform it into sRGB.
I believe this solution is similar to the one used by the library that Kel linked in a comment. I'm not entirely sure, as I am not familiar with Python.
function RGBfromXYZ(X, Y, Z) {
const R = 3.2404542 * X - 1.5371385 * Y - 0.4985314 * Z
const G = -0.969266 * X + 1.8760108 * Y + 0.0415560 * Z
const B = 0.0556434 * X - 0.2040259 * Y + 1.0572252 * Z
return [R, G, B]
}
function XYZfromYxy(Y, x, y) {
const X = Y / y * x
const Z = Y / y * (1 - x - y)
return [X, Y, Z]
}
function srgb_from_linear(x) {
if (x <= 0.0031308) {
return x * 12.92
} else {
return 1.055 * Math.pow(x, 1/2.4) - 0.055
}
}
// Analytic Approximations to the CIE XYZ Color Matching Functions
// from Sloan http://jcgt.org/published/0002/02/01/paper.pdf
function xFit_1931(x) {
const t1 = (x - 442) * (x < 442 ? 0.0624 : 0.0374)
const t2 = (x -599.8) * (x < 599.8 ? 0.0264 : 0.0323)
const t3 = (x - 501.1) * (x < 501.1 ? 0.0490 : 0.0382)
return 0.362 * Math.exp(-0.5 * t1 * t1) + 1.056 * Math.exp(-0.5 * t2 * t2) - 0.065 * Math.exp(-0.5 * t3 * t3)
}
function yFit_1931(x) {
const t1 = (x - 568.8) * (x < 568.8 ? 0.0213 : 0.0247)
const t2 = (x - 530.9) * (x < 530.9 ? 0.0613 : 0.0322)
return 0.821 * Math.exp(-0.5 * t1 * t1) + 0.286 * Math.exp(-0.5 * t2 * t2)
}
function zFit_1931(x) {
const t1 = (x - 437) * (x < 437 ? 0.0845 : 0.0278)
const t2 = (x - 459) * (x < 459 ? 0.0385 : 0.0725)
return 1.217 * Math.exp(-0.5 * t1 * t1) + 0.681 * Math.exp(-0.5 * t2 * t2)
}
const canvas = document.createElement("canvas")
document.body.append(canvas)
canvas.width = canvas.height = 512
const ctx = canvas.getContext("2d")
const locus_points = []
for (let i = 440; i < 650; ++i) {
const [X, Y, Z] = [xFit_1931(i), yFit_1931(i), zFit_1931(i)]
const x = (X / (X + Y + Z)) * canvas.width
const y = (Y / (X + Y + Z)) * canvas.height
locus_points.push([x, y])
}
ctx.beginPath()
ctx.moveTo(...locus_points[0])
locus_points.slice(1).forEach(point => ctx.lineTo(...point))
ctx.closePath()
ctx.fill()
const imageData = ctx.getImageData(0, 0, canvas.width, canvas.height)
for (let y = 0; y < canvas.height; ++y) {
for (let x = 0; x < canvas.width; ++x) {
const alpha = imageData.data[(y * canvas.width + x) * 4 + 3]
if (alpha > 0) {
const [X, Y, Z] = XYZfromYxy(1, x / canvas.width, y / canvas.height)
const [R, G, B] = RGBfromXYZ(X, Y, Z)
const r = Math.round(srgb_from_linear(R / Math.sqrt(R**2 + G**2 + B**2)) * 255)
const g = Math.round(srgb_from_linear(G / Math.sqrt(R**2 + G**2 + B**2)) * 255)
const b = Math.round(srgb_from_linear(B / Math.sqrt(R**2 + G**2 + B**2)) * 255)
imageData.data[(y * canvas.width + x) * 4 + 0] = r
imageData.data[(y * canvas.width + x) * 4 + 1] = g
imageData.data[(y * canvas.width + x) * 4 + 2] = b
}
}
}
ctx.putImageData(imageData, 0, 0)
I have written a code for the cost function and it is giving incorrect answer.
I have read the code many times but I cannot find the mistake.
Here is my code:-
function J = computeCost(X, y, theta)
m = length(y); % number of training examples
s = 0;
h = 0;
sq = 0;
J = 0;
for i = 1:m
h = theta' * X(i, :)';
sq = (h - y(i))^2;
s = s + sq;
end
J = (1/2*m) * s;
end
Example:-
computeCost( [1 2; 1 3; 1 4; 1 5], [7;6;5;4], [0.1;0.2] )
ans = 11.9450
Here the answer should be 11.9450 but my code is giving me this:-
ans = 191.12
I have checked the the matrix multiplication and the code is calculating it right.
It seems you misunderstood the operator evaluation order. In fact
1/2*m ~= 1/(2*m)
With this in mind it seems you're computing an average. Instead of reinventing the wheel it is usually a good idea to use the built in functions to do the job which results in a much clearer (and less error prone) implementation:
function J = computeCost(X, y, theta)
h = X * theta;
sq = (h - y).^2;
J = 1/2 * mean(sq);
end
computeCost( [1,2;1,3;1,4;1,5], [7;6;5;4], [0.1;0.2] )
% ans = 11.9450
Try it online!
EDIT: Resolved, I answered the question below.
I am using the following to get metadata for PHAssets:
let data = NSData.init(contentsOf: url!)!
if let imageSource = CGImageSourceCreateWithData(data, nil) {
let metadata = CGImageSourceCopyPropertiesAtIndex(imageSource, 0, nil)! as NSDictionary
}
The metadata dictionary has all the values I am looking for. However a few fields like ShutterSpeedValue, ExposureTime which have fractions get printed as decimals:
ExposureTime = "0.05"
ShutterSpeedValue = "4.321956769055745"
When I look at this data on my Mac's preview app and exiftool, it shows:
ExposureTime = 1/20
ShutterSpeedValue = 1/20
How can I get the correct fraction string instead of the decimal string?
EDIT: I tried simply converting the decimal to a fraction string using this from SO code but this isn't correct:
func rationalApproximation(of x0 : Double, withPrecision eps : Double = 1.0E-6) -> String {
var x = x0
var a = x.rounded(.down)
var (h1, k1, h, k) = (1, 0, Int(a), 1)
while x - a > eps * Double(k) * Double(k) {
x = 1.0/(x - a)
a = x.rounded(.down)
(h1, k1, h, k) = (h, k, h1 + Int(a) * h, k1 + Int(a) * k)
}
return "\(h)/\(k)"
}
As you notice, the decimal value of ShutterSpeedValue printed as 4.321956769055745 isn't even equal to 1/20.
Resolved.
As per
https://www.dpreview.com/forums/post/54376235
ShutterSpeedValue is defined as APEX value, where:
ShutterSpeed = -log2(ExposureTime)
So -log2(1/20) is 4.3219, just as what I observed.
So to get the ShutterSpeedValue, I use the following:
"1/\(ceil(pow(2, Double(4.321956769055745))))"
I tested 3 different photos and 1/20, 1/15 and 1/1919 were all correctly calculated using your formula.
I am trying to speed up a process that slows down my main thread by distributing it at least across two different cores.
The reason I think I can pull this off is that each of the individual operations are independent requiring only two points and a float.
However my first stab at is has the code running significantly slower when doing queue.asnc vs queue.sync and I have no clue why!
Here is the code running synchronously
var block = UnsafeMutablePointer<Datas>.allocate(capacity: 0)
var outblock = UnsafeMutablePointer<Decimal>.allocate(capacity: 0)
func initialise()
{
outblock = UnsafeMutablePointer<Decimal>.allocate(capacity: testWith * 4 * 2)
block = UnsafeMutablePointer<Datas>.allocate(capacity: particles.count)
}
func update()
{
var i = 0
for part in particles
{
part.update()
let x1 = part.data.p1.x; let y1 = part.data.p1.y
let x2 = part.data.p2.x; let y2 = part.data.p2.x;
let w = part.data.size * rectScale
let w2 = part.data.size * rectScale
let dy = y2 - y1; let dx = x2 - x1
let length = sqrt(dy * dy + dx * dx)
let calcx = (-(y2 - y1) / length)
let calcy = ((x2 - x1) / length)
let calcx1 = calcx * w
let calcy1 = calcy * w
let calcx2 = calcx * w2
let calcy2 = calcy * w2
outblock[i] = x1 + calcx1
outblock[i+1] = y1 + calcy1
outblock[i+2] = x1 - calcx1
outblock[i+3] = y1 - calcy1
outblock[i+4] = x2 + calcx2
outblock[i+5] = y2 + calcy2
outblock[i+6] = x2 - calcx2
outblock[i+7] = y2 - calcy2
i += 8
}
}
Here is my attempt at distributing the workload among multiple cores
let queue = DispatchQueue(label: "construction_worker_1", attributes: .concurrent)
let blocky = block
let oblocky = outblock
for i in 0..<particles.count
{
particles[i].update()
block[i] = particles[i].data//Copy the raw data into a thead safe format
queue.async {
let x1 = blocky[i].p1.x; let y1 = blocky[i].p1.y
let x2 = blocky[i].p2.x; let y2 = blocky[i].p2.x;
let w = blocky[i].size * rectScale
let w2 = blocky[i].size * rectScale
let dy = y2 - y1; let dx = x2 - x1
let length = sqrt(dy * dy + dx * dx)
let calcx = (-(y2 - y1) / length)
let calcy = ((x2 - x1) / length)
let calcx1 = calcx * w
let calcy1 = calcy * w
let calcx2 = calcx * w2
let calcy2 = calcy * w2
let writeIndex = i * 8
oblocky[writeIndex] = x1 + calcx1
oblocky[writeIndex+1] = y1 + calcy1
oblocky[writeIndex+2] = x1 - calcx1
oblocky[writeIndex+3] = y1 - calcy1
oblocky[writeIndex+4] = x2 + calcx2
oblocky[writeIndex+5] = y2 + calcy2
oblocky[writeIndex+6] = x2 - calcx2
oblocky[writeIndex+7] = y2 - calcy2
}
}
I really have no clue why this slowdown is happening! I am using UnsafeMutablePointer so the data is thread safe and I am ensuring that no variable can ever get read or written by multiple threads at the same time.
What is going on here?
As described in Performing Loop Iterations Concurrently, there is overhead with each block dispatched to some background queue. So you will want to “stride” through your array, letting each iteration process multiple data points, not just one.
Also, dispatch_apply, called concurrentPerform in Swift 3 and later, is designed for performing loops in parallel and it’s optimized for the particular device’s cores. Combined with striding, you should achieve some performance benefit:
DispatchQueue.global(qos: .userInitiated).async {
let stride = 100
DispatchQueue.concurrentPerform(iterations: particles.count / stride) { iteration in
let start = iteration * stride
let end = min(start + stride, particles.count)
for i in start ..< end {
particles[i].update()
block[i] = particles[i].data//Copy the raw data into a thead safe format
queue.async {
let x1 = blocky[i].p1.x; let y1 = blocky[i].p1.y
let x2 = blocky[i].p2.x; let y2 = blocky[i].p2.x
let w = blocky[i].size * rectScale
let w2 = blocky[i].size * rectScale
let dy = y2 - y1; let dx = x2 - x1
let length = hypot(dy, dx)
let calcx = -dy / length
let calcy = dx / length
let calcx1 = calcx * w
let calcy1 = calcy * w
let calcx2 = calcx * w2
let calcy2 = calcy * w2
let writeIndex = i * 8
oblocky[writeIndex] = x1 + calcx1
oblocky[writeIndex+1] = y1 + calcy1
oblocky[writeIndex+2] = x1 - calcx1
oblocky[writeIndex+3] = y1 - calcy1
oblocky[writeIndex+4] = x2 + calcx2
oblocky[writeIndex+5] = y2 + calcy2
oblocky[writeIndex+6] = x2 - calcx2
oblocky[writeIndex+7] = y2 - calcy2
}
}
}
}
You should experiment with different stride values and see how the performance changes.
I can't run this code (I don't have sample data, I don't have definition of Datas, etc.), so I apologize if I introduced any issues. But don't focus on the code here, and instead just focus on the broader issues of using concurrentPerform for performing concurrent loops, and striding to ensure that you've got enough work on each thread so threading overhead doesn't outweigh the broader benefits of running threads in parallel.
For more information, see https://stackoverflow.com/a/22850936/1271826 for a broader discussion of the issues here.
Your expectations may be wrong. Your goal was to free up the main thread, and you did that. That is what is now faster: the main thread!
But async on a background thread means "please do this any old time you please, allowing it to pause so other code can run in the middle of it" — it doesn't mean "do it fast", not at all. And I don't see any qos specification in your code, so it's not like you are asking for special attention or anything.
I've got the following code to do a biliner interpolation from a matrix of 2D vectors, each cell has x and y values of the vector, and the function receives k and l indices telling the bottom-left nearest position in the matrix
// p[1] returns the interpolated values
// fieldLinePointsVerts the raw data array of fieldNumHorizontalPoints x fieldNumVerticalPoints
// only fieldNumHorizontalPoints matters to determine the index to access the raw data
// k and l horizontal and vertical indices of the point just bellow p[0] in the raw data
void interpolate( vertex2d* p, vertex2d* fieldLinePointsVerts, int fieldNumHorizontalPoints, int k, int l ) {
int index = (l * fieldNumHorizontalPoints + k) * 2;
vertex2d p11;
p11.x = fieldLinePointsVerts[index].x;
p11.y = fieldLinePointsVerts[index].y;
vertex2d q11;
q11.x = fieldLinePointsVerts[index+1].x;
q11.y = fieldLinePointsVerts[index+1].y;
index = (l * fieldNumHorizontalPoints + k + 1) * 2;
vertex2d q21;
q21.x = fieldLinePointsVerts[index+1].x;
q21.y = fieldLinePointsVerts[index+1].y;
index = ( (l + 1) * fieldNumHorizontalPoints + k) * 2;
vertex2d q12;
q12.x = fieldLinePointsVerts[index+1].x;
q12.y = fieldLinePointsVerts[index+1].y;
index = ( (l + 1) * fieldNumHorizontalPoints + k + 1 ) * 2;
vertex2d p22;
p22.x = fieldLinePointsVerts[index].x;
p22.y = fieldLinePointsVerts[index].y;
vertex2d q22;
q22.x = fieldLinePointsVerts[index+1].x;
q22.y = fieldLinePointsVerts[index+1].y;
float fx = 1.0 / (p22.x - p11.x);
float fx1 = (p22.x - p[0].x) * fx;
float fx2 = (p[0].x - p11.x) * fx;
vertex2d r1;
r1.x = fx1 * q11.x + fx2 * q21.x;
r1.y = fx1 * q11.y + fx2 * q21.y;
vertex2d r2;
r2.x = fx1 * q12.x + fx2 * q22.x;
r2.y = fx1 * q12.y + fx2 * q22.y;
float fy = 1.0 / (p22.y - p11.y);
float fy1 = (p22.y - p[0].y) * fy;
float fy2 = (p[0].y - p11.y) * fy;
p[1].x = fy1 * r1.x + fy2 * r2.x;
p[1].y = fy1 * r1.y + fy2 * r2.y;
}
Currently this code needs to be run every single frame in old iOS devices, say devices with arm6 processors
I've taken the numeric sub-indices from the wikipedia's equations http://en.wikipedia.org/wiki/Bilinear_interpolation
I'd accreciate any comments on optimization for performance, even plain asm code
This code should not be causing your slowdown if it's only run once per frame. However, if it's run multiple times per frame, it easily could be.
I'd run your app with a profiler to see where the true performance problem lies.
There is some room for optimization here: a) Certain index calculations could be factored out and re-used in subsequent calculations), b) You could dereference your fieldLinePointsVerts array to a pointer once and re-use that, instead of indexing it twice per index...
but in general those things won't help a great deal, unless this function is being called many, many times per frame. In which case every little thing will help.