I'm experimenting with the C++ AMP library in F# as a way of using the GPU to do work in parallel. However, the results I'm getting don't seem intuitive.
In C++, I made a library with one function that squares all the numbers in an array, using AMP:
extern "C" __declspec ( dllexport ) void _stdcall square_array(double* arr, int n)
{
// Create a view over the data on the CPU
array_view<double,1> dataView(n, &arr[0]);
// Run code on the GPU
parallel_for_each(dataView.extent, [=] (index<1> idx) restrict(amp)
{
dataView[idx] = dataView[idx] * dataView[idx];
});
// Copy data from GPU to CPU
dataView.synchronize();
}
(Code adapted from Igor Ostrovsky's blog on MSDN.)
I then wrote the following F# to compare the Task Parallel Library (TPL) to AMP:
// Print the time needed to run the given function
let time f =
let s = new Stopwatch()
s.Start()
f ()
s.Stop()
printfn "elapsed: %d" s.ElapsedTicks
module CInterop =
[<DllImport("CPlus", CallingConvention = CallingConvention.StdCall)>]
extern void square_array(float[] array, int length)
let options = new ParallelOptions()
let size = 1000.0
let arr = [|1.0 .. size|]
// Square the number at the given index of the array
let sq i =
do arr.[i] <- arr.[i] * arr.[i]
()
// Square every number in the array using TPL
time (fun() -> Parallel.For(0, arr.Length - 1, options, new Action<int>(sq)) |> ignore)
let arr2 = [|1.0 .. size|]
// Square every number in the array using AMP
time (fun() -> CInterop.square_array(arr2, arr2.Length))
If I set the array size to a trivial number like 10, it takes the TPL ~22K ticks to finish, and AMP ~10K ticks. That's what I expect. As I understand it, a GPU (hence AMP) should be better suited to this situation, where the work is broken into very small pieces, than the TPL.
However, if I increase the array size to 1000, the TPL now takes ~30K ticks and AMP takes ~70K ticks. And it just gets worse from there. For an array of size 1 million, AMP takes nearly 1000x as long as the TPL.
Since I expect the GPU (i.e. AMP) to be better at this kind of task, I'm wondering what I'm missing here.
My graphics card is a GeForce 550 Ti with 1GB, not a slouch as far as I know. I know there's overhead in using PInvoke to call into the AMP code, but I expect that to be a flat cost that is amortized over larger array sizes. I believe the array is passed by reference (though I could be wrong), so I don't expect any cost associated with copying that.
Thank you to everyone for your advice.
Transferring data back and forth between GPU and CPU takes time. You are most likely measuring your PCI Express bus bandwidth here. Squaring 1M of floats is piece of cake for a GPU.
It's also not a good idea to use the Stopwach class to measure performance for AMP because GPU calls can happen asynchronously. In your case it is ok, but if you measure the compute part only (the parallel_for_each) this won't work. I think you can use D3D11 performance counters for that.
Related
Recently I have faced a problem when the program running the memory keep increasing, and when program is closed the memory would restore normal level. Obviously, it's a memory leak. After some work, I have located the code responsible, but I don't know why? The program's work flow is simple:
first use lidar api to get point cloud and image data;
then transport to next tbb flow graph to process these data;
finally use open3d api to visualzie them.
In the first step, the lidar itself's api use asio to asynchronously invoke some callback function to transport data, so I create some tbb concurrent_queue to store these data, and a align function to match cloud and image with timestamp. The problem is in the align function. In the function, I create a vector<shared_ptr<open3d::..::PointCloud>> and use iterator to store point cloud elements. However, I found when the function complete, the shared_ptr use count don't reduce . Similar but simpler example code like this:
std::pair<std::shared_ptr<int>, int> helper() {
auto a = std::make_shared<int>(90);
auto c = 100;
std::vector<std::pair<std::shared_ptr<int>, int>> container;
container.reserve(5);
auto iter = container.begin();
for (int i = 0; i < 3; i++) {
*iter = std::make_pair(a, c);
iter++;
}
return *(iter-1);
}
int main() {
auto b = helper();
std::cout << "shared_ptr use count: " << std::get<0>(b).use_count() << std::endl;
return 0;
}
Ubuntu 20.04 + gcc 9.4, the print result is shared_ptr use count: 4.
Why the vector can't be auto destroyed when function is completed? Hope someone kindly explain this problem.
Thanks #Retired Ninja! The root of the problem is vector.reserve just reserve capacity not physical space. So the vector space after reserve is 0. The following iterator operation is assumed point to some undifined memory. While the result can be transport to main function with no value error, the shared_ptr use count can't reduce to 1 after function call.
To solve the problem, One can just modify the reserve to resize, which can change physical space of the vector and iterator point to defined memory space. Or avoid use iterator, just use push_back and return back().
I have a nested loop iteration scheme that's taking up a lot of memory. I do understand that I should not have globals but even after I included everything in a function, the memory situation didn't improve. It just accumulates after each iteration as if there is no garbage collection.
Here is a workable example that's similar to my code.
I have two files. First, functions.jl:
##functions.jl
module functions
function getMatrix(A)
L = rand(A,A);
return L;
end
function loopOne(A, B)
res = 0;
for i = 1:B
res = inv(getMatrix(A));
end
return(res);
end
end
Second file main.jl:
##main.jl
include("functions.jl")
function main(C)
res = 0;
A = 50;
B = 30;
for i =1:C
mat = functions.loopOne(A,B);
res = mat .+ 1;
end
return res;
end
main(100)
When I execute julia main.jl, the memory increases as I extend C in main(C) (sometimes to more than millions of allocations and 10GiBs when I increase C to 1000000).
I know that the example looks useless but it resembles the structure that I have. Can someone please help? Thank you.
UPDATE:
Michael K. Borregaard gave an answer which is very helpful:
module Functions #1
function loopOne!(res, mymatrix, B) #2
for i = 1:B
res .= inv(rand!(mymatrix)) #3
end
return res #4
end
end
function some_descriptive_name(C) #5
A, B = 50, 30 #6
res, mymat = zeros(A,A), zeros(A,A)
for i =1:C
res .= Functions.loopOne!(res, mymat, B) .+ 1
end
return res
However, when I time it, allocations and memory still increase as I dial up C.
#time some_descriptive_name(30)
0.057177 seconds (11.77 k allocations: 58.278 MiB, 9.58% gc time)
#time some_descriptive_name(60)
0.113808 seconds (23.53 k allocations: 116.518 MiB, 9.63% gc time)
I believe that the problem comes from the inv function. If I change the code to:
function some_descriptive_name(C) #5
A, B = 50, 30 #6
res, mymat = zeros(A,A), zeros(A,A)
for i =1:C
res .= res .+ 1
end
return res
end
The memory and allocations will then stay constant:
#time some_descriptive_name(3)
0.000007 seconds (8 allocations: 39.438 KiB)
#time some_descriptive_name(60)
0.000037 seconds (8 allocations: 39.438 KiB)
Is there a way to "clear" the memory after using inv? Since I'm not creating anything new or storing anything new, the memory usage should stay constant.
A few pointers at least:
The getMatrix function allocates a new AxA matrix every time. That will certainly consume memory. It is better to avoid the allocations if you can, e.g. by using rand! to fill an existing array with random values.
The res = 0 line defines res as an Int but you subsequently assign a Matrix{Float} to it (the result of inv(getMatrix)). Changing the type of a variable in the code makes it hard for the compiler to figure out what the type is, which makes for slow code.
It seems you have a module called functions but you don't write it.
The res = inv code line constantly overwrites the value, so the loop does nothing!
The structure and code looks like C++. Try looking at the style guide.
Here's how the code would look in a more ideomatic way that avoids allocations:
module Functions #1
function loopOne!(res, mymatrix, B) #2
for i = 1:B
res .= inv(rand!(mymatrix)) #3
end
return res #4
end
end
function some_descriptive_name(C) #5
A, B = 50, 30 #6
res, mymat = zeros(A,A), zeros(A,A)
for i =1:C
res .= Functions.loopOne!(res, mymat, B) .+ 1
end
return res
end
Comments:
Use a module if you like - it's up to you whether to put things in different files. Module name in caps.
If you can, it's an advantage to use functions that overwrite values of an existing container. Such functions end with ! to signal that they will modify the arguments (like passing a variably by reference without making it const in C++).
Use the .= operator to indicate that you're not creating a new container, you're overwriting the elements of the existing one. The rand! function overwrites mymatrix.
The return keyword is not strictly needed, but DNF suggested it is better style in a comment
The main convention isn't used in Julia, as most code gets called by the user, not by execution of a program.
Compact assignment format for multiple variables.
Note that in this case, none of these optimisation matter much, as 99% of the calculation time is spent in the expensive inv function.
RESPONSE TO THE UPDATE:
There's nothing wrong with the inv function, it just is a costly operation to do. But again I think you may be misunderstanding what the memory counting does. It is not that the memory use is increasing, like you would be looking for in C++ if you had a pointer to an object that was never released (a memory leak). The memory use is constant, but the total sum of allocations increase, because the inv function has to make some internal allocations.
Consider this example
for i in 1:n
b = [1, 2, 3, 4] # Here a length-4 Array{Int64} is initialized in memory, cost is 32 bytes
end # Here, that memory is released.
For each run through the for loop, 32 bytes is allocated, and 32 bytes is released. When the loop ends, regardless of n, 0 bytes will be allocated from this operation. But Julia's memory tracking only adds the allocations - so after running the code you will see allocation of 32*n bytes.
The reason julia does this is that allocating space in the RAM is one of the costliest operations in computing - so reducing allocations is a good way to speed up code. But you cannot avoid it.
There is thus nothing wrong with your code (in the new format) - the memory allocation and time taken you see is just a result of doing a big (expensive) operation.
I've been using F# for nearly six months and have been so sure that F# Interactive should have the same performance as compiled, that when I bothered to benchmark it, I was convinced it was some kind of compiler bug. Though now it occurs to me that I should have checked here first before opening an issue.
For me it is roughly 3x slower and the optimization switch does not seem to be doing anything at all.
Is this supposed to be standard behavior? If so, I really got trolled by the #time directive. I have the timings for how long it takes to sum 100M elements on this Reddit thread.
Update:
Thanks to FuleSnabel, I uncovered some things.
I tried running the example script from both fsianycpu.exe (which is the default F# Interactive) and fsi.exe and I am getting different timings for two runs. 134ms for the first and 78ms for the later. Those two timings also correspond to the timings from unoptimized and optimized binaries respectively.
What makes the matter even more confusing is that the first project I used to compile the thing is a part of the game library (in script form) I am making and it refuses to compile the optimized binary, instead switching to the unoptimized one without informing me. I had to start a fresh project to get it to compile properly. It is a wonder the other test compiled properly.
So basically, something funky is going on here and I should look into switching fsianycpu.exe to fsi.exe as the default interpreter.
I tried the example code in pastebin I don't see the behavior you describe. This is the result from my performance run:
.\bin\Release\ConsoleApplication3.exe
Total iterations: 300000000, Outer: 10000, Inner: 30000
reduce sequence of list, result 450015000, time 2836 ms
reduce array, result 450015000, time 594 ms
for loop array, result 450015000, time 180 ms
reduce list, result 450015000, time 593 ms
fsi -O --exec .\Interactive.fsx
Total iterations: 300000000, Outer: 10000, Inner: 30000
reduce sequence of list, result 450015000, time 2617 ms
reduce array, result 450015000, time 589 ms
for loop array, result 450015000, time 168 ms
reduce list, result 450015000, time 603 ms
It's expected that Seq.reduce would be the slowest, the for loop the fastest and that the reduce on list/array is roughly similar (this assumes locality of list elements which isn't guaranteed).
I rewrote your code to allow for longer runs w/o running out of memory and to improve cache locality of data. With short runs the uncertainity of measurements makes it hard to compare the data.
Program.fs:
module fs
let stopWatch =
let sw = new System.Diagnostics.Stopwatch()
sw.Start ()
sw
let total = 300000000
let outer = 10000
let inner = total / outer
let timeIt (name : string) (a : unit -> 'T) : unit =
let t = stopWatch.ElapsedMilliseconds
let v = a ()
for i = 2 to outer do
a () |> ignore
let d = stopWatch.ElapsedMilliseconds - t
printfn "%s, result %A, time %d ms" name v d
[<EntryPoint>]
let sumTest(args) =
let numsList = [1..inner]
let numsArray = [|1..inner|]
printfn "Total iterations: %d, Outer: %d, Inner: %d" total outer inner
let sumsSeqReduce () = Seq.reduce (+) numsList
timeIt "reduce sequence of list" sumsSeqReduce
let sumsArray () = Array.reduce (+) numsArray
timeIt "reduce array" sumsArray
let sumsLoop () =
let mutable total = 0
for i in 0 .. inner - 1 do
total <- total + numsArray.[i]
total
timeIt "for loop array" sumsLoop
let sumsListReduce () = List.reduce (+) numsList
timeIt "reduce list" sumsListReduce
0
Interactive.fsx:
#load "Program.fs"
fs.sumTest [||]
PS. I am running on Windows with Visual Studio 2015. 32bit or 64bit seemed to make only marginal difference
I have a simple F# function cost receiving a single parameter amount which is used for some calculations. It is a float so I need to pass in something like cost 33.0 which in math is the same as cost 33. The compiler complaints about it, and I understand why, but I would like being able to call it like that, I tried to create another function named the same and used type annotation for both of them and I also get compiler warnings. Is there a way to do this like C# does?
There are two mechanisms in F# to achieve this, and both do not rely on implicit casts "like C#":
(A) Method overloading
type Sample =
static member cost (amount: float) =
amount |> calculations
static member cost (amount: int) =
(amount |> float) |> calculations
Sample.cost 10 // compiles OK
Sample.cost 10. // compiles OK
(B) Using inlining
let inline cost amount =
amount + amount
cost 10 // compiles OK
cost 10. // compiles OK
F# doesn't allow overloading of let-bound functions, but you can overload methods on classes like in C#.
Sometimes, you can change the model to work on a Discriminated Union instead of a set of overloaded primitives, but I don't think it would be particularly sensible to do just to be able to distinguish between floats and integers.
if you want to use an int at call site but have a float inside the function body ; why not simply cast it ?
let cost amount =
// cast amount from to float (reusing the name amount to shadow the first one)
let amount = float amount
// rest of your function
Maps, filters, folds and more : http://learnyousomeerlang.com/higher-order-functions#maps-filters-folds
The more I read ,the more i get confused.
Can any body help simplify these concepts?
I am not able to understand the significance of these concepts.In what use cases will these be needed?
I think it is majorly because of the syntax,diff to find the flow.
The concepts of mapping, filtering and folding prevalent in functional programming actually are simplifications - or stereotypes - of different operations you perform on collections of data. In imperative languages you usually do these operations with loops.
Let's take map for an example. These three loops all take a sequence of elements and return a sequence of squares of the elements:
// C - a lot of bookkeeping
int data[] = {1,2,3,4,5};
int squares_1_to_5[sizeof(data) / sizeof(data[0])];
for (int i = 0; i < sizeof(data) / sizeof(data[0]); ++i)
squares_1_to_5[i] = data[i] * data[i];
// C++11 - less bookkeeping, still not obvious
std::vec<int> data{1,2,3,4,5};
std::vec<int> squares_1_to_5;
for (auto i = begin(data); i < end(data); i++)
squares_1_to_5.push_back((*i) * (*i));
// Python - quite readable, though still not obvious
data = [1,2,3,4,5]
squares_1_to_5 = []
for x in data:
squares_1_to_5.append(x * x)
The property of a map is that it takes a collection of elements and returns the same number of somehow modified elements. No more, no less. Is it obvious at first sight in the above snippets? No, at least not until we read loop bodies. What if there were some ifs inside the loops? Let's take the last example and modify it a bit:
data = [1,2,3,4,5]
squares_1_to_5 = []
for x in data:
if x % 2 == 0:
squares_1_to_5.append(x * x)
This is no longer a map, though it's not obvious before reading the body of the loop. It's not clearly visible that the resulting collection might have less elements (maybe none?) than the input collection.
We filtered the input collection, performing the action only on some elements from the input. This loop is actually a map combined with a filter.
Tackling this in C would be even more noisy due to allocation details (how much space to allocate for the output array?) - the core idea of the operation on data would be drowned in all the bookkeeping.
A fold is the most generic one, where the result doesn't have to contain any of the input elements, but somehow depends on (possibly only some of) them.
Let's rewrite the first Python loop in Erlang:
lists:map(fun (E) -> E * E end, [1,2,3,4,5]).
It's explicit. We see a map, so we know that this call will return a list as long as the input.
And the second one:
lists:map(fun (E) -> E * E end,
lists:filter(fun (E) when E rem 2 == 0 -> true;
(_) -> false end,
[1,2,3,4,5])).
Again, filter will return a list at most as long as the input, map will modify each element in some way.
The latter of the Erlang examples also shows another useful property - the ability to compose maps, filters and folds to express more complicated data transformations. It's not possible with imperative loops.
They are used in almost every application, because they abstract different kinds of iteration over lists.
map is used to transform one list into another. Lets say, you have list of key value tuples and you want just the keys. You could write:
keys([]) -> [];
keys([{Key, _Value} | T]) ->
[Key | keys(T)].
Then you want to have values:
values([]) -> [];
values([{_Key, Value} | T}]) ->
[Value | values(T)].
Or list of only third element of tuple:
third([]) -> [];
third([{_First, _Second, Third} | T]) ->
[Third | third(T)].
Can you see the pattern? The only difference is what you take from the element, so instead of repeating the code, you can simply write what you do for one element and use map.
Third = fun({_First, _Second, Third}) -> Third end,
map(Third, List).
This is much shorter and the shorter your code is, the less bugs it has. Simple as that.
You don't have to think about corner cases (what if the list is empty?) and for experienced developer it is much easier to read.
filter searches lists. You give it function, that takes element, if it returns true, the element will be on the returned list, if it returns false, the element will not be there. For example filter logged in users from list.
foldl and foldr are used, when you have to do additional bookkeeping while iterating over the list - for example summing all the elements or counting something.
The best explanations, I've found about those functions are in books about Lisp: "Structure and Interpretation of Computer Programs" and "On Lisp" Chapter 4..