Julia - nested loops are consuming a lot of memory - memory

I have a nested loop iteration scheme that's taking up a lot of memory. I do understand that I should not have globals but even after I included everything in a function, the memory situation didn't improve. It just accumulates after each iteration as if there is no garbage collection.
Here is a workable example that's similar to my code.
I have two files. First, functions.jl:
##functions.jl
module functions
function getMatrix(A)
L = rand(A,A);
return L;
end
function loopOne(A, B)
res = 0;
for i = 1:B
res = inv(getMatrix(A));
end
return(res);
end
end
Second file main.jl:
##main.jl
include("functions.jl")
function main(C)
res = 0;
A = 50;
B = 30;
for i =1:C
mat = functions.loopOne(A,B);
res = mat .+ 1;
end
return res;
end
main(100)
When I execute julia main.jl, the memory increases as I extend C in main(C) (sometimes to more than millions of allocations and 10GiBs when I increase C to 1000000).
I know that the example looks useless but it resembles the structure that I have. Can someone please help? Thank you.
UPDATE:
Michael K. Borregaard gave an answer which is very helpful:
module Functions #1
function loopOne!(res, mymatrix, B) #2
for i = 1:B
res .= inv(rand!(mymatrix)) #3
end
return res #4
end
end
function some_descriptive_name(C) #5
A, B = 50, 30 #6
res, mymat = zeros(A,A), zeros(A,A)
for i =1:C
res .= Functions.loopOne!(res, mymat, B) .+ 1
end
return res
However, when I time it, allocations and memory still increase as I dial up C.
#time some_descriptive_name(30)
0.057177 seconds (11.77 k allocations: 58.278 MiB, 9.58% gc time)
#time some_descriptive_name(60)
0.113808 seconds (23.53 k allocations: 116.518 MiB, 9.63% gc time)
I believe that the problem comes from the inv function. If I change the code to:
function some_descriptive_name(C) #5
A, B = 50, 30 #6
res, mymat = zeros(A,A), zeros(A,A)
for i =1:C
res .= res .+ 1
end
return res
end
The memory and allocations will then stay constant:
#time some_descriptive_name(3)
0.000007 seconds (8 allocations: 39.438 KiB)
#time some_descriptive_name(60)
0.000037 seconds (8 allocations: 39.438 KiB)
Is there a way to "clear" the memory after using inv? Since I'm not creating anything new or storing anything new, the memory usage should stay constant.

A few pointers at least:
The getMatrix function allocates a new AxA matrix every time. That will certainly consume memory. It is better to avoid the allocations if you can, e.g. by using rand! to fill an existing array with random values.
The res = 0 line defines res as an Int but you subsequently assign a Matrix{Float} to it (the result of inv(getMatrix)). Changing the type of a variable in the code makes it hard for the compiler to figure out what the type is, which makes for slow code.
It seems you have a module called functions but you don't write it.
The res = inv code line constantly overwrites the value, so the loop does nothing!
The structure and code looks like C++. Try looking at the style guide.
Here's how the code would look in a more ideomatic way that avoids allocations:
module Functions #1
function loopOne!(res, mymatrix, B) #2
for i = 1:B
res .= inv(rand!(mymatrix)) #3
end
return res #4
end
end
function some_descriptive_name(C) #5
A, B = 50, 30 #6
res, mymat = zeros(A,A), zeros(A,A)
for i =1:C
res .= Functions.loopOne!(res, mymat, B) .+ 1
end
return res
end
Comments:
Use a module if you like - it's up to you whether to put things in different files. Module name in caps.
If you can, it's an advantage to use functions that overwrite values of an existing container. Such functions end with ! to signal that they will modify the arguments (like passing a variably by reference without making it const in C++).
Use the .= operator to indicate that you're not creating a new container, you're overwriting the elements of the existing one. The rand! function overwrites mymatrix.
The return keyword is not strictly needed, but DNF suggested it is better style in a comment
The main convention isn't used in Julia, as most code gets called by the user, not by execution of a program.
Compact assignment format for multiple variables.
Note that in this case, none of these optimisation matter much, as 99% of the calculation time is spent in the expensive inv function.
RESPONSE TO THE UPDATE:
There's nothing wrong with the inv function, it just is a costly operation to do. But again I think you may be misunderstanding what the memory counting does. It is not that the memory use is increasing, like you would be looking for in C++ if you had a pointer to an object that was never released (a memory leak). The memory use is constant, but the total sum of allocations increase, because the inv function has to make some internal allocations.
Consider this example
for i in 1:n
b = [1, 2, 3, 4] # Here a length-4 Array{Int64} is initialized in memory, cost is 32 bytes
end # Here, that memory is released.
For each run through the for loop, 32 bytes is allocated, and 32 bytes is released. When the loop ends, regardless of n, 0 bytes will be allocated from this operation. But Julia's memory tracking only adds the allocations - so after running the code you will see allocation of 32*n bytes.
The reason julia does this is that allocating space in the RAM is one of the costliest operations in computing - so reducing allocations is a good way to speed up code. But you cannot avoid it.
There is thus nothing wrong with your code (in the new format) - the memory allocation and time taken you see is just a result of doing a big (expensive) operation.

Related

c++ vecotr with iterator can't be destroyed?

Recently I have faced a problem when the program running the memory keep increasing, and when program is closed the memory would restore normal level. Obviously, it's a memory leak. After some work, I have located the code responsible, but I don't know why? The program's work flow is simple:
first use lidar api to get point cloud and image data;
then transport to next tbb flow graph to process these data;
finally use open3d api to visualzie them.
In the first step, the lidar itself's api use asio to asynchronously invoke some callback function to transport data, so I create some tbb concurrent_queue to store these data, and a align function to match cloud and image with timestamp. The problem is in the align function. In the function, I create a vector<shared_ptr<open3d::..::PointCloud>> and use iterator to store point cloud elements. However, I found when the function complete, the shared_ptr use count don't reduce . Similar but simpler example code like this:
std::pair<std::shared_ptr<int>, int> helper() {
auto a = std::make_shared<int>(90);
auto c = 100;
std::vector<std::pair<std::shared_ptr<int>, int>> container;
container.reserve(5);
auto iter = container.begin();
for (int i = 0; i < 3; i++) {
*iter = std::make_pair(a, c);
iter++;
}
return *(iter-1);
}
int main() {
auto b = helper();
std::cout << "shared_ptr use count: " << std::get<0>(b).use_count() << std::endl;
return 0;
}
Ubuntu 20.04 + gcc 9.4, the print result is shared_ptr use count: 4.
Why the vector can't be auto destroyed when function is completed? Hope someone kindly explain this problem.
Thanks #Retired Ninja! The root of the problem is vector.reserve just reserve capacity not physical space. So the vector space after reserve is 0. The following iterator operation is assumed point to some undifined memory. While the result can be transport to main function with no value error, the shared_ptr use count can't reduce to 1 after function call.
To solve the problem, One can just modify the reserve to resize, which can change physical space of the vector and iterator point to defined memory space. Or avoid use iterator, just use push_back and return back().

F# seq behavior

I'm a little baffled about the inner work of the sequence expression in F#.
Normally if we make a sequential file reader with seq with no intentional caching of data
seq {
let mutable current = file.Read()
while current <> -1 do
yield current
}
We will end up with some weird behavior if we try to do some re-iterate or backtracking, My Idea of this was, since Read() is a function calling some mutable value we can't expect the output to be correct if we re-iterate. But then this behaves nicely even on boundary reading?
let Read path =
seq {
use fp = System.IO.File.OpenRead path
let buf = [| for _ in 0 .. 1024 -> 0uy |]
let mutable pos = 1
let mutable current = 0
while pos <> 0 do
if current = 0 then
pos <- fp.Read(buf, 0, 1024)
if pos > 0 && current < pos then
yield buf.[current]
current <- (current + 1) % 1024
}
let content = Read "some path"
We clearly use the same buffer to enhance performance, but assuming that we read the 1025 byte, it will trigger an update to the buffer, if we then try to read any byte with position < 1025 after we still get the correct output. How can that be and what are the difference?
Your question is a bit unclear, so I'll try to guess.
When you create a seq { }, you're essentially creating a state machine which will run only as far as it needs to. When you request the very first element from it, it'll start at the top and run until your first yield instruction. Then, when you request another value, it'll run from that point until the next yield, and so on.
Keep in mind that a seq { } produces an IEnumerable<'T>, which is like a "plan of execution". Each time you start to iterate the sequence (for example by calling Seq.head), a call to GetEnumerator is made behind the scenes, which causes a new IEnumerator<'T> to be created. It is the IEnumerator which does the actual providing of values. You can think of it in more classical terms as having an array over which you can iterate (an iterable or enumerable) and many pointers over that array, each of which are at different points in the array (many iterators or enumerators).
In your first code, file is most likely external to the seq block. This means that the file you are reading from is baked into the plan of execution; no matter how many times you start to iterate the sequence, you'll always be reading from the same file. This is obviously going to cause unpredictable behaviour.
However, in your second code, the file is opened as part of the seq block's definition. This means that you'll get a new file handle each time you iterate the sequence or, essentially, a new file handle per enumerator. The reason this code works is that you can't reverse an enumerator or iterate over it multiple times, not with a single thread at least.
(Now, if you were to manually get an enumerator and advance it over multiple threads, you'd probably run into problems very quickly. But that is a different topic.)

erlang has no shared memory. So what happens with the sum function for example?

Erlang has no shared memory. Look at the sum function,
sum(H|T)->H+sum(T);
sum([])->0
So
sum([1,2,3])=1+2+3+0
Now what happens? Does erlang creates an array with [1,1+2,1+2+3,1+2+3+0]?
This is what happens:
sum([1,2,3]) = 1 + sum([2,3])
=> sum[2, 3] = 2 + sum([3])
=> sum([3]) = 3 + sum([])
=> sum([]) = 0
Now sum([3]) can be evaluated:
sum([3]) = 3 + sum([]) = 3 + 0 = 3
which means that sum([2, 3]) can be evaluated:
sum([2, 3]) = 2 + sum([3]) = 2 + 3 = 5
which means that sum([1, 2, 3]) can be evaluated:
sum([1,2,3]) = 1 + sum([2,3]) = 1 + 5 = 6
Response to comment:
Okay, I figured what you were really asking about was immutable variables. Suppose you have the following C code:
int x = 0;
x += 1;
Does that code somehow demonstrate shared memory? If not, then C does not use shared memory for int variables...and neither does erlang.
In C you introduce a variable, sum, give it an initial value, 0, and
after that you add values to it. Erlang does not do this. What does
Erlang do?
Erlang allocates a new frame on the stack for each recursive function call. Each frame stores the local variables and their values, e.g. the parameter variables, for that particular function call. There can be multiple frames on the stack each storing a variable named X, but they are separate variables, so none of the X variables is ever mutated--instead a new X variable is created for each new frame, and the new X is given a new value.
Now, if the stack really worked like that in erlang, then a recursive function that executed millions of times would add millions of frames to the stack and in the process would probably use up its allocated memory and crash your program. To avoid using excessive amounts of memory, erlang employs tail call optimization, which allows the amount of memory that a function uses to remain constant. Tail call optimization allows erlang to replace the first frame on the stack with a subsequent frame of the same size, which keeps the memory usage constant. In addition, even when a function is not defined in a tail recursive format, like your sum() function, erlang can optimize the code so that it uses constant memory (see the Seven Myths of Erlang Performance).
In your sum() function, no variables are mutated and no memory is shared. In effect, though, function parameter variables do act like mutable variables.
My first diagram above is a representation of the stack adding a new frame for each recursive function call. If you redefine sum() to be tail recursive, like this:
sum(List)->
sum(List, 0).
sum([H|T], Total) ->
sum(T, Total+H);
sum([], Total)->
Total.
then below is a diagram of a recursive function executing that represents frames being replaced on the stack to keep the memory usage constant:
sum([1, 2, 3]) => sum([1, 2, 3], 0) [H=1, T=[2,3], Total=0]
=> sum([2,3], 1) [H=2, T=[3], Total=1]
=> sum([3], 3]) [H=3, T=[], Total=3]
=> sum([], 6) [Total=6]
=> 6
you're making recursive calls. the scope for every function body is not terminated until it returns something. so the invariable variable H for every call will be kept till the base case happen.
it could be tail recursive with the help of an accumulator in the function arguments, which is lighter on memory by calculating the H part first and then calling the successor recursive and giving the calculated value to the successor as the accumulator.
so in both ways there's nothing used outside of your function scopes.

Is the C++ AMP library useful from F#?

I'm experimenting with the C++ AMP library in F# as a way of using the GPU to do work in parallel. However, the results I'm getting don't seem intuitive.
In C++, I made a library with one function that squares all the numbers in an array, using AMP:
extern "C" __declspec ( dllexport ) void _stdcall square_array(double* arr, int n)
{
// Create a view over the data on the CPU
array_view<double,1> dataView(n, &arr[0]);
// Run code on the GPU
parallel_for_each(dataView.extent, [=] (index<1> idx) restrict(amp)
{
dataView[idx] = dataView[idx] * dataView[idx];
});
// Copy data from GPU to CPU
dataView.synchronize();
}
(Code adapted from Igor Ostrovsky's blog on MSDN.)
I then wrote the following F# to compare the Task Parallel Library (TPL) to AMP:
// Print the time needed to run the given function
let time f =
let s = new Stopwatch()
s.Start()
f ()
s.Stop()
printfn "elapsed: %d" s.ElapsedTicks
module CInterop =
[<DllImport("CPlus", CallingConvention = CallingConvention.StdCall)>]
extern void square_array(float[] array, int length)
let options = new ParallelOptions()
let size = 1000.0
let arr = [|1.0 .. size|]
// Square the number at the given index of the array
let sq i =
do arr.[i] <- arr.[i] * arr.[i]
()
// Square every number in the array using TPL
time (fun() -> Parallel.For(0, arr.Length - 1, options, new Action<int>(sq)) |> ignore)
let arr2 = [|1.0 .. size|]
// Square every number in the array using AMP
time (fun() -> CInterop.square_array(arr2, arr2.Length))
If I set the array size to a trivial number like 10, it takes the TPL ~22K ticks to finish, and AMP ~10K ticks. That's what I expect. As I understand it, a GPU (hence AMP) should be better suited to this situation, where the work is broken into very small pieces, than the TPL.
However, if I increase the array size to 1000, the TPL now takes ~30K ticks and AMP takes ~70K ticks. And it just gets worse from there. For an array of size 1 million, AMP takes nearly 1000x as long as the TPL.
Since I expect the GPU (i.e. AMP) to be better at this kind of task, I'm wondering what I'm missing here.
My graphics card is a GeForce 550 Ti with 1GB, not a slouch as far as I know. I know there's overhead in using PInvoke to call into the AMP code, but I expect that to be a flat cost that is amortized over larger array sizes. I believe the array is passed by reference (though I could be wrong), so I don't expect any cost associated with copying that.
Thank you to everyone for your advice.
Transferring data back and forth between GPU and CPU takes time. You are most likely measuring your PCI Express bus bandwidth here. Squaring 1M of floats is piece of cake for a GPU.
It's also not a good idea to use the Stopwach class to measure performance for AMP because GPU calls can happen asynchronously. In your case it is ok, but if you measure the compute part only (the parallel_for_each) this won't work. I think you can use D3D11 performance counters for that.

CUDA kernels and memory access (one kernel doesn't execute entirely and the next doesn't get launched)

I'm having trouble here. I launch two kernels , check if some value is the one expected (memcpy to the host), if it is I stop, if it isn't I launch the two kernels again.
the first kernel:
__global__ void aco_step(const KPDeviceData* data)
{
int obj = threadIdx.x;
int ant = blockIdx.x;
int id = threadIdx.x + blockIdx.x * blockDim.x;
*(data->added) = 1;
while(*(data->added) == 1)
{
*(data->added) = 0;
//check if obj fits
int fits = (data->obj_weights[obj] + data->weight[ant] <= data->max_weight);
fits = fits * !(getElement(data->selections, data->selections_pitch, ant, obj));
if(obj == 0)
printf("ant %d going..\n", ant);
__syncthreads();
...
The code goes on after this. But that printf never gets printed, that syncthreads is there just for debugging purposes.
The "added" variable was shared, but since shared memory is a PITA and usually throws bugs in the code, i just removed it for now. This "added" variable isn't the smartest thing to do but it's faster than the alternative, which is checking if any variable within an array is some value on the host and deciding to keep iterating or not.
The getElement, simply does the matrix memory calculation with the pitch to access the right position and returns the element there:
int* el = (int*) ((char*)mat + row * pitch) + col;
return *el;
The obj_weights array has the right size, n*sizeof(int). So does the weight array, ants*sizeof(float). So they aren't out of bounds.
The kernel after this one has a printf right on the beginning, and it doesn't get printed either and after the printf it sets a variable on the device memory, and this memory is copied to the CPU after the kernel finished, and it isn't the right value when I print it in the CPU code. So I think this kernel is doing something illegal and the second one doesn't even get launched.
I'm testing some instances, when I launch 8 blocks and 512 threads, it runs OK. 32 blocks, 512 threads, OK. But 8 blocks and 1024 threads, and this happens, the kernel doesn't work, neither 32 blocks and 1024 threads.
Am I doing something wrong? Memory access? Am I launching too many threads?
edit: tried removing the "added" variable and the while loop, so it should execute just once. Still doesnt work, nothing gets printed, even if the printf is right after the three initial lines and the next kernel also doesn't print anything.
edit: another thing, I'm using a GTX 570, so the "Maximum number of threads per block" is 1024 according to http://en.wikipedia.org/wiki/CUDA. Maybe I'll just stick with 512 maximum or check on how higher I can put this value.
__syncthreads() inside conditional code is only allowed if the condition evaluates identically on all threads of a block.
In your case the condition suffers a race condition and is nondeterministic, so it most probably evaluates to different results for different threads.
printf() output is only displayed after the kernel finishes successfully. In this case it doesn't due to the problem mentioned above, so the output never shows up. You could have figured out this by testing the return codes all CUDA function calls for errors.

Resources