Memory allocation and solve linear systems in Julia - memory

I'm using Julia 1.5.0. Consider the following code:
using LinearAlgebra
using Distributions
using BenchmarkTools
function solve_b!(A, tol_iters)
b = [1.0 2.0]'
luA = lu!(A)
x = [0.0; 0.0]
for i =1:tol_iters
A[1,1] += 0.001
A[2,2] += 0.001
luA = lu!(A)
ldiv!(x, luA, b)
A = rand(2,2)
solve_b!(A, 1000)
If I run this with julia --track-allocation=user, I see that most of the memory allocation comes from b = [1.0 2.0]' and x = [0.0; 0.0]. That is, when I see the .mem file, I see the following:
96 b = [1.0 2.0]'
0 luA = lu!(A)
96 x = [0.0; 0.0]
The memory allocation increases as I increase tol_iters.
Can someone explain why? I'm using lu! and ldiv!, so I would expect the update to be in-place. Therefore there should not be any additional memory allocation associated with the number of iterations.


memory handling with dask-cuda on a windows machine

I am currently exploring how to handle memory in dask-cuda in order to write a function that will interpolate values along lines that cross an image.
My machine is a very basic windows 10 laptop with a single gpu (GeForce GTX 1050 4GB memory) and 16GB of RAM. I am using the following packages:
cupy 10.2.0
cudatoolkit 11.6.0
dask 2022.1.0
dask-cuda 22.2
A minimal version of my code is as follows:
import cupy as cp
import numpy as np
import dask.array as da
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
from dask import compute
from math import pi
def bilinear_interpolate(im, x, y):
x0 = np.floor(x).astype(np.int32)
x1 = x0 + 1
y0 = np.floor(y).astype(np.int32)
y1 = y0 + 1
# keep coordinates within bounds
x0 = np.clip(x0, 0, im.shape[1]-1)
x1 = np.clip(x1, 0, im.shape[1]-1)
y0 = np.clip(y0, 0, im.shape[0]-1)
y1 = np.clip(y1, 0, im.shape[0]-1)
# retrieve values
Ia = im[ y0, x0 ]
Ib = im[ y1, x0 ]
Ic = im[ y0, x1 ]
Id = im[ y1, x1 ]
# calculate weights
wa = (x1-x) * (y1-y)
wb = (x1-x) * (y-y0)
wc = (x-x0) * (y1-y)
wd = (x-x0) * (y-y0)
return (wa*Ia + wb*Ib + wc*Ic + wd*Id).astype(np.float32)
# image used 5000x5000
im = cp.asarray(im)
# Set cuda-dask Client
cluster = LocalCUDACluster(
device_memory_limit = 0.4,
jit_unspill= True
client = Client(cluster)
# location
loc = im.shape[0] // 2, im.shape[1] //2
# radius
radius = [1,3000]
# pts per line
pts_line = radius[1]-radius[0]
# lines per degree factor
factor = 200
# radial chunks
rchunks = 180
# generate lines
r = cp.linspace(radius[0], radius[1], pts_line , dtype=np.int32)
psi = cp.linspace(0, 2*pi, 360*factor, dtype=np.float32)
R, PSI = cp.meshgrid(r, psi, sparse= True)
dR = da.from_array(R, chunks=(-1,pts_line), asarray=False).astype(np.float32)
dPSI = da.from_array(PSI, chunks=(rchunks,-1), asarray=False).astype(np.float32)
# Use polar coordinate to generate lines
rr = loc[0] + dR*np.cos(dPSI)
cc = loc[1] + dR* np.sin(dPSI)
rslt= da.map_blocks(bilinear_interpolate, im, cc, rr)
z = cp.asnumpy(rslt.compute().T)
I have managed, through trial and error, to interpolate +90k lines (each with 3000 points) across a single band image with 5000x5000 in size. However, as mentioned above, I am trying to write this function so that it will compute regardless of image size and number of lines to interpolate.
Much of the information I have found does not refer to windows machines (e.g. it appears that one cannot set rmm_pool_limit in a LocalCUDACluster as RAPIDS rmm package only works in linux). I am also familiar, thanks to the video by Mads Kristensen, with the limitations of the device_memory_limit parameter (i.e., it being a soft target) and the use of jit_unspill = True parameter as a way to minimize GPU memory spikes, etc. Yet, despite maintaining chuck sizes way below what I have available as GPU memory, I am still running into out of memory errors.
What is the best way to determine memory requirements? I had imagine that chuck size was the most crucial aspect (I thought that regardless of the original size of the array provided the chunk size was way below GPU memory limits I would be okay).

Minimizing memory usage in Julia function

This function is a workhorse which I want to optimize. Any idea on how its memory usage can be limited would be great.
function F(len, rNo, n, ratio = 0.5)
s = zeros(len); m = copy(s); d = copy(s);
rNo ≤ len-1 && (m[rNo + 1] = s[rNo+1] = -n[rNo])
rNo > 1 && (m[rNo - 1] = s[rowNo-1] = n[rowNo-1])
while true
for i ∈ 2:len-1
d[i] = (n[i]*m[i+1] - n[i-1]*m[i-1])/(r+1)
d[1] = n[1]*m[2]/(r+1);
d[len] = -n[len-1]*m[len-1]/(r+1);
for i ∈ 1:len
sum(abs.(d))/sum(abs.(m)) < ratio && break #converged
m = copy(d); r+=1
return reshape(s, 1, :)
It calculates rows of a special matrix exponential which I stack later.
Although the full method is quite faster than built in exp thanks to the special properties, it takes up far more memory as measured by #time.
Since I am a noob in memory management and also in Julia, I am sure it can be optimized quite a bit..
Am I doing something obviously wrong?
I think most of your allocations come from sum(abs.(d))/sum(abs.(m)) < ratio && break #converged. If you replace it with sum(abs, d)/sum(abs,m) < ratio && break #converged those allocations should go away. (it also will be a speed boost).
Your other allocations can be removed by replacing m = copy(d) with m .= d which does an element-wise copy.
There are also a couple of style things where I think you could make this a nicer function to read and use. My changes would be as follows
function F(rNo, v, ratio = 0.5)
len = length(v)
s = zeros(len+1); m = copy(s); d = copy(s);
rNo ≤ len && (m[rNo + 1] = s[rNo+1] = -v[rNo])
rNo > 1 && (m[rNo - 1] = s[rowNo-1] = v[rowNo-1])
while true
for i ∈ 2:len
d[i] = (v[i]*m[i+1] - v[i-1]*m[i-1]) / (r+1)
d[1] = v[1]*m[2]/(r+1);
d[end] = -v[end]*m[end]/(r+1);
s .+= d
sum(abs, d)/sum(abs, m) < ratio && break #converged
m .= d; r+=1
return reshape(s, 1, :)
The most notable change is removing len from the arguments. Including an array length argument is common in C (and probably others) where finding the length of an array is hard, but in Julia length is cheap (O(1)), and adding extra arguments is just more clutter and confusion for the people using it. I also made use of the fact that julia is able to turn s[end] into s[length(x)] to make this a little cleaner. Also, in general when using Julia you should look for ways to use dotted operations rather than writing for loops. The for loops will be fast, but why take 3 lines to do what you could in 1 shorter line? (I also renamed n to v since to me n is a number and v is a vector, but that is pure preference).
I hope this helps.

Need a vectorized solution in pytorch

I'm doing an experiment using face images in PyTorch framework. The input x is the given face image of size 5 * 5 (height * width) and there are 192 channels.
Objective: To obtain patches of x of patch_size(given as argument).
I have obtained the required result with the help of two for loops. But I want a better-vectorized solution so that the computation cost will be very less than using two for loops.
Used: PyTorch 0.4.1, (12 GB) Nvidia TitanX GPU.
The following is my implementation using two for loops
def extractpatches( x, patch_size): # x is bsx192x5x5
patches = x.unfold( 2, patch_size , 1).unfold(3,patch_size,1)
bs,c,pi,pj, _, _ = patches.size() #bs,192,
cnt = 0
p = torch.empty((bs,pi*pj,c,patch_size,patch_size)).to(device)
s = torch.empty((bs,pi*pj, c*patch_size*patch_size)).to(device)
//Want a vectorized method instead of two for loops below
for i in range(pi):
for j in range(pj):
p[:,cnt,:,:,:] = patches[:,:,i,j,:,:]
s[:,cnt,:] = p[:,cnt,:,:,:].view(-1,c*patch_size*patch_size)
cnt = cnt+1
return s
Thanks for your help in advance.
I think you can try this as following. I used some parts of your code for my experiment and it worked for me. Here l and f are the lists of tensor patches
l = [patches[:,:,int(i/pi),i%pi,:,:] for i in range(pi * pi)]
f = [l[i].contiguous().view(-1,c*patch_size*patch_size) for i in range(pi * pi)]
You can verify the above code using toy input values.

Julia - Preallocating for sparse matrices

I was reading about preallocation from Performance Tips and it has this example:
function xinc!(ret::AbstractVector{T}, x::T) where T
ret[1] = x
ret[2] = x+1
ret[3] = x+2
function loopinc_prealloc()
ret = Array{Int}(3)
y = 0
for i = 1:10^7
xinc!(ret, i)
y += ret[2]
I see that the example is trying to change ret which is preallocated. However, when I tried the following:
function addSparse!(sp1, sp2)
sp1 = 2*sp2
function loopinc_prealloc()
sp1 = spzeros(3, 3)
y = 0
for i = 1:10^7
sp2 = sparse([1, 2], [1, 2], [2 * i, 2 * i], 3, 3)
addSparse!(sp1, sp2)
y += sp1[1,1]
I don't think sp1 is updated by addSparse!. In the example from Julia, function xinc! modifies ret one by one. How can I do the same to a sparse matrix?
In my actual code, I need to update a big sparse matrix in a loop for the sake of saving memory it makes sense for me to preallocate.
The issue is not that the Matrix is sparse. The issue is that when you use the assignment operator = you assign the name sp1 to a new object (with value 2sp2), rather than updating the sp1 matrix. Consider the example from performance tips: ret[1] = x does not reassign ret it just modifies it's elements.
Use the .= operator instead to overwrite all the elements of a container.

Theano gradient doesn't work with .sum(), only .mean()?

I'm trying to learn theano and decided to implement linear regression (using their Logistic Regression from the tutorial as a template). I'm getting a wierd thing where T.grad doesn't work if my cost function uses .sum(), but does work if my cost function uses .mean(). Code snippet:
x = T.matrix('x')
y = T.vector('y')
w = theano.shared(rng.randn(feats), name='w')
b = theano.shared(0., name="b")
# now we do the actual expressions
h =,w) + b # prediction is dot product plus bias
single_error = .5 * ((h - y)**2)
cost = single_error.sum()
gw, gb = T.grad(cost, [w,b])
train = theano.function(inputs=[x,y], outputs=[h, single_error], updates = ((w, w - .1*gw), (b, b - .1*gb)))
predict = theano.function(inputs=[x], outputs=h)
for i in range(training_steps):
pred, err = train(D[0], D[1])
x = T.matrix('x')
y = T.vector('y')
w = theano.shared(rng.randn(feats), name='w')
b = theano.shared(0., name="b")
# now we do the actual expressions
h =,w) + b # prediction is dot product plus bias
single_error = .5 * ((h - y)**2)
cost = single_error.mean()
gw, gb = T.grad(cost, [w,b])
train = theano.function(inputs=[x,y], outputs=[h, single_error], updates = ((w, w - .1*gw), (b, b - .1*gb)))
predict = theano.function(inputs=[x], outputs=h)
for i in range(training_steps):
pred, err = train(D[0], D[1])
The only difference is in the cost = single_error.sum() vs single_error.mean(). What I don't understand is that the gradient should be the exact same in both cases (one is just a scaled version of the other). So what gives?
The learning rate (0.1) is way to big. Using mean make it divided by the batch size, so this help. But I'm pretty sure you should make it much smaller. Not just dividing by the batch size (which is equivalent to using mean).
Try a learning rate of 0.001.
Try dividing your gradient descent step size by the number of training examples.
