memory handling with dask-cuda on a windows machine - dask

I am currently exploring how to handle memory in dask-cuda in order to write a function that will interpolate values along lines that cross an image.
My machine is a very basic windows 10 laptop with a single gpu (GeForce GTX 1050 4GB memory) and 16GB of RAM. I am using the following packages:
cupy 10.2.0
cudatoolkit 11.6.0
dask 2022.1.0
dask-cuda 22.2
A minimal version of my code is as follows:
import cupy as cp
import numpy as np
import dask.array as da
from dask_cuda import LocalCUDACluster
from dask.distributed import Client
from dask import compute
from math import pi
def bilinear_interpolate(im, x, y):
x0 = np.floor(x).astype(np.int32)
x1 = x0 + 1
y0 = np.floor(y).astype(np.int32)
y1 = y0 + 1
# keep coordinates within bounds
x0 = np.clip(x0, 0, im.shape[1]-1)
x1 = np.clip(x1, 0, im.shape[1]-1)
y0 = np.clip(y0, 0, im.shape[0]-1)
y1 = np.clip(y1, 0, im.shape[0]-1)
# retrieve values
Ia = im[ y0, x0 ]
Ib = im[ y1, x0 ]
Ic = im[ y0, x1 ]
Id = im[ y1, x1 ]
# calculate weights
wa = (x1-x) * (y1-y)
wb = (x1-x) * (y-y0)
wc = (x-x0) * (y1-y)
wd = (x-x0) * (y-y0)
return (wa*Ia + wb*Ib + wc*Ic + wd*Id).astype(np.float32)
# image used 5000x5000
im = cp.asarray(im)
# Set cuda-dask Client
cluster = LocalCUDACluster(
CUDA_VISIBLE_DEVICES = "0",
device_memory_limit = 0.4,
jit_unspill= True
)
client = Client(cluster)
# location
loc = im.shape[0] // 2, im.shape[1] //2
# radius
radius = [1,3000]
# pts per line
pts_line = radius[1]-radius[0]
# lines per degree factor
factor = 200
# radial chunks
rchunks = 180
# generate lines
r = cp.linspace(radius[0], radius[1], pts_line , dtype=np.int32)
psi = cp.linspace(0, 2*pi, 360*factor, dtype=np.float32)
R, PSI = cp.meshgrid(r, psi, sparse= True)
dR = da.from_array(R, chunks=(-1,pts_line), asarray=False).astype(np.float32)
dPSI = da.from_array(PSI, chunks=(rchunks,-1), asarray=False).astype(np.float32)
# Use polar coordinate to generate lines
rr = loc[0] + dR*np.cos(dPSI)
cc = loc[1] + dR* np.sin(dPSI)
rslt= da.map_blocks(bilinear_interpolate, im, cc, rr)
z = cp.asnumpy(rslt.compute().T)
I have managed, through trial and error, to interpolate +90k lines (each with 3000 points) across a single band image with 5000x5000 in size. However, as mentioned above, I am trying to write this function so that it will compute regardless of image size and number of lines to interpolate.
Much of the information I have found does not refer to windows machines (e.g. it appears that one cannot set rmm_pool_limit in a LocalCUDACluster as RAPIDS rmm package only works in linux). I am also familiar, thanks to the video by Mads Kristensen, with the limitations of the device_memory_limit parameter (i.e., it being a soft target) and the use of jit_unspill = True parameter as a way to minimize GPU memory spikes, etc. Yet, despite maintaining chuck sizes way below what I have available as GPU memory, I am still running into out of memory errors.
What is the best way to determine memory requirements? I had imagine that chuck size was the most crucial aspect (I thought that regardless of the original size of the array provided the chunk size was way below GPU memory limits I would be okay).

Related

Memory allocation and solve linear systems in Julia

I'm using Julia 1.5.0. Consider the following code:
using LinearAlgebra
using Distributions
using BenchmarkTools
function solve_b!(A, tol_iters)
b = [1.0 2.0]'
luA = lu!(A)
x = [0.0; 0.0]
for i =1:tol_iters
A[1,1] += 0.001
A[2,2] += 0.001
luA = lu!(A)
ldiv!(x, luA, b)
end
end
A = rand(2,2)
solve_b!(A, 1000)
If I run this with julia --track-allocation=user, I see that most of the memory allocation comes from b = [1.0 2.0]' and x = [0.0; 0.0]. That is, when I see the .mem file, I see the following:
96 b = [1.0 2.0]'
0 luA = lu!(A)
96 x = [0.0; 0.0]
The memory allocation increases as I increase tol_iters.
Can someone explain why? I'm using lu! and ldiv!, so I would expect the update to be in-place. Therefore there should not be any additional memory allocation associated with the number of iterations.

How to publish odom (nav_msgs/Odometry) from MD49 encoders outputs?

I am using MD49 Motor Drive with its motors
https://www.robot-electronics.co.uk/htm/md49tech.htm
http://wiki.ros.org/md49_base_controller
How to subscribe (encoder_l and encoder_r) from md49_base_controller package and publish (vx , vy ,and vth ) as a form odom (nav_msgs/Odometry) ?
There are two problem:
1-The first is that the encoders outputs are not correct "the package needs to be modified.
2-The second is the I want to create a package that subscribe the right and left wheels encoder counts (encoder_l and encoder_r) and publish (vx , vy ,and vth) as a form odom (nav_msgs/Odometry) to be compatable wth imu MPU9250
http://wiki.ros.org/robot_pose_ekf
The proposed package is:
1- We have to convert (encoder_l and encoder_r) to (RPM_l and RPM_r) as follow:
if (speed_l>128){newposition1 = encoder_l;}
else if (speed_l<128){ newposition1 = 0xFFFFFFFF-encoder_l;}
else if (speed_l==128) {newposition1=0;}
newtime1 = millis();
RPM_l = ((newposition1-oldposition1)*1000*60)/((newtime1-oldtime1)*980);
oldposition1 = newposition1;
oldtime1 = newtime1;
delay(250);
if (speed_r>128){ newposition2 = encoder_r;}
else if (speed_r<128){ newposition2 = 0xFFFFFFFF-encoder_r;}
else if (speed_r==128) { newposition2=0;}
newtime2 = millis();
RPM_r = ((newposition2-oldposition2)*1000*60)/((newtime2-oldtime2)*980);
oldposition2 = newposition2;
oldtime2= newtime2;
delay(250);
2- We have to convert (RPM_l and RPM_r) to (vx, vy, and vth) as follow:
vx=(r/2)*RPM_l*math.cos(th)+(r/2)*RPM_r*math.cos(th);
vx=(r/2)*RPM_l*math.sin(th)+(r/2)*RPM_r*math.sin(th);
vth=(r/B)*omega_l-(r/B)*omega_r;
Hint: r and B are wheel radius and vehicle width respectively.
3- The odom (nav_msgs/Odometry) package is:
#!/usr/bin/env python
import math
from math import sin, cos, pi
import rospy
import tf
from nav_msgs.msg import Odometry
from geometry_msgs.msg import Point, Pose, Quaternion, Twist, Vector3
from md49_messages.msg import md49_encoders
rospy.init_node('odometry_publisher')
odom_pub = rospy.Publisher("odom", Odometry, queue_size=50)
odom_broadcaster = tf.TransformBroadcaster()
x = 0.0
y = 0.0
th = 0.0
vx =0.1
vy = -0.1
vth = 0.1
current_time = rospy.Time.now()
last_time = rospy.Time.now()
r = rospy.Rate(1.0)
while not rospy.is_shutdown():
current_time = rospy.Time.now()
# compute odometry in a typical way given the velocities of the robot
dt = (current_time - last_time).to_sec()
delta_x = (vx * cos(th) - vy * sin(th)) * dt
delta_y = (vx * sin(th) + vy * cos(th)) * dt
delta_th = vth * dt
x += delta_x
y += delta_y
th += delta_th
# since all odometry is 6DOF we'll need a quaternion created from yaw
odom_quat = tf.transformations.quaternion_from_euler(0, 0, th)
# first, we'll publish the transform over tf
odom_broadcaster.sendTransform(
(x, y, 0.),
odom_quat,
current_time,
"base_link",
"odom"
)
# next, we'll publish the odometry message over ROS
odom = Odometry()
odom.header.stamp = current_time
odom.header.frame_id = "odom"
# set the position
odom.pose.pose = Pose(Point(x, y, 0.), Quaternion(*odom_quat))
# set the velocity
odom.child_frame_id = "base_link"
odom.twist.twist = Twist(Vector3(vx, vy, 0), Vector3(0, 0, vth))
# publish the message
odom_pub.publish(odom)
last_time = current_time
r.sleep()
The problem was in serial comunication setup not for the code.
Setup UART on the raspi 3 GPIO
For some strange reason the default for Pi3 using the latest 4.4.9 kernel is to DISABLE UART. To enable it you need to change enable_uart=1 in /boot/config.txt. (There is no longer necessary to add core_freq=250 to fix the core frequency to get stable baudrate.)
If you don’t use Bluetooth (or have undemanding uses) it is possible to swap the ports back in the Device Tree. There is a pi3-miniuart-bt and pi3-disable-bt module which are described in /boot/overlays/README here.
As mentioned, by default the new GPIO UART is not enabled so the first thing to do is to edit the config.txt file to enable the serial UART:
sudo nano /boot/config.txt
and add the line (at the bottom):
enable_uart=1
You have to reboot for the changes to take effect.
To check where your serial ports are pointing you can use the list command with the long “-l” listing format:
ls -l /dev
In the long listing you will find:
serial0 -> ttyS0
serial1 -> ttyAMA0
Thus on a Raspberry Pi 3 and Raspberry Pi Zero W, serial0 will point to GPIO J8 pins 8 and 10 and use the /dev/ttyS0. On other Raspberry Pi’s it will point to /dev/ttyAMA0.
So where possible refer to the serial port via it’s alias of “serial0” and your code should work on both Raspberry Pi 3 and other Raspberry Pi’s.
You also need to remove the console from the cmdline.txt. If you edit this with:
sudo nano /boot/cmdline.txt
you will see something like this:
dwc_otg.lpm_enable=0 console=serial0,115200 console=tty1 root=/dev/mmcblk0p2 rootfstype=ext4 elevator=deadline fsck.repair=yes root wait
remove the line: console=serial0,115200 and save and reboot for changes to take effect:
reboot
First off, you need to import nav_msgs/Odometry as the following:
from nav_msgs.msg import Odometry
You must have a function that performs those conversions and then in rospy.Subscriber import those variables, like this:
def example(data):
data.vx=<conversion>
data.vth=<conversion>
def listener():
rospy.Subscriber('*topic*', Odometry, example)
rospy.spin()
if __name__ == 'main':
listener()
I think this would work

Theano gradient doesn't work with .sum(), only .mean()?

I'm trying to learn theano and decided to implement linear regression (using their Logistic Regression from the tutorial as a template). I'm getting a wierd thing where T.grad doesn't work if my cost function uses .sum(), but does work if my cost function uses .mean(). Code snippet:
(THIS DOESN'T WORK, RESULTS IN A W VECTOR FULL OF NANs):
x = T.matrix('x')
y = T.vector('y')
w = theano.shared(rng.randn(feats), name='w')
b = theano.shared(0., name="b")
# now we do the actual expressions
h = T.dot(x,w) + b # prediction is dot product plus bias
single_error = .5 * ((h - y)**2)
cost = single_error.sum()
gw, gb = T.grad(cost, [w,b])
train = theano.function(inputs=[x,y], outputs=[h, single_error], updates = ((w, w - .1*gw), (b, b - .1*gb)))
predict = theano.function(inputs=[x], outputs=h)
for i in range(training_steps):
pred, err = train(D[0], D[1])
(THIS DOES WORK, PERFECTLY):
x = T.matrix('x')
y = T.vector('y')
w = theano.shared(rng.randn(feats), name='w')
b = theano.shared(0., name="b")
# now we do the actual expressions
h = T.dot(x,w) + b # prediction is dot product plus bias
single_error = .5 * ((h - y)**2)
cost = single_error.mean()
gw, gb = T.grad(cost, [w,b])
train = theano.function(inputs=[x,y], outputs=[h, single_error], updates = ((w, w - .1*gw), (b, b - .1*gb)))
predict = theano.function(inputs=[x], outputs=h)
for i in range(training_steps):
pred, err = train(D[0], D[1])
The only difference is in the cost = single_error.sum() vs single_error.mean(). What I don't understand is that the gradient should be the exact same in both cases (one is just a scaled version of the other). So what gives?
The learning rate (0.1) is way to big. Using mean make it divided by the batch size, so this help. But I'm pretty sure you should make it much smaller. Not just dividing by the batch size (which is equivalent to using mean).
Try a learning rate of 0.001.
Try dividing your gradient descent step size by the number of training examples.

Generating a spectrogram for a sequence of 2D movie frames

I have some data that consists of a sequence of video frames which represent changes in luminance over time relative to a moving baseline. In these videos there are two kinds of 'event' that can occur - 'localised' events, which consist of luminance changes in small groups of clustered pixels, and contaminating 'diffuse' events, which affect most of the pixels in the frame:
I'd like to be able to isolate local changes in luminance from diffuse events. I'm planning on doing this by subtracting an appropriately low-pass filtered version of each frame. In order to design an optimal filter, I'd like to know which spatial frequencies of my frames are modulated during diffuse and local events, i.e. I'd like to generate a spectrogram of my movie over time.
I can find lots of information about generating spectrograms for 1D data (e.g. audio), but I haven't come across much on generating spectrograms for 2D data. What I've tried so far is to generate a 2D power spectrum from the Fourier transform of the frame, then perform a polar transformation about the DC component and then average across angles to get a 1D power spectrum:
I then apply this to every frame in my movie, and generate a raster plot of spectral power over time:
Does this seem like a sensible approach to take? Is there a more 'standard' approach to doing spectral analysis on 2D data?
Here's my code:
import numpy as np
# from pyfftw.interfaces.scipy_fftpack import fft2, fftshift, fftfreq
from scipy.fftpack import fft2, fftshift, fftfreq
from matplotlib import pyplot as pp
from matplotlib.colors import LogNorm
from scipy.signal import windows
from scipy.ndimage.interpolation import map_coordinates
def compute_2d_psd(img, doplot=True, winfun=windows.hamming, winfunargs={}):
nr, nc = img.shape
win = make2DWindow((nr, nc), winfun, **winfunargs)
f2 = fftshift(fft2(img*win))
psd = np.abs(f2*f2)
pol_psd = polar_transform(psd, centre=(nr//2, nc//2))
mpow = np.nanmean(pol_psd, 0)
stdpow = np.nanstd(pol_psd, 0)
freq_r = fftshift(fftfreq(nr))
freq_c = fftshift(fftfreq(nc))
pos_freq = np.linspace(0, np.hypot(freq_r[-1], freq_c[-1]),
pol_psd.shape[1])
if doplot:
fig,ax = pp.subplots(2,2)
im0 = ax[0,0].imshow(img*win, cmap=pp.cm.gray)
ax[0,0].set_axis_off()
ax[0,0].set_title('Windowed image')
lnorm = LogNorm(vmin=psd.min(), vmax=psd.max())
ax[0,1].set_axis_bgcolor('k')
im1 = ax[0,1].imshow(psd, extent=(freq_c[0], freq_c[-1],
freq_r[0], freq_r[-1]), aspect='auto',
cmap=pp.cm.hot, norm=lnorm)
# cb1 = pp.colorbar(im1, ax=ax[0,1], use_gridspec=True)
# cb1.set_label('Power (A.U.)')
ax[0,1].set_title('2D power spectrum')
ax[1,0].set_axis_bgcolor('k')
im2 = ax[1,0].imshow(pol_psd, cmap=pp.cm.hot, norm=lnorm,
extent=(pos_freq[0],pos_freq[-1],0,360),
aspect='auto')
ax[1,0].set_ylabel('Angle (deg)')
ax[1,0].set_xlabel('Frequency (cycles/px)')
# cb2 = pp.colorbar(im2, ax=(ax[0,1],ax[1,1]), use_gridspec=True)
# cb2.set_label('Power (A.U.)')
ax[1,0].set_title('Polar-transformed power spectrum')
ax[1,1].hold(True)
# ax[1,1].fill_between(pos_freq, mpow - stdpow, mpow + stdpow,
# color='r', alpha=0.3)
ax[1,1].axvline(0, c='k', ls='--', alpha=0.3)
ax[1,1].plot(pos_freq, mpow, lw=3, c='r')
ax[1,1].set_xlabel('Frequency (cycles/px)')
ax[1,1].set_ylabel('Power (A.U.)')
ax[1,1].set_yscale('log')
ax[1,1].set_xlim(-0.05, None)
ax[1,1].set_title('1D power spectrum')
fig.tight_layout()
return mpow, stdpow, pos_freq
def make2DWindow(shape,winfunc,*args,**kwargs):
assert callable(winfunc)
r,c = shape
rvec = winfunc(r,*args,**kwargs)
cvec = winfunc(c,*args,**kwargs)
return np.outer(rvec,cvec)
def polar_transform(image, centre=(0,0), n_angles=None, n_radii=None):
"""
Polar transformation of an image about the specified centre coordinate
"""
shape = image.shape
if n_angles is None:
n_angles = shape[0]
if n_radii is None:
n_radii = shape[1]
theta = -np.linspace(0, 2*np.pi, n_angles, endpoint=False).reshape(-1,1)
d = np.hypot(shape[0]-centre[0], shape[1]-centre[1])
radius = np.linspace(0, d, n_radii).reshape(1,-1)
x = radius * np.sin(theta) + centre[0]
y = radius * np.cos(theta) + centre[1]
# nb: map_coordinates can give crazy negative values using higher order
# interpolation, which introduce nans when you take the log later on
output = map_coordinates(image, [x, y], order=1, cval=np.nan,
prefilter=True)
return output
I believe that the approach you describe is in general the best way to do this analysis.
However, i did spot an error in your code. as:
np.abs(f2*f2)
is not the PSD of complex array f2, you need to multiply f2 by it's complex conjugate instead of itself (|f2^2| is not the same as |f2|^2).
Instead you should do something like
(f2*np.conjugate(f2)).astype(float)
Or, more cleanly:
np.abs(f2)**2.
The oscillations in the 2D power-spectrum are a tell-tale sign of this kind of error (I've done this before myself!)

Emgu CV - Anisotropic Diffusion

Can anybody guide me to some existing implementations of anisotropic diffusion, preferably the perona-malik diffusion?
translate the following MATLAB code :
% pm2.m - Anisotropic Diffusion routines
function ZN = pm2(ZN,K,iterate);
[m,n] = size(ZN);
% lambda = 0.250;
lambda = .025;
%K=16;
rowC = [1:m]; rowN = [1 1:m-1]; rowS = [2:m m];
colC = [1:n]; colE = [2:n n]; colW = [1 1:n-1];
result_save=0;
for i = 1:iterate,
%i;
% result=PSNR(Z,ZN);
% if result>result_save
% result_save=result;
% else
% break;
% end
deltaN = ZN(rowN,colC) - ZN(rowC,colC);
deltaS = ZN(rowS,colC) - ZN(rowC,colC);
deltaE = ZN(rowC,colE) - ZN(rowC,colC);
deltaW = ZN(rowC,colW) - ZN(rowC,colC);
% deltaN = deltaN .*abs(deltaN<K);
% deltaS = deltaS .*abs(deltaS<K);
% deltaE = deltaE .*abs(deltaE<K);
% deltaW = deltaW .*abs(deltaW<K);
fluxN = deltaN .* exp(-((abs(deltaN) ./ K).^2) );
fluxS = deltaS .* exp(-((abs(deltaS) ./ K).^2) );
fluxE = deltaE .* exp(-((abs(deltaE) ./ K).^2) );
fluxW = deltaW .* exp(-((abs(deltaW) ./ K).^2) );
ZN = ZN + lambda*(fluxN +fluxS + fluxE + fluxW);
%ZN=max(0,ZN);ZN=min(255,ZN);
end
the code is not mine and has been taken from: http://www.csee.wvu.edu/~xinl/code/pm2.m
OpenCV Implementation (It needs 3 channel image):
from cv2.ximgproc import anisotropicDiffusion
ultrasound_ad_cv2 = anisotropicDiffusion(im,0.075 ,80, 100)
Juxtapose comparison
From scratch in Python: (For grayscale image only)
import scipy.ndimage.filters as flt
import numpy as np
import warnings
def anisodiff(img,niter=1,kappa=50,gamma=0.1,step=(1.,1.),sigma=0, option=1,ploton=False):
"""
Anisotropic diffusion.
Usage:
imgout = anisodiff(im, niter, kappa, gamma, option)
Arguments:
img - input image
niter - number of iterations
kappa - conduction coefficient 20-100 ?
gamma - max value of .25 for stability
step - tuple, the distance between adjacent pixels in (y,x)
option - 1 Perona Malik diffusion equation No 1
2 Perona Malik diffusion equation No 2
ploton - if True, the image will be plotted on every iteration
Returns:
imgout - diffused image.
kappa controls conduction as a function of gradient. If kappa is low
small intensity gradients are able to block conduction and hence diffusion
across step edges. A large value reduces the influence of intensity
gradients on conduction.
gamma controls speed of diffusion (you usually want it at a maximum of
0.25)
step is used to scale the gradients in case the spacing between adjacent
pixels differs in the x and y axes
Diffusion equation 1 favours high contrast edges over low contrast ones.
Diffusion equation 2 favours wide regions over smaller ones.
"""
# ...you could always diffuse each color channel independently if you
# really want
if img.ndim == 3:
warnings.warn("Only grayscale images allowed, converting to 2D matrix")
img = img.mean(2)
# initialize output array
img = img.astype('float32')
imgout = img.copy()
# initialize some internal variables
deltaS = np.zeros_like(imgout)
deltaE = deltaS.copy()
NS = deltaS.copy()
EW = deltaS.copy()
gS = np.ones_like(imgout)
gE = gS.copy()
# create the plot figure, if requested
if ploton:
import pylab as pl
from time import sleep
fig = pl.figure(figsize=(20,5.5),num="Anisotropic diffusion")
ax1,ax2 = fig.add_subplot(1,2,1),fig.add_subplot(1,2,2)
ax1.imshow(img,interpolation='nearest')
ih = ax2.imshow(imgout,interpolation='nearest',animated=True)
ax1.set_title("Original image")
ax2.set_title("Iteration 0")
fig.canvas.draw()
for ii in np.arange(1,niter):
# calculate the diffs
deltaS[:-1,: ] = np.diff(imgout,axis=0)
deltaE[: ,:-1] = np.diff(imgout,axis=1)
if 0<sigma:
deltaSf=flt.gaussian_filter(deltaS,sigma);
deltaEf=flt.gaussian_filter(deltaE,sigma);
else:
deltaSf=deltaS;
deltaEf=deltaE;
# conduction gradients (only need to compute one per dim!)
if option == 1:
gS = np.exp(-(deltaSf/kappa)**2.)/step[0]
gE = np.exp(-(deltaEf/kappa)**2.)/step[1]
elif option == 2:
gS = 1./(1.+(deltaSf/kappa)**2.)/step[0]
gE = 1./(1.+(deltaEf/kappa)**2.)/step[1]
# update matrices
E = gE*deltaE
S = gS*deltaS
# subtract a copy that has been shifted 'North/West' by one
# pixel. don't as questions. just do it. trust me.
NS[:] = S
EW[:] = E
NS[1:,:] -= S[:-1,:]
EW[:,1:] -= E[:,:-1]
# update the image
imgout += gamma*(NS+EW)
if ploton:
iterstring = "Iteration %i" %(ii+1)
ih.set_data(imgout)
ax2.set_title(iterstring)
fig.canvas.draw()
# sleep(0.01)
return imgout
Usage
:
#anisodiff(img,niter=1,kappa=50,gamma=0.1,step=(1.,1.),sigma=0, option=1,ploton=False)
us_im_ad = anisodiff(ultrasound,100,80,0.075,(1,1),2.5,1)
Source
Juxtapose comparison

Resources