Calculate least squares in batches - machine-learning

I want to calculate a Gaussian process. For this I need to calculate the following equation:
f = X # Z^-1 # y
To handle the inverse I use c = Z^-1 # y which is the solution of the problem Z # c = y. So, I use a solver for linear systems (least squares) to calculate c on my GPU (using cupy).
The problem now is, that I reach the memory capacity (16000x16000 matrix). Is it possible to calculate c in batches? Or how can I solve the out of memory issue?
I searched the internet but I found nothing.

The GPU will help you with the arithmetic, but will penalize data access.
Solving a linear system will require O(n^2) arithmetic operations, and the amount of data accessed is O(n^2). So you are probably not going to have a big gain.
You will find efficient linear solvers CPU implementations in pytorch, or scipy for instance, if you can use float32 instead of float64 even better.
You can have a matrix of 16000 x 16000 if you have a recent decent GPU, it would take 2GB using float64 elements. Make sure to release all other allocated memory before trying to allocate Z, and when solving try not to allocate more data.
If the matrix is well conditioned, on GPU you could try to use float16 that would place a 16000 x 16000 matrix in 512MB memory.
If you want to use GPU pytorch is my preferred option, and if you have an algorithm running on CPU you can run it on GPU with very little changes.

Related

Resampling signal to get N amount of points from signal

I have a signal and I would like to get n equally spaced points from it.
I was thinking of using resample as a way to do this but I wasn't sure of the correct values to use.
Example: I have a sine wave signal sampled at 8000 Hz but I want to get just 4 equally spaced points from the signal.
fs=8000
len_of_sig=1.0; %length of signal in seconds
t=linspace(0,len_of_sig,fs*len_of_sig);
y=1*sin(1*(2*pi)*t);
spaced_points=resample(y,)
I'm not sure how to calculate the correct values to use to get just n equally spaced points (like 4,5,6...points).
Any ideas? I don't really need to use resample, I just thought that would be the quickest.
I'm using Octave 4.2.2 on Ubuntu 64bit.
The documentation for the resample function doesn't need anything aside from the resampling factor itself:
y = resample (x, p, q, h)
Change the sample rate of x by a factor of p/q. This is performed using a polyphase algorithm. The impulse response h of the antialiasing filter is either specified or either designed with a Kaiser-windowed sinecard.
Suppose you have the variable ndesired_samples, which specifies how many samples you want in the end. Let nsamp = fs*len_of_sig.
The resampling factor is given by ndesired_samples/nsamp, hence p is the number of desired samples, and q is the number of total samples. Note that resample divides p and q by their GCD internally.
Beware of issues stemming from the polyphase structure and the Kaiser interpolation window. IIRC these are especially bad if p and q end up large after GCD (i.e. resampling 10000 samples to 8000 samples is OK, resampling 10000 points to 8001 warrants further caution).

Is the following system memory or memoryless and linear or nonlinear?

Is this system memory or memoryless and linear or nonlinear? I can't understand whether it is memory or memoryless and linear or nonlinear when u[n] is involved. Please help me out.
y[n] = x[n]+3u[n+1]
TIA.
This is a linear system since output is computed with linear combination (addition and scaling of two signals i.e. input and unit-step). This is not memoryless system since in order to compute output (y[n]) at time-instance n you need unit-step u[n+1] i.e. a time-point in future n+1

about CUFFT input sizes

It's written that CUFFT library supports algorithms that higly optimized for input sizes can be written in the folowing form: 2^a X 3^b X 5^c X 7^d.
How could they managed to do that?
For as far as I know, FFT must provide best perfomance only for 2^a input size.
This means that input sizes with prime factors larger than 7 would go slower.
The Cooley-Tukey algorithm can operate on a variety of DFT lengths which can be expressed as N = N_1*N_2. The algorithm recursively expresses a DFT of length N into N_1 smaller DFTs of length N_2.
As you note, the fastest is generally the radix-2 factorization, which recursively breaks a DFT of length N into 2 smaller DFTs of length N/2, running in O(NlogN).
However, the actual performance will depend on hardware and implementation. For example, if we are considering the cuFFT with a thread warp size of 32 then DFTs that have a length of some multiple of 32 would be optimal (note: just an example, I'm not aware of the actual optimizations that exist under the hood of the cuFFT.)
Short answer: the underlying code is optimized for any prime factorization up to 7 based on the Cooley-Tukey radix-n algorithm.
http://mathworld.wolfram.com/FastFourierTransform.html
https://en.wikipedia.org/wiki/Cooley-Tukey_FFT_algorithm

How to find an eigenvector given eigenvalue 1, minimising memory use

I'd be grateful if people could help me find an efficient way (probably low memory algorithm) to tackle the following problem.
I need to find the stationary distribution x of a transition matrix P. The transition matrix is an extremely large, extremely sparse matrix, constructed such that all the columns sum to 1. Since the stationary distribution is given by the equation Px = x, then x is simply the eigenvector of P associated with eigenvalue 1.
I'm currently using GNU Octave to both generate the transition matrix, find the stationary distribution, and plot the results. I'm using the function eigs(), which calculates both eigenvalues and eigenvectors, and it is possible to return just one eigenvector, where the eigenvalue is 1 (I actually had to specify 1.1, to prevent an error). Construction of the transition matrix (using a sparse matrix) is fairly quick, but finding the eigenvector gets increasingly slow as I increase the size, and I'm running out of memory before I can examine even moderately sized problems.
My current code is
[v l] = eigs(P, 1, 1.01);
x = v / sum(v);
Given that I know that 1 is the eigenvalue, I'm wondering if there is either a better method to calculate the eigenvector, or a way that makes more efficient use of memory, given that I don't really need an intermediate large dense matrix. I naively tried
n = size(P,1); % number of states
Q = P - speye(n,n);
x = Q\zeros(n,1); % solve (P-I)x = 0
which fails, since Q is singular (by definition).
I would be very grateful if anyone has any ideas on how I should approach this, as it's a calculation I have to perform a great number of times, and I'd like to try it on larger and more complex models if possible.
As background to this problem, I'm solving for the equilibrium distribution of the number of infectives in a cattle herd in a stochastic SIR model. Unfortunately the transition matrix is very large for even moderately sized herds. For example: in an SIR model with an average of 20 individuals (95% of the time the population is between 12 and 28 individuals), P is 21169 by 21169 with 20340 non-zero values (i.e. 0.0005% dense), and uses up 321 Kb (a full matrix of that size would be 3.3 Gb), while for around 50 individuals P uses 3 Mb. x itself should be pretty small. I suspect that eigs() has a dense matrix somewhere, which is causing me to run out of memory, so I should be okay if I can avoid using full matrices.
Power iteration is a standard way to find the dominant eigenvalue of a matrix. You pick a random vector v, then hit it with P repeatedly until you stop seeing it change very much. You want to periodically divide v by sqrt(v^T v) to normalise it.
The rate of convergence here is proportional to the separation between the largest eigenvalue and the second largest eigenvalue. Each iteration takes just a couple of matrix multiplies.
There are fancier-pants ways to do this ("PageRank" is one good thing to search for here) that improve speed for really huge sparse matrices, but I don't know that they're necessary or useful here.
Your approach seems like a good one. However, what you're calling x, is the null space of Q. null(Q) would work if it supported sparse matrices, but it doesn't. There's a bunch of stuff on the web for finding the null space of a sparse matrix. For example:
http://www.mathworks.co.uk/matlabcentral/newsreader/view_thread/249467
http://www.mathworks.com/matlabcentral/fileexchange/42922-null-space-for-sparse-matrix/content/nulls.m
http://www.mathworks.com/matlabcentral/fileexchange/11120-null-space-of-a-sparse-matrix
It seems the best solution is to use the Power Iteration method, as suggested by tmyklebu.
The method is to iterate x = Px; x /= sum(x), until x converges. I'm assuming convergence if the d1 norm between successive iterations is less than 1e-5, as that seems to give good results.
Convergence can take a while, since the largest two eigenvalues are fairly close (the number of iterations needed to converge can vary considerably, from around 200 to 2000 depending on the model used and population sizes, but it gets there in the end). However, the memory requirements are low, and it's very easy to implement.

Is a CUDA-programmed GPU suitable for implementation of OpenCV adaptive threshold?

On my system, for a 5 MP image with a large window size (75px) it takes a whopping 140 ms (roughly 20 times as much as linear operations) to complete and I am looking to optimize it. I have noticed that the OpenCV gpu module does not implement a gpu version of the adaptiveThreshold so I have been thinking of implementing that algorithm for the GPU myself.
Can I hope for any speedup if I implement an adaptive threshold algorithm in CUDA, based on a large window size (50px+) and a large image (5 MP+), ignoring the overhead for loading memory into the GPU?
adaptiveThreshold documentation on opencv.org:
http://docs.opencv.org/modules/imgproc/doc/miscellaneous_transformations.html#adaptivethreshold
Building on Eric's answer:
The Npp CUDA library does not implement adaptiveThreshold but it seems beneficial to getting an adaptive threshold in a VERY straightforward way (just tested it and anecdotally works):
Run a box filter on src (i.e. compute mean window value for every pixel),
store in an intermediate image tmp.
Subtract a number K from each pixel in tmp
Run a compare function between src and
tmp into dst. The end.
The code may look like this (here K=0, 2nd step omitted):
nppiFilterBox_8u_C1R(oDeviceSrc.data(), oDeviceSrc.pitch(),
oDeviceIntermediate.data(), oDeviceDst.pitch(),
oSizeROI, oAdapThreshWindowSize,oAnchor);
nppiCompare_8u_C1R(oDeviceSrc.data(),oDeviceSrc.pitch(),
oDeviceDst.data(),oDeviceDst.pitch(),
oDeviceResult.data(),oDeviceResult.pitch(),
oSizeROI,NPP_CMP_LESS);
Also, wikipedia claims that applying a box filter 3 times in a row approximates a Gaussian filter to 97% accuracy.
Yes, this algorithm can be optimized on the GPU. I would expect to see an excellent speedup.
For ADAPTIVE_THRESH_MEAN_C, you could use a standard parallel reduction to calculate the arithmetic mean. For ADAPTIVE_THRESH_GAUSSIAN_C, you might use a kernel that performs per-pixel gaussian attenuation combined with a standard parallel reduction for the sum.
Implementation by CUDA should give you a satisfied performance gain.
Since your window size is large, this operation should be compute-bounded. The theoretical peak performance of a 5 MP image with 75px window on a Tesla K20X GPU should be about
5e6 * 75 * 75 / 3.95 Tflops = 7ms
Here's a white paper about image convolution. It shows how to implement a high performance box filer with CUDA.
http://docs.nvidia.com/cuda/samples/3_Imaging/convolutionSeparable/doc/convolutionSeparable.pdf
Nvidia cuNPP library also provides a function nppiFilterBox(), which can be used to implement ADAPTIVE_THRESH_MEAN_C directly.
http://docs.nvidia.com/cuda/cuda-samples/index.html#box-filter-with-npp
For ADAPTIVE_THRESH_GAUSSIAN_C, the function nppiFilter() with a proper mask could be used.
NPP doc pp.1009 http://docs.nvidia.com/cuda/pdf/NPP_Library.pdf

Resources