LS-SVM training : Out of memory - machine-learning

I try to train an LS-SVM classifier on a dataset having the following size:
Training dataset: TS = 48000x12 (double)
Groups: G = 48000x1 (double)
Matlab training code is:
class = svmtrain(TS,G,'method','LS',...
'kernel_function','rbf','boxconstraint',C,'rbf_sigma',sigma);
Then, I got this error message:
Error using svmtrain (line 516)
Error evaluating kernel function 'rbf_kernel'.
Caused by:
Error using repmat
Out of memory. Type HELP MEMORY for your options.
Note that the size of the physical memory is 4Gb, and it works when I decrease dataset training size. So if there are any solution with the same data size and of course without adding physical memory.

It seems, that the implementation requires computation of the whole Gram matrix, which is the size of N x N (where N - number of sampels) in your case it is 2,304,000,000, now each is represented by the 32bit float, meaning it requires at least 4 bytes which gives as 9,216,000,000 bytes required, which is roughly 9GB of data just for a Gram (Kernel) matrix.
There are two options:
Find implementation which for RBF kernel do not compute the kernel (Gram) matrix, but instead use some callable to compute the kernel value each time
You can try to use some kind of LS-SVM approximation, like Fast Sparse Approximation of Least Squares Support Vector Machine : http://homes.cs.washington.edu/~lfb/software/FSALS-SVM.htm

Related

Calculate least squares in batches

I want to calculate a Gaussian process. For this I need to calculate the following equation:
f = X # Z^-1 # y
To handle the inverse I use c = Z^-1 # y which is the solution of the problem Z # c = y. So, I use a solver for linear systems (least squares) to calculate c on my GPU (using cupy).
The problem now is, that I reach the memory capacity (16000x16000 matrix). Is it possible to calculate c in batches? Or how can I solve the out of memory issue?
I searched the internet but I found nothing.
The GPU will help you with the arithmetic, but will penalize data access.
Solving a linear system will require O(n^2) arithmetic operations, and the amount of data accessed is O(n^2). So you are probably not going to have a big gain.
You will find efficient linear solvers CPU implementations in pytorch, or scipy for instance, if you can use float32 instead of float64 even better.
You can have a matrix of 16000 x 16000 if you have a recent decent GPU, it would take 2GB using float64 elements. Make sure to release all other allocated memory before trying to allocate Z, and when solving try not to allocate more data.
If the matrix is well conditioned, on GPU you could try to use float16 that would place a 16000 x 16000 matrix in 512MB memory.
If you want to use GPU pytorch is my preferred option, and if you have an algorithm running on CPU you can run it on GPU with very little changes.

SelectKBest gives memory error even with sparse matrix

I have some spreadsheet data that is over a GB and wanting to use random forest. Following some other questions on here I was able to tune the algorithm to work with my data but unfortunately to get the best performance I needed to do one hot encoding of a categorical feature and now my input matrix has over 3000 features resulting in a memory error.
I'm trying to reduce these features so I'm using SelectKBest with chi2 which according to docs will deal with my sparse matrix but I'm still getting memory error.
I tried using to_sparse with fill_value=0 which seems to reduce memory footprint, but when I call fit_transform I get memory error
MemoryError Traceback (most recent call last)
in ()
4 Y_sparse = df_processed.loc[:,'Purchase'].to_sparse(fill_value=0)
5
----> 6 X_new = kbest.fit_transform(X_sparse, Y_sparse)
kbest = SelectKBest(mutual_info_regression, k = 5)
X_sparse = df_processed.loc[:,df_processed.columns != 'Purchase'].to_sparse(fill_value=0)
Y_sparse = df_processed.loc[:,'Purchase'].to_sparse(fill_value=0)
X_new = kbest.fit_transform(X_sparse, Y_sparse)
I simply want to reduce the 3000 features to something more manageable say 20 that correlate well with my Y values (continuous response)
The reason you are getting an error on everything is because to do anything in Pandas or sklearn, the entire dataset has to be loaded in memory along with all the other data from temporary steps.
Instead of doing one hot encoding, try binary encoding or hashing encoding. One-hot-encoding has a linear growth rate n where n is the number of categories in a categorical feature. Binary encoding has log_2(n) growth rate so you will be able to avoid memory error. If not, try hashing encoding.

How to calculate 512 point FFT using 2048 point FFT hardware module

I have a 2048 point FFT IP. How may I use it to calculate 512 point FFT ?
There are different ways to accomplish this, but the simplest is to replicate the input data 4 times, to obtain a signal of 2048 samples. Note that the DFT (which is what the FFT computes) can be seen as assuming the input signal being replicated infinitely. Thus, we are just providing a larger "view" of this infinitely long periodic signal.
The resulting FFT will have 512 non-zero values, with zeros in between. Each of the non-zero values will also be four times as large as the 512-point FFT would have produced, because there are four times as many input samples (that is, if the normalization is as commonly applied, with no normalization in the forward transform and 1/N normalization in the inverse transform).
Here is a proof of principle in MATLAB:
data = randn(1,512);
ft = fft(data); % 512-point FFT
data = repmat(data,1,4);
ft2 = fft(data); % 2048-point FFT
ft2 = ft2(1:4:end) / 4; % 512-point FFT
assert(all(ft2==ft))
(Very surprising that the values were exactly equal, no differences due to numerical precision appeared in this case!)
An alternate solution from the correct solution provided by Cris Luengo which does not require any rescaling is to pad the data with zeros to the required length of 2048 samples. You then get your result by reading every 2048/512 = 4 outputs (i.e. output[0], output[3], ... in a 0-based indexing system).
Since you mention making use of a hardware module, this could be implemented in hardware by connecting the first 512 input pins and grounding all other inputs, and reading every 4th output pin (ignoring all other output pins).
Note that this works because the FFT of the zero-padded signal is an interpolation in the frequency-domain of the original signal's FFT. In this case you do not need the interpolated values, so you can just ignore them. Here's an example computing a 4-point FFT using a 16-point module (I've reduced the size of the FFT for brievety, but kept the same ratio of 4 between the two):
x = [1,2,3,4]
fft(x)
ans> 10.+0.j,
-2.+2.j,
-2.+0.j,
-2.-2.j
x = [1,2,3,4,0,0,0,0,0,0,0,0,0,0,0,0]
fft(x)
ans> 10.+0.j, 6.499-6.582j, -0.414-7.242j, -4.051-2.438j,
-2.+2.j, 1.808+1.804j, 2.414-1.242j, -0.257-2.3395j,
-2.+0.j, -0.257+2.339j, 2.414+1.2426j, 1.808-1.8042j,
-2.-2.j, -4.051+2.438j, -0.414+7.2426j, 6.499+6.5822j
As you can see in the second output, the first column (which correspond to output 0, 3, 7 and 11) is identical to the desired output from the first, smaller-sized FFT.

Neural net fails to generalize a simple bitwise AND

After taking bunch of online courses and reading many papers I started playing with neural-net but to my surprise it fails to generalize a simple bitwise AND operation.
Inputs:
Inp#1 - randomly generated number between 0-15, scaled down to (0,1)
Inp#2 - 16 bit randomly generated unsigned int scaled down to (0,1)
# Code snippet
int in1 = (int)rand()%16;
int in2 = (int)rand()%(0x0010000);
in[0] = (fann_type)(in1/100.0); // not to worry about float roundup
in[1] = (fann_type)(in2/100000.0); // not to worry about float roundup
Outputs:
Out#1 = -1 if the corresponding bit specified by index inp#1 in inp#2 value is 0, otherwise 1
# Code snippet
int out1 = (in2 & (1<<in1)) ? 1 : -1;
out[0] = (fann_type)out1;
Network: tried many different variations, below is example
A. 1 hidden layer with 30 neurons,
Activation Function (hidden): sigmoid,
Activation Function (output): sigmoid_symmetric (tanh),
Training method: RPROP
Learning rate: 0.7 (default)
Momentum: 0.0 (default)
RPROP Increase factor: 1.2 (default)
RPROP Decrease factor: 0.5 (default)
RPROP Minimum Step-size: 0 (default)
RPROP Maximum Step-size: 50 (default)
B. 3 hidden layers each having 30 neurons, with the same params as in A
C. tried the same networks also with scaling inputs to (-1,1) and using tanh for also hidden layer.
Data Sets: 5000 samples for training, 5000 for testing and 5000 for validation. Tried even bigger datasets, no success
# examples from training set
0.040000 0.321600
-1
0.140000 0.625890
1
0.140000 0.039210
-1
0.010000 0.432830
1
0.100000 0.102220
1
Process: the network trained with training set and monitored the MSE of test data in parallel to avoid possible overfitting.
Libraries: used multiple, but mostly tried with fann and used fanntool for gui.
Any ideas? Can upload the datasets if any particular interest.
If I understand your setup, you try to do something like:
have a network of architecture 2-X-X-X-1 (where X - hidden units) - thus 2 inputs, one output
model bitwise function over inputs
If this is true, this is extremely peculiar problem, and a very bad choice of architecture. Neural networks are not magical hats, they are very big family of models. What you try to do has no characteristics, which is expected from function to model by NN. It is completely non smooth in the input, it has lots of discontinuities, it is actually a bunch of if-else clauses.
What you should do? You should express your inputs as bits, thus you should have 32 inputs, 16 binary inputs per number, then it will learn your function without any problems. You encoded inputs in a very specific manner (by taking its decimal representation) and expect your network to model decomposition to binary and then operation on top of it. NN will learn it, but you might need quite complex network to achieve such operation - again, the whole reason is the fact that you provided your network with suboptimal representation and build a very simple network, which was originally designed to approximate smooth functions.

svm scaling input values

I am using libSVM.
Say my feature values are in the following format:
instance1 : f11, f12, f13, f14
instance2 : f21, f22, f23, f24
instance3 : f31, f32, f33, f34
instance4 : f41, f42, f43, f44
..............................
instanceN : fN1, fN2, fN3, fN4
I think there are two scaling can be applied.
scale each instance vector such that each vector has zero mean and unit variance.
( (f11, f12, f13, f14) - mean((f11, f12, f13, f14) ). /std((f11, f12, f13, f14) )
scale each colum of the above matrix to a range. for example [-1, 1]
According to my experiments with RBF kernel (libSVM) I found that the second scaling (2) improves the results by about 10%. I did not understand the reason why (2) gives me a improved results.
Could anybody explain me what is the reason for applying scaling and why the second option gives me improved results?
The standard thing to do is to make each dimension (or attribute, or column (in your example)) have zero mean and unit variance.
This brings each dimension of the SVM into the same magnitude. From http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf:
The main advantage of scaling is to avoid attributes in greater numeric
ranges dominating those in smaller numeric ranges. Another advantage is to avoid
numerical diculties during the calculation. Because kernel values usually depend on
the inner products of feature vectors, e.g. the linear kernel and the polynomial ker-
nel, large attribute values might cause numerical problems. We recommend linearly
scaling each attribute to the range [-1,+1] or [0,1].
I believe that it comes down to your original data a lot.
If your original data has SOME extreme values for some columns, then in my opinion you lose some definition when scaling linearly, for example in the range [-1,1].
Let's say that you have a column where 90% of values are between 100-500 and in the remaining 10% the values are as low as -2000 and as high as +2500.
If you scale this data linearly, then you'll have:
-2000 -> -1 ## <- The min in your scaled data
+2500 -> +1 ## <- The max in your scaled data
100 -> -0.06666666666666665
234 -> -0.007111111111111068
500 -> 0.11111111111111116
You could argue that the discernibility between what was originally 100 and 500 is smaller in the scaled data in comparison to what it was in the original data.
At the end, I believe it very much comes down to the specifics of your data and I believe the 10% improved performance is very coincidental, you will certainly not see a difference of this magnitude in every dataset you try both scaling methods on.
At the same time, in the paper in the link listed in the other answer, you can clearly see that the authors recommend data to be scaled linearly.
I hope someone finds this useful!
The accepted answer speaks of "Standard Scaling", which is not efficient for high-dimensional data stored in sparse matrices (text data is a use-case); in such cases, you may resort to "Max Scaling" and its variants, which works with sparse matrices.

Resources