Tuning vlfeat SVM - machine-learning

I have 6 samples of 1 dim data as example and I'm trying to train vlfeat's SVM on it:
first 3 samples are positive and last 3 samples are negative.
and I get weights(including bias):
w: -0.6220197226 -0.0002974511
the problem is that all samples get predicted as negative, but they are clearly linear separable.
For learning I use solver type VlSvmSolverSgd and lambda 0.01.
I'm using C API if it matters.
Minimum working example:
void vlfeat_svm_test()
vl_size const numData = 6 ;
vl_size const dimension = 1 ;
//double x[dimension * numData] = {188.0,168.0,191.0,150.0, 154.0, 124.0};
double x[dimension * numData] = {188.0/255,168.0/255,191.0/255,150.0/255, 154.0/255, 124.0/255};
double y[numData] = {1, 1, 1, -1, -1, -1} ;
double lambda = 0.01;
VlSvm *svm = vl_svm_new(VlSvmSolverSgd, x, dimension, numData, y, lambda);
double const * w= vl_svm_get_model(svm);
double bias= vl_svm_get_bias(svm);
for(int k=0;k<numData;++k)
double res= 0.0;
for(int i=0;i<dimension;++i)
res+= x[k*dimension+i]*w[i];
int pred= ((res+bias)>0)?1:-1;
cout<< pred <<endl;
cout << "w: ";
for(int i=0;i<dimension;++i)
cout<< w[i] <<" ";
cout<< bias <<endl;
Also I tried to scale input data by dividing by 255 it has no effect.
Update 2:
Extremly low lambda= 0.000001 seems solve the problem.

This happens because the SVM solvers in VLFeat do not estimate the model and bias directly, but use the workaround of adding a constant component to the data (as mentioned in http://www.vlfeat.org/api/svm-fundamentals.html) and return the corresponding model weight as the bias.
The bias term is thus a part of the regularizer and models with higher bias are "penalized" in terms of energy. This effect is especially strong in your case, since your data are extremely low dimensional :)
Therefore you need to choose a small value of the regularization parameter LAMBDA to lower the importance of the regularizer.


How to calculate the "energy" of a signal from DCT coefficients?

I want to compute the proportion of energy of a 2D signal/image that is represented by the n largest DCT (Discrete cosine transform) coefficients.
What I found is this but I don't quite understand why I just can use the L2 norm. Also I don't find another source for it.
X = dct(x);
[XX,ind] = sort(abs(X),'descend');
i = 1;
while norm(X(ind(1:i)))/norm(X) < 0.99
i = i + 1;
Assuming DCT is an orthonormal transformation, i.e. each vector of the basis is normalized, and uncorrelated to the all other vectors.
If two vectors x[i], x[j] are uncorrelated it means that the energy (a[i] * x[i] + a[i] * x[i]) is the energy of the individual parts (a1 * x1) and (a2 * x2).
If x[i] is normalized it means that the energy of a[i] * x[i] is simply a[i]**2.
Combining this two things you conclude that the energy of sum a[i] * x[i] is simply sum a[i]**2.
This is why you can simply use L2 norm of the coefficients in any orthonormal basis to compute the L2 norm of the signal.

Batch gradient descent for polynomial regression

I am trying to move on from simple linear single-variable gradient descent into something more advanced: best polynomial fit for a set of points. I created a simple octave test script which allows me to visually set the points in a 2D space, then start the gradient dsecent algorithm and see how it gradually approaches the best fit.
Unfortunately, it doesn't work as good as it did with the simple single-variable linear regression: the results I get ( when I get them ) are inconsistent with the polynome I expect!
Here is the code:
h = figure();
axis([-dim dim -dim dim]);
hold on
index = 1;
data = zeros(1,2);
[x,y,b] = ginput(1);
if( length(b) == 0 )
plot(x, y, "b+");
data(index, :) = [x y];
y = data(:, 2);
m = length(y);
X = data(:, 1);
X = [ones(m, 1), data(:,1), data(:,1).^2, data(:,1).^3 ];
theta = zeros(4, 1);
iterations = 100;
alpha = 0.001;
J = zeros(1,iterations);
for iter = 1:iterations
theta -= ( (1/m) * ((X * theta) - y)' * X)' * alpha;
plot(-dim:0.01:dim, theta(1) + (-dim:0.01:dim).*theta(2) + (-dim:0.01:dim).^2.*theta(3) + (-dim:0.01:dim).^3.*theta(4), "g-");
J(iter) = sum( (1/m) * ((X * theta) - y)' * X);
plot(-dim:0.01:dim, theta(1) + (-dim:0.01:dim).*theta(2) + (-dim:0.01:dim).^2.*theta(3) + (-dim:0.01:dim).^3.*theta(4), "r-");
plot(1:iter, J);
I continuously get wrong results, even though it would seem that J is minimized correctly. I checked the plotting function with the normal equation ( which works correctly of course, and although I believe the error lies somewhere in the theta equation, I cannot figure out what it.
i implemented your code and it seems to be just fine, the reason that you do not have the results that you want is that Linear regression or polynomial regression in your case suffers from local minimum when you try to minimize the objective function. The algorithm traps in local minimum during execution. i implement your code changing the step (alpha) and i saw that with smaller step it fits the data better but still you are trapping in local minimum.
Choosing random initialization point of thetas every time i am trapping in a different local minimum. If you are lucky you will find a better initial points for theta and fit the data better. I think that there are some algorithms that finds the best initial points.
Below i attach the results for random initial points and the results with Matlab's polyfit.
In the above plot replace "Linear Regression with Polynomial Regression", type error.
If you observe better the plot, you will see that by chance (using rand() ) i chose some initial points that leaded me to the best data fitting comparing the other initial points.... i am showing that with a pointer.

iOS Accelerate low-pass FFT filter mirroring result

I am trying to port an existing FFT based low-pass filter to iOS using the Accelerate vDSP framework.
It seems like the FFT works as expected for about the first 1/4 of the sample. But then after that the results seem wrong, and even more odd are mirrored (with the last half of the signal mirroring most of the first half).
You can see the results from a test application below. First is plotted the original sampled data, then an example of the expected filtered results (filtering out signal higher than 15Hz), then finally the results of my current FFT code (note that the desired results and example FFT result are at a different scale than the original data):
The actual code for my low-pass filter is as follows:
double *lowpassFilterVector(double *accell, uint32_t sampleCount, double lowPassFreq, double sampleRate )
double stride = 1;
int ln = log2f(sampleCount);
int n = 1 << ln;
// So that we get an FFT of the whole data set, we pad out the array to the next highest power of 2.
int fullPadN = n * 2;
double *padAccell = malloc(sizeof(double) * fullPadN);
memset(padAccell, 0, sizeof(double) * fullPadN);
memcpy(padAccell, accell, sizeof(double) * sampleCount);
ln = log2f(fullPadN);
n = 1 << ln;
int nOver2 = n/2;
DSPDoubleSplitComplex A;
A.realp = (double *)malloc(sizeof(double) * nOver2);
A.imagp = (double *)malloc(sizeof(double) * nOver2);
// This can be reused, just including it here for simplicity.
FFTSetupD setupReal = vDSP_create_fftsetupD(ln, FFT_RADIX2);
// Use the FFT to get frequency counts
vDSP_fft_zripD(setupReal, &A, stride, ln, FFT_FORWARD);
const double factor = 0.5f;
vDSP_vsmulD(A.realp, 1, &factor, A.realp, 1, nOver2);
vDSP_vsmulD(A.imagp, 1, &factor, A.imagp, 1, nOver2);
A.realp[nOver2] = A.imagp[0];
A.imagp[0] = 0.0f;
A.imagp[nOver2] = 0.0f;
// Set frequencies above target to 0.
// This tells us which bin the frequencies over the minimum desired correspond to
NSInteger binLocation = (lowPassFreq * n) / sampleRate;
// We add 2 because bin 0 holds special FFT meta data, so bins really start at "1" - and we want to filter out anything OVER the target frequency
for ( NSInteger i = binLocation+2; i < nOver2; i++ )
A.realp[i] = 0;
// Clear out all imaginary parts
bzero(A.imagp, (nOver2) * sizeof(double));
//A.imagp[0] = A.realp[nOver2];
// Now shift back all of the values
vDSP_fft_zripD(setupReal, &A, stride, ln, FFT_INVERSE);
double *filteredAccell = (double *)malloc(sizeof(double) * fullPadN);
// Converts complex vector back into 2D array
vDSP_ztocD(&A, stride, (DSPDoubleComplex*)filteredAccell, 2, nOver2);
// Have to scale results to account for Apple's FFT library algorithm, see:
// http://developer.apple.com/library/ios/#documentation/Performance/Conceptual/vDSP_Programming_Guide/UsingFourierTransforms/UsingFourierTransforms.html#//apple_ref/doc/uid/TP40005147-CH202-15952
double scale = (float)1.0f / fullPadN;//(2.0f * (float)n);
vDSP_vsmulD(filteredAccell, 1, &scale, filteredAccell, 1, fullPadN);
// Tracks results of conversion
printf("\nInput & output:\n");
for (int k = 0; k < sampleCount; k++)
printf("%3d\t%6.2f\t%6.2f\t%6.2f\n", k, accell[k], padAccell[k], filteredAccell[k]);
// Acceleration data will be replaced in-place.
return filteredAccell;
In the original code the library was handling non power-of-two sizes of input data; in my Accelerate code I am padding out the input to the nearest power of two. In the case of the sample test below the original sample data is 1000 samples so it's padded to 1024. I don't think that would affect results but I include that for the sake of possible differences.
If you want to experiment with a solution, you can download the sample project that generates the graphs here (in the FFTTest folder):
FFT Example Project code
Thanks for any insight, I've not worked with FFT's before so I feel like I am missing something critical.
If you want a strictly real (not complex) result, then the data before the IFFT must be conjugate symmetric. If you don't want the result to be mirror symmetric, then don't zero the imaginary component before the IFFT. Merely zeroing bins before the IFFT creates a filter with a huge amount of ripple in the passband.
The Accelerate framework also supports more FFT lengths than just powers of 2.

gradient descent seems to fail

I implemented a gradient descent algorithm to minimize a cost function in order to gain a hypothesis for determining whether an image has a good quality. I did that in Octave. The idea is somehow based on the algorithm from the machine learning class by Andrew Ng
Therefore I have 880 values "y" that contains values from 0.5 to ~12. And I have 880 values from 50 to 300 in "X" that should predict the image's quality.
Sadly the algorithm seems to fail, after some iterations the value for theta is so small, that theta0 and theta1 become "NaN". And my linear regression curve has strange values...
here is the code for the gradient descent algorithm:
(theta = zeros(2, 1);, alpha= 0.01, iterations=1500)
function [theta, J_history] = gradientDescent(X, y, theta, alpha, num_iters)
m = length(y); % number of training examples
J_history = zeros(num_iters, 1);
for iter = 1:num_iters
for i=1:m,
tmp_j1 = tmp_j1+ ((theta (1,1) + theta (2,1)*X(i,2)) - y(i));
for i=1:m,
tmp_j2 = tmp_j2+ (((theta (1,1) + theta (2,1)*X(i,2)) - y(i)) *X(i,2));
tmp1= theta(1,1) - (alpha * ((1/m) * tmp_j1))
tmp2= theta(2,1) - (alpha * ((1/m) * tmp_j2))
% ============================================================
% Save the cost J in every iteration
J_history(iter) = computeCost(X, y, theta);
And here is the computation for the costfunction:
function J = computeCost(X, y, theta) %
m = length(y); % number of training examples
J = 0;
for i=1:m,
tmp = tmp+ (theta (1,1) + theta (2,1)*X(i,2) - y(i))^2; %differenzberechnung
J= (1/(2*m)) * tmp
If you are wondering how the seemingly complex looking for loop can be vectorized and cramped into a single one line expression, then please read on. The vectorized form is:
theta = theta - (alpha/m) * (X' * (X * theta - y))
Given below is a detailed explanation for how we arrive at this vectorized expression using gradient descent algorithm:
This is the gradient descent algorithm to fine tune the value of θ:
Assume that the following values of X, y and θ are given:
m = number of training examples
n = number of features + 1
m = 5 (training examples)
n = 4 (features+1)
X = m x n matrix
y = m x 1 vector matrix
θ = n x 1 vector matrix
xi is the ith training example
xj is the jth feature in a given training example
h(x) = ([X] * [θ]) (m x 1 matrix of predicted values for our training set)
h(x)-y = ([X] * [θ] - [y]) (m x 1 matrix of Errors in our predictions)
whole objective of machine learning is to minimize Errors in predictions. Based on the above corollary, our Errors matrix is m x 1 vector matrix as follows:
To calculate new value of θj, we have to get a summation of all errors (m rows) multiplied by jth feature value of the training set X. That is, take all the values in E, individually multiply them with jth feature of the corresponding training example, and add them all together. This will help us in getting the new (and hopefully better) value of θj. Repeat this process for all j or the number of features. In matrix form, this can be written as:
This can be simplified as:
[E]' x [X] will give us a row vector matrix, since E' is 1 x m matrix and X is m x n matrix. But we are interested in getting a column matrix, hence we transpose the resultant matrix.
More succinctly, it can be written as:
Since (A * B)' = (B' * A'), and A'' = A, we can also write the above as
This is the original expression we started out with:
theta = theta - (alpha/m) * (X' * (X * theta - y))
i vectorized the theta thing...
may could help somebody
theta = theta - (alpha/m * (X * theta-y)' * X)';
I think that your computeCost function is wrong.
I attended NG's class last year and I have the following implementation (vectorized):
m = length(y);
J = 0;
predictions = X * theta;
sqrErrors = (predictions-y).^2;
J = 1/(2*m) * sum(sqrErrors);
The rest of the implementation seems fine to me, although you could also vectorize them.
theta_1 = theta(1) - alpha * (1/m) * sum((X*theta-y).*X(:,1));
theta_2 = theta(2) - alpha * (1/m) * sum((X*theta-y).*X(:,2));
Afterwards you are setting the temporary thetas (here called theta_1 and theta_2) correctly back to the "real" theta.
Generally it is more useful to vectorize instead of loops, it is less annoying to read and to debug.
If you are OK with using a least-squares cost function, then you could try using the normal equation instead of gradient descent. It's much simpler -- only one line -- and computationally faster.
Here is the normal equation:
And in octave form:
theta = (pinv(X' * X )) * X' * y
Here is a tutorial that explains how to use the normal equation: http://www.lauradhamilton.com/tutorial-linear-regression-with-octave
While not scalable like a vectorized version, a loop-based computation of a gradient descent should generate the same results. In the example above, the most probably case of the gradient descent failing to compute the correct theta is the value of alpha.
With a verified set of cost and gradient descent functions and a set of data similar with the one described in the question, theta ends up with NaN values just after a few iterations if alpha = 0.01. However, when set as alpha = 0.000001, the gradient descent works as expected, even after 100 iterations.
Using only vectors here is the compact implementation of LR with Gradient Descent in Mathematica:
Theta = {0, 0}
alpha = 0.0001;
iteration = 1500;
Jhist = Table[0, {i, iteration}];
Theta = Theta -
alpha * Dot[Transpose[X], (Dot[X, Theta] - Y)]/m;
Jhist[[k]] =
Total[ (Dot[X, Theta] - Y[[All]])^2]/(2*m); Theta, {k, iteration}]
Note: Of course one assumes that X is a n * 2 matrix, with X[[,1]] containing only 1s'
This should work:-
theta(1,1) = theta(1,1) - (alpha*(1/m))*((X*theta - y)'* X(:,1) );
theta(2,1) = theta(2,1) - (alpha*(1/m))*((X*theta - y)'* X(:,2) );
its cleaner this way, and vectorized also
predictions = X * theta;
errorsVector = predictions - y;
theta = theta - (alpha/m) * (X' * errorsVector);
If you remember the first Pdf file for Gradient Descent form machine Learning course, you would take care of learning rate. Here is the note from the mentioned pdf.
Implementation Note: If your learning rate is too large, J(theta) can di-
verge and blow up', resulting in values which are too large for computer
calculations. In these situations, Octave/MATLAB will tend to return
NaNs. NaN stands fornot a number' and is often caused by undened
operations that involve - infinity and +infinity.

kissfft scaling

I am looking to compute a fast correlation using FFTs and the kissfft library, and scaling needs to be precise. What scaling is necessary (forward and backwards) and what value do I use to scale my data?
The 3 most common FFT scaling factors are:
1.0 forward FFT, 1.0/N inverse FFT
1.0/N forward FFT, 1.0 inverse FFT
1.0/sqrt(N) in both directions, FFT & IFFT
Given any possible ambiguity in the documentation, and for whatever scaling the user expects to be "correct" for their purposes, best to just feed a pure sine wave of known (1.0 float or 255 integer) amplitude and exactly periodic in the FFT length to the FFT (and/or IFFT) in question, and see if the scaling matches one of the above, is maybe different from one of the above by 2X or sqrt(2), or the desired scaling is something completely different.
e.g. Write a unit test for kissfft in your environment for your data types.
multiply each frequency response by 1/sqrt(N), for an overall scaling of 1/N
In pseudocode:
ifft( fft(x)*conj( fft(y) )/N ) == circular_correlation(x,y)
At least this is true for kisfft with floating point types.
The output of the following c++ example code should be something like
the circular correlation of [1, 3i, 0 0 ....] with itself = (10,0),(1.19796e-10,3),(-4.91499e-08,1.11519e-15),(1.77301e-08,-1.19588e-08) ...
#include <complex>
#include <iostream>
#include "kiss_fft.h"
using namespace std;
int main()
const int nfft=256;
kiss_fft_cfg fwd = kiss_fft_alloc(nfft,0,NULL,NULL);
kiss_fft_cfg inv = kiss_fft_alloc(nfft,1,NULL,NULL);
std::complex<float> x[nfft];
std::complex<float> fx[nfft];
x[0] = 1;
x[1] = std::complex<float>(0,3);
for (int k=0;k<nfft;++k) {
fx[k] = fx[k] * conj(fx[k]);
fx[k] *= 1./nfft;
cout << "the circular correlation of [1, 3i, 0 0 ....] with itself = ";
<< x[0] << ","
<< x[1] << ","
<< x[2] << ","
<< x[3] << " ... " << endl;
return 0;
