Can someone Vectorize the diffusion() function

Can someone Vectorize the diffusion() function - vectorization

I am unable to vectorize this code
the Given Information to do is
vectorizing diffusion()
The diffusion workload has a true vector dependency over timesteps, so this workload can not be vectorized over timesteps. One solution to this issue is to implement loop-interchange to make the particle loop to the inner loop.
In order to implement the interchange, you must create a temporary buffer to store the positions of the particles. Furthermore, the random number generator can't be in the vectorized loop. So you must generate and store multiple random numbers before the vectorized loop. You can generate n_particles random numbers with:
float rn[n_particles];
vsRngUniform(VSL_RNG_METHOD_UNIFORM_STD,
rnStream, n_particles, rn, -1.0, 1.0);
The given Code to vectorize is
int diffusion(
const int n_particles, // num of particles
const int n_steps, // num of timesteps
const float x_threshold, // x cutoff
const float alpha, // for dist_func
VSLStreamStatePtr rnStream // RNG
) {
int n_escaped=0;
for (int i = 0; i < n_particles; i++) {
float x = 0.0f;
for (int j = 0; j < n_steps; j++) {
float rn;
// Intel(R) MKL RNG
vsRngUniform(VSL_RNG_METHOD_UNIFORM_STD,
rnStream, 1, &rn, -1.0, 1.0);
x += dist_func(alpha, rn);
}
if (x > x_threshold) n_escaped++;
}
return n_escaped;
}

Related

How can I get the Kernel of a SVM classifier in OpenCV?

I developed a multi-class SVM with OpenCV 3.0 and I want to compute the distance between each class and data (input) in order to estimate the confidence of the prediction.
I used the below code at link1 but I have errors when I try to get the Kernel of my SVM!
Thanks for your help.
Mat sv = svm->getSupportVectors();
Ptr<SVM::Kernel> kernel = svm->getKernel(); // ??
Mat buffer(1, sv.rows, CV_32F);
kernel->calc(sv.rows, sv.cols, sv.ptr<float>(), data.ptr<float>(), buffer.ptr<float>()); // apply kernel on data (CV_32F vector) and support vectors
Mat alpha, svidx;
int N = 11;
vector<int> votes(N, 0); // results of majority vote will be stored here (N is number of classes)
int i, j, dfi;
for (i = dfi = 0; i < N; i++)
{
for (j = i + 1; j < N; j++, dfi++)
{
// compute score for each binary svm
double rho = svm->getDecisionFunction(dfi, alpha, svidx);
double sum = -rho;
for (int k = 0; k < sv.rows; k++)
sum += alpha.at<float>(k)*buffer.at<float>(sv.at<int>(k));
// majority vote
votes[sum > 0 ? i : j]++;
}
}

Why are my frequency values for iPhone FFT incorrect?

I've been trying to get exact frequencies using the FFT in Apple's Accelerate framework, but I'm having trouble working out why my values are off the true frequency.
I have been using this article http://www.dspdimension.com/admin/pitch-shifting-using-the-ft/ as the basis for my implementation, and after really struggling to get to the point I'm at now, I am totally stumped.
So far I've got audio in -> Hanning window -> FFT -> phase calculation -> weird final output. I'd think that there will be a problem with my maths somewhere, but I'm really out of ideas by now.
The outputs are a lot lower what they should be, e.g., I input 440Hz and it prints out 190Hz, or I input 880Hz and it prints out 400Hz. For the most part these results are consistent, but not always, and there doesn't seem to be any common factor between anything either...
Here is my code:
enum
{
sampleRate = 44100,
osamp = 4,
samples = 4096,
range = samples * 7 / 16,
step = samples / osamp
};
NSMutableArray *fftResults;
static COMPLEX_SPLIT A;
static FFTSetup setupReal;
static uint32_t log2n, n, nOver2;
static int32_t stride;
static float expct = 2*M_PI*((double)step/(double)samples);
static float phase1[range];
static float phase2[range];
static float dPhase[range];
- (void)fftSetup
{
// Declaring integers
log2n = 12;
n = 1 << log2n;
stride = 1;
nOver2 = n / 2;
// Allocating memory for complex vectors
A.realp = (float *) malloc(nOver2 * sizeof(float));
A.imagp = (float *) malloc(nOver2 * sizeof(float));
// Allocating memory for FFT
setupReal = vDSP_create_fftsetup(log2n, FFT_RADIX2);
// Setting phase
memset(phase2, 0, range * sizeof(float));
}
// For each sample in buffer...
for (int bufferCount = 0; bufferCount < audioBufferList.mNumberBuffers; bufferCount++)
{
// Declaring samples from audio buffer list
SInt16 *samples = (SInt16*)audioBufferList.mBuffers[bufferCount].mData;
// Creating Hann window function
for (int i = 0; i < nOver2; i++)
{
double hannMultiplier = 0.5 * (1 - cos((2 * M_PI * i) / (nOver2 - 1)));
// Applying window to each sample
A.realp[i] = hannMultiplier * samples[i];
A.imagp[i] = 0;
}
// Applying FFT
vDSP_fft_zrip(setupReal, &A, stride, log2n, FFT_FORWARD);
// Detecting phase
vDSP_zvphas(&A, stride, phase1, stride, range);
// Calculating phase difference
vDSP_vsub(phase2, stride, phase1, stride, dPhase, stride, range);
// Saving phase
memcpy(phase2, phase1, range * sizeof(float));
// Extracting DSP outputs
for (size_t j = 0; j < nOver2; j++)
{
NSNumber *realNumbers = [NSNumber numberWithFloat:A.realp[j]];
NSNumber *imagNumbers = [NSNumber numberWithFloat:A.imagp[j]];
[real addObject:realNumbers];
[imag addObject:imagNumbers];
}
// Combining real and imaginary parts
[resultsCombined addObject:real];
[resultsCombined addObject:imag];
// Filling FFT output array
[fftResults addObject:resultsCombined];
}
}
int fftCount = [fftResults count];
NSLog(#"FFT Count: %d",fftCount);
// For each FFT...
for (int i = 0; i < fftCount; i++)
{
// Declaring integers for peak detection
float peak = 0;
float binNumber = 0;
// Declaring integers for phase detection
float deltaPhase;
static float trueFrequency[range];
for (size_t j = 1; j < range; j++)
{
// Calculating bin magnitiude
float realVal = [[[[fftResults objectAtIndex:i] objectAtIndex:0] objectAtIndex:j] floatValue];
float imagVal = [[[[fftResults objectAtIndex:i] objectAtIndex:1] objectAtIndex:j] floatValue];
float magnitude = sqrtf(realVal*realVal + imagVal*imagVal);
// Peak detection
if (magnitude > peak)
{
peak = magnitude;
binNumber = (float)j;
}
// Getting phase difference
deltaPhase = dPhase[j];
// Subtract expected difference
deltaPhase -= (float)j*expct;
// Map phase difference into +/- pi interval
int qpd = deltaPhase / M_PI;
if (qpd >= 0)
qpd += qpd&1;
else
qpd -= qpd&1;
deltaPhase -= M_PI * (float)qpd;
// Getting bin deviation from +/i interval
float deltaFrequency = osamp * deltaPhase / (2 * M_PI);
// Calculating true frequency at the j-th partial
trueFrequency[j] = (j * (sampleRate/samples)) + (deltaFrequency * (sampleRate/samples));
}
UInt32 mag;
mag = binNumber;
// Extracting frequency at bin peak
float f = trueFrequency[mag];
NSLog(#"True frequency = %fHz", f);
float b = roundf(binNumber*(sampleRate/nOver2));
NSLog(#" Bin frequency = %fHz", b);
}

Note that the expected phase difference (even for a bin-centered frequency) depends on both the window offset or overlap of the FFT pairs, and the bin number or frequency of the FFT result. e.g. If you offset the windows by very little (1 sample), then the unwrapped phase difference between 2 FFTs will be smaller than with a larger offset. At the same offset, if the frequency is higher, the expected phase difference between the same bin of two FFTs will be greater (or it will wrap more).

Fast Gaussian Blur image filter with ARM NEON

I'm trying to make a mobile fast version of Gaussian Blur image filter.
I've read other questions, like: Fast Gaussian blur on unsigned char image- ARM Neon Intrinsics- iOS Dev
For my purpose i need only a fixed size (7x7) fixed sigma (2) Gaussian filter.
So, before optimizing for ARM NEON, I'm implementing 1D Gaussian Kernel in C++, and comparing performance with OpenCV GaussianBlur() method directly in mobile environment (Android with NDK). This way it will result in a much simpler code to optimize.
However the result is that my implementation is 10 times slower then OpenCV4Android version. I've read that OpenCV4 Tegra have optimized GaussianBlur implementation, but I don't think that standard OpenCV4Android have those kind of optimizations, so why is my code so slow?
Here is my implementation (note: reflect101 is used for pixel reflection when applying filter near borders):
Mat myGaussianBlur(Mat src){
Mat dst(src.rows, src.cols, CV_8UC1);
Mat temp(src.rows, src.cols, CV_8UC1);
float sum, x1, y1;
// coefficients of 1D gaussian kernel with sigma = 2
double coeffs[] = {0.06475879783, 0.1209853623, 0.1760326634, 0.1994711402, 0.1760326634, 0.1209853623, 0.06475879783};
//Normalize coeffs
float coeffs_sum = 0.9230247873f;
for (int i = 0; i < 7; i++){
coeffs[i] /= coeffs_sum;
}
// filter vertically
for(int y = 0; y < src.rows; y++){
for(int x = 0; x < src.cols; x++){
sum = 0.0;
for(int i = -3; i <= 3; i++){
y1 = reflect101(src.rows, y - i);
sum += coeffs[i + 3]*src.at<uchar>(y1, x);
}
temp.at<uchar>(y,x) = sum;
}
}
// filter horizontally
for(int y = 0; y < src.rows; y++){
for(int x = 0; x < src.cols; x++){
sum = 0.0;
for(int i = -3; i <= 3; i++){
x1 = reflect101(src.rows, x - i);
sum += coeffs[i + 3]*temp.at<uchar>(y, x1);
}
dst.at<uchar>(y,x) = sum;
}
}
return dst;
}

A big part of the problem, here, is that the algorithm is overly precise, as #PaulR pointed out. It's usually best to keep your coefficient table no more precise than your data. In this case, since you appear to be processing uchar data, you would use roughly an 8-bit coefficient table.
Keeping these weights small will particularly matter in your NEON implementation because the narrower you have the arithmetic, the more lanes you can process at once.
Beyond that, the first major slowdown that stands out is that having the image edge reflection code within the main loop. That's going to make the bulk of the work less efficient because it will generally not need to do anything special in that case.
It might work out better if you use a special version of the loop near the edges, and then when you're safe from that you use a simplified inner loop that doesn't call that reflect101() function.
Second (more relevant to prototype code) is that it's possible to add the wings of the window together before applying the weighting function, because the table contains the same coefficients on both sides.
sum = src.at<uchar>(y1, x) * coeffs[3];
for(int i = -3; i < 0; i++) {
int tmp = src.at<uchar>(y + i, x) + src.at<uchar>(y - i, x);
sum += coeffs[i + 3] * tmp;
}
This saves you six multiplies per pixel, and it's a step towards some other optimisations around controlling overflow conditions.
Then there are a couple of other problems related to the memory system.
The two-pass approach is good in principle, because it saves you from performing a lot of recomputation. Unfortunately it can push the useful data out of L1 cache, which can make everything quite a lot slower. It also means that when you write the result out to memory, you're quantising the intermediate sum, which can reduce precision.
When you convert this code to NEON, one of the things you will want to focus on is trying to keep your working set inside the register file, but without discarding calculations before they've been fully utilised.
When people do use two passes, it's usual for the intermediate data to be transposed -- that is, a column of input becomes a row of output.
This is because the CPU will really not like fetching small amounts of data across multiple lines of the input image. It works out much more efficient (because of the way the cache works) if you collect together a bunch of horizontal pixels, and filter those. If the temporary buffer is transposed, then the second pass also collects together a bunch of horizontal points (which would vertical in the original orientation) and it transposes its output again so it comes out the right way.
If you optimise to keep your working set localised, then you might not need this transposition trick, but it's worth knowing about so that you can set yourself a healthy baseline performance. Unfortunately, localisation like this does force you to go back to the non-optimal memory fetches, but with the wider data types that penalty can be mitigated.

If this is specifically for 8 bit images then you really don't want floating point coefficients, especially not double precision. Also you don't want to use floats for x1, y1. You should just use integers for coordinates and you can use fixed point (i.e. integer) for the coefficients to keep all the filter arithmetic in the integer domain, e.g.
Mat myGaussianBlur(Mat src){
Mat dst(src.rows, src.cols, CV_8UC1);
Mat temp(src.rows, src.cols, CV_16UC1); // <<<
int sum, x1, y1; // <<<
// coefficients of 1D gaussian kernel with sigma = 2
double coeffs[] = {0.06475879783, 0.1209853623, 0.1760326634, 0.1994711402, 0.1760326634, 0.1209853623, 0.06475879783};
int coeffs_i[7] = { 0 }; // <<<
//Normalize coeffs
float coeffs_sum = 0.9230247873f;
for (int i = 0; i < 7; i++){
coeffs_i[i] = (int)(coeffs[i] / coeffs_sum * 256); // <<<
}
// filter vertically
for(int y = 0; y < src.rows; y++){
for(int x = 0; x < src.cols; x++){
sum = 0; // <<<
for(int i = -3; i <= 3; i++){
y1 = reflect101(src.rows, y - i);
sum += coeffs_i[i + 3]*src.at<uchar>(y1, x); // <<<
}
temp.at<uchar>(y,x) = sum;
}
}
// filter horizontally
for(int y = 0; y < src.rows; y++){
for(int x = 0; x < src.cols; x++){
sum = 0; // <<<
for(int i = -3; i <= 3; i++){
x1 = reflect101(src.rows, x - i);
sum += coeffs_i[i + 3]*temp.at<uchar>(y, x1); // <<<
}
dst.at<uchar>(y,x) = sum / (256 * 256); // <<<
}
}
return dst;
}

This is the code after implementing all the suggestions of #Paul R and #sh1, summarized as follows:
1) use only integer arithmetic (with precision to taste)
2) add the values of the pixels at the same distance from the mask center before applying the multiplications, to reduce the number of multiplications.
3) apply only horizontal filters to take advantage of the storage by rows of the matrices
4) separate cycles around the edges from those inside the image not to make unnecessary calls to reflection functions. I totally removed the functions of reflection, including them inside the loops along the edges.
5) In addition, as a personal observation, to improve rounding without calling a (slow) function "round" or "cvRound", I've added to both temporary and final pixel results 0.5f (= 32768 in integers precision) to reduce the error / difference compared to OpenCV.
Now the performance is much better from about 15 to about 6 times slower than OpenCV.
However, the resulting matrix is not perfectly identical to that obtained with the Gaussian Blur of OpenCV. This is not due to arithmetic length (sufficient) as well as removing the error remains. Note that this is a minimum difference, between 0 and 2 (in absolute value) of pixel intensity, between the matrices resulting from the two versions. Coefficient are the same used by OpenCV, obtained with getGaussianKernel with same size and sigma.
Mat myGaussianBlur(Mat src){
Mat dst(src.rows, src.cols, CV_8UC1);
Mat temp(src.rows, src.cols, CV_8UC1);
int sum;
int x1;
double coeffs[] = {0.070159, 0.131075, 0.190713, 0.216106, 0.190713, 0.131075, 0.070159};
int coeffs_i[7] = { 0 };
for (int i = 0; i < 7; i++){
coeffs_i[i] = (int)(coeffs[i] * 65536); //65536
}
// filter horizontally - inside the image
for(int y = 0; y < src.rows; y++){
uchar *ptr = src.ptr<uchar>(y);
for(int x = 3; x < (src.cols - 3); x++){
sum = ptr[x] * coeffs_i[3];
for(int i = -3; i < 0; i++){
int tmp = ptr[x+i] + ptr[x-i];
sum += coeffs_i[i + 3]*tmp;
}
temp.at<uchar>(y,x) = (sum + 32768) / 65536;
}
}
// filter horizontally - edges - needs reflect
for(int y = 0; y < src.rows; y++){
uchar *ptr = src.ptr<uchar>(y);
for(int x = 0; x <= 2; x++){
sum = 0;
for(int i = -3; i <= 3; i++){
x1 = x + i;
if(x1 < 0){
x1 = -x1;
}
sum += coeffs_i[i + 3]*ptr[x1];
}
temp.at<uchar>(y,x) = (sum + 32768) / 65536;
}
}
for(int y = 0; y < src.rows; y++){
uchar *ptr = src.ptr<uchar>(y);
for(int x = (src.cols - 3); x < src.cols; x++){
sum = 0;
for(int i = -3; i <= 3; i++){
x1 = x + i;
if(x1 >= src.cols){
x1 = 2*src.cols - x1 - 2;
}
sum += coeffs_i[i + 3]*ptr[x1];
}
temp.at<uchar>(y,x) = (sum + 32768) / 65536;
}
}
// transpose to apply again horizontal filter - better cache data locality
transpose(temp, temp);
// filter horizontally - inside the image
for(int y = 0; y < src.rows; y++){
uchar *ptr = temp.ptr<uchar>(y);
for(int x = 3; x < (src.cols - 3); x++){
sum = ptr[x] * coeffs_i[3];
for(int i = -3; i < 0; i++){
int tmp = ptr[x+i] + ptr[x-i];
sum += coeffs_i[i + 3]*tmp;
}
dst.at<uchar>(y,x) = (sum + 32768) / 65536;
}
}
// filter horizontally - edges - needs reflect
for(int y = 0; y < src.rows; y++){
uchar *ptr = temp.ptr<uchar>(y);
for(int x = 0; x <= 2; x++){
sum = 0;
for(int i = -3; i <= 3; i++){
x1 = x + i;
if(x1 < 0){
x1 = -x1;
}
sum += coeffs_i[i + 3]*ptr[x1];
}
dst.at<uchar>(y,x) = (sum + 32768) / 65536;
}
}
for(int y = 0; y < src.rows; y++){
uchar *ptr = temp.ptr<uchar>(y);
for(int x = (src.cols - 3); x < src.cols; x++){
sum = 0;
for(int i = -3; i <= 3; i++){
x1 = x + i;
if(x1 >= src.cols){
x1 = 2*src.cols - x1 - 2;
}
sum += coeffs_i[i + 3]*ptr[x1];
}
dst.at<uchar>(y,x) = (sum + 32768) / 65536;
}
}
transpose(dst, dst);
return dst;
}

According to Google document, on Android device, using float/double is twice slower than using int/uchar.
You may find some solutions to speed up your C++ code on this Android documents.
https://developer.android.com/training/articles/perf-tips

Cepstrum and Formant Tracking Using Apple Accelerate Framework

I've been using this web page as a guideline for formant tracking of speech...
http://iitg.vlab.co.in/?sub=59&brch=164&sim=615&cnt=1
It all seems to be going pretty well, except for the last step, which is the converting of the cepstrum into a smoothed representation for simple peak picking for the formant tracking. The spectrograph looks good, and the cepstrograph (can I say that? :P) also looks good (from what I can tell), but the final stage the results (smoothed formant representation) are not what I expected.
I uploaded a sample of each stage as visual images to...
http://imgur.com/a/62duS
This sample is for the speech of the sound 'i' as in 'beed'. According to this site...
http://home.cc.umanitoba.ca/~robh/howto.html#formants
the first formant should come in around 500hz, and the second and third around 2200hz and 2800 hz respectively. The spetrograph shows something very similar, but on the last stage I am gettings results similar to...
F1 - 891
F2 - 1550
F3 - 2329
Any insight would be greatly appreciated. I've been going round in circles on this for some time. My code looks as follows...
// set up fft parameters
UInt32 log2n = 9;
UInt32 n = 512;
UInt32 window = n;
UInt32 halfN = n/2;
UInt32 stride = 1;
FFTSetup setupReal = [appDelegate getFftSetup];
int stepSize = (hpBuffer.sampleCount-window) / quantizeCount;
// calculate volume from raw samples, because it seems more reliable that fft
UInt32 volumeWindow = 128;
volumeBuffer = malloc(sizeof(float)*quantizeCount);
int windowPos = 0;
for (int i=0; i < quantizeCount; i++) {
windowPos += stepSize;
float total = 0.0f;
float max = 0.0f;
for (int p=windowPos; p < windowPos+volumeWindow; p++) {
total += sampleBuffer.buffer[p];
if (sampleBuffer.buffer[p] > max)
max = sampleBuffer.buffer[p];
}
volumeBuffer[i] = max;
}
// normalize volumebuffer
[FloatAudioBuffer normalizePositiveBuffer:volumeBuffer ofSize:quantizeCount];
// allocate memory for complex array
COMPLEX_SPLIT complexArray;
complexArray.realp = (float*)malloc(4096*sizeof(float));
complexArray.imagp = (float*)malloc(4096*sizeof(float));
// allocate some space for temporary hamming buffer
float *hamBuffer = malloc(n*sizeof(float));
// create spectrum and feature buffer
spectrumBuffer = malloc(sizeof(float)*halfN*quantizeCount);
formantBuffer = malloc(sizeof(float)*4096*quantizeCount);
cepstrumBuffer = malloc(sizeof(float)*halfN*quantizeCount);
lowCepstrumBuffer = malloc(sizeof(float)*featureCount*quantizeCount);
featureBuffer = malloc(sizeof(float)*featureCount*quantizeCount);
// create data point for each quantize segment
float TWOPI = 2.0f * M_PI;
for (int s=0; s < quantizeCount; s++) {
// copy buffer data into a seperate array and apply hamming window
int offset = (int)(s * stepSize);
for (int i=0; i < n; i++)
hamBuffer[i] = hpBuffer.buffer[offset+i] * ((1.0f-0.46f) - 0.46f*cos(TWOPI*i/((float)n-1.0f)));
// configure float array into acceptable input array format (interleaved)
vDSP_ctoz((COMPLEX*)hamBuffer, 2, &complexArray, 1, halfN);
// run FFT
vDSP_fft_zrip(setupReal, &complexArray, stride, log2n, FFT_FORWARD);
// Absolute square (equivalent to mag^2)
complexArray.imagp[0] = 0.0f;
vDSP_zvmags(&complexArray, 1, complexArray.realp, 1, halfN);
bzero(complexArray.imagp, (halfN) * sizeof(float));
// scale
float scale = 1.0f / (2.0f*(float)n);
vDSP_vsmul(complexArray.realp, 1, &scale, complexArray.realp, 1, halfN);
// get log of absolute values for passing to inverse FFT for cepstrum
for (int i=0; i < halfN; i++)
complexArray.realp[i] = logf(sqrtf(complexArray.realp[i]));
// save this into spectrum buffer
memcpy(&spectrumBuffer[s*halfN], complexArray.realp, halfN*sizeof(float));
// convert spectrum to interleaved ready for inverse fft
vDSP_ctoz((COMPLEX*)&spectrumBuffer[s*halfN], 2, &complexArray, 1, halfN/2);
// create cepstrum
vDSP_fft_zrip(setupReal, &complexArray, stride, log2n-1, FFT_INVERSE);
//convert interleaved to real and straight into cepstrum buffer
vDSP_ztoc(&complexArray, 1, (COMPLEX*)&cepstrumBuffer[s*halfN], 2, halfN/2);
// copy first part of cepstrum into low cepstrum buffer
memcpy(&lowCepstrumBuffer[s*featureCount], &cepstrumBuffer[s*halfN], featureCount*sizeof(float));
// make 8000 point array based on the first 15 values
float *tempArray = malloc(8192*sizeof(float));
for (int i=0; i < 8192; i++) {
if (i < 15)
tempArray[i] = cepstrumBuffer[s*halfN+i];
else
tempArray[i] = 0.0f;
}
vDSP_ctoz((COMPLEX*)tempArray, 2, &complexArray, 1, 4096);
float newLog2n = log2f(8192.0f);
complexArray.imagp[0] = 0.0f;
vDSP_fft_zrip(setupReal, &complexArray, stride, newLog2n, FFT_FORWARD);
vDSP_zvmags(&complexArray, 1, complexArray.realp, 1, 4096);
bzero(complexArray.imagp, (4096) * sizeof(float));
// scale
scale = 1.0f / (2.0f*(float)8192);
vDSP_vsmul(complexArray.realp, 1, &scale, complexArray.realp, 1, 4096);
// get magnitude
for (int i=0; i < 4096; i++)
complexArray.realp[i] = sqrtf(complexArray.realp[i]);
// write to formant buffer
memcpy(&formantBuffer[s*4096], complexArray.realp, 4096*sizeof(float));
// complex array now contains formant spectrum
// it's large, so get features here!
// try simple peak picking algorithm for first 3 formants
int formantIndex = 0;
float *peaks = malloc(6*sizeof(float));
for (int i=0; i < 6; i++)
peaks[i] = 0.0f;
for (int i=1; i < 4096-1 && formantIndex < 6; i++) {
if (complexArray.realp[i-1] < complexArray.realp[i] &&
complexArray.realp[i+1] < complexArray.realp[i])
peaks[formantIndex++] = i;
}

Input matrix to opencv kmeans clustering

This question is specific to opencv:
The kmeans example given in the opencv documentation has a 2-channel matrix - one channel for each dimension of the feature vector. But, some of the other example seem to say that it should be a one channel matrix with features along the columns with one row for each sample. Which of these is right?
if I have a 5 dimensional feature vector, what should be the input matrix that I use:
This one:
cv::Mat inputSamples(numSamples, 1, CV32FC(numFeatures))
or this one:
cv::Mat inputSamples(numSamples, numFeatures, CV_32F)

The correct answer is cv::Mat inputSamples(numSamples, numFeatures, CV_32F).
The OpenCV Documentation about kmeans says:
samples – Floating-point matrix of input samples, one row per sample
So it is not a Floating-point vector of n-Dimensional floats as in the other option. Which examples suggested such a behaviour?
Here is also a small example by me that shows how kmeans can be used. It clusters the pixels of an image and displays the result:
#include "opencv2/imgproc/imgproc.hpp"
#include "opencv2/highgui/highgui.hpp"
using namespace cv;
int main( int argc, char** argv )
{
Mat src = imread( argv[1], 1 );
Mat samples(src.rows * src.cols, 3, CV_32F);
for( int y = 0; y < src.rows; y++ )
for( int x = 0; x < src.cols; x++ )
for( int z = 0; z < 3; z++)
samples.at<float>(y + x*src.rows, z) = src.at<Vec3b>(y,x)[z];
int clusterCount = 15;
Mat labels;
int attempts = 5;
Mat centers;
kmeans(samples, clusterCount, labels, TermCriteria(CV_TERMCRIT_ITER|CV_TERMCRIT_EPS, 10000, 0.0001), attempts, KMEANS_PP_CENTERS, centers );
Mat new_image( src.size(), src.type() );
for( int y = 0; y < src.rows; y++ )
for( int x = 0; x < src.cols; x++ )
{
int cluster_idx = labels.at<int>(y + x*src.rows,0);
new_image.at<Vec3b>(y,x)[0] = centers.at<float>(cluster_idx, 0);
new_image.at<Vec3b>(y,x)[1] = centers.at<float>(cluster_idx, 1);
new_image.at<Vec3b>(y,x)[2] = centers.at<float>(cluster_idx, 2);
}
imshow( "clustered image", new_image );
waitKey( 0 );
}

As alternative to reshaping the input matrix manually, you can use OpenCV reshape function to achieve similar result with less code. Here is my working implementation of reducing colors count with K-Means method (in Java):
private final static int MAX_ITER = 10;
private final static int CLUSTERS = 16;
public static Mat colorMapKMeans(Mat img, int K, int maxIterations) {
Mat m = img.reshape(1, img.rows() * img.cols());
m.convertTo(m, CvType.CV_32F);
Mat bestLabels = new Mat(m.rows(), 1, CvType.CV_8U);
Mat centroids = new Mat(K, 1, CvType.CV_32F);
Core.kmeans(m, K, bestLabels,
new TermCriteria(TermCriteria.COUNT | TermCriteria.EPS, maxIterations, 1E-5),
1, Core.KMEANS_RANDOM_CENTERS, centroids);
List<Integer> idx = new ArrayList<>(m.rows());
Converters.Mat_to_vector_int(bestLabels, idx);
Mat imgMapped = new Mat(m.size(), m.type());
for(int i = 0; i < idx.size(); i++) {
Mat row = imgMapped.row(i);
centroids.row(idx.get(i)).copyTo(row);
}
return imgMapped.reshape(3, img.rows());
}
public static void main(String[] args) {
System.loadLibrary(Core.NATIVE_LIBRARY_NAME);
Highgui.imwrite("result.png",
colorMapKMeans(Highgui.imread(args[0], Highgui.CV_LOAD_IMAGE_COLOR),
CLUSTERS, MAX_ITER));
}
OpenCV reads image into 2 dimensional, 3 channel matrix. First call to reshape - img.reshape(1, img.rows() * img.cols()); - essentially unrolls 3 channels into columns. In resulting matrix one row corresponds to one pixel of the input image, and 3 columns corresponds to RGB components.
After K-Means algorithm finished its work, and color mapping has been applied, we call reshape again - imgMapped.reshape(3, img.rows()), but now rolling columns back into channels, and reducing row numbers to the original image row number, thus getting back the original matrix format, but only with reduced colors.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart