Vector Matrix multiplication via ARM NEON - ios

I have a task - to multiply big row vector (10 000 elements) via big column-major matrix (10 000 rows, 400 columns). I decided to go with ARM NEON since I'm curious about this technology and would like to learn more about it.
Here's a working example of vector matrix multiplication I wrote:
//float* vec_ptr - a pointer to vector
//float* mat_ptr - a pointer to matrix
//float* out_ptr - a pointer to output vector
//int matCols - matrix columns
//int vecRows - vector rows, the same as matrix
for (int i = 0, max_i = matCols; i < max_i; i++) {
for (int j = 0, max_j = vecRows - 3; j < max_j; j+=4, mat_ptr+=4, vec_ptr+=4) {
float32x4_t mat_val = vld1q_f32(mat_ptr); //get 4 elements from matrix
float32x4_t vec_val = vld1q_f32(vec_ptr); //get 4 elements from vector
float32x4_t out_val = vmulq_f32(mat_val, vec_val); //multiply vectors
float32_t total_sum = vaddvq_f32(out_val); //sum elements of vector together
out_ptr[i] += total_sum;
}
vec_ptr = &myVec[0]; //switch ptr back again to zero element
}
The problem is that it's taking very long time to compute - 30 ms on iPhone 7+ when my goal is 1 ms or even less if it's possible. Current execution time is understandable since I launch multiplication iteration 400 * (10000 / 4) = 1 000 000 times.
Also, I tried to process 8 elements instead of 4. It seems to help, but numbers still very far from my goal.
I understand that I might make some horrible mistakes since I'm newbie with ARM NEON. And I would be happy if someone can give me some tip how I can optimize my code.
Also - is it worth doing big vector-matrix multiplication via ARM NEON? Does this technology fit well for such purpose?

Your code is completely flawed: it iterates 16 times assuming both matCols and vecRows are 4. What's the point of SIMD then?
And the major performance problem lies in float32_t total_sum = vaddvq_f32(out_val);:
You should never convert a vector to a scalar inside a loop since it causes a pipeline hazard that costs around 15 cycles everytime.
The solution:
float32x4x4_t myMat;
float32x2_t myVecLow, myVecHigh;
myVecLow = vld1_f32(&pVec[0]);
myVecHigh = vld1_f32(&pVec[2]);
myMat = vld4q_f32(pMat);
myMat.val[0] = vmulq_lane_f32(myMat.val[0], myVecLow, 0);
myMat.val[0] = vmlaq_lane_f32(myMat.val[0], myMat.val[1], myVecLow, 1);
myMat.val[0] = vmlaq_lane_f32(myMat.val[0], myMat.val[2], myVecHigh, 0);
myMat.val[0] = vmlaq_lane_f32(myMat.val[0], myMat.val[3], myVecHigh, 1);
vst1q_f32(pDst, myMat.val[0]);
Compute all the four rows in a single pass
Do a matrix transpose (rotation) on-the-fly by vld4
Do vector-scalar multiply-accumulate instead of vector-vector multiply and horizontal add that causes the pipeline hazards.
You were asking if SIMD is suitable for matrix operations? A simple "yes" would be a monumental understatement. You don't even need a loop for this.

Related

Q: What is the mathematical relationship between the amplitudes of the cos and sine waves of an FFT output

I have decomposed some time series data using a custom FFT implementation. By design my FFT implementation gives me a set of cos and sine waves that I can then sum together to regenerate the original signal. This works well without issue, so I know that the extracted sine and cos waves are correct in terms of amplitude, period and phase.
The data I am using has 1024 samples which gives me the properties of 512 cos waves and 512 sine waves (eg the amplitude, phase and period data for each wave).
To save on data storage I am trying to find/understand the mathematical relationship between the amplitudes of the waves. Instead of having to save every amplitude for every sine and cos wave I would like to simply save some coefficients that I can later use to rebuild the amplitudes in code.
FFT Sine Waves with Amplitudes
From the above image you can see that there is a set of Power curve coefficients that roughly fit the amplitude data, however for my use case this is not accurate enough.
As I have all the source data along with the generated properties of each wave, is there a simple formula that I can use or a transform I can perform to generate the amplitudes in code after I have performed the FFT? I know that the amplitudes are related to the real and imaginary values however I cannot store all the real and imaginary values either due to space requirements.
As an example of how I am saving this issue for the period data, I have found that the period of each wave is simply Math.Power(waveIndex, -1). So for the wave periods I do not have to store the data, I can simply regenerate in code.
I cannot currently find a relationship between the amplitudes within the sine wave or even a relationship between cos and sine amplitudes, however the theory and math behind FFT is beyond me so I am hoping that there is a simply formula or concept I can implement.
Following the replies I have added the below code that I use to get the sine and cos wave values, this code snippet may help those replying.
internal void GetSineAndCosWavesBasic(double[] outReal, double[] outImag, int numWaves, out double[,] sineValues, out double[,] cosValues)
{
// the real and imag values from Cooley-Tukey decimation-in-time radix-2 FFT are passed in
// and we want to generate the cos and sine values for each sample for each wave
var length = outReal.Length;
var lengthDouble = (double)length;
var halfLength = lengthDouble / 2.0;
sineValues = new double[numWaves, length];
cosValues = new double[numWaves, length];
var Pi2 = 2 * Math.PI;
for (var waveIdx = 0; waveIdx < numWaves; waveIdx++)
{
for (var sampleIdx = 0; sampleIdx < length; sampleIdx++)
{
// first value case and middle value case
var reX = outReal[waveIdx] / halfLength;
if (sampleIdx == 0)
{
reX = outReal[waveIdx] / lengthDouble;
}
else if (sampleIdx == halfLength)
{
reX = outReal[waveIdx] / lengthDouble;
}
// precompute the value that gets sine/cos applied
var tmp = (Pi2 * waveIdx * sampleIdx) / lengthDouble;
// get the instant cos and sine values
var valueCos = Math.Cos(tmp) * reX;
var valueSin = Math.Sin(tmp) * (-outImag[waveIdx] / halfLength);
// update the sine and cos values for this wave for this sample
cosValues[waveIdx, sampleIdx] = valueCos;
sineValues[waveIdx, sampleIdx] = valueSin;
}
}
}
And the below is how I get the magnitude and phase values, although I do not currently use those anywhere.
internal void CalculateMagAndPhaseBasic(double[] outReal, double[] outImag, out double[] mag, out double[] phase)
{
// the real and imag values from Cooley-Tukey decimation-in-time radix-2 FFT are passed in
// and we want to generate the magnitude and phase values
var length = outReal.Length;
mag = new double[(length / 2) +1];
phase = new double[(length / 2) + 1];
for (var i = 0; i <= length / 2; i++)
{
mag[i] = Math.Pow((outReal[i] * outReal[i]) + (outImag[i] * outImag[i]), 0.5);
phase[i] = Math.Atan2(outImag[i], outReal[i]);
}
}
Actually the fft just returns you complex coefficients S(w)=a+jb
For an N point fft, abs(S(w)) * 2/N will be (close to) the amplitude of the sinusoidal component at frequency w.
This assumes that the sinusoidal component has a frequency close to the center of the fft bin, otherwise the power will be "split" between two adjacent bins.
And that the frequency you're interested in is present through all the fft window.
The output of an FFT has the same number of degrees of freedom as the input. There is no simple formula (other than the FFT itself) that relates the FFT results to just each other, as all of the FFT outputs can change if any of the FFT inputs changes.
The relationship between the sine and cosine of each FFT complex bin result is related to the phase of the sinusoidal input component at that frequency (of the bin center), circularly relative to the start and end. If the phase changes, so can both the sine and cosine component. See: atan2()

reducing bpp of an image using Scilab

Hi guys,
I'm trying to reduce the number of bits per pixel to below 8, on gray scale images using Scilab
Is this possible?
If so, how can I do this?
Thank you.
I think it is not possible. The integer types available in Scilab are one or multiple bytes, see types here.
If you are looking to loose the high frequency information, you could shift out information.
Pseudo implementation
for x=1:width
for y=1:height
// Get pixel and make a 1 byte integer
pixel = int8(picture(x,y))
//Display bits
disp( dec2bin(pixel) )
// We start out with 8 bits - 4 = 4 bits info
bits_to_shift = 4
shifted_down_pixel = pixel/(2^bits_to_shift)
//Display shifted down
disp( dec2bin(shifted_down_pixel))
//Shift it back
shifted_back_pixel = pixel*(2^bits_to_shift)
disp( dec2bin(shifted_back_pixel))
// Replace old pixel with new
picture(x,y) = shifted_back_pixel
end
end
Of course you can do the above code much faster with one big matrix operation, but it is to show the concept.
Working example
rgb = imread('your_image.png')
gry = rgb2gray(rgb)
gry8bit = im2uint8(gry)
function result = reduce_bits(img, bits)
reduced = img / (2^bits);
result = reduced * (2^bits);
return result;
endfunction
gry2bit = reduce_bits(gry8bit, 6)
imshow(gry2bit)

More precise frequency from FFT with pure sine tones

I'm currently using FFT code from here:
https://github.com/syedhali/EZAudio/tree/master/EZAudioExamples/iOS/EZAudioFFTExample
Here's the code from the 2 relevant methods:
-(void)createFFTWithBufferSize:(float)bufferSize withAudioData:(float*)data {
// Setup the length
_log2n = log2f(bufferSize);
// Calculate the weights array. This is a one-off operation.
_FFTSetup = vDSP_create_fftsetup(_log2n, FFT_RADIX2);
// For an FFT, numSamples must be a power of 2, i.e. is always even
int nOver2 = bufferSize/2;
// Populate *window with the values for a hamming window function
float *window = (float *)malloc(sizeof(float)*bufferSize);
vDSP_hamm_window(window, bufferSize, 0);
// Window the samples
vDSP_vmul(data, 1, window, 1, data, 1, bufferSize);
free(window);
// Define complex buffer
_A.realp = (float *) malloc(nOver2*sizeof(float));
_A.imagp = (float *) malloc(nOver2*sizeof(float));
}
-(void)updateFFTWithBufferSize:(float)bufferSize withAudioData:(float*)data {
// For an FFT, numSamples must be a power of 2, i.e. is always even
int nOver2 = bufferSize/2;
// Pack samples:
// C(re) -> A[n], C(im) -> A[n+1]
vDSP_ctoz((COMPLEX*)data, 2, &_A, 1, nOver2);
// Perform a forward FFT using fftSetup and A
// Results are returned in A
vDSP_fft_zrip(_FFTSetup, &_A, 1, _log2n, FFT_FORWARD);
// Convert COMPLEX_SPLIT A result to magnitudes
float amp[nOver2];
float maxMag = 0;
for(int i=0; i<nOver2; i++) {
// Calculate the magnitude
float mag = _A.realp[i]*_A.realp[i]+_A.imagp[i]*_A.imagp[i];
maxMag = mag > maxMag ? mag : maxMag;
}
for(int i=0; i<nOver2; i++) {
// Calculate the magnitude
float mag = _A.realp[i]*_A.realp[i]+_A.imagp[i]*_A.imagp[i];
// Bind the value to be less than 1.0 to fit in the graph
amp[i] = [EZAudio MAP:mag leftMin:0.0 leftMax:maxMag rightMin:0.0 rightMax:1.0];
}
I've modified the updateFFTWithBufferSize method above so that I could get the frequency in Hz like this:
for(int i=0; i<nOver2; i++) {
// Calculate the magnitude
float mag = _A.realp[i]*_A.realp[i]+_A.imagp[i]*_A.imagp[i];
if(maxMag < mag) {
_i_max = i;
}
maxMag = mag > maxMag ? mag : maxMag;
}
float frequency = _i_max / bufferSize * 44100;
NSLog(#"FREQUENCY: %f", frequency);
I've generated a few pure sine tones with Audacity at different frequencies to test with. The issue I'm seeing is that the code is returning the same frequency for two different sine tones that are relatively close in value.
For example:
A sine tone generated at 19255Hz will show up from FFT as 19293.750000Hz. So will a sine tone generated at 19330Hz. Something must be off in the calculations.
Any assistance in how I can modify the above code to get a more precise FFT frequency reading for pure sine tones is greatly appreciated.
Thank you!
You can get a rough frequency estimate by fitting a parabolic curve to the 3 FFT bin magnitudes around the peak magnitude bin, and then finding the extrema of that parabola.
A better estimate can be created by using the transform of your FFT window as an interpolation kernel, and doing successive approximation to refine an estimate of the maxima of the interpolated points. (Zero padding and using a much longer FFT will give you a similar type of interpolated estimate.)
The easy way for a stationary signal is, if possible, to just use a longer FFT with more samples that span a longer time interval.
You've got a number of problems going on here:
1) Your frequency axis spacing is fmax/N, or about 80Hz, so you're not going to get a resolution much better than that.
2) You're signal is very close to the Nyquist frequency (ie, 20KHz/44.1KHz is almost 0.5), and when you're this close to the Nyquist limit you need to be very careful if you want accurate results. (That is, at 20KHz, you're only recording about two data points for each full oscillation cycle.)
3) Since 20KHz is at the edge of human hearing (and higher for most people), many microphones don't really worry about it. Here's a measurement for the iPhone.
Perhaps your sampling frequency isn't high enough?
The FFT is a very good method to get a spectrum if you don't know anything about the input. If you know that the input is a pure sine wave, you can do much better. Start off by calculating the FFT to get a rough idea where the sine is. Get the minimum and maximum to estimate the amplitude [or get that from the FFT - square all inputs, add them, take square root] , get the phase at the begin and end given the estimated frequency and amplitude.
In general, you'll find that the phase does not match. That's because the phase at the end is off by 2*Δf * N. f - Δf is a better estimate of the frequency. Keep in mind that such a method is super noise sensitive. The method works because the input is a pure sine wave, and noise is everything but that. Using this method iteratively blows up quickly; you even hit rounding errors (not sinusoidal either)
Another similar trick is subtracting the estimated wave. The difference between two sines is the product of two sines, one with the frequencies added (in your case, ±38.5 kHz) and one with the frequencies subtracted (Δ_f_, less than 100 Hz). See also Heterodyne detection

Fast Gaussian blur on unsigned char image- ARM Neon Intrinsics- iOS Dev

Can someone tell me a fast function to find the gaussian blur of an image using a 5x5 mask. I need it for iOS app dev. I am working directly on the memory of the image defined as
unsigned char *image_sqr_Baseaaddr = (unsigned char *) malloc(noOfPixels);
for (row = 2; row < H-2; row++)
{
for (col = 2; col < W-2; col++)
{
newPixel = 0;
for (rowOffset=-2; rowOffset<=2; rowOffset++)
{
for (colOffset=-2; colOffset<=2; colOffset++)
{
rowTotal = row + rowOffset;
colTotal = col + colOffset;
iOffset = (unsigned long)(rowTotal*W + colTotal);
newPixel += (*(imgData + iOffset)) * gaussianMask[2 + rowOffset][2 + colOffset];
}
}
i = (unsigned long)(row*W + col);
*(imgData + i) = newPixel / 159;
}
}
This is obviously the slowest function possible. I heard that ARM Neon intrinsics on the iOS can be used to make several operations in 1 cycle. Maybe that's the way to go ?
The problem is that I am not very familiar and don't have enough time to learn assembly language at the moment. So it would be great if anyone can post a Neon intrinsics code for the problem mentioned above or any other fast implementation in C/C++.
Before you get into SIMD optimisation with NEON you should first improve your scalar implementation. The biggest problem with your code as it stands is that it has been implemented as if it were a non-separable filter, whereas a Gaussian kernel is separable. By switching to a separable implementation you reduce the number of operations form N^2 to 2N, which in your case of a 5x5 kernel would be a reduction from 25 multiply-adds to 10, i.e. a 2.5x speed up for very little effort.
It may be that a sufficiently optimised scalar implementation will meet your needs without the need to resort to SIMD. If not then you can at least carry these scalar optimisations over into a vectorized implementation.
http://en.wikipedia.org/wiki/Gaussian_blur
http://blogs.mathworks.com/steve/2006/11/28/separable-convolution-part-2/
Separate your kernel, as described by Paul R.
Don't re-invent the wheel. Use vImage, which is part of the Accelerate framework, and implements a vectorized, multi-threaded convolution for you. Specifically, it seems like you want the function vImageConvolve_Planar8.

Laplacian of gaussian filter use

This is a formula for LoG filtering:
(source: ed.ac.uk)
Also in applications with LoG filtering I see that function is called with only one parameter:
sigma(σ).
I want to try LoG filtering using that formula (previous attempt was by gaussian filter and then laplacian filter with some filter-window size )
But looking at that formula I can't understand how the size of filter is connected with this formula, does it mean that the filter size is fixed?
Can you explain how to use it?
As you've probably figured out by now from the other answers and links, LoG filter detects edges and lines in the image. What is still missing is an explanation of what σ is.
σ is the scale of the filter. Is a one-pixel-wide line a line or noise? Is a line 6 pixels wide a line or an object with two distinct parallel edges? Is a gradient that changes from black to white across 6 or 8 pixels an edge or just a gradient? It's something you have to decide, and the value of σ reflects your decision — the larger σ is the wider are the lines, the smoother the edges, and more noise is ignored.
Do not get confused between the scale of the filter (σ) and the size of the discrete approximation (usually called stencil). In Paul's link σ=1.4 and the stencil size is 9. While it is usually reasonable to use stencil size of 4σ to 6σ, these two quantities are quite independent. A larger stencil provides better approximation of the filter, but in most cases you don't need a very good approximation.
This was something that confused me too, and it wasn't until I had to do the same as you for a uni project that I understood what you were supposed to do with the formula!
You can use this formula to generate a discrete LoG filter. If you write a bit of code to implement that formula, you can then to generate a filter for use in image convolution. To generate, say a 5x5 template, simply call the code with x and y ranging from -2 to +2.
This will generate the values to use in a LoG template. If you graph the values this produces you should see the "mexican hat" shape typical of this filter, like so:
(source: ed.ac.uk)
You can fine tune the template by changing how wide it is (the size) and the sigma value (how broad the peak is). The wider and broader the template the less affected by noise the result will be because it will operate over a wider area.
Once you have the filter, you can apply it to the image by convolving the template with the image. If you've not done this before, check out these few tutorials.
java applet tutorials more mathsy.
Essentially, at each pixel location, you "place" your convolution template, centred at that pixel. You then multiply the surrounding pixel values by the corresponding "pixel" in the template and add up the result. This is then the new pixel value at that location (typically you also have to normalise (scale) the output to bring it back into the correct value range).
The code below gives a rough idea of how you might implement this. Please forgive any mistakes / typos etc. as it hasn't been tested.
I hope this helps.
private float LoG(float x, float y, float sigma)
{
// implement formula here
return (1 / (Math.PI * sigma*sigma*sigma*sigma)) * //etc etc - also, can't remember the code for "to the power of" off hand
}
private void GenerateTemplate(int templateSize, float sigma)
{
// Make sure it's an odd number for convenience
if(templateSize % 2 == 1)
{
// Create the data array
float[][] template = new float[templateSize][templatesize];
// Work out the "min and max" values. Log is centered around 0, 0
// so, for a size 5 template (say) we want to get the values from
// -2 to +2, ie: -2, -1, 0, +1, +2 and feed those into the formula.
int min = Math.Ceil(-templateSize / 2) - 1;
int max = Math.Floor(templateSize / 2) + 1;
// We also need a count to index into the data array...
int xCount = 0;
int yCount = 0;
for(int x = min; x <= max; ++x)
{
for(int y = min; y <= max; ++y)
{
// Get the LoG value for this (x,y) pair
template[xCount][yCount] = LoG(x, y, sigma);
++yCount;
}
++xCount;
}
}
}
Just for visualization purposes, here is a simple Matlab 3D colored plot of the Laplacian of Gaussian (Mexican Hat) wavelet. You can change the sigma(σ) parameter and see its effect on the shape of the graph:
sigmaSq = 0.5 % Square of σ parameter
[x y] = meshgrid(linspace(-3,3), linspace(-3,3));
z = (-1/(pi*(sigmaSq^2))) .* (1-((x.^2+y.^2)/(2*sigmaSq))) .*exp(-(x.^2+y.^2)/(2*sigmaSq));
surf(x,y,z)
You could also compare the effects of the sigma parameter on the Mexican Hat doing the following:
t = -5:0.01:5;
sigma = 0.5;
mexhat05 = exp(-t.*t/(2*sigma*sigma)) * 2 .*(t.*t/(sigma*sigma) - 1) / (pi^(1/4)*sqrt(3*sigma));
sigma = 1;
mexhat1 = exp(-t.*t/(2*sigma*sigma)) * 2 .*(t.*t/(sigma*sigma) - 1) / (pi^(1/4)*sqrt(3*sigma));
sigma = 2;
mexhat2 = exp(-t.*t/(2*sigma*sigma)) * 2 .*(t.*t/(sigma*sigma) - 1) / (pi^(1/4)*sqrt(3*sigma));
plot(t, mexhat05, 'r', ...
t, mexhat1, 'b', ...
t, mexhat2, 'g');
Or simply use the Wavelet toolbox provided by Matlab as follows:
lb = -5; ub = 5; n = 1000;
[psi,x] = mexihat(lb,ub,n);
plot(x,psi), title('Mexican hat wavelet')
I found this useful when implementing this for edge detection in computer vision. Although not the exact answer, hope this helps.
It appears to be a continuous circular filter whose radius is sqrt(2) * sigma. If you want to implement this for image processing you'll need to approximate it.
There's an example for sigma = 1.4 here: http://homepages.inf.ed.ac.uk/rbf/HIPR2/log.htm

Resources