Audio Processing with Accelerate and vDSP_desamp() - ios

I am totally new to the vdsp framework and I am trying to learn by building. My goal is for the signal to be processed in the following way:
100th order Band Pass FIR
Downsampling by factor: 2
From what I could understand from Apple's documentation the function vDSP_desamp() is what I am looking for (it can do both steps at the same time, right?)
How would I use this correctly?
Here are my thoughts:
Given an AudioBufferList *audio and an array of filter coefficients filterCoeffs with length [101]:
vDSP_desamp((float*)audio->mBuffers[0].mData, 2, &filterCoeffs, (float*)audio->mBuffers[0].mData, frames, 101);
would this be a correct use of the method?
Do I need to implement a circular buffer for this process?
Any guidance /direction /pointer to something to read would be very welcome.
thanks

Reading the documentation, vDSP_desamp() is indeed a compound decimation and FIR operation. Doing both together is a good idea as it reduces memory access and there is scope for eliminating a lot of computation.
The assumption here is FIR filter has been recast with (P-1)/2 group delay. The consequence of this is that to calculate C(n) the function needs access to A(n*I+p)
Where (using the terminology of the documentation):
`A[0..x-1]`: input sample array
`C[0..n-1]`: output sample array
`P`: number of filter coefficients
`I`: Decimation factor
Clearly if you pass a CoreAudio buffer to this, it'll run off the end of the buffer by 200 input samples. At best, yielding 100 garbage samples, and at worst a SIGSEGV.
So, the simple answer is NO. You cannot use vDSP_desamp() alone.
Your options are:
Assemble the samples needed into a buffer and then call vDSP_desamp() for N output samples. This involves copying samples from two CoreAudio buffers. If you're worried about latency, you recast the FIR to use 100 previous samples, alternatively, they could come from the next buffer.
Use vDSP_desamp() for what you can, and calculate the more complex case when the filter wraps over the two buffers.
Two calls to vDSP_desamp() - one with the easy case, and another with an assembled input buffer where samples wrap adjacent CoreAudio buffers
I don't see how you can use circular buffer to solve this problem: You still have the case where the buffer wraps to deal with, and still need to copy all samples into it.
Which is faster rather depends on the size of the audio buffers presented by CoreAudio. My hunch is that for small buffers, and a small filter length, vDSP_desamp() possibly isn't worth it, but you're going to need to measure to be sure.
When I've implemented this kind of thing in the past on iOS, I've found a
hand-rolled decimation and filter operation to be fairly insignificant in the grand scheme of things, and didn't bother optimizing further.

Related

How to generate a waveform table for quicker realtime audio synthesis

I developed an app a few months back for iOS devices that generates real-time harmonic rich drones. It works fine on newer devices, but it's running into buffer underruns on slower devices. I need to optimize this thing and need some mental help. Here's a super basic overview of what I'm currently doing:
Create an "Oscillator Bank" that consists of X number of harmonics (simply calculated from a given fundamental frequency. Nothing fancy here.)
Inside my DAC function that spits out samples to an iOS audio buffer, I call a "GetNextSample()" function that goes through the bank of sine oscillators, calculates the sample for each one and adds them up. Some simple additive synthesis.
Enjoy the beauty of the drone.
Again, it works great, until it doesn't. I'd like to optimize this thing so I'm not using brute additive synthesis of real-time calculated sine waves. If I limit the number of harmonics ("banks") to 2, it'll work on the older devices. Not cool. On the newer devices, it underruns around 50 harmonics. Not too bad. But if I want to play multiple drones at once to create some chords, that's too much processing power.... so...
Should I generate waveform tables to just loop through instead of constant calculation? (I assume yes...)
Should I convert my usage of double-precision floating point to integer based calculations? (I assume yes...)
And my big algorithmic question (being pretty non-mathematical):
If I use a waveform table, how do I accurately determine how long the wave/table should be?? In my experience developing this app, if I just go to the end of a period (2*PI) and start over again, resetting the phase back to 0, I get a sound artifact, since I'm force offsetting the phase. In other words, I can't guarantee that one period will give me the right results...
Maybe I'm over complicating things... What's the standard way of doing quick, processor friendly real-time synth of multiple added sines?
I'll keep poking around in the meantime.
Thanks!
Have you (or can you, not an iOS person) increase the buffer size? Might give you enough overhead that you do not need this. Otherwise yes wave-table synthesis is a viable approach. You could calculate a new wavetable from the sum of all the harmonics only when a parameter changes.
I have written such a beast in golang on server side ... for starters yes use single precision floating point
To address table population, I would assure your implementation is solid by having it synthesize a square wave. Visualize the output for each run as you give it each additional frequency (with its corresponding parms of amplitude and phase shift) ... by definition a single cycle is enough as long as you are correctly using enough cycles to cover the time period of a sample
Its important to leverage the knowledge that generating an output curve from an input set of sine waves ( each with freq, amplitude, phase shift) lends itself to doing the reverse ... namely perform a FFT on that output curve to have the api give you its version of the underlying sine waves (again each with a freq, amplitude and phase) ... this will confirm your system is accurate
The name of the process you are implementing is : inverse Fourier transform and there are libraries for this however I too prefer rolling my own

How do the cuSPARSE and cuBLAS libraries deal with memory allocated using cudaMallocPitch?

I am implementing a simple routine that performs sparse matrix - dense matrix multiplication using cusparseScsrmm from cuSPARSE. This is part of a bigger application that could allocate memory on GPU using cudaMalloc (more than 99% of the time) or cudaMallocPitch (very rarely used). I have a couple of questions regarding how cuSPARSE deals with pitched memory:
1) I passed in pitched memory into the cuSPARSE routine but the results were incorrect (as expected, since there is no way to pass in the pitch as an argument). Is there a way to get these libraries working with memory allocated using cudaMallocPitch?
2) What is the best way to deal with this? Should I just add a check in the calling function, to enforce that the memory not be allocated using pitched mode?
For sparse matrix operations, the concept of pitched data has no relevance anyway.
For dense matrix operations most operations don't directly support a "pitch" to the data per se, however various operations can operate on a sub-matrix. With a particular caveat, it should be possible for such operations to handle pitched or unpitched data. Any time you see a CUBLAS (or CUSPARSE) operation that accepts "leading dimension" arguments, those arguments could be used to encompass a pitch in the data.
Since the "leading dimension" parameter is specified in matrix elements, and the pitch is (usually) specified in bytes, the caveat here is that the pitch is evenly divisible by the size of the matrix element in question, so that the pitch (in bytes) can be converted to a "leading dimension" parameter specified in matrix elements. I would expect that this would be typically possible for char, int, float, double and similar types, as I believe the pitch quantity returned by cudaMallocPitch will usually be evenly divisible by 16. But there is no stated guarantee of this, so proper run-time checking is advised, if you intend to use this approach.
For example, it should be possible to perform a CUBLAS matrix-matrix multiply (gemm) on pitched data, with appropriate specification of the lda, ldb and ldc parameters.
The operation you indicate does offer such leading dimension parameters for the dense matrices involved.
If 99% of your use-cases don't use pitched data, I would either not support pitched data at all, or else, for operations where no leading dimension parameters are available, copy the pitched data to an unpitched buffer for use in the desired operation. A device-to-device pitched to unpitched copy can run at approximately the rate of memory bandwidth, so it might be fast enough to not be a serious issue for 1% of the use cases.

Simulink: Convert Continuous Signal to Discrete

I am very new to simulink, so this question may seem simple. I am looking for a way to sample a continuous signal every X number of seconds.
essentially what I am doing is simulating the principle of a data acquisition unit for a demonstration I am running, but I can't seem to find a block to do this, the nearest thing I can get is the Zero-Order-Hold.
What you may need is a combination of two blocks. First, a Quantizer block to discretize the input to a chosen resolution. Second a Zero-Order Hold block to sample and hold at the chosen sampling rate.
The ordering doesn't seem to be of much importance here.
Here's an example:
also, you can use the Rate Transition block.

Simulink - Finding index of vector element where accumulation crosses a threshold

I'm looking to improve the delay estimation portion of a Simulink model. The input is an estimated impulse response for the system. I want the index of the first sample of the impulse response where the sum of the absolute values of it and the previous elements exceeeds a certain fraction of the total across the whole vector.
Here's my current solution:
The matrix sum runs along dimension 2. The prelookup block is set to clip. This is finding the element (possibly one off, I haven't thought that through yet) where 1% of the total is reached.
This seems overly complicated, and it isn't clear what it is trying to do without some explanation. I tried coming up with a solution based on the discrete integrator/accumulator block but couldn't come up with something better. It certainly does a lot more addition than it needs to with this solution, although performance isn't really an issue right now.
Is there a simpler way to get the running sum across a vector that I could put in place of the Toeplitz->Triangular->Sum section? Is there a better way overall to perform the whole lookup?
If you have DSP System toolbox, there is a "Cumulative Sum" block which should be able to replace your toeplitz, traiangular matrix and matrix sum.
http://www.mathworks.com/help/dsp/ref/cumulativesum.html
If you do not have DSP System toolbox, I suggest coding this in MATLAB Function block where it should be a one liner.
y = cumsum(x);
While you are there you may also want to code the entire logic in MATLAB Function block which in cases like this is easier to code and understand.

Can GLSL perform a recursion formula calculation? Or how can I speed up this formular

I want to implement this formula in my iOS App. Is there any way to using GLSL to speed this formula up. Or can I use mental or something to speed this formula up?
for (k = 0; k < imageSize; k++) {
imageOut[k] = imageOut[k-1] * a + imageIn[k] * b;
}
OpenCL is not available.
This is a classic IIR filter, and the data dependencies cause problems when converting it to SIMD code. This means that you can't do the operation as a simple transform feedback or render-to-texture operation. In other words, the GPU is designed to work on a bunch of data in parallel, but your formula forces the output to be computed serially (you can't compute out[k] without computing out[k-1] first).
I see two ways to optimize this:
You can use SIMD on the CPU. For iOS, this means ARM NEON. See articles like Optimising IIR Filters Using ARM NEON or Parallelization of IIR Filters using SIMD Extensions.
You can redesign the filter as an FIR filter, completely eliminating data dependencies.
Unfortunately, there is no easy translation to GLSL. Maybe you could use Metal instead of NEON, I'm not sure.
What you have there, as Dietrich Epp already pointed out, is a IIR filter. Now on a computer there's no such thing as "inifinite", you're always limited by number precision, memory, available computational time etc. – even if you executed that loop ad infinitum, due to the limited precision of your typical number representation you'll loose anything meaningful to roundoff errors quite early on.
So lets be honest about it and call a FIR filter with a very long response time. Can those be parallelized? Yes, they can, but for that we have to leave the time domain and look at it from the frequency domain.
Assume you can model the response to a system (=filter) to all the possible signals there are, then "playing back" that response based on the signal gives you the desired output. In the frequency domain that would be a "recording" of the system in response to a broadband signal covering all the frequencies. But that signal is just a simple impulse. That's where the terms FIR and IIR get their middle I from.
Any applying the impulse response of the system to an arbitrary signal by means of a convolution gives you what the system would respond to like to the signal itself. However calculating a convolution in the time domain is the same as multiplying the Fourier transform of the signal with the Fourier transform of the impulse response and transforming the result back, i.e.
s * r = F^-1(F(s) · F(r))
And Fourier transforms are one of the things that can be well parallelized and GPUs are really quite good at doing.
Now there are GLSL based Fourier transform codes, but normally these are written in OpenCL or CUDA to run on GPUs.
Anyway, here's the recipe for you:
Determine the cutoff k for which your IIR becomes indistinguishable from a FIR. Determine the Fourier transform of the impulse response (= complex spectral response, CSR). Fourier transform the signal (=image) multiply with the CSR and transform back.

Resources