adding "audible ticks" to a waveform for onset detection debugging - waveform

I'm playing around with some onset/beat detection algorithms on my own. My input is a .wav file and my output is a .wav file; I have access to the entire waveform in chunks of float[] arrays.
I'm having trouble coming up with a good way to debug and evaluate my algorithms. Since my input and output are both auditory, I thought it'd make the most sense if my debugging facility was also auditory, eg. by means of adding audible "ticks" or "beeps" to the .wav file at onset points.
Does anyone have any ideas on how to do this? Ideally, it would be a simple for-loop that I'd run a couple hundred or couple thousand samples through.

float * sample = first sample where beep is to be mixed in
float const beep_duration = desired beep duration in seconds
float const sample_rate = sampling rate in samples per second
float const frequency = desired beep frequency, Hz
float const PI = 3.1415926..
float const volume = desired beep volume
for( int index = 0; index < (int)(beep_duration * sample_rate); index++ )
{
sample[index] +=
sin( float(index) * 2.f * PI * sample_rate / frequency ) * volume;
}

Poor man's answer: find a recording of a tick or beep, then mix it with the original waveform at each desired moment. You mix by simply averaging the values of the beep and the input waveform for the duration of the beep.

Figure out where in your sample you want to insert your tick (include the length of the tick, so this is a range, not a point). Take the FFT of that section of the waveform. Add to the frequency domain representation whatever frequency components you desire for your "tick" sound (simplest would be just a single frequency tone). Perform the inverse FFT on the result and voila, you have your tone mixed into the original signal. I think (it's been a while since I've done this).

Related

Creating a phase vocoder with the webaudio API using analyzer and periodic wave

I was wondering if it would be possible to poll the AnalyzerNode from the WebAudio API and use it to construct a PeriodicWave that is synthesized via an OscillatorNode?
My intuition is that something about the difference in amplitudes between analyzer frames can help calculate the right phase for a PeriodicWave, but I'm not sure how to go about implementing it. Any help on the right algorithm to use would be appreciated!
As luck would have it, I was working on a similar project just a few weeks ago. I put together a JSFiddle to explore the idea of reconstructing a phase-randomized version of a waveform using frequency data from an AnalyserNode. You can find that experiment here:
https://jsfiddle.net/mattdiamond/w4u7x8zk/
Here's the code that takes in the frequency data output from an AnalyserNode and generates a PeriodicWave:
function generatePeriodicWave(freqData) {
const real = [];
const imag = [];
freqData.forEach((x, i) => {
const amp = fromDecibels(x);
const phase = getRandomPhase();
real.push(amp * Math.cos(phase));
imag.push(amp * Math.sin(phase));
});
return context.createPeriodicWave(real, imag);
}
function fromDecibels(x) {
return 10 ** (x / 20);
}
function getRandomPhase() {
return Math.random() * 2 * Math.PI - Math.PI;
}
Since the AnalyserNode converts the FFT amplitude values to decibels, we need to recover those original values first (which we do by simply using the inverse of the formula that was used to convert them to decibels). We also need to provide a phase for each frequency, which we select at random from the range -π to π.
Now that we have an amplitude and phase, we construct a complex number by multiplying the amplitude by the cosine and sine of the phase. This is because the amplitude and phase correspond to a polar coordinate, and createPeriodicWave expects a list of real and imaginary numbers corresponding to Cartesian coordinates in the complex plane. (See here for more information on the mathematics behind this conversion.)
Once we've generated the PeriodicWave, all that's left to do is load it into an OscillatorNode, set the desired frequency, and start the oscillator. You'll notice that the default frequency is set to context.sampleRate / FFT_SIZE (you can ignore the toFixed, that was just for the sake of the UI). This causes the oscillator to play the wave at the same rate as the original samples. Increasing or decreasing the frequency from this value will pitch-shift the audio up or down, respectively.
You'll also notice that I chose 2^15 as the FFT size, which is the maximum size that the AnalyserNode allows. For my purposes -- creating interesting looped drones -- a larger FFT results in a more interesting and less "loopy" drone. (A while back I created a webpage that allowed users to generate drones from much larger FFTs... that experiment utilized a third-party FFT library instead of the AnalyserNode.) I'm not sure if this is the right FFT size for your purposes, but it's something to consider.
Anyway, I think that covers the core of the algorithm. Hope this helps! (And feel free to ask more questions in the comments if anything's unclear.)

AurioTouch2 sound frequency

Is there a way to get the frequency value in Apple's example project AurioTouch2? I've been looking all over for an answer to this and couldn't find one. Thanks
Maybe it helps
1.
The try-it-out way
FFTs use frequency bins and the bin frequency width is based on the FFT parameters. To find a frequency you will need to record it sampled at a rate at least twice the highest frequency present in the sample. Then find the time between the cycles. If it is not a pure frequency this will of course be harder.
2.
I heard of the frequency of each bin is
yFract * hwSampleRate * 1/2
yFract is the half of fftLength and the last bin of the FFT corresponds with half of the sampling rate. Using it like this way
NSLog(#"The magnitude for %f Hz is %f.", (yFract * hwSampleRate * .5), (interpVal * 120));

What actually does the size of FFT mean

While using FFT sample code from Apple documentation, what actually does the N, log2n, n and nOver2 mean?
Does N refer to the window size of the fft or the whole number of samples in a given audio, and
how do I calculate N from an audio file?
how are they related to the audio sampling rate i.e. 44.1kHz?
What would be the FFT frame size in this code?
Code:
/* Set the size of FFT. */
log2n = N;
n = 1 << log2n;
stride = 1;
nOver2 = n / 2;
printf("1D real FFT of length log2 ( %d ) = %d\n\n", n, log2n);
/* Allocate memory for the input operands and check its availability,
* use the vector version to get 16-byte alignment. */
A.realp = (float *) malloc(nOver2 * sizeof(float));
A.imagp = (float *) malloc(nOver2 * sizeof(float));
originalReal = (float *) malloc(n * sizeof(float));
obtainedReal = (float *) malloc(n * sizeof(float));
N or n typically refers to the number of elements. log2n is the base-two logarithm of n. (The base-two logarithm of 32 is 5.) nOver2 is n/2, n divided by two.
In the context of an FFT, n is the number of samples being fed into the FFT.
n is usually determined by a variety of constraints. You want more samples to provide a better quality result, but you do not want so many samples that processing takes up a lot of computer time or that the result is not available until so late that the user notices a lag. Usually, it is not the length of an audio file that determines the size. Rather, you design a “window” that you will use for processing, then you read samples from the audio file into a buffer big enough to hold your window, then you process the buffer, then you repeat with more samples from the file. Repetitions continue until the entire file is processed.
A higher audio sampling rate means there will be more samples in a given period of time. E.g., if you want to keep your window under 1/30th of a second, then a 44.1 kHz sampling rate will have less than 44.1•1000/30 = 1470 samples. A higher sampling rate means you have more work to do, so you may need to adjust your window size to keep the processing within limits.
That code uses N for log2n, which is unfortunate, since it may confuse people. Otherwise, the code is as I described above, and the FFT frame size is n.
There can be some confusion about FFT size or length when a mix of real data and complex data is involved. Typically, for a real-to-complex FFT, the number of real elements is said to be the length. When doing a complex-to-complex FFT, the number of complex elements is the length.
'N' is the number of samples, i.e., your vector size. Corresponding, 'log2N' is the logarithm of 'N' with the base 2, and 'nOver2' is the half of 'N'.
To answer the other questions, one must know, what do you want to do with FFT. This document, even it is written with a specific system in mind, can serve as an survey about the relation and the meaning of the parameters in (D)FFT.

Does FFT neccessary to find peaks and pits on audio files

I'm able to read a wav files and its values. I need to find peaks and pits positions and their values. First time, i tried to smooth it by (i-1 + i + i +1) / 3 formula then searching on array as array[i-1] > array[i] & direction == 'up' --> pits style solution but because of noise and other reasons of future calculations of project, I'm tring to find better working area. Since couple days, I'm researching FFT. As my understanding, fft translates the audio files to series of sines and cosines. After fft operation the given values is a0's and a1's for a0 + ak * cos(k*x) + bk * sin(k*x) which k++ and x++ as this picture
http://zone.ni.com/images/reference/en-XX/help/371361E-01/loc_eps_sigadd3freqcomp.gif
My question is, does fft helps to me find peaks and pits on audio? Does anybody has a experience for this kind of problems?
It depends on exactly what you are trying to do, which you haven't really made clear. "finding the peaks and pits" is one thing, but since there might be various reasons for doing this there might be various methods. You already tried the straightforward thing of actually looking for the local maximum and minima, it sounds like. Here are some tips:
you do not need the FFT.
audio data usually swings above and below zero (there are exceptions, including 8-bit wavs, which are unsigned, but these are exceptions), so you must be aware of positive and negative values. Generally, large positive and large negative values carry large amounts of energy, though, so you want to count those as the same.
due to #2, if you want to average, you might want to take the average of the absolute value, or more commonly, the average of the square. Once you find the average of the squares, take the square root of that value and this gives the RMS, which is related to the power of the signal, so you might do something like this is you are trying to indicate signal loudness, intensity or approximate an analog meter. The average of absolutes may be more robust against extreme values, but is less commonly used.
another approach is to simply look for the peak of the absolute value over some number of samples, this is commonly done when drawing waveforms, and for digital "peak" meters. It makes less sense to look at the minimum absolute.
Once you've done something like the above, yes you may want to compute the log of the value you've found in order to display the signal in dB, but make sure you use the right formula. 10 * log_10( amplitude ) is not it. Rule of thumb: usually when computing logs from amplitude you will see a 20, not a 10. If you want to compute dBFS (the amount of "headroom" before clipping, which is the standard measurement for digital meters), the formula is -20 * log_10( |amplitude| ), where amplitude is normalize to +/- 1. Watch out for amplitude = 0, which gives an infinite headroom in dB.
If I understand you correctly, you just want to estimate the relative loudness/quietness of an audio digital sample at a given point.
For this estimation, you don't need to use FFT. However your method of averaging the signal does not produce the appropiate picture neither.
The digital signal is the value of the audio wave at a given moment. You need to find the overall amplitude of the signal at that given moment. You can somewhat see it as the local maximum value for a given interval around the moment you want to calculate. You may have a moving max for the signal and get your amplitude estimation.
At a 16 bit sound sample, the sound signal value can go from 0 up to 32767. At a 44.1 kHz sample rate, you can find peaks and pits of around 0.01 secs by finding the max value of 441 samples around a given t moment.
max=1;
for (i=0; i<441; i++) if (array[t*44100+i]>max) max=array[t*44100+i];
then for representing it on a 0 to 1 scale you (not really 0, because we used a minimum of 1)
amplitude = max / 32767;
or you might represent it in relative dB logarithmic scale (here you see why we used 1 for the minimum value)
dB = 20 * log10(amplitude);
all you need to do is take dy/dx, which can getapproximately by just scanning through the wave and and subtracting the previous value from the current one and look at where it goes to zero or changes from positive to negative
in this code I made it really brief and unintelligent for sake of brevity, of course you could handle cases of dy being zero better, find the 'centre' of a long section of a flat peak, that kind of thing. But if all you need is basic peaks and troughs, this will find them.
lastY=0;
bool goingup=true;
for( i=0; i < wave.length; i++ ) {
y = wave[i];
dy = y - lastY;
bool stillgoingup = (dy>0);
if( goingup != direction ) {
// changed direction - note value of i(place) and 'y'(height)
stillgoingup = goingup;
}
}

Audio Analysis for Sheet Music

I'm currently working on a program that analyses a wav file of a solo musician playing an instrument and detects the notes within it. To do this it performs an FFT and then looks at the data produced. The goal is to (at some point) produce the sheet music by writing a midi file.
I just wanted to get a few opinions on what might be difficult about it, whether anyones tried it before, maybe a few things it would be good to research. At the moment my biggest struggle is that not all notes are purely one frequency and I cannot yet detect chords; just single notes. Also there has to be a pause between the notes I am detecting so I know for sure one has ended and the other started. Any comments on this would also be very welcome!
This is the code I use when A new frame comes in from the signal. it looks for the frequency that is most dominant in the sample:
//Get frequency vector for power match
double[] frequencyVectorDoubleArray = Accord.Audio.Tools.GetFrequencyVector(waveSignal.Length, waveSignal.SampleRate);
powerSpectrumDoubleArray[0] = 0; // zero DC
double[,] frequencyPowerDoubleArray = new double[powerSpectrumDoubleArray.Length, 2];
for (int i = 0; i < powerSpectrumDoubleArray.Length; i++)
{
if (frequencyVectorDoubleArray[i] > 15.00)
{
frequencyPowerDoubleArray[i, 0] = frequencyVectorDoubleArray[i];
frequencyPowerDoubleArray[i, 1] = powerSpectrumDoubleArray[i];
}
}
//Method for finding the highest frequency in a sample of frequency domain data
//But I want to filter out stuff
pulsePowerDouble = lowestPowerAcceptedDouble;//0;//lowestPowerAccepted;
int frequencyIndexAtPulseInt = 0;
int oldFrequencyIndexAtPulse = 0;
for (int j = 0; j < frequencyPowerDoubleArray.Length / 2; j++)
{
if (frequencyPowerDoubleArray[j, 1] > pulsePowerDouble)
{
oldPulsePowerDouble = pulsePowerDouble;
pulsePowerDouble = frequencyPowerDoubleArray[j, 1];
oldFrequencyIndexAtPulse = frequencyIndexAtPulseInt;
frequencyIndexAtPulseInt = j;
}
}
foundFreq = frequencyPowerDoubleArray[frequencyIndexAtPulseInt, 0];
1) There is a lot (several decades worth) of research literature on frequency estimation and pitch estimation (which are two different subjects).
2) Peak FFT frequency is not the same as the musical pitch. Some solo musical instruments can produces well over a dozen frequency peaks for just one note, let alone a chord, and with none of the largest peaks anywhere near the musical pitch. For some common instruments, the peaks might not even be mathematically exact harmonics.
3) Using the peak bin of a short unwindowed FFT isn't a great frequency estimator.
4) Note onset detection might require some sophisticated pattern matching, depending on the instrument.
You don't want to focus on the highest frequency, but rather the lowest. Every note from any musical instrument is full of harmonics. Expect to hear the fundamental, and every octave above it. Plus all the second and third harmonics.
Harmonics is what makes a trumpet sound different from a trombone when they are both playing the same note.
Unfortunately this is an extremely hard problem, some of the reasons have already been given. I would start with a literature search (Google Scholar, for instance) for "musical note identification".
If this isn't a spare time project, beware - I have seen masters theses founder on this particular shoal without getting any useful results.

Resources