Audio Analysis for Sheet Music - signal-processing

I'm currently working on a program that analyses a wav file of a solo musician playing an instrument and detects the notes within it. To do this it performs an FFT and then looks at the data produced. The goal is to (at some point) produce the sheet music by writing a midi file.
I just wanted to get a few opinions on what might be difficult about it, whether anyones tried it before, maybe a few things it would be good to research. At the moment my biggest struggle is that not all notes are purely one frequency and I cannot yet detect chords; just single notes. Also there has to be a pause between the notes I am detecting so I know for sure one has ended and the other started. Any comments on this would also be very welcome!
This is the code I use when A new frame comes in from the signal. it looks for the frequency that is most dominant in the sample:
//Get frequency vector for power match
double[] frequencyVectorDoubleArray = Accord.Audio.Tools.GetFrequencyVector(waveSignal.Length, waveSignal.SampleRate);
powerSpectrumDoubleArray[0] = 0; // zero DC
double[,] frequencyPowerDoubleArray = new double[powerSpectrumDoubleArray.Length, 2];
for (int i = 0; i < powerSpectrumDoubleArray.Length; i++)
{
if (frequencyVectorDoubleArray[i] > 15.00)
{
frequencyPowerDoubleArray[i, 0] = frequencyVectorDoubleArray[i];
frequencyPowerDoubleArray[i, 1] = powerSpectrumDoubleArray[i];
}
}
//Method for finding the highest frequency in a sample of frequency domain data
//But I want to filter out stuff
pulsePowerDouble = lowestPowerAcceptedDouble;//0;//lowestPowerAccepted;
int frequencyIndexAtPulseInt = 0;
int oldFrequencyIndexAtPulse = 0;
for (int j = 0; j < frequencyPowerDoubleArray.Length / 2; j++)
{
if (frequencyPowerDoubleArray[j, 1] > pulsePowerDouble)
{
oldPulsePowerDouble = pulsePowerDouble;
pulsePowerDouble = frequencyPowerDoubleArray[j, 1];
oldFrequencyIndexAtPulse = frequencyIndexAtPulseInt;
frequencyIndexAtPulseInt = j;
}
}
foundFreq = frequencyPowerDoubleArray[frequencyIndexAtPulseInt, 0];

1) There is a lot (several decades worth) of research literature on frequency estimation and pitch estimation (which are two different subjects).
2) Peak FFT frequency is not the same as the musical pitch. Some solo musical instruments can produces well over a dozen frequency peaks for just one note, let alone a chord, and with none of the largest peaks anywhere near the musical pitch. For some common instruments, the peaks might not even be mathematically exact harmonics.
3) Using the peak bin of a short unwindowed FFT isn't a great frequency estimator.
4) Note onset detection might require some sophisticated pattern matching, depending on the instrument.

You don't want to focus on the highest frequency, but rather the lowest. Every note from any musical instrument is full of harmonics. Expect to hear the fundamental, and every octave above it. Plus all the second and third harmonics.
Harmonics is what makes a trumpet sound different from a trombone when they are both playing the same note.

Unfortunately this is an extremely hard problem, some of the reasons have already been given. I would start with a literature search (Google Scholar, for instance) for "musical note identification".
If this isn't a spare time project, beware - I have seen masters theses founder on this particular shoal without getting any useful results.

Related

Creating a phase vocoder with the webaudio API using analyzer and periodic wave

I was wondering if it would be possible to poll the AnalyzerNode from the WebAudio API and use it to construct a PeriodicWave that is synthesized via an OscillatorNode?
My intuition is that something about the difference in amplitudes between analyzer frames can help calculate the right phase for a PeriodicWave, but I'm not sure how to go about implementing it. Any help on the right algorithm to use would be appreciated!
As luck would have it, I was working on a similar project just a few weeks ago. I put together a JSFiddle to explore the idea of reconstructing a phase-randomized version of a waveform using frequency data from an AnalyserNode. You can find that experiment here:
https://jsfiddle.net/mattdiamond/w4u7x8zk/
Here's the code that takes in the frequency data output from an AnalyserNode and generates a PeriodicWave:
function generatePeriodicWave(freqData) {
const real = [];
const imag = [];
freqData.forEach((x, i) => {
const amp = fromDecibels(x);
const phase = getRandomPhase();
real.push(amp * Math.cos(phase));
imag.push(amp * Math.sin(phase));
});
return context.createPeriodicWave(real, imag);
}
function fromDecibels(x) {
return 10 ** (x / 20);
}
function getRandomPhase() {
return Math.random() * 2 * Math.PI - Math.PI;
}
Since the AnalyserNode converts the FFT amplitude values to decibels, we need to recover those original values first (which we do by simply using the inverse of the formula that was used to convert them to decibels). We also need to provide a phase for each frequency, which we select at random from the range -π to π.
Now that we have an amplitude and phase, we construct a complex number by multiplying the amplitude by the cosine and sine of the phase. This is because the amplitude and phase correspond to a polar coordinate, and createPeriodicWave expects a list of real and imaginary numbers corresponding to Cartesian coordinates in the complex plane. (See here for more information on the mathematics behind this conversion.)
Once we've generated the PeriodicWave, all that's left to do is load it into an OscillatorNode, set the desired frequency, and start the oscillator. You'll notice that the default frequency is set to context.sampleRate / FFT_SIZE (you can ignore the toFixed, that was just for the sake of the UI). This causes the oscillator to play the wave at the same rate as the original samples. Increasing or decreasing the frequency from this value will pitch-shift the audio up or down, respectively.
You'll also notice that I chose 2^15 as the FFT size, which is the maximum size that the AnalyserNode allows. For my purposes -- creating interesting looped drones -- a larger FFT results in a more interesting and less "loopy" drone. (A while back I created a webpage that allowed users to generate drones from much larger FFTs... that experiment utilized a third-party FFT library instead of the AnalyserNode.) I'm not sure if this is the right FFT size for your purposes, but it's something to consider.
Anyway, I think that covers the core of the algorithm. Hope this helps! (And feel free to ask more questions in the comments if anything's unclear.)

Detect peak value from real time data in an array

I have connected a sine wave of amplitude 3V with 1.5V shifted above the zero to 12-bit ADC channel by using dsPIC33FJ32MC204 controller and stored in an array. I want to detect the peak for each interval, so please give me any suggestion on it. I have just posted my logic for max value detection out of five samples. I am getting output as a zero.
void read_adc_Voltage()
{
int arr[100];
int arr1[100];
int max = arr[0];
arr[0]=0;
int i,j=0;
int count = 6;
for (i=1;i<count;i++)
{
var=(ain1Buff[sampleCounter]);
voltage=var*((float)3.3/(float)4095);
arr[j] = voltage;
if(arr[j] > max)
{
max = arr[j];
}
j++;
}
sprintf(data1,"%.2f",max);
LCD_String_xy(1,1,data1);
sampleCounter++;
if(sampleCounter==6)
{
sampleCounter=0;
}
}
Define what you mean by various terms like peak and interval, and ask or decide, and specify, your criteria for other conditions.
You can track the maximum value by updating MaxValue whenever SampleValue is greater than the previous MaxValue in that interval. Similarly with MinValue.
For example is your criterion for peak that SampleValue should Rise by a certain amount, followed by a Fall of a certain amount? Or is it that SampleValue should exceed a certain fixed Threshold? Or something more complicated? They will have different consequences when the level of the signal fluctuates with noise; you may need to set up initial conditions; and perhaps to consider whether the system can lock into an undesirable condition like Threshold too high or too low. The more complicated the solution, the more the opportunity for undesirable conditions so KISS.
Same sort of considerations for when there is a new interval. Is it a fixed time? - use a timer. Is it related to the sample frequency? - count the number of samples. Is it to begin after each peak, perhaps with a minimum interval? - reset your timing or counting and MaxValue and MinValue after each peak, and prohibit determination of a peak until your MinTime or MinCount is passed.
Use the simplest solution that will deal with your signal, taking noise etc into account. Might get quite complicated if you let it.

How to use apple's Accelerate framework in swift in order to compute the FFT of a real signal?

How am I supposed to use the Accelerate framework to compute the FFT of a real signal in Swift on iOS?
Available example on the web
Apple’s Accelerate framework seems to provide functions to compute the FFT of a signal efficiently.
Unfortunately, most of the examples available on the Internet, like Swift-FFT-Example and TempiFFT, crash if tested extensively and call the Objective C API.
The Apple documentation answers many questions, but also leads to some others (Is this piece mandatory? Why do I need this call to convert?).
Threads on Stack Overflow
There are few threads addressing various aspects of the FFT with concrete examples. Notably FFT Using Accelerate In Swift, DFT result in Swift is different than that of MATLAB and FFT Calculating incorrectly - Swift.
None of them address directly the question “What is the proper way to do it, starting from 0”?
It took me one day to figure out how to properly do it, so I hope this thread can give a clear explanation of how you are supposed to use Apple's FFT, show what are the pitfalls to avoid, and help developers save precious hours of their time.
TL ; DR : If you need a working implementation to copy past here is a gist.
What is FFT?
The Fast Fourier transform is an algorithm that take a signal in the time domain -- a collection of measurements took at a regular, usual small, interval of time -- and turn it into a signal expressed into the phase domain (a collection of frequency).
The ability to express the signal along time lost by the transformation (the transformation is invertible, which means no information is lost by computing the FFT and you can apply a IFFT to get the original signal back), but we get the ability to distinguish between frequencies that the signal contained. This is typically used to display the spectrograms of the music you are listening to on various hardware and youtube videos.
The FFT works with complexe numbers. If you don't know what they are, lets just pretend it is a combination of a radius and an angle. There is one complex number per point on a 2D plane. Real numbers (your usual floats) can be saw as a position on a line (negative on the left, positive on the right).
Nb: FFT(FFT(FFT(FFT(X))) = X (up to a constant depending on your FFT implementation).
How to compute the FFT of a real signal.
Usual you want to compute the FFT of a small window of an audio signal. For sake of the example, we will take a small 1024 samples window. You would also prefer to use a power of two, otherwise things gets a little bit more difficult.
var signal: [Float] // Array of length 1024
First, you need to initialize some constants for the computation.
// The length of the input
length = vDSP_Length(signal.count)
// The power of two of two times the length of the input.
// Do not forget this factor 2.
log2n = vDSP_Length(ceil(log2(Float(length * 2))))
// Create the instance of the FFT class which allow computing FFT of complex vector with length
// up to `length`.
fftSetup = vDSP.FFT(log2n: log2n, radix: .radix2, ofType: DSPSplitComplex.self)!
Following apple's documentation, we first need to create a complex array that will be our input.
Dont get mislead by the tutorial. What you usual want is to copy your signal as the real part of the input, and keep the complex part null.
// Input / Output arrays
var forwardInputReal = [Float](signal) // Copy the signal here
var forwardInputImag = [Float](repeating: 0, count: Int(length))
var forwardOutputReal = [Float](repeating: 0, count: Int(length))
var forwardOutputImag = [Float](repeating: 0, count: Int(length))
Be careful, the FFT function do not allow to use the same splitComplex as input and output at the same time. If you experience crashs, this may be the cause. This is why we define both an input and an output.
Now, we have to be careful and "lock" the pointer to this four arrays, as showed in the documentation example. If you simply use &forwardInputReal as argument of your DSPSplitComplex, the pointer may become invalidated at the following line and you will likely experience sporadic crash of your app.
forwardInputReal.withUnsafeMutableBufferPointer { forwardInputRealPtr in
forwardInputImag.withUnsafeMutableBufferPointer { forwardInputImagPtr in
forwardOutputReal.withUnsafeMutableBufferPointer { forwardOutputRealPtr in
fforwardOutputImag.withUnsafeMutableBufferPointer { forwardOutputImagPtr in
// Input
let forwardInput = DSPSplitComplex(realp: forwardInputRealPtr.baseAddress!, imagp: forwardInputImagPtr.baseAddress!)
// Output
var forwardOutput = DSPSplitComplex(realp: forwardOutputRealPtr.baseAddress!, imagp: forwardOutputImagPtr.baseAddress!)
// FFT call goes here
}
}
}
}
Now, the finale line: the call to your fft:
fftSetup.forward(input: forwardInput, output: &forwardOutput)
The result of your FFT is now available in forwardOutputReal and forwardOutputImag.
If you want only the amplitude of each frequency, and you don't care about the real and imaginary part, you can declare alongside the input and output an additional array:
var magnitudes = [Float](repeating: 0, count: Int(length))
add right after your fft compute the amplitude of each "bin" with:
vDSP.absolute(forwardOutput, result: &magnitudes)

Does FFT neccessary to find peaks and pits on audio files

I'm able to read a wav files and its values. I need to find peaks and pits positions and their values. First time, i tried to smooth it by (i-1 + i + i +1) / 3 formula then searching on array as array[i-1] > array[i] & direction == 'up' --> pits style solution but because of noise and other reasons of future calculations of project, I'm tring to find better working area. Since couple days, I'm researching FFT. As my understanding, fft translates the audio files to series of sines and cosines. After fft operation the given values is a0's and a1's for a0 + ak * cos(k*x) + bk * sin(k*x) which k++ and x++ as this picture
http://zone.ni.com/images/reference/en-XX/help/371361E-01/loc_eps_sigadd3freqcomp.gif
My question is, does fft helps to me find peaks and pits on audio? Does anybody has a experience for this kind of problems?
It depends on exactly what you are trying to do, which you haven't really made clear. "finding the peaks and pits" is one thing, but since there might be various reasons for doing this there might be various methods. You already tried the straightforward thing of actually looking for the local maximum and minima, it sounds like. Here are some tips:
you do not need the FFT.
audio data usually swings above and below zero (there are exceptions, including 8-bit wavs, which are unsigned, but these are exceptions), so you must be aware of positive and negative values. Generally, large positive and large negative values carry large amounts of energy, though, so you want to count those as the same.
due to #2, if you want to average, you might want to take the average of the absolute value, or more commonly, the average of the square. Once you find the average of the squares, take the square root of that value and this gives the RMS, which is related to the power of the signal, so you might do something like this is you are trying to indicate signal loudness, intensity or approximate an analog meter. The average of absolutes may be more robust against extreme values, but is less commonly used.
another approach is to simply look for the peak of the absolute value over some number of samples, this is commonly done when drawing waveforms, and for digital "peak" meters. It makes less sense to look at the minimum absolute.
Once you've done something like the above, yes you may want to compute the log of the value you've found in order to display the signal in dB, but make sure you use the right formula. 10 * log_10( amplitude ) is not it. Rule of thumb: usually when computing logs from amplitude you will see a 20, not a 10. If you want to compute dBFS (the amount of "headroom" before clipping, which is the standard measurement for digital meters), the formula is -20 * log_10( |amplitude| ), where amplitude is normalize to +/- 1. Watch out for amplitude = 0, which gives an infinite headroom in dB.
If I understand you correctly, you just want to estimate the relative loudness/quietness of an audio digital sample at a given point.
For this estimation, you don't need to use FFT. However your method of averaging the signal does not produce the appropiate picture neither.
The digital signal is the value of the audio wave at a given moment. You need to find the overall amplitude of the signal at that given moment. You can somewhat see it as the local maximum value for a given interval around the moment you want to calculate. You may have a moving max for the signal and get your amplitude estimation.
At a 16 bit sound sample, the sound signal value can go from 0 up to 32767. At a 44.1 kHz sample rate, you can find peaks and pits of around 0.01 secs by finding the max value of 441 samples around a given t moment.
max=1;
for (i=0; i<441; i++) if (array[t*44100+i]>max) max=array[t*44100+i];
then for representing it on a 0 to 1 scale you (not really 0, because we used a minimum of 1)
amplitude = max / 32767;
or you might represent it in relative dB logarithmic scale (here you see why we used 1 for the minimum value)
dB = 20 * log10(amplitude);
all you need to do is take dy/dx, which can getapproximately by just scanning through the wave and and subtracting the previous value from the current one and look at where it goes to zero or changes from positive to negative
in this code I made it really brief and unintelligent for sake of brevity, of course you could handle cases of dy being zero better, find the 'centre' of a long section of a flat peak, that kind of thing. But if all you need is basic peaks and troughs, this will find them.
lastY=0;
bool goingup=true;
for( i=0; i < wave.length; i++ ) {
y = wave[i];
dy = y - lastY;
bool stillgoingup = (dy>0);
if( goingup != direction ) {
// changed direction - note value of i(place) and 'y'(height)
stillgoingup = goingup;
}
}

adding "audible ticks" to a waveform for onset detection debugging

I'm playing around with some onset/beat detection algorithms on my own. My input is a .wav file and my output is a .wav file; I have access to the entire waveform in chunks of float[] arrays.
I'm having trouble coming up with a good way to debug and evaluate my algorithms. Since my input and output are both auditory, I thought it'd make the most sense if my debugging facility was also auditory, eg. by means of adding audible "ticks" or "beeps" to the .wav file at onset points.
Does anyone have any ideas on how to do this? Ideally, it would be a simple for-loop that I'd run a couple hundred or couple thousand samples through.
float * sample = first sample where beep is to be mixed in
float const beep_duration = desired beep duration in seconds
float const sample_rate = sampling rate in samples per second
float const frequency = desired beep frequency, Hz
float const PI = 3.1415926..
float const volume = desired beep volume
for( int index = 0; index < (int)(beep_duration * sample_rate); index++ )
{
sample[index] +=
sin( float(index) * 2.f * PI * sample_rate / frequency ) * volume;
}
Poor man's answer: find a recording of a tick or beep, then mix it with the original waveform at each desired moment. You mix by simply averaging the values of the beep and the input waveform for the duration of the beep.
Figure out where in your sample you want to insert your tick (include the length of the tick, so this is a range, not a point). Take the FFT of that section of the waveform. Add to the frequency domain representation whatever frequency components you desire for your "tick" sound (simplest would be just a single frequency tone). Perform the inverse FFT on the result and voila, you have your tone mixed into the original signal. I think (it's been a while since I've done this).

Resources