How can I resample an audio file programatically in swift?

How can I resample an audio file programatically in swift? - ios

I would like to know if it's possible to resample an already written AVAudioFile.
All the references that I found don't work on this particular problem, since:
They propose the resampling while the user is recording an AVAudioFile, while installTap is running. In this approach, the AVAudioConverter works in each buffer chunk given by the inputNode and appends it in the AVAudioFile. [1] [2]
The point is that I would like to resample my audio file regardless of the recording process.
The harder approach would be to upsample the signal by a L factor and applying decimation by a factor of M, using vDSP:
Audio on Compact Disc has a sampling rate of 44.1 kHz; to transfer it to a digital medium that uses 48 kHz, method 1 above can be used with L = 160, M = 147 (since 48000/44100 = 160/147). For the reverse conversion, the values of L and M are swapped. Per above, in both cases, the low-pass filter should be set to 22.05 kHz. [3]
The last one obviously seems like a too hard coded way to solve it. I hope there's a way to resample it with AVAudioConverter, but it lacks documentation :(

Related

Python: time stretch wave files - comparison between three methods

I'm doing some data augmentation on a speech dataset, and I want to stretch/squeeze each audio file in the time domain.
I found the following three ways to do that, but I'm not sure which one is the best or more optimized way:
dimension = int(len(signal) * speed)
res = librosa.effects.time_stretch(signal, speed)
res = cv2.resize(signal, (1, dimension)).squeeze()
res = skimage.transform.resize(signal, (dimension, 1)).squeeze()
However, I found that librosa.effects.time_stretch adds unwanted echo (or something like that) to the signal.
So, my question is: What are the main differences between these three ways? And is there any better way to do that?

librosa.effects.time_stretch(signal, speed) (docs)
In essence, this approach transforms the signal using stft (short time Fourier transform), stretches it using a phase vocoder and uses the inverse stft to reconstruct the time domain signal. Typically, when doing it this way, one introduces a little bit of "phasiness", i.e. a metallic clang, because the phase cannot be reconstructed 100%. That's probably what you've identified as "echo."
Note that while this approach effectively stretches audio in the time domain (i.e., the input is in the time domain as well as the output), the work is actually being done in the frequency domain.
cv2.resize(signal, (1, dimension)).squeeze() (docs)
All this approach does is interpolating the given signal using bilinear interpolation. This approach is suitable for images, but strikes me as unsuitable for audio signals. Have you listened to the result? Does it sound at all like the original signal only faster/slower? I would assume not only the tempo changes, but also the frequency and perhaps other effects.
skimage.transform.resize(signal, (dimension, 1)).squeeze() (docs)
Again, this is meant for images, not sound. Additionally to the interpolation (spline interpolation with the order 1 by default), this function also does anti-aliasing for images. Note that this has nothing to do with avoiding audio aliasing effects (Nyqist/Aliasing), therefore you should probably turn that off by passing anti_aliasing=False. Again, I would assume that the results may not be exactly what you want (changing frequencies, other artifacts).
What to do?
IMO, you have several options.
If what you feed into your ML algorithms ends up being something like a Mel spectrogram, you could simply treat it as image and stretch it using the skimage or opencv approach. Frequency ranges would be preserved. I have successfully used this kind of approach in this music tempo estimation paper.
Use a better time_stretch library, e.g. rubberband. librosa is great, but its current time scale modification (TSM) algorithm is not state of the art. For a review of TSM algorithms, see for example this article.
Ignore the fact that the frequency changes and simply add 0 samples on a regular basis to the signal or drop samples on a regular basis from the signal (much like your image interpolation does). If you don't stretch too far it may still work for data augmentation purposes. After all the word content is not changed, if the audio content has higher or lower frequencies.
Resample the signal to another sampling frequency, e.g. 44100 Hz -> 43000 Hz or 44100 Hz -> 46000 Hz using a library like resampy and then pretend that it's still 44100 Hz. This still change the frequencies, but at least you get the benefit that resampy does proper filtering of the result so that you avoid the aforementioned aliasing, which otherwise occurs.

fundamental frequency of female voice

According to what I have read on the internet, the normal range of fundamental frequency of female voice is 165 to 255 Hz .
I am using Praat and also python library called Parselmouth to get the fundamental frequency values of female voice in an audio file(.wav). however, I got some values that are over 255Hz(eg: 400+Hz, 500Hz).
Is it normal to get big values like this?

It is possible, but unlikely, if you are trying to capture the fundamental frequency (F0) of a speaking voice. It sounds likely that you are capturing a more easily resonating overtone (e.g. F1 or F2) instead.
My experiments with Praat give me the impression that the with good parameters it will reliably extract F0.
What you'll want to do is to verify that by comparing the pitch curve with a spectrogram. Here's an example of a fitting made by Praat (female speaker):
You can see from the image that
Most prominent frequency seems to be F2
Around 200 Hz seems likely to be F0, since there's only noise below that (compared to before/after the segment)
Praat has calculated a good estimate of F0 for the voiced speech segments
If, after a visual inspection, it seems that you are getting wrong results, you can try to tweak the parameters. Window length greatly affects the frequency resolution.
If you can't capture frequencies this low, you should try increasing the window length - the intuition is that it gives the algorithm a better chance at finding slowly changing periodic features in the data.

Input representation in FFT for a given list of amplitudes and sampling rate

How to represent a use a sound wave (Sine wave, 1000Hz, 3sec, -3dBFS, 44.1kHz) in FFT program? The input for the program is list of amplitues and sampling rate.
I mean how to transform a sound file(Eg: XYZ.wav file) as input to FFT where one of the input argument needs to take a .dat file consisting of amplitudes and other input arguments needs to take sampling rate and if any necessary.

Typically when you execute a fft call you supply a one dimensional array which represents a curve in the time domain, often this is your audio curve however fft will transform any time series curve ... when you start from an audio file, say a wav file, you must transform the binary data into this floating point 1D array ... if its wav then the file will begin with a 44 byte header which details essential attributes like sample rate, bit depth and endianness ... the rest of the wav file is the payload ... depending on bit depth you will then need to parse a set of bytes then transform the data from typically 16 bit which will consume two bytes into an integer by doing some bit shifting ... to do that you need to be aware of notion of endianness (big or little endian) as well as handling interleaving of a multi-channel signal like stereo ... once you have the generated the floating point array representation just feed it into your fft call ... for starters ignore using a wav file and simply synthesize your own sin curve and feed this into a fft call just to confirm the known frequency in will result in that frequency represented in its frequency domain coming out of the fft call
The response back from a fft call (or DFT) will be a 1D array of complex numbers ... there is a simple formula to calculate the magnitude and phase of each frequency in this fft result set ... be aware of what a Nyquist Limit is and how to fold the freq domain array on top of itself to double the magnitude while only using half of the elements of this freq domain array ... element 0 of this freq domain array is your DC offset and each subsequent element is called a frequency bin which is separated from each other by a constant frequency increment also calculated by a simple formula ... pipe back if interested in what these formulas are
Now you can appreciate people who spend their entire careers pushing the frontiers of working the algos behind the curtain on these api calls ... slop chop slamming together 30 lines of api calls to perform all of above is probably available however its far more noble to write the code to perform all of above yourself by hand as I know it will open up new horizons to enable you to conquer more subtle questions
A super interesting aspect of transforming a curve in time domain into its frequency domain counterpart by making a fft call is that you have retained all of the information of your source signal ... to prove this point I highly suggest you take the next step and perform the symmetrical operation by transforming the output of your fft call back into the time domain
audio curve in time domain --> fft --> freq domain representation --> inverse fft --> back to original audio curve in time domain
this cycle of transformation is powerful as its a nice way to allow you to confirm your audio curve to freq domain is working

Frequency analysis of very short signal in GNU Octave

I have some very short signals from oscilloscope (50k-200k samples) registered over about 2ms time length. Those are acoustic signals with registered signal of a spark of ESD (electrostatic discharge).
I'd like to get some frequency data of that signal, in near-acoustic frequency range (up to about 30kHz) with as high time resolution as possible.
I have tried ploting a spectrogram (specgram in Octave) to view the signal, but the output is not really usefull. Using specgram( x, N, fs );, where x is my signal of fs sampling rate, I receive plot starting at very high frequencies of about 500MHz for low values of N and I get better frequency resolution for big N values (like 2^12-13) but the window is too wide and I receive only 2 spectrum values over whole signal length.
I understand that it may be the limitation of Fourier transform which is probably used by the specgram function (actually, I don't know much about signal analysis).
Is there any other way to get some frequency (as a function of time) information of that kind of signal? I've read something about wavelets, but when I tried using dwt function of signal package, I received this error:
error: 'wfilters' undefined near line 51 column 14
error: called from
dwt at line 51 column 12
Even if this would work, I am not so sure if I'd know how to actually use the output of those wavelet functions ...

To get audio frequency information from such a high sample rate, you will need obtain a sample vector long enough to contain at least a few whole cycles at audio frequencies, e.g. many 10's of milliseconds of contiguous samples, which may or may not be more than your scope can gather. To reasonably process this amount of data, you might low pass filter the sample data to just contain audio frequencies, and then resample it to a lower sample rate, but above twice that filter cut-off frequency. Then you will end up with a much shorter sample vector to feed an FFT for your audio spectrum analysis.

MATLAB - Trouble of converting training data to spectrogram

I am a student and new to signal processing just few months ago. I picked "A Novel Fuzzy Approach to Speech Recognition" for my project (you can google for the downloadable version).
I am a little stuck in converting the training data into a spectrogram which has been passed through a mel-filter.
I use this for my mel-filterbank, with a little modification of course.
Then I wrote this simple code to make the spectrogram of my training data:
p =25;
fl =0.0;
fh =0.5;
w ='hty';
[a,fs]=wavread('a.wav'); %you can simply record a sound and name it a.wav, other param will follows
n=length(a)+1;
fa=rfft(a);
xa=melbank_me(p,n,fs); %the mel-filterbank function
za=log(xa*abs(fa).^2);
ca=dct(za);
spectrogram(ca(:,1))
All I got is just like this which is not like the paper say::
Please let me know that either my code or the spectrogram I have was right. if so, what do I have to do to make my spectrogram like the paper's? and if didn't, please tell me where's the wrong
And another question, is it ok to having the lenght of FFT that much?
Because when I try to lower it, my code gives errors.

You shouldn't be doing an FFT of the entire file - that will include too much time-varing information - you should pick a window size in which the sound is relatively stationary, e.g. 10 ms # 44.1 kHz = 441 samples, so perhaps N = 512 might be a good starting point. You can then generate your spectrogram over successive windows if needed, in order to display the time-varying frequency content.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart