I simply need to know when the user starts talking into the microphone. I will not be doing any speech processing or anything fancy, just detect when the microphone has picked up anything. I've been looking for an hour now and can't find anything as simple as that. Can somebody point me in the right direction?
Update 1
I apologise for how late this is; I have been having connectivity issues. Here's the code I've been using:
override func viewDidLoad() {
super.viewDidLoad()
// Do any additional setup after loading the view, typically from a nib.
let audioEngine = AVAudioEngine()
let inputNode = audioEngine.inputNode
let bus = 0
inputNode.installTapOnBus(bus, bufferSize: 8192, format:inputNode.inputFormatForBus(bus)) {
(buffer: AVAudioPCMBuffer!, time: AVAudioTime!) -> Void in
println("Speech detected.")
}
audioEngine.prepare()
audioEngine.startAndReturnError(nil)
}
The callback you're passing to installTapOnBus will be called with every audio block coming from the mic. The code above detects that your app has started listening -- to silence or anything else -- not whether someone is speaking into the mic.
In order to actually identify the start of speech you would need to look at the data in the buffer.
A simple version of this is similar to an audio noise gate used in PA systems: Pick an amplitude threshold and a duration threshold and once both are met you call it speech. Because phones, mics, and environments all vary you will probably need to adaptively determine an amplitude threshold to get good performance.
Even with an adaptive threshold, sustained loud sounds will still be considered "speech". If you need to weed those out too then you'll want to look at some sort of frequency analysis (e.g., FFT) and identify sufficient amplitude and variation over time in speech frequencies. Or you could pass the buffers to the speech recognition engine (e.g., SFSpeechRecognizer) and see whether it recognizes anything, hence piggybacking on Apple's signal processing work. But that's pretty heavyweight if you do it often.
Related
I'm building a rhythm game and trying to provide extremely low latency audio response to the user using AudioKit.
I'm new to AudioKit, following the Hello world example, I built a very simple test app using AKOscillator:
...
let oscillator = AKOscillator()
...
AudioKit.output = oscilator
oscillator.frequency = 880
AKSettings.bufferLength = .shortest
AKSettings.ioBufferDuration = 0.002
AudioKit.start()
... // On Touch event ///
oscillator.start()
... // 20 ms later ///
oscillator.stop()
I measured the latency between touch event and the first sound coming out, it's around 100ms, which is way to slow for us...
Few possibilities I could think:
100 ms is hitting the hardware limit of the audio output delay
some more magic settings can fix this
oscillator.start() has some delay, to achieve lowest latency I should use something else
something wrong with the other parts of the code (touch handling etc.)
Since I have now experience on AudioKit (nor iOS audio system...), any piece of information will be really appreciated!
I am trying to extract MFCC vectors from the audio signal as input into a recurrent neural network. However, I am having trouble figuring out how to obtain the raw audio frames in Swift using Core Audio. Presumably, I have to go low-level to get that data, but I cannot find helpful resources in this area.
How can I get the audio signal information that I need using Swift?
Edit: This question was flagged as a possible duplicate of How to capture audio samples in iOS with Swift?. However, that particular question does not have the answer that I am looking for. Namely, the solution to that question is the creation of an AVAudioRecorder, which is a component, not the end result, of a solution to my question.
This question How to convert WAV/CAF file's sample data to byte array? is more in the direction of where I am headed. The solutions to that are written in Objective-C, and I am wondering if there is a way to do it in Swift.
Attaching a tap to the default input node on AVAudioEngine is pretty straightforward and will get you real-time ~100ms chunks of audio from the microphone as Float32 arrays. You don't even have to connect any other audio units. If your MFCC extractor & network are sufficiently responsive this may be the easiest way to go.
let audioEngine = AVAudioEngine()
if let inputNode = audioEngine.inputNode {
inputNode.installTap( onBus: 0, // mono input
bufferSize: 1000, // a request, not a guarantee
format: nil, // no format translation
block: { buffer, when in
// This block will be called over and over for successive buffers
// of microphone data until you stop() AVAudioEngine
let actualSampleCount = Int(buffer.frameLength)
// buffer.floatChannelData?.pointee[n] has the data for point n
var i=0
while (i < actualSampleCount) {
let val = buffer.floatChannelData?.pointee[i]
// do something to each sample here...
i += 1
}
})
do {
try audioEngine.start()
} catch let error as NSError {
print("Got an error starting audioEngine: \(error.domain), \(error)")
}
}
You will need to request and obtain microphone permission as well.
I find the amplitudes to be rather low, so you may need to apply some gain or normalization depending on your network's needs.
To process your WAV files, I'd try AVAssetReader, though I don't have code at hand for that.
I am writing an app to recording single-channel audio with the built-in microphone on an iPhone 6. The apps works as expected when configured to record at 8000 Hz. Here's the code
// Set up audio session
let session = AVAudioSession.sharedInstance()
// Configure audio session
do {
try session.setCategory(AVAudioSessionCategoryPlayAndRecord)
var recordSettings = [String:AnyObject]()
recordSettings[AVFormatIDKey] = Int(kAudioFormatLinearPCM) as AnyObject
// Set the sampling rate
recordSettings[AVSampleRateKey] = 8000.0 as AnyObject
recordSettings[AVNumberOfChannelsKey] = 1 as AnyObject
recorder = try AVAudioRecorder(url: outputFileURL, settings: recordSettings)
recorder?.delegate = self
recorder?.isMeteringEnabled = true
recorder?.prepareToRecord()
return true
}
catch {
throw Error.AVConfiguration
}
To reduce the storage requirements, I would like to record at a much lower sample rate (ideally less than 1000 Hz). If I set the sample rate to 1000 Hz, the app records at 8000 Hz.
According to Apple's documentation,
The available range for hardware sample rate is device dependent. It typically ranges from 8000 through 48000 hertz.
Question...is it possible to use AVAudioSession (or other framework) to record audio at a low sample rate?
Audio recording on iPhone are made with hardware codecs so available frame rates are hardcoded and can't be changed. But if you need to have 1kHz sample rate you can record at 8kHz and than just resample you record with some resample library. Personally, I prefer to use ffmpeg for such tasks.
I hope you are aware that by the niquist theorem you cannot expect very useful results of what you try to achieve.
That is, except you are targeting for low frequencies only. For that case you might want to use a low-pass filter first. It's almost impossible to understand voices with olny frequencies below 500 Hz. Speech is usually said to require 3 kHz, that makes for a sample rate of 6000.
For an example of what you'd have to expect try something similar to:
ffmpeg -i tst.mp3 -ar 1000 tst.wav
with e.g. some vocals and listen to the result.
You can however possibly achieve some acceptable trade-off using e.g. a sample rate of 3000.
An alternative would be to do some compression on the fly as #manishg suggested. As Smartphones these days can do video compression in real time that should be totally feasible with iPhone's Hard- and Software. But it's a totally different thing than reducing the sample rate.
I am creating a metronome as part of a larger app and I have a few very short wav files to use as the individual sounds. I would like to use AVAudioEngine because NSTimer has significant latency problems and Core Audio seems rather daunting to implement in Swift. I'm attempting the following, but I'm currently unable to implement the first 3 steps and I'm wondering if there is a better way.
Code outline:
Create an array of file URLs according to the metronome's current settings (number of beats per bar and subdivisions per beat; file A for beats, file B for subdivisions)
Programmatically create a wav file with the appropriate number of frames of silence, based on the tempo and the length of the files, and insert it into the array between each of the sounds
Read those files into a single AudioBuffer or AudioBufferList
audioPlayer.scheduleBuffer(buffer, atTime:nil, options:.Loops, completionHandler:nil)
So far I have been able to play a looping buffer (step 4) of a single sound file, but I haven't been able to construct a buffer from an array of files or create silence programmatically, nor have I found any answers on StackOverflow that address this. So I'm guessing that this isn't the best approach.
My question is: Is it possible to schedule a sequence of sounds with low latency using AVAudioEngine and then loop that sequence? If not, which framework/approach is best suited for scheduling sounds when coding in Swift?
I was able to make a buffer containing sound from file and silence of required length. Hope this will help:
// audioFile here – an instance of AVAudioFile initialized with wav-file
func tickBuffer(forBpm bpm: Int) -> AVAudioPCMBuffer {
audioFile.framePosition = 0 // position in file from where to read, required if you're read several times from one AVAudioFile
let periodLength = AVAudioFrameCount(audioFile.processingFormat.sampleRate * 60 / Double(bpm)) // tick's length for given bpm (sound length + silence length)
let buffer = AVAudioPCMBuffer(PCMFormat: audioFile.processingFormat, frameCapacity: periodLength)
try! audioFile.readIntoBuffer(buffer) // sorry for forcing try
buffer.frameLength = periodLength // key to success. This will append silcence to sound
return buffer
}
// player – instance of AVAudioPlayerNode within your AVAudioEngine
func startLoop() {
player.stop()
let buffer = tickBuffer(forBpm: bpm)
player.scheduleBuffer(buffer, atTime: nil, options: .Loops, completionHandler: nil)
player.play()
}
I think that one of possible ways to have sounds played at with lowest possible time error is providing audio samples directly via callback. In iOS you could do this with AudioUnit.
In this callback you could track sample count and know at what sample you are now. From sample counter you could go to time value (using sample rate) and use it for your high level tasks like metronome. If you see that it is time to play metronome sound then you just starting to copy audio samples from that sound to buffer.
This is a theoretic part without any code, but you could find many examples of AudioUnit and callback technique.
To expand upon 5hrp's answer:
Take the simple case where you have two beats, an upbeat (tone1) and a downbeat (tone2), and you want them out of phase with each other so the audio will be (up, down, up, down) to a certain bpm.
You will need two instances of AVAudioPlayerNode (one for each beat), let's call them audioNode1 and audioNode2
The first beat you will want to be in phase, so setup as normal:
let buffer = tickBuffer(forBpm: bpm)
audioNode1player.scheduleBuffer(buffer, atTime: nil, options: .loops, completionHandler: nil)
then for the second beat you want it to be exactly out of phase, or to start at t=bpm/2. for this you can use an AVAudioTime variable:
audioTime2 = AVAudioTime(sampleTime: AVAudioFramePosition(AVAudioFrameCount(audioFile2.processingFormat.sampleRate * 60 / Double(bpm) * 0.5)), atRate: Double(1))
you can use this variable in the buffer like so:
audioNode2player.scheduleBuffer(buffer, atTime: audioTime2, options: .loops, completionHandler: nil)
This will play on loop your two beats, bpm/2 out of phase from each other!
It's easy to see how to generalise this to more beats, to create a whole bar. It's not the most elegant solution though, because if you want to say do 16th notes you'd have to create 16 nodes.
I use ffmpeg to decode video/audio stream and use portaudio to play audio. I encounter a sync problem with portaudio. I have a function like below,
double AudioPlayer::getPlaySec() const
{
double const latency = Pa_GetStreamInfo( mPaStream )->outputLatency;
double const bytesPerSec = mSampleRate * Pa_GetSampleSize( mSampleFormat ) * mChannel;
double const playtime = mConsumedBytes / bytesPerSec;
return playtime - latency;
}
mCousumeBytes is the byte count which written into audio device in the portaudio callback function. I thought I could have got the playing time according to the byte count. Actually, when I execute the other process ( like open firefox ) which make cpu busy, the audio become intermittent, but the callback doesn't stop so that mConsumeBytes is more than expected, and getPlaySec return a time which is larger than playing time.
I have no idea how this happened. Any suggestion is welcome. Thanks!
Latency, in PortAudio is defined a bit vaguely. Something like the average time between when you put data into the buffer and when you can expect it to play. That's not something you want to use for this purpose.
Instead, to find the current playback time of the device, you can actually poll the device using the Pa_GetStreamTime function.
You may want to see this document for more detailed info.
I know this is old. But still; PortAudio v19+ can provide you with its own sample rate. You should use that for audio sync, since actual sample rate playback can differ between different hardware. PortAudio might try to compensate (depending on implementation). If you have drift problems, try using that.