SFSpeechRecognizer - detect end of utterance - ios

I am hacking a little project using iOS 10 built-in speech recognition. I have working results using device's microphone, my speech is recognized very accurately.
My problem is that recognition task callback is called for every available partial transcription, and I want it to detect person stopped talking and call the callback with isFinal property set to true. It is not happening - app is listening indefinitely.
Is SFSpeechRecognizer ever capable of detecting end of sentence?
Here's my code - it is based on example found on the Internets, it is mostly a boilerplate needed to recognize from microphone source.
I modified it by adding recognition taskHint. I also set shouldReportPartialResults to false, but it seems it has been ignored.
func startRecording() {
if recognitionTask != nil {
recognitionTask?.cancel()
recognitionTask = nil
}
let audioSession = AVAudioSession.sharedInstance()
do {
try audioSession.setCategory(AVAudioSessionCategoryRecord)
try audioSession.setMode(AVAudioSessionModeMeasurement)
try audioSession.setActive(true, with: .notifyOthersOnDeactivation)
} catch {
print("audioSession properties weren't set because of an error.")
}
recognitionRequest = SFSpeechAudioBufferRecognitionRequest()
recognitionRequest?.shouldReportPartialResults = false
recognitionRequest?.taskHint = .search
guard let inputNode = audioEngine.inputNode else {
fatalError("Audio engine has no input node")
}
guard let recognitionRequest = recognitionRequest else {
fatalError("Unable to create an SFSpeechAudioBufferRecognitionRequest object")
}
recognitionRequest.shouldReportPartialResults = true
recognitionTask = speechRecognizer?.recognitionTask(with: recognitionRequest, resultHandler: { (result, error) in
var isFinal = false
if result != nil {
print("RECOGNIZED \(result?.bestTranscription.formattedString)")
self.transcriptLabel.text = result?.bestTranscription.formattedString
isFinal = (result?.isFinal)!
}
if error != nil || isFinal {
self.state = .Idle
self.audioEngine.stop()
inputNode.removeTap(onBus: 0)
self.recognitionRequest = nil
self.recognitionTask = nil
self.micButton.isEnabled = true
self.say(text: "OK. Let me see.")
}
})
let recordingFormat = inputNode.outputFormat(forBus: 0)
inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { (buffer, when) in
self.recognitionRequest?.append(buffer)
}
audioEngine.prepare()
do {
try audioEngine.start()
} catch {
print("audioEngine couldn't start because of an error.")
}
transcriptLabel.text = "Say something, I'm listening!"
state = .Listening
}

It seems that isFinal flag doesn't became true when user stops talking as expected. I guess this is a wanted behaviour by Apple, because the event "User stops talking" is an undefined event.
I believe that the easiest way to achieve your goal is to do the following:
You have to estabilish an "interval of silence". That means if the user doesn't talk for a time greater than your interval, he has stopped talking (i.e. 2 seconds).
Create a Timer at the beginning of the audio session:
var timer = NSTimer.scheduledTimerWithTimeInterval(2, target: self, selector: "didFinishTalk", userInfo: nil, repeats: false)
when you get new transcriptions in recognitionTaskinvalidate and restart your timer
timer.invalidate()
timer = NSTimer.scheduledTimerWithTimeInterval(2, target: self, selector: "didFinishTalk", userInfo: nil, repeats: false)
if the timer expires this means the user doesn't talk from 2 seconds. You can safely stop Audio Session and exit

Based on my test on iOS10, when shouldReportPartialResults is set to false, you have to wait 60 seconds to get the result.

I am using Speech to text in an app currently and it is working fine for me. My recognitionTask block is as follows:
recognitionTask = speechRecognizer?.recognitionTask(with: recognitionRequest, resultHandler: { (result, error) in
var isFinal = false
if let result = result, result.isFinal {
print("Result: \(result.bestTranscription.formattedString)")
isFinal = result.isFinal
completion(result.bestTranscription.formattedString, nil)
}
if error != nil || isFinal {
self.audioEngine.stop()
inputNode.removeTap(onBus: 0)
self.recognitionRequest = nil
self.recognitionTask = nil
completion(nil, error)
}
})

if result != nil {
self.timerDidFinishTalk.invalidate()
self.timerDidFinishTalk = Timer.scheduledTimer(timeInterval: TimeInterval(self.listeningTime), target: self, selector:#selector(self.didFinishTalk), userInfo: nil, repeats: false)
let bestString = result?.bestTranscription.formattedString
self.fullsTring = bestString!.trimmingCharacters(in: .whitespaces)
self.st = self.fullsTring
}
Here self.listeningTime is the time after which you want to stop after getting end of the utterance.

I have a different approach that I find far more reliable in determining when the recognitionTask is done guessing: the confidence score.
When shouldReportPartialResults is set to true, the partial results will have a confidence score of 0.0. Only the final guess will come back with a score over 0.
recognitionTask = speechRecognizer.recognitionTask(with: recognitionRequest) { result, error in
if let result = result {
let confidence = result.bestTranscription.segments[0].confidence
print(confidence)
self.transcript = result.bestTranscription.formattedString
}
}
The segments array above contains each word in the transcription. 0 is the safest index to examine, so I tend to use that one.
How you use it is up to you, but if all you want to do is know when the guesser is done guessing, you can just call:
let myIsFinal = confidence > 0.0 ? true : false
You can also look at the score (100.0 is totally confident) and group responses into groups of low -> high confidence guesses as well if that helps your application.

Related

SwiftUI : Speech recognition works on iphone by crashes on ipad

I don't know why the app works on iphone but crashes on ipad. I am building a speech to text feature.
this is my speech to text code
func StartRecording() -> String{
// Configure the audio session for the app.
let audioSession = AVAudioSession.sharedInstance()
try! audioSession.setCategory(.record, mode: .measurement, options: .duckOthers)
try! audioSession.setActive(true, options: .notifyOthersOnDeactivation)
let inputNode = audioEngine.inputNode
//
let recordingFormat = inputNode.outputFormat(forBus: 0)
inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { (buffer: AVAudioPCMBuffer, when: AVAudioTime) in
self.recognitionRequest?.append(buffer)
}
audioEngine.prepare()
try! audioEngine.start()
// Create and configure the speech recognition request.
recognitionRequest = SFSpeechAudioBufferRecognitionRequest()
guard let recognitionRequest = recognitionRequest else { fatalError("Unable to create a SFSpeechAudioBufferRecognitionRequest object") }
recognitionRequest.shouldReportPartialResults = true
// Create a recognition task for the speech recognition session.
// Keep a reference to the task so that it can be canceled.
recognitionTask = speechRecognizer.recognitionTask(with: recognitionRequest) { result, error in
var isFinal = false
if let result = result {
// Update the text view with the results.
self.recognizedText = result.bestTranscription.formattedString
isFinal = result.isFinal
}
if error != nil || isFinal {
// Stop recognizing speech if there is a problem.
self.audioEngine.stop()
inputNode.removeTap(onBus: 0)
self.recognitionRequest = nil
self.recognitionTask = nil
}
}
return recognizedText
}
the app works fine on iphone but not on ipad.
This is the error I get when I try to run speech recognition on ipad simulator
Terminating app due to uncaught exception 'com.apple.coreaudio.avfaudio', reason: 'required condition is false: format.sampleRate == hwFormat.sampleRate'
what is causing the crash and how can I fix it?

Tap audio output using AVAudioEngine

I'm trying install a tap on the output audio that is played on my app. I have no issue catching buffer from microphone input, but when it comes to catch sound that it goes trough the speaker or the earpiece or whatever the output device is, it does not succeed. Am I missing something?
In my example I'm trying to catch the audio buffer from an audio file that an AVPLayer is playing. But let's pretend I don't have access directly to the AVPlayer instance.
The goal is to perform Speech Recognition on an audio stream.
func catchAudioBuffers() throws {
let audioSession = AVAudioSession.sharedInstance()
try audioSession.setCategory(.playAndRecord, mode: .voiceChat, options: .allowBluetooth)
try audioSession.setActive(true)
let outputNode = audioEngine.outputNode
let recordingFormat = outputNode.outputFormat(forBus: 0)
outputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { (buffer: AVAudioPCMBuffer, when: AVAudioTime) in
// PROCESS AUDIO BUFFER
}
audioEngine.prepare()
try audioEngine.start()
// For example I am playing an audio conversation with an AVPlayer and a local file.
player.playSound()
}
This code results in a:
AVAEInternal.h:76 required condition is false: [AVAudioIONodeImpl.mm:1057:SetOutputFormat: (_isInput)]
*** Terminating app due to uncaught exception 'com.apple.coreaudio.avfaudio', reason: 'required condition is false: _isInput'
I was facing the same problem and during 2 days of brainstorming found the following.
Apple says that For AVAudioOutputNode, tap format must be specified as nil. I'm not sure that it's important but in my case, that finally worked, format was nil.
You need to start recording and don't forget to stop it.
Removing tap is really important, otherwise you will have file that you can't open.
Try to save the file with the same audio settings that you used in source file.
Here's my code that finally worked. It was partly taken from this question Saving Audio After Effect in iOS.
func playSound() {
let rate: Float? = effect.speed
let pitch: Float? = effect.pitch
let echo: Bool? = effect.echo
let reverb: Bool? = effect.reverb
// initialize audio engine components
audioEngine = AVAudioEngine()
// node for playing audio
audioPlayerNode = AVAudioPlayerNode()
audioEngine.attach(audioPlayerNode)
// node for adjusting rate/pitch
let changeRatePitchNode = AVAudioUnitTimePitch()
if let pitch = pitch {
changeRatePitchNode.pitch = pitch
}
if let rate = rate {
changeRatePitchNode.rate = rate
}
audioEngine.attach(changeRatePitchNode)
// node for echo
let echoNode = AVAudioUnitDistortion()
echoNode.loadFactoryPreset(.multiEcho1)
audioEngine.attach(echoNode)
// node for reverb
let reverbNode = AVAudioUnitReverb()
reverbNode.loadFactoryPreset(.cathedral)
reverbNode.wetDryMix = 50
audioEngine.attach(reverbNode)
// connect nodes
if echo == true && reverb == true {
connectAudioNodes(audioPlayerNode, changeRatePitchNode, echoNode, reverbNode, audioEngine.mainMixerNode, audioEngine.outputNode)
} else if echo == true {
connectAudioNodes(audioPlayerNode, changeRatePitchNode, echoNode, audioEngine.mainMixerNode, audioEngine.outputNode)
} else if reverb == true {
connectAudioNodes(audioPlayerNode, changeRatePitchNode, reverbNode, audioEngine.mainMixerNode, audioEngine.outputNode)
} else {
connectAudioNodes(audioPlayerNode, changeRatePitchNode, audioEngine.mainMixerNode, audioEngine.outputNode)
}
// schedule to play and start the engine!
audioPlayerNode.stop()
audioPlayerNode.scheduleFile(audioFile, at: nil) {
var delayInSeconds: Double = 0
if let lastRenderTime = self.audioPlayerNode.lastRenderTime, let playerTime = self.audioPlayerNode.playerTime(forNodeTime: lastRenderTime) {
if let rate = rate {
delayInSeconds = Double(self.audioFile.length - playerTime.sampleTime) / Double(self.audioFile.processingFormat.sampleRate) / Double(rate)
} else {
delayInSeconds = Double(self.audioFile.length - playerTime.sampleTime) / Double(self.audioFile.processingFormat.sampleRate)
}
}
// schedule a stop timer for when audio finishes playing
self.stopTimer = Timer(timeInterval: delayInSeconds, target: self, selector: #selector(EditViewController.stopAudio), userInfo: nil, repeats: false)
RunLoop.main.add(self.stopTimer!, forMode: RunLoop.Mode.default)
}
do {
try audioEngine.start()
} catch {
showAlert(Alerts.AudioEngineError, message: String(describing: error))
return
}
//Try to save
let dirPaths: String = (NSSearchPathForDirectoriesInDomains(.libraryDirectory, .userDomainMask, true)[0]) + "/sounds/"
let tmpFileUrl = URL(fileURLWithPath: dirPaths + "effected.caf")
//Save the tmpFileUrl into global varibale to not lose it (not important if you want to do something else)
filteredOutputURL = URL(fileURLWithPath: filePath)
do{
print(dirPaths)
let settings = [AVSampleRateKey : NSNumber(value: Float(44100.0)),
AVFormatIDKey : NSNumber(value: Int32(kAudioFormatMPEG4AAC)),
AVNumberOfChannelsKey : NSNumber(value: 1),
AVEncoderAudioQualityKey : NSNumber(value: Int32(AVAudioQuality.medium.rawValue))]
self.newAudio = try! AVAudioFile(forWriting: tmpFileUrl as URL, settings: settings)
let length = self.audioFile.length
audioEngine.mainMixerNode.installTap(onBus: 0, bufferSize: 4096, format: nil) {
(buffer: AVAudioPCMBuffer?, time: AVAudioTime!) -> Void in
//Let us know when to stop saving the file, otherwise saving infinitely
if (self.newAudio.length) <= length {
do{
try self.newAudio.write(from: buffer!)
} catch _{
print("Problem Writing Buffer")
}
} else {
//if we dont remove it, will keep on tapping infinitely
self.audioEngine.mainMixerNode.removeTap(onBus: 0)
}
}
}
// play the recording!
audioPlayerNode.play()
}
#objc func stopAudio() {
if let audioPlayerNode = audioPlayerNode {
let engine = audioEngine
audioPlayerNode.stop()
engine?.mainMixerNode.removeTap(onBus: 0)
}
if let stopTimer = stopTimer {
stopTimer.invalidate()
}
configureUI(.notPlaying)
if let audioEngine = audioEngine {
audioEngine.stop()
audioEngine.reset()
}
isPlaying = false
}

How to get frequency of sound from AVAudioEngine

I'm implementing a speech recognition module for an app. It works fine, however there are some additional things that I need to do. For example, I need to know if a user is speaking or shouting. I know, I can achieve that by knowing the frequency of the sound. Here is how I implement it:
let audioEngine = AVAudioEngine()
let speechRecognizer: SFSpeechRecognizer? = SFSpeechRecognizer()
let request = SFSpeechAudioBufferRecognitionRequest()
var recognitionTask = SFSpeechRecognitionTask()
func recordAndRecognizeSpeech() {
let node = audioEngine.inputNode
let recordingFormat = node.outputFormat(forBus: 0)
node.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { (buffer, _) in
self.request.append(buffer)
}
audioEngine.prepare()
do {
try audioEngine.start()
} catch {
return print(error)
}
guard let myRecoginizer = SFSpeechRecognizer() else {
return
}
if !myRecoginizer.isAvailable {
return
}
recognitionTask = (speechRecognizer?.recognitionTask(with: request, resultHandler: { (result, error) in
//Handling speech recognition tasks here
}))!
}
This works fine for the speech recognition, but how can I get the frequency or amplitude value of the sound?

Getting crash while continuing conversion on speech framework ios 10

I am getting crash while continue talk with speech.framework and getting below crash of AVAudio engine getting NULL.
*** Terminating app due to uncaught exception 'com.apple.coreaudio.avfaudio', reason: 'required condition is false:
nullptr == Tap()'
This is due to in some cases my AudioEngine getting null.
Here is my StartRecording function code :
func startRecording(){
if recognizationTask != nil{
recognizationTask?.cancel()
recognizationTask = nil
}
let audioSession = AVAudioSession.sharedInstance()
do{
try audioSession.setCategory(AVAudioSessionCategoryRecord)
try audioSession.setMode(AVAudioSessionModeSpokenAudio)
try audioSession.setActive(true, with: .notifyOthersOnDeactivation)
} catch {
print("Audion session properies weren't set because of an error.")
}
recognizationRequest = SFSpeechAudioBufferRecognitionRequest()
guard let inputNode = audioEngine.inputNode as AVAudioInputNode? else {
fatalError("Audio engine has no input node")
}
guard let recognizationRequest = recognizationRequest else {
fatalError("Unable to create an SFSpeechAudioBufferRecognizationRequest object.")
}
recognizationRequest.shouldReportPartialResults = true
recognizationTask = speechRecognizer?.recognitionTask(with: recognizationRequest, resultHandler: { (result, error) in
var isFinal = false
if result != nil{
self.txtViewSiriDetecation.text = result?.bestTranscription.formattedString
isFinal = (result?.isFinal)!
}
if error != nil || isFinal {
self.audioEngine.stop()
inputNode.removeTap(onBus:0)
self.recognizationRequest = nil
self.recognizationTask = nil
self.btnSiri.isEnabled = true
}
})
let recordingFormat = inputNode.outputFormat(forBus: 0)
inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { (buffer, when) in
self.recognizationRequest?.append(buffer)
}
audioEngine.prepare()
do {
try audioEngine.start()
} catch {
print("audio engine couldn't start b'cus of an error.")
}
txtViewSiriDetecation.text = "Say something, I'm listening!"
}
How can I overcome this situation of getting NULL ?
Any one guide me on this?
Thanks in Advance!
I had that problem too, adding audioEngine.inputNode.removeTap(onBus: 0) fixed it for me.
fileprivate func stopRecording() {
recordingMic.isHidden = true
audioEngine.stop()
audioEngine.inputNode.removeTap(onBus: 0)
recognitionRequest?.endAudio()
recognitionTask?.cancel()
self.detectSpeechButton.isEnabled = true
self.detectSpeechButton.setTitle("Detect Speech", for: .normal)
recordingMic.isHidden = true
self.textView.isHidden = false
}
Non-nil format for installTap is being passed. This should only be done when attaching to an output bus which is not connected to another node; an error will result otherwise. The tap and connection formats (if non-nil) on the specified bus should be identical. Otherwise, the latter operation will override any previously set format.

Voice recognizer breaks voice synthesizer

I must be overlooking something, but when I try to combine voice synthesis and voice recognition in Swift, I get bad results ("Could not get attribute 'LocalURL': Error Domain=MobileAssetError Code=1 "Unable to copy asset attributes" UserInfo={NSDescription=Unable to copy asset attributes}") and the final result is that after that I am able to do speech to text, but text to speech is ruined until restart of the app.
let identifier = "\(Locale.current.languageCode!)_\(Locale.current.regionCode!)" // e.g. en-US
speechRecognizer = SFSpeechRecognizer(locale: Locale.init(identifier: identifier))!
if audioEngine.isRunning {
audioEngine.stop() // will also stop playing music.
recognitionRequest?.endAudio()
speechButton.isEnabled = false
} else {
recordSpeech() // here we do steps 1 .. 12
}
// recordSpeech() :
if recognitionTask != nil { // Step 1
recognitionTask?.cancel()
recognitionTask = nil
}
let audioSession = AVAudioSession.sharedInstance() // Step 2
do {
try audioSession.setCategory(AVAudioSessionCategoryRecord)
try audioSession.setMode(AVAudioSessionModeMeasurement)
try audioSession.setActive(true, with: .notifyOthersOnDeactivation)
} catch {
print("audioSession properties weren't set because of an error.")
}
recognitionRequest = SFSpeechAudioBufferRecognitionRequest() // Step 3
guard let inputNode = audioEngine.inputNode else {
fatalError("Audio engine has no input node")
} // Step 4
guard let recognitionRequest = recognitionRequest else {
fatalError("Unable to create an SFSpeechAudioBufferRecognitionRequest object")
} // Step 5
recognitionRequest.shouldReportPartialResults = true // Step 6
recognitionTask = speechRecognizer.recognitionTask(with: recognitionRequest, resultHandler: { (result, error) in // Step 7
var isFinal = false // Step 8
if result != nil {
print(result?.bestTranscription.formattedString as Any)
isFinal = (result?.isFinal)!
if (isFinal) {
if (result != nil) {
self.speechOutput.text = self.speechOutput.text + "\n" + (result?.bestTranscription.formattedString)!
}
}
}
if error != nil || isFinal { // Step 10
self.audioEngine.stop()
inputNode.removeTap(onBus: 0)
self.recognitionRequest = nil
self.recognitionTask = nil
self.speechButton.isEnabled = true
}
})
let recordingFormat = inputNode.outputFormat(forBus: 0) // Step 11
inputNode.installTap(onBus: 0, bufferSize: 1024, format: recordingFormat) { (buffer, when) in
self.recognitionRequest?.append(buffer)
}
audioEngine.prepare() // Step 12
do {
try audioEngine.start()
} catch {
print("audioEngine couldn't start because of an error.")
}
I used thus tutorial to base my code upon:
http://www.appcoda.com/siri-speech-framework/
func say(_ something : String, lang : String ) {
let synth = AVSpeechSynthesizer()
synth.delegate = self
print(something) // debug code, works fine
let identifier = "\(Locale.current.languageCode!)-\(Locale.current.regionCode!)"
let utterance = AVSpeechUtterance(string: something)
utterance.voice = AVSpeechSynthesisVoice(language: identifier)
synth.speak(utterance)
}
So if I use the "say" method on it's own it works well, if I combine the two, after doing speech recognition, the synthesizer does not work anymore. Any hints into the direction of the solution? I suppose something is not being gracefully restored to it's prior state, but I can't seem to figure out what.
Grrr...
This is the solution, sorry about not looking well enough, costed me a lot of time though.
func say(_ something : String, lang : String ) {
let audioSession = AVAudioSession.sharedInstance()
do {
// this is the solution:
try audioSession.setCategory(AVAudioSessionCategoryPlayback)
try audioSession.setMode(AVAudioSessionModeDefault)
// the recognizer uses AVAudioSessionCategoryRecord
// so we want to set it to AVAudioSessionCategoryPlayback
// again before we can say something
} catch {
print("audioSession properties weren't set because of an error.")
}
synth = AVSpeechSynthesizer()
synth.delegate = self
print(something)
let identifier = "\(Locale.current.languageCode!)-\(Locale.current.regionCode!)"
let utterance = AVSpeechUtterance(string: something)
utterance.voice = AVSpeechSynthesisVoice(language: identifier)
synth.speak(utterance)
}

Resources