I'd like my iOS app to use text-to-speech to read to the user some information that it receives from a server, and I'd also like to allow the user to stop such speech by a voice command. I have tried speech recognition frameworks for iOS like OpenEars and I find the problem that it is listening and detecting the information the app itself is "saying" and it intereferes in the recognition of user's voice commands.
Has somebody dealt with this scenario in iOS and found a solution for that? Thanks in advance
It is not a trivial thing to implement. Unfortunately iOS and others record the sound which is playing through speaker. The only choice you have is to use the headset. In that case speech recognition can continue listening for input. In Openears recognition is disabled during TTS unless headset is plugged in.
If you still want to implement this feature which is called "barge-in" you have to do the following:
Store the audio you play though microphone
Implement noise cancellation algorithm which effectively will remove the audio from the recording. You can use cross-correlation to find a proper offset in the recording and spectral subtraction to remove the audio.
Recognize the speech in remaining signal.
It is not possible to do that without significant modification of openears sources.
Related question is Android Speech Recognition while music is playing
Related
I'm new to iOS programming and I don't know where to start. I found code examples how to read frequencies from the microphone with AudioKit framework. But this is not what I am looking for. Is it possible to retrieving frequency of the currently playing song in real time without using a microphone?
Thank you for help.
The iOS security sandbox prevents apps from capturing general audio output of any other app, such as the Music app.
Certain music apps, such as GarageBand might share inter-app audio, but this isn't supported by the majority of apps that output "songs".
An app might play the "song" itself, via an AVAudioPlayer, and tap the AVPlayer's output to get raw sample data for spectral frequency and pitch analysis (two very different things, by-the-way).
I am developing an Karaoke app in which you can record your voice while listening to the music. When user uses headphones, everything is great - he can listen to the music and himself in headphones while singing. Then we have his pure voice recorded and we can mix it with playback.
Problem occurs when user does not use headphones. Then we play music via speakers AVAudioSessionCategoryPlayAndRecord and record simultaneously. In final recording we have user's voice and playback from speakers mixed together. Problem is that playback's volume is very big and it's "covering" user's voice. Firstly I thought that this is normal behaviour because speakers are close to microphone so there is nothing I can do.
However when I tried the same thing on Garage Band it somehow lowers playback from speakers making voice more hearable.
I also tried it with Instagram (you can record while playing music e.g. from Spotify) and I noticed that after ~1 sec. playback's volume is decreasing and we can hear voice more precisely.
I don't think that it's post processing because it would be very complicated so maybe there is an option to let "iOS handle it".
To be clear - it does not lowers playback during recording - it's "done" while listening final video.
I use AVCaptureSession for recording and AudioKit Player for playing.
Thanks in advance for any thoughts/tips/advices!
Regards
Ok so I asked Apple TS and the respond was exactly what I wanted: https://developer.apple.com/documentation/avfoundation/avaudiosession/mode/1616455-voicechat You just have to set this mode in AVAudioSession and system will handle it device’s tonal equalization is optimized for voice
iOS cannot 'just handle' that, there is no "filter out the music" function. The fact that it doesn't do it live, but does so later or with a delay strongly implies they are doing some post processing. I'm not a machine learning expert, but I think if you just used an equalizer and a noise gate you could get this effect. It'd be hard to extract an acapella but you could certainly improve it. Likely Instagram takes that second to identify where the voice frequencies are so it knows how to EQ the signal.
I'm developing an iOS app that does voice based AI; i.e. it's meant to take voice input from the microphone, turn it into text, send it to an AI agent, then output the returned text through the speaker. I've got everything working, though using a button to start and stop recording the speech (SpeechKit for voice recognition, API.AI for the AI, Amazon's Polly for the output).
The piece that I need is to have the microphone always on and to automatically start and stop the recording of the user's voice as they begin and end talking. This app is being developed for an unorthodox context, where there will be no access to the screen for the user (but they will have a high-end shotgun mic for recording their text).
My research suggests this piece of the puzzle is known as 'Voice Activity Detection' and seems to be one of the hardest steps in the whole voice-based AI system.
I'm hoping someone can either supply some straightforward (Swift) code to implement this myself, or point me in the direction of some decent libraries / SDKs that I can implement in this project.
For good VAD algorithm implementation you can use py-webrtcvad.
It is a Python interface for C code, you can just import C files from the project and use them from swift.
I understand that this question might get a bad rating, but I've been looking at questions which ask how to reroute audio output to the loud speaker on iOS devices.
Every question I looked at the user talked about using your AVAudioSession to reroute it.. However, I'm not using AVAudioSession, I'm using an AVAudioEngine.
So basically my question is, even though I'm using an AVAudioEngine, should I still have an AVAudioSession?
If so, what is the relationship between these two objects? Or is there a way to connect an AVAudioEngine to an AVAudioSession?
If this is not the case, and there is no relation between an AVAudioEngine and an AVAudioSession, than how do you reroute audio so that it plays out of the main speakers on an iOS device rather than the earpiece.
Thank you!
AVAudioSession is specific to iOS and coordinates audio playback between apps, so that, for example, audio is stopped when a call comes in, or music playback stops when the user starts a movie. This API is needed to make sure an app behaves correctly in response to such events
AVAudioEngine is a modern Objective-C API for playback and recording. It provides a level of control for which you previously had to drop down to the C APIs of the Audio Toolbox framework (for example, with real-time audio tasks). The audio engine APIs are built to interface well with lower-level APIs, so you can still drop down to Audio Toolbox if you have to.
The basic concept of this API is to build up a graph of audio nodes, ranging from source nodes (players and microphones) and overprocessing nodes (mixers and effects) to destination nodes (hardware outputs). Each node has a certain number of input and output busses with well-defined data formats. This architecture makes it very flexible and powerful. And it even integrates with audio units.
so there is no inclusive relation between this .
Source Link : https://www.objc.io/issues/24-audio/audio-api-overview/
Yes it is not clearly commented , however, I found this comment from ios developer documentation.
AVFoundation playback and recording classes automatically activate your audio session.
Document Link : https://developer.apple.com/library/content/documentation/Audio/Conceptual/AudioSessionProgrammingGuide/ConfiguringanAudioSession/ConfiguringanAudioSession.html
I hope this may help you.
I have made an app that uses Openears framework to readout some text. But I haven't used any of Openears' speech recognition/speech synthesis features, just the talk to speech feature. My app got rejected by apple telling that the app asks for permission to use microphone while the app doesn't have any features of that kind. The following is the rejection message from apple:
During review we were prompted to provide consent to use the microphone, however, we were not able to find any features or functionality that use the microphone for audio recording.
The microphone consent request is generated by the use of either AVAudioSessionCategoryRecord or AVAudioSessionCategoryPlayAndRecord audio categories.
If you do not intend to record audio with your application, it would be appropriate to choose the AVAudioSession session category that fits your application's needs or modify your app to include audio-recording features.
For more information, please refer to the Security section of the iOS SDK Release Notes for iOS 7 GM Seed.
I have searched the app for AVAudioSessionCategoryRecord or AVAudioSessionCategoryPlayAndRecord audio categories as mentioned in the message but couldn't find any. How can I disable the prompting for permission to use microphone?
Your application got rejected because you don't need the microphone feature, openears by default interface with the use of the microphone feature hence why the user permissions came up. These user permissions are not dismissible as apple increased the security features for users so that they can be in more control of what they want their applications to be able to do. If you have to use OpenEars audio management feature for speech recognition see Update 1 otherwise continue on for a different solution using Apples Siri's Speech Synthesizer on iOS 7.
In your case, if all you want to do is read out some text, then you can use iOS7 Speech Synthesizer, which is the same synthesizer used to create Siri's voice.
It's SO easy to setup and I am currently using it for one of my projects to interact with the user via voice. Here's a quick tutorial on how to get it all setup:
Speech synthesizer tutorial
UPDATE 1
After #halle's comment, I decided to update the post for those that have to use the OpenEars framework who will be using only the FliteController Text To Speech feature without any sort of OpenEars speech recognition.
You can set the FliteController property noAudioSessionOverrides to TRUE so that you ensure that OpenEars wont interface with the Audio recording stream and this will stop the Microphone permissions alert from popping up.
[self.fliteController setNoAudioSessionOverrides:TRUE]
UPDATE 2
Based on #Halle's comment, you no longer need to do update 1:
Just an update that starting with today's update 1.65, FliteController won't ever make audio session calls on its own, so there is no further rejection danger here and it isn't necessary to set noAudioSessionOverrides.
I'm sorry your app was rejected. To use TTS only without any of the audio session management related to speech recognition in OpenEars, set FliteController's property noAudioSessionOverrides to TRUE. This will result in no audio session changes/no use of the mic stream.
I'll see if I can make the documentation for this setting a bit more prominent for developers doing TTS with OpenEars' FliteController only.
For completeness' sake, the documentation on how to greatly reduce your app binary size when using OpenEars, since that was also an issue for you:
http://www.politepix.com/forums/topic/slimming-down-your-app/
http://www.politepix.com/openears/support/#Q_How_can_I_trim_down_the_size_of_the_final_binary_for_distribution
Edit: starting with today's version 1.65 of OpenEars and its plugins, if you just use FliteController there is no danger of rejection because the TTS classes no longer make any calls to the audio session by themselves. Thanks for the heads-up about this and, again, sorry you had a rejection due to this.