I'm trying to synchronise text in my iOS app to audio that is being streamed simultaneously. The text is a very very accurate transcription of the audio that has been previously done manually. Is it possible to use keyword spotting or audio to text to assist with this?
The text is already indexed in the app with the clucene search engine, so it'll be very easy to search for any string of text/words in any paragraph in the text. Even if the audio to text conversion is not 100% accurate the search engine should be able to handle it and still find the best match in text within a couple tries.
Could you point me to any open source libraries for the audio to text conversion that would assist with this? I would prefer one that can convert the streamed audio to text directly and not rely on the microphones as is common in speech the text libraries as there may be cases where users may use headphones with the app and/or their may be background noise.
To recognize audiofile or audiostream on iOS you can use CMUSphinx with Openears.
To recognize a file you need to set pathToTestFile, see for details
http://www.politepix.com/openears/#PocketsphinxController_Class_Reference
To recognize the stream you can feed the audio into pocketsphinx through Pocketsphinx API
Since you know the text beforehand you can create a grammar from it and the recognition will be accurate.
Related
I've been researching several iOS speech recognition frameworks and have found it hard to accomplish something I would think is pretty straightforward.
I have an app that allows people to record their voices. After a recording is made, they have the option to create a text version.
Looking into the services out there (i.e., Nuance) most require you to use the microphone. OpenEars allows you to do this, but the dictionary is so limited because it is an offline solution (they recommend 300 or less words).
There are a few other things going on with the app that would make it very unappealing to switch from the current recording method. For what it is worth, I am using the Amazing Audio Engine framework.
Anyone have any other suggestions for frameworks. Or is there a way to dig deeper with Nuance to transcribe a recorded file?
Thank you for your time.
For services, there are a few cloud based hosted speech recognition services you can use. You simply post the audio file to their URL and receive back the text. Most of them don't have any constraint on the vocabulary. You can of course choose any recording method you like.
See here: Server-side Voice Recognition . Many of them offer free trial as well.
I am creating an iOS game in which I have to inform user about events in the game with voice, that you have moved one piece, 2 pieces or well done you have performed well.
The problem is that voices are in large amount and if I replace audio files for each voice the app size will grow very large.
Second option I have discovered is to use text-to-speech library. I have tried "OpenEars" but the issue is I want voice like cartoon character or bird like which is not available in any of open source text-to-speech libraries as far as I have searched.
Can anybody suggest me what is the better way to handle it or any text-to-speech framework with different voice capabilities as mentioned in above paragraph.
Thanks in advance.
VoiceForge offers different TTS voices.
http://www.voiceforge.com
After finally successfully finding a way to concatenate multiple voice files into one single audio file on the iPhone, I am am now trying to superimpose an audio file over the length of the voice file.
So basically I have two .m4a files:
voice.m4a which is about 10 seconds for example.
music.m4a which is about 5 seconds.
What I require is that two file be combined in such a manner that the resulting single audio file now contains the music in the background of the voice file for the length of it, so basically the resulting output should have the 10 seconds of voice and the 5seconds of music repeated twice. It is absolutely important to have a single file that contains all of this.
I am trying to get all of this done in an application on the iPhone.
Can anyone please help me out with this?
If you are looking to do that programmatically, you will need to go deeper down into CoreAudio. For a simpler solution you could use AudioQueues or for more fine grained control AudioUnits and an AUGraph. The MultiChannelMixer is the Audio Unit you are looking for. Unfortunately there is no space for an elaborate tutorial here (would take a couple of days to write just the tutorial itself), but I am hoping I could point you to the right direction.
If you decide to go down that path and want to do further audio programming then this one time simple example, then I strongly suggest you buy "Learning Core Audio, A Hands-on Guide to Audio Programming for Mac and iOS" - Chris Adamson, Kevin Avila. You can find it on Amazon, paperback or Kindle.
I have a DirectShow application written in Delphi 6 using the DSPACK component library. I want to be able to mix together audio coming from the output pins from multiple Capture Filters that are set to the exact same media format. Is there an open source or "sdk sample" filter that does this?
I know that intelligent mixing is a big deal and that I'd most likely have to buy a commercial library to do that. But all I need is a DirectShow filter that can accept wave audio input from multiple output pins and does a straight addition of the samples received. I know there are Tee Filter's for splitting a single stream into multiple streams (one-to-many), but I need something that does the opposite (many-to-one), preferably with format checking on each input connection attempt so that any attempt to attach an output pin with a different media format than the ones already added is thwarted with an error. Is there anything out there?
Not sure about anything available out of the box, however it would be definitely a third party component.
The complexity of creating this custom filter is not very high (it is not a rocket science in terms of creating such component yourself for specific need). You basically need to have all input audio converted to the same PCM format, match the timestamps, add the data and then deliver via output pin.
I'm developing a virtual instrument app for iOS and am trying to implement a recording function so that the app can record and playback the music the user makes with the instrument. I'm currently using the CocosDenshion sound engine (with a few of my own hacks involving fades etc) which is based on OpenAL. From my research on the net it seems I have two options:
Keep a record of the user's inputs (ie. which notes were played at what volume) so that the app can recreate the sound (but this cannot be shared/emailed).
Hack my own low-level sound engine using AudioUnits & specifically RemoteIO so that I manually mix all the sounds and populate the final output buffer by hand and hence can save said buffer to a file. This will be able to be shared by email etc.
I have implemented a RemoteIO callback for rendering the output buffer in the hope that it would give me previously played data in the buffer but alas the buffer is always all 00.
So my question is: is there an easier way to sniff/listen to what my app is sending to the speakers than my option 2 above?
Thanks in advance for your help!
I think you should use remoteIO, I had a similar project several months ago and wanted to avoid remoteIO and audio units as much as possible, but in the end, after I wrote tons of code and read lots of documentations from third party libraries (including cocosdenshion) I end up using audio units anyway. More than that, it's not that hard to set up and work with. If you however look for a library to do most of the work for you, you should look for one written a top of core audio not open al.
You might want to take a look at the AudioCopy framework. It does a lot of what you seem to be looking for, and will save you from potentially reinventing some wheels.