"Sound" Recognition in Swift? - ios

I'm working on an applicaion in Swift and I was thinking about a way to get Non-Speech sound recognition in my project.
I mean is there a way in which I can take in sound inputs and match them against some predefined sounds already incorporated in the project and if a match occurs, it should do some particular action?
Is there any way to do the above? I'm thinking breaking up the sounds and doing the checks, but can't seem to get any further than that.

My personal experience follows matt's comment above: requires serious technical knowledge.
There are several ways to do this, and one is typically as follows: extract some properties from the sound segment of interest (audio feature extraction), and classify this audio feature vector with some kind of machine learning technique. This typically requires some training phase where the machine learning technique was given some examples to learn what sounds you want to recognize (your predefined sounds) so that it can build a model from that data.
Without knowing what types of sounds you're aiming for to be recognized, maybe our C/C++ SDK available here might do the trick for you: http://www.samplesumo.com/percussive-sound-recognition
There's a technical demo on that page that you can download and try with your sounds. It's a C/C++ library, and there is a Mac, Windows and iOS version, so you should be able to integrate it with a Swift app on iOS. Maybe this will allow you to do what you need?
If you want to develop your own technology, you may want to start by finding and reading some scientific papers using the keywords "sound classification", "audio recognition", "machine listening", "audio feature classification", ...

Matt,
We've been developing a bunch of cool tools to speed up iOS development, specially in Swift. One of these tools is what we called TLSphinx: a Swift wrapper around Pocketsphinx which can perform speech recognition without the audio leaving the device.
I assume TLSphinx can help you solve your problem since it is a totally open source library. Search for it on Github ('TLSphinx') and you can also download our iOS app ('Tryolabs Mobile Showcase') and try the module live to see how it works.
Hope it is useful!
Best!

Related

Speech recognition from recorded file

I've been researching several iOS speech recognition frameworks and have found it hard to accomplish something I would think is pretty straightforward.
I have an app that allows people to record their voices. After a recording is made, they have the option to create a text version.
Looking into the services out there (i.e., Nuance) most require you to use the microphone. OpenEars allows you to do this, but the dictionary is so limited because it is an offline solution (they recommend 300 or less words).
There are a few other things going on with the app that would make it very unappealing to switch from the current recording method. For what it is worth, I am using the Amazing Audio Engine framework.
Anyone have any other suggestions for frameworks. Or is there a way to dig deeper with Nuance to transcribe a recorded file?
Thank you for your time.
For services, there are a few cloud based hosted speech recognition services you can use. You simply post the audio file to their URL and receive back the text. Most of them don't have any constraint on the vocabulary. You can of course choose any recording method you like.
See here: Server-side Voice Recognition . Many of them offer free trial as well.

shazam for voice recognition on iphone

I am trying to build an app that allows the user to record individual people speaking, and then save the recordings on the device and tag each record with the name of the person who spoke. Then there is the detection mode, in which i record someone and can tell whats his name if he is in the local database.
First of all - is this possible at all? I am very new to iOS development and not so familiar with the available APIs.
More importantly, which API should I use (ideally free) to correlate between the incoming voice and the records I have in the local db? This should behave something like Shazam, but much more simple since the database I am looking for a match against is much smaller.
If you're new to iOS development, I'd start with the core app to record the audio and let people manually choose a profile/name to attach it to and worry about the speaker recognition part later.
You obviously have two options for the recognition side of things: You can either tie in someone else's speech authentication/speaker recognition library (which will probably be in C or C++), or you can try to write your own.
How many people are going to use your app? You might be able to create something basic yourself: If it's the difference between a man and a woman you could probably figure that out by doing an FFT spectral analysis of the audio and figure out where the frequency peaks are. Obviously the frequencies used to enunciate different phonemes are going to vary somewhat, so solving the general case for two people who sound fairly similar is probably hard. You'll need to train the system with a bunch of text and build some kind of model of frequency distributions. You could try to do clustering or something, but you're going to run into a fair bit of maths fairly quickly (gaussian mixture models, et al). There are libraries/projects that'll do this. You might be able to port this from matlab, for example: https://github.com/codyaray/speaker-recognition
If you want to take something off-the-shelf, I'd go with a straight C library like mistral, as it should be relatively easy to call into from Objective-C.
The SpeakHere sample code should get you started for audio recording and playback.
Also, it may well take longer for the user to train your app to recognise them than it's worth in time-saving from just picking their name from a list. Unless you're intending their voice to be some kind of security passport type thing, it might just not be worth bothering with.

Audio Framework Confusion

I've read quite a bit both here (Audio Framework in iPhone) and abroad but am still confused as to which Audio Framework to use.
I'm able to get some easier things done, like recording and playing back but I'm looking to the future of the app where I'll be doing more complex things, like managing past recordings (although maybe that's a NSURL bookmark thing) and editing audio.
Right now I'm using AVFoundation but have started reading the docs for Core Audio (and there's also AudioToolbox). I wish there was a developer doc called "Understanding the Different Audio Frameworks and How and When to use them" because, well, the docs are dense and I'm having trouble figuring out which path to go down.
Links to good docs would also be much appreciated!
I recommend you take a look at the recent Learning Core Audio book. The purpose of it was to disambiguate the confusion around audio frameworks on Mac OS and iOS. If you want "good docs", it's well worth getting.
Depending on your requirements, you might also want to consider some of the non-Apple audio frameworks, particularly the MoMu release of STK, which in may respects will be simpler and easier-to-use than Apple's frameworks.

wavetables implemented on iOS

I just saw an iPhone app which uses wavetables to generate sounds. I wish to know how it is possible to implement.
I am pretty much sure that core audio have to be used, but any other idea where to go for some other info will be appreciated.
You'll want CoreAudio or AudioUnits for a responsive program (e.g. AudioQueue's latency is a bit high).
You'll want AudioFile APIs (in AudioToolbox) for reading the tables if you save them as a common audio file format (just wave files with a new shape every cycle, which is every N samples).
Beyond that, you'll probably have to write the wavetable engine. I have done that; It's not tough if you know how wavetable synthesis works and are familiar with audio signals. It's one of the most basic synthesis types.
musicdsp.org may have something you can use as a starting point for this.
After huge investigating I have found an open source project regarding this. http://gitorious.org/pdlib/
Audio file I/O: I found a great resource here. This guy created an excellent API for using ExtAudioFileServices.
A must read is Learning Core Audio. Chris Adamson and company have really put together a great resource. Chris's blog can also be found here
Also, sign up for the Core Audio mailing list.
Michael Tyson's blog/ resources are great too A Tasty Pixel.
Hope this helps!
Take a look at this tutorial on how to use the STK: http://arielelkin.github.io/articles/mandolin/
It is an open-source C++ library with cool synths, some with wavetables.

iOS / C: Algorithm to detect phonemes

I am searching for an algorithm to determine whether realtime audio input matches one of 144 given (and comfortably distinct) phoneme-pairs.
Preferably the lowest level that does the job.
I'm developing radical / experimental musical training software for iPhone / iPad.
My musical system comprises 12 consonant phonemes and 12 vowel phonemes, demonstrated here. That makes 144 possible phoneme pairs. The student has to sing the correct phoneme pair 'laa duu bee' etc in response to visual stimulus.
I have done a lot of research into this, it looks like my best bet may be to use one of the iOS Sphinx wrappers ( iPhone App › Add voice recognition? is the best source of information I have found ). However, I can't see how I would adapt such a package, can anyone with experience using one of these technologies give a basic rundown of the steps that would be required?
Would training be necessary by the user? I would have thought not, as it is such an elementary task, compared with full language models of thousands of words and far greater and more subtle phoneme base. However, it would be acceptable (not ideal) to have the user train 12 phoneme pairs: { consonant1+vowel1, consonant2+vowel2, ..., consonant12+vowel12 }. The full 144 would be too burdensome.
Is there a simpler approach? I feel like using a fully featured continuous speech recogniser is using a sledgehammer to crack a nut. It would be far more elegant to use the minimum technology that would solve the problem.
So really I'm hunting for any open source software that recognises phonemes.
PS I need a solution which runs pretty much real-time. so even as they are singing the note, firstly it blinks on to illustrate that it picked up the phoneme pair that was sung, and then it glows to illustrate whether they are singing the correct note pitch
If you are looking for a phone-level open source recogniser, then I would recommend HTK. Very good documentation is available with this tool in the form of the HTK Book. It also contains an entire chapter dedicated to building a phone level real-time speech recogniser. From your problem statement above, it seems to me like you might be able to re-work that example into your own solution. Possible pitfalls:
Since you want to do a phone level recogniser, the data needed to train the phone models would be very high. Also, your training database should be balanced in terms of distribution of the phones.
Building a speaker-independent system would require data from more than one speaker. And lots of that too.
Since this is open-source, you should also check into the licensing info for any additional details about shipping the code. A good alternative would be to use the on-phone recorder and then have the recorded waveform sent over a data channel to a server for the recognition, pretty much something like what google does.
I have a little bit of experience with this type of signal processing, and I would say that this is probably not the type of finite question that can be answered definitively.
One thing worth noting is that although you may restrict the phonemes you are interested in, the possibility space remains the same (i.e. infinite-ish). User training might help the algorithms along a bit, but useful training takes quite a bit of time and it seems you are averse to too much of that.
Using Sphinx is probably a great start on this problem. I haven't gotten very far in the library myself, but my guess is that you'll be working with its source code yourself to get exactly what you want. (Hooray for open source!)
...using a sledgehammer to crack a nut.
I wouldn't label your problem a nut, I'd say it's more like a beast. It may be a different beast than natural language speech recognition, but it is still a beast.
All the best with your problem solving.
Not sure if this would help: check out OpenEars' LanguageModelGenerator. OpenEars uses Sphinx and other libraries.
http://www.hfink.eu/matchbox
This page links to both YouTube video demo and github source.
I'm guessing it would still be a lot of work to mould it into the shape I'm after, but is also definitely does do a lot of the work.

Resources