How to detect pauses between words with PortAudio? - portaudio

PortAudio V19, Windows 11.
I have a .WAV file with someone speaking a sentence.
I need to know where a word ends and another starts.
It's not important to know or understand that word, just to detect that such a sentence has, for instance, 7 words and when they are spoken.

Related

Chinese text-to-speech problems on iOS

We try to use build-in iOS text-to-speech tool for reading Chinese words in the app.
It's good in reading texts. But got problems reading separate words.
For example, we have character 还. It could be pronounced like "hái" with meaning "also, in addition"; and could be pronounced like "huàn" with meaning "to return".
In phrase 我还要还钱 (wǒ hái yào huàn qián) it pronounce 还 in both ways (correct).
In case of separate "还" iOS prefer to read it only like "hái". How to make it pronounce characters in the way we need it (if possible)?
As a quick solution you can cut required words from longer files and play them as audio instead of using TTS

Does iOS AVSpeechSynthesizer support Embedded Speech Commands

The Mac OS speech synthesizer has a set of embedded commands that let you do things like change the pitch, speech rate, level of emphasis, etc. For example, you might use
That is [[emph +]]not[[emph -]] my dog!
To add emphasis to the word "not" in the phrase
That is not my dog!
Is there any such support in the iOS speech synthesizer? It looks like there is not, but I'm hoping against hope somebody knows of a way to do this.
As a follow-on question, is there a way to make global changes to the "Stock" voice you get for a given locale?" In the settings for Siri you can select the Language and country as well as the gender. The AVSpeechSynthesizer appears to only give you a single, semi-random gender for each language/country however. (For example the voice for en-US is female, en-GB is male, en-AU is female, with no apparent way to change it.)
I agree that it doesn't seem possible. From the docs, it seems Apple intends that you would create separate utterances and manually adjust the pitch/rate:
Because an utterance can control speech parameters, you can split text
into sections that require different parameters. For example, you can
emphasize a sentence by increasing the pitch and decreasing the rate
of that utterance relative to others, or you can introduce pauses
between sentences by putting each one into an utterance with a leading
or trailing delay. Because the speech synthesizer sends messages to
its delegate as it starts or finishes speaking an utterance, you can
create an utterance for each meaningful unit in a longer text in order
to be notified as its speech progresses.
I'm thinking to create a category extension to AVSpeechUtterance to parse embedded commands (as in your example) and automatically create separate utterances. If someone else has done this, or wants to help, please let me know. I'll update here.

Open Ears API says every sound it hears is a word, even a cough

I am trying to use Open Ears for small part of my app. I have three or four keywords that I want to be able to "listen" to. Something like "Add", "Subtract", etc. I am just using the sample app found here. I want to have a special case in the app when I here "Add" etc. as opposed to a word that is not one of my four keywords. Right now I set my language to be only the four keywords, but whenever the Open Ears API hears anything, it picks between my four keywords. So if I cough, it picks the closest word out of the four words
How can I listen for a specific word without always choosing one of the keywords?
I was thinking I could have a whole bunch of words, a few hundred, and just check which word was spoken, and have a special case for my four keywords, but I don't want to have to type down each word. Does Open ears provide any default languages?
OpenEars developer here. Check out the dynamic grammar generation API that was just added in OpenEars 1.7 which may provide the right results for your requirements: http://www.politepix.com/2014/04/10/openears-1-7-introducing-dynamic-grammar-generation/
This approach might be more suitable for keyword detection and detection of fixed phrases. Please bring further questions to the OpenEars forums if you'd like to troubleshoot them with me.

voice recognition on iOS - convert OOV words to phonemes on iOS?

I’ve tried, as suggested on StackOverflow, Openears sucessfully, and generate custom vocabularies from arrays of NSSTRINGS.
However, we also need to recognize names from the addressbook, and here the fallback method inevitably fails miserably very often…
I could write a parser and dynamically transcribe the texts (mainly French and Dutch sounding names) to phonemes myself, but that would be a lot of (guessing) work…. I’m pretty sure the data I need is generated somewhere in the recognition process, so maybe someone could point me to a hook in OpenEars or Flite code in a way I can exploit on iOS?
Or some other library that would convert user speech to a string of phonemes I can feed into Openears?
The right way to recognize names in openears is to put specific pronunciations in the phonetic dictionary. You do not need to analyze phonetic strings yourself and actually recognizer do not have information about phonetic string altogether so you can not even retrieve it. Also, there is no clear correspondence between audio and phoneme sequence.
For example grapheme to phoneme code can defer the following pronunciaiton:
tena T IH N
While the correct pronunciation is
tena T EH N AH
With incorrect pronunciation predicted the recognizer will not be able to recognize a name. With corrected it will recognize the name accurately
The problem is that automatic word to phoneme converion in openears might fail. For foreign words it might fail even more frequently. What you need to do is to add the names into the dictionary so that recognizer will know their proper phonetic sequencies. If proper sequence is known, the recognizer will be able to detect the word by itself. You can also improve grapheme to phoneme code in openears to make it more accurate. Modern pocketsphinx uses phonetisaurus API which is both more accurate than flite and also trainable on special cases like foreign names.
For all the issues you have with accuracy first of all it's recommended to collect a database of test samples in order to enable stringaccuracy analysis. Once you have such database you can improve accuracy significantly. See for details
http://cmusphinx.sourceforge.net/wiki/faq#qwhy_my_accuracy_is_poor

Synchronise an audio to accurate transcription on iOS

I'm trying to synchronise text in my iOS app to audio that is being streamed simultaneously. The text is a very very accurate transcription of the audio that has been previously done manually. Is it possible to use keyword spotting or audio to text to assist with this?
The text is already indexed in the app with the clucene search engine, so it'll be very easy to search for any string of text/words in any paragraph in the text. Even if the audio to text conversion is not 100% accurate the search engine should be able to handle it and still find the best match in text within a couple tries.
Could you point me to any open source libraries for the audio to text conversion that would assist with this? I would prefer one that can convert the streamed audio to text directly and not rely on the microphones as is common in speech the text libraries as there may be cases where users may use headphones with the app and/or their may be background noise.
To recognize audiofile or audiostream on iOS you can use CMUSphinx with Openears.
To recognize a file you need to set pathToTestFile, see for details
http://www.politepix.com/openears/#PocketsphinxController_Class_Reference
To recognize the stream you can feed the audio into pocketsphinx through Pocketsphinx API
Since you know the text beforehand you can create a grammar from it and the recognition will be accurate.

Resources