The Mac OS speech synthesizer has a set of embedded commands that let you do things like change the pitch, speech rate, level of emphasis, etc. For example, you might use
That is [[emph +]]not[[emph -]] my dog!
To add emphasis to the word "not" in the phrase
That is not my dog!
Is there any such support in the iOS speech synthesizer? It looks like there is not, but I'm hoping against hope somebody knows of a way to do this.
As a follow-on question, is there a way to make global changes to the "Stock" voice you get for a given locale?" In the settings for Siri you can select the Language and country as well as the gender. The AVSpeechSynthesizer appears to only give you a single, semi-random gender for each language/country however. (For example the voice for en-US is female, en-GB is male, en-AU is female, with no apparent way to change it.)
I agree that it doesn't seem possible. From the docs, it seems Apple intends that you would create separate utterances and manually adjust the pitch/rate:
Because an utterance can control speech parameters, you can split text
into sections that require different parameters. For example, you can
emphasize a sentence by increasing the pitch and decreasing the rate
of that utterance relative to others, or you can introduce pauses
between sentences by putting each one into an utterance with a leading
or trailing delay. Because the speech synthesizer sends messages to
its delegate as it starts or finishes speaking an utterance, you can
create an utterance for each meaningful unit in a longer text in order
to be notified as its speech progresses.
I'm thinking to create a category extension to AVSpeechUtterance to parse embedded commands (as in your example) and automatically create separate utterances. If someone else has done this, or wants to help, please let me know. I'll update here.
Related
I am trying to implement the accessibility to my ios project.
Is there a way to correct the pronunciation of some specific words when the voice-over is turned on? For example, The correct pronunciation of 'speech' is [spiːtʃ], but I want the voice-over to read all the words 'speech' as same as 'speak' [spiːk] during my whole project.
I know there is one way that I can set the accessibility label of any UIElements that I want to change the pronunciation to 'speak'. However, some elements are dynamic. For example, we get the label text from the back-end, but we will never know when the label text will be 'speech'. If I get the words 'speech' from the back end, I would like to hear voice-over read it as 'speak'.
Therefore, I would like to change the setting for the voice-over. Every time, If the words are 'speech', the voice-over will read as 'speak'.
Can I do it?
Short answer.
Yes you can do it, but please do not.
Long Answer
Can I do it?
Yes, of course you can.
Simply fetch the data from the backend and do a find-replace on the string for any words you want spoken differently using a dictionary of words to replace, then add the new version of the string as the accessibility label.
SHOULD you do it?
Absolutely not.
Every time someone tries to "fix" pronunciation it ends up making things a lot worse.
I don't even understand why you would want screen reader users to hear "speak" whenever anyone else sees "speech", it does not make sense and is likely to break the meaning of sentences:
"I attended the speech given last night, it was very informative".
Would transform into:
"I attended the speak given last night, it was very informative"
Screen reader users are used to it.
A screen reader user is used to hearing things said differently (and incorrectly!), my guess is you have not been using a screen reader long enough to get used to the idiosyncrasies of screen reader speech.
Far from helping screen reader users you will actually end up making things worse.
I have only ever overridden screen reader default behaviour twice, once when it was a version number that was being read as a date and once when it was a password manager that read the password back and would try and read things as words.
Other than those very narrow examples I have not come across a reason to change things for a screen reader.
What about braille users?
You could change things because they don't sound right. But braille users also use screen readers and changing things for them could be very confusing (as per the example above of "speech").
What about best practices
"Give assistive technology users as similar an experience as possible to non assistive tech users". That is the number one guiding principle of accessibility, the second you change pronunciations and words, you potentially change the meaning of sentences and therefore offer a different experience.
Summing up
Anyway this is turning into a rant when it isn't meant to be (my apologies, I am just trying to get the point across as I answer similar questions to this quite often!), hopefully you get the idea, leave it alone and present the same info, I haven't even covered different speech synthesizers, language translation and more that using "unnatural" language can interfere with.
The easiest solution is to return a 2nd string from the backend that is used just for the accessibilityLabel.
If you need a bit more control, you can pass an AttributedString as the accessibilityLabel with a number of different options for controlling pronunication
https://medium.com/macoclock/ios-attributed-accessibility-labels-f54b8dcbf9fa
We try to use build-in iOS text-to-speech tool for reading Chinese words in the app.
It's good in reading texts. But got problems reading separate words.
For example, we have character 还. It could be pronounced like "hái" with meaning "also, in addition"; and could be pronounced like "huàn" with meaning "to return".
In phrase 我还要还钱 (wǒ hái yào huàn qián) it pronounce 还 in both ways (correct).
In case of separate "还" iOS prefer to read it only like "hái". How to make it pronounce characters in the way we need it (if possible)?
As a quick solution you can cut required words from longer files and play them as audio instead of using TTS
Given a known manuscript (text) which I expect the user to read (more or less accurately), what is the best approach to recognize the user's progress within the manuscript?
While I'm searching for a particular solution on iOS, I'm also interested in a more general answer.
iOS provides a speech recognition framework called Speech that I can use to recognize any speech. My current approach is to use the string results of this framework to match them against the manuscript. However, it seems to me like this has quite some overhead and that it would save resources and increase precision when I first feed the speech recognizer with the expected words so that it "knows" what to listen for.
For example, when the next word in the manuscript is "fish", I don't
need the speech recognizer to search the whole English language
dictionary for a word that best matches the recorded audio – I only
need to get a probability value how likely it is that the user just
said "fish".
I think it's very similar to keyword spotting only that I'm not only spotting a few keywords but the words in a whole manuscript.
Unfortunately, I haven't been able to find such an API on iOS. Is there any better approach to achieve this "speech tracking" than the one described above?
However, it seems to me like this has quite some overhead and that it would save resources and increase precision when I first feed the speech recognizer with the expected words so that it "knows" what to listen for.
Maybe it would, but the speech framework provides no way for you to do that, so you can't.
I'm working on an application that requires the use of a text to speech synthesizer. Implementing this was rather simple for iOS using AVSpeechSynthesizer. However, when it comes to customizing synthesis, I was directed to documentation about speech synthesis for an OSX only API, which allows you to input phoneme pairs, in order to customize word pronunciation. Unfortunately, this interface is not available on iOS.
I was hoping someone might know of a similar library or plugin that might accomplish the same task. If you do, it would be much appreciated if you would lend a hand.
Thanks in advance!
AVSpeechSynthesizer for iOS is not capable (out of the box) to work with phonemes. NSSpeechSynthesizer is capable of it, but that's not available on iOS.
You can create an algorithm that produces short phonemes, but it would be incredibly difficult to make it sound good by any means.
... allows you to input phoneme pairs, in order to customize word pronunciation. Unfortunately, this interface is not available on iOS.
This kind of interface is definitely available on iOS: in your device settings (iOS 12), once the menu General - Accessibility - Speech - Pronunciations is reached:
Select the '+' icon to add a new phonetic element.
Name this new element in order to quickly find it later on.
Tap the microphone icon.
Vocalize an entire sentence or a single word.
Listen to the different system proposals.
Validate your choice with the 'OK' button or cancel to start over.
Tap the back button to confirm the new created phonetic element.
Find all the generated elements in the Pronunciations page.
Following the steps above, you will be able to synthesize speech using phonemes for iOS.
I am trying to use Open Ears for small part of my app. I have three or four keywords that I want to be able to "listen" to. Something like "Add", "Subtract", etc. I am just using the sample app found here. I want to have a special case in the app when I here "Add" etc. as opposed to a word that is not one of my four keywords. Right now I set my language to be only the four keywords, but whenever the Open Ears API hears anything, it picks between my four keywords. So if I cough, it picks the closest word out of the four words
How can I listen for a specific word without always choosing one of the keywords?
I was thinking I could have a whole bunch of words, a few hundred, and just check which word was spoken, and have a special case for my four keywords, but I don't want to have to type down each word. Does Open ears provide any default languages?
OpenEars developer here. Check out the dynamic grammar generation API that was just added in OpenEars 1.7 which may provide the right results for your requirements: http://www.politepix.com/2014/04/10/openears-1-7-introducing-dynamic-grammar-generation/
This approach might be more suitable for keyword detection and detection of fixed phrases. Please bring further questions to the OpenEars forums if you'd like to troubleshoot them with me.