From what I have seen online, Apple dev APIs enable use of speech recognition but only at a very high level.
Is there a way to access low-level speech recognition tools, so that for example, the speech can be used to recognize a particular individual or determine the mood of speaker? Of course, these would require bespoke algorithms, but I am asking whether or not the data would be available.
Thanks.
Related
There seems to be a lot of overlap between these 3 Google libraries.
According to their sites:
MediaPipe: MediaPipe offers cross-platform, customizable ML solutions for live and streaming media.
ARCore: With ARCore, build new augmented reality experiences that seamlessly blend the digital and physical worlds.
MLKit Vision: Video and image analysis APIs to label images and detect barcodes, text, faces, and objects.
Could someone with experience working with these explain how they relate to eachother and what are the use cases for each?
For example which would be appropriate to implement high level, popular features such as face filters?
(Also perhaps some insight on which of the 3 is most likely to land in Google Graveyard the fastest)
Some simplified & informal explanations:
MediaPipe is a powerful but lower-level library for live and streaming ML solutions, which requires non-trivial setup and customization before it works for your use case.
ML Kit is an end-to-end solution provider, offering mobile friendly, easy-to-use APIs and pre-built pipelines under the hood. Several ML Kit features are actually powered by MediaPipe internally (i.e. Pose detection and Selfie-segmentation).
There is no direct relationships between ARCore and ML Kit, but there could be shared or similar ML models in between, because both require ML models to power their features but the two products have different focuses.
When I try google speech recognition it shows low performance on traditional Chineses audio file with background noise. Can I improve the performance of speech recognition after some pre-processing(like speech enhancement)? Does it work on google speech service?
I would suggest that you go through this page in google cloud speech documentation outlining best practices on how to provide speech data to the service, including recommendations for pre-processing.
Keep the recording as close to the original speech signal as possible. No distortion, no clipping, no noise, no artificial pre-processing, like noise suppression and automatic gain control. I think such kind of pre-processings can damage the useful information in speech signals.
I copied the key points from google and paste them as below.
Position the microphone as close as possible to the person that is speaking, particularly when background noise is present.
Avoid audio clipping.
Do not use automatic gain control (AGC).
All noise reduction processing should be disabled.
Listen to some sample audio. It should sound clear, without distortion or unexpected noise.
I want to make an iOS app to count interrogative sentences. I will look for WH questions and also "will I, am I?" format questions.
I am not very get in the speech or audio technology world, but I did Google and found that there are few speech recognition SDKs. But still no idea how can I detect and graph intonation. Are there any SDKs that support intonation or emotional speech recognition?
AFAIK there is no cloud-based Speech Recognition SDK which also gives you intonation. You could search for pitch-tracking solutions and derive intonation from the pitch contour. An opensource one is available in the librosa package in Python:
https://librosa.org/librosa/generated/librosa.core.piptrack.html
If you can't embed Python in your application, there is always the option of serving it in a REST API with Flask or fastapi.
I am looking for a way / library to analyze voice patterns. Say, there are 6 people in the room. I want to identify each one by voice.
Any hints are much appreciated.
Dmitry
The task of taking a long contiguous audio recording and splitting it up in chunks in which only one speaker is speaking - without any prior knowledge about the voice characteristics of each speaker - is called "Speaker diarization". You can find links to research code on the wikipedia page.
If you have prior recordings of each voice, and would rather do classification, this is a slightly different problem (Speaker recognition or Speaker identification). Software tools for that are available here (note that general purposes speech recognition packages like Sphinx or HTK are flexible enough to be coaxed into doing that).
Answered here https://dsp.stackexchange.com/questions/3119/library-to-differentiate-people-by-their-voice-timbre
I need to write a speech detection algorithm (not speech recognition).
At first I thought I just have to measure the microphone power and compare it to some threshold value. But the problem gets much harder once you have to take the ambient sound level into consideration (for example in a pub a simple power threshold is crossed immediately because of other people talking).
So in the second version I thought I have to measure the current power spikes against the average sound level or something like that. Coding this idea proved to be quite hairy for me, at which point I decided it might be time to research already existing solutions.
Do you know of some general algorithm description for speech detection? Existing code or library in C/C++/Objective-C is also fine, be it commercial or free.
P.S. I guess there is a difference between “speech” and “sound” recognition, with the first one only responding to frequencies close to human speech range. I’m fine with the second, simpler case.
The key phrase that you need to Google for is Voice Activity Detection (VAD) – it's implemented widely in telecomms, particularly in Acoustic Echo Cancellation (AEC).