Speech recogition and intonation detection - ios

I want to make an iOS app to count interrogative sentences. I will look for WH questions and also "will I, am I?" format questions.
I am not very get in the speech or audio technology world, but I did Google and found that there are few speech recognition SDKs. But still no idea how can I detect and graph intonation. Are there any SDKs that support intonation or emotional speech recognition?

AFAIK there is no cloud-based Speech Recognition SDK which also gives you intonation. You could search for pitch-tracking solutions and derive intonation from the pitch contour. An opensource one is available in the librosa package in Python:
https://librosa.org/librosa/generated/librosa.core.piptrack.html
If you can't embed Python in your application, there is always the option of serving it in a REST API with Flask or fastapi.

Related

MediaPipe vs MLKit Vision vs ARCore

There seems to be a lot of overlap between these 3 Google libraries.
According to their sites:
MediaPipe: MediaPipe offers cross-platform, customizable ML solutions for live and streaming media.
ARCore: With ARCore, build new augmented reality experiences that seamlessly blend the digital and physical worlds.
MLKit Vision: Video and image analysis APIs to label images and detect barcodes, text, faces, and objects.
Could someone with experience working with these explain how they relate to eachother and what are the use cases for each?
For example which would be appropriate to implement high level, popular features such as face filters?
(Also perhaps some insight on which of the 3 is most likely to land in Google Graveyard the fastest)
Some simplified & informal explanations:
MediaPipe is a powerful but lower-level library for live and streaming ML solutions, which requires non-trivial setup and customization before it works for your use case.
ML Kit is an end-to-end solution provider, offering mobile friendly, easy-to-use APIs and pre-built pipelines under the hood. Several ML Kit features are actually powered by MediaPipe internally (i.e. Pose detection and Selfie-segmentation).
There is no direct relationships between ARCore and ML Kit, but there could be shared or similar ML models in between, because both require ML models to power their features but the two products have different focuses.

How to feed speech files into RNN/LSTM for speech recognition?

I am working on RNN/LSTMs. I have done a simple project with RNN in which i input text into RNNs. But i don't know how to input speech into RNNs and how to preprocess speeches for recurrent networks. I have read many articles from medium and other sites. But i am not able to use speech in the networks. You can share any project in which speech and RNN/LSTMs or anything that can help me.
You will need to convert raw audio signal into spectrogram or some other convenient format that is easier to process using RNN/LSTMS. This medium blog should be helpful. You can look at this github repo for implementation.

Low Level Speech Recognition in iOS

From what I have seen online, Apple dev APIs enable use of speech recognition but only at a very high level.
Is there a way to access low-level speech recognition tools, so that for example, the speech can be used to recognize a particular individual or determine the mood of speaker? Of course, these would require bespoke algorithms, but I am asking whether or not the data would be available.
Thanks.

BPM detection options on iOS

I have scoured the net for resources on BPM detection for iOS, tried to implement various techniques and link to various libraries etc. but I just have issues either with build errors or with bpm detection not working.
What are the viable options for basic BPM detection on iOS? It doesn't have to be highly accurate with onset positions, but rather just detect the BPM for a series of audio buffers.
I tried VAMP but cannot get it to run on iOS, Ive tried various c++ options but none of them work.
Are there any MIT licensed BPM detection algorithms that integrate easily with iOS, or any commercial options that don't cost loads because its for a full audio library. I would like to detect BPM from a file not through the microphone.
I would just like a BPM detector class as I don't have the time to learn and implement one myself at this point in time.
Any help will be greatly appreciated.

iOS: Real Time OCR on top of live camera feed (similar to iTunes Redeem Gift Card)

Is there a way to accomplish something similar to what the iTunes and App Store Apps do when you redeem a Gift Card using the device camera, recognizing a short string of characters in real time on top of the live camera feed?
I know that in iOS 7 there is now the AVMetadataMachineReadableCodeObject class which, AFAIK, only represents barcodes. I'm more interested in detecting and reading the contents of a short string. Is this possible using publicly available API methods, or some other third party SDK that you might know of?
There is also a video of the process in action:
https://www.youtube.com/watch?v=c7swRRLlYEo
Best,
I'm working on a project that does something similar to the Apple app store redeem with camera as you mentioned.
A great starting place on processing live video is a project I found on GitHub. This is using the AVFoundation framework and you implement the AVCaptureVideoDataOutputSampleBufferDelegate methods.
Once you have the image stream (video), you can use OpenCV to process the video. You need to determine the area in the image you want to OCR before you run it through Tesseract. You have to play with the filtering, but the broad steps you take with OpenCV are:
Convert the images to B&W using cv::cvtColor(inputMat, outputMat, CV_RGBA2GRAY);
Threshold the images to eliminate unnecessary elements. You specify the threshold value to eliminate, and then set everything else to black (or white).
Determine the lines that form the boundary of the box (or whatever you are processing). You can either create a "bounding box" if you have eliminated everything but the desired area, or use the HoughLines algorithm (or the probabilistic version, HoughLinesP). Using this, you can determine line intersection to find corners, and use the corners to warp the desired area to straighten it into a proper rectangle (if this step is necessary in your application) prior to OCR.
Process the portion of the image with Tesseract OCR library to get the resulting text. It is possible to create training files for letters in OpenCV so you can read the text without Tesseract. This could be faster but also could be a lot more work. In the App Store case, they are doing something similar to display the text that was read overlaid on top of the original image. This adds to the cool factor, so it just depends on what you need.
Some other hints:
I used the book "Instant OpenCV" to get started quickly with this. It was pretty helpful.
Download OpenCV for iOS from OpenCV.org/downloads.html
I have found adaptive thresholding to be very useful, you can read all about it by searching for "OpenCV adaptiveThreshold". Also, if you have an image with very little in between light and dark elements, you can use Otsu's Binarization. This automatically determines the threshold values based on the histogram of the grayscale image.
This Q&A thread seems to consistently be one of the top search hits for the topic of OCR on iOS, but is fairly out of date, so I thought I'd post some additional resources that might be useful that I've found as of the time of writing this post:
Vision Framework
https://developer.apple.com/documentation/vision
As of iOS 11, you can now use the included CoreML-based Vision framework for things like rectangle or text detection. I've found that I no longer need to use OpenCV with these capabilities included in the OS. However, note that text detection is not the same as text recognition or OCR so you will still need another library like Tesseract (or possibly your own CoreML model) to translate the detected parts of the image into actual text.
SwiftOCR
https://github.com/garnele007/SwiftOCR
If you're just interested in recognizing alphanumeric codes, this OCR library claims significant speed, memory consumption, and accuracy improvements over Tesseract (I have not tried it myself).
ML Kit
https://firebase.google.com/products/ml-kit/
Google has released ML Kit as part of its Firebase suite of developer tools, in beta at the time of writing this post. Similar to Apple's CoreML, it is a machine learning framework that can use your own trained models, but also has pre-trained models for common image processing tasks like Vision Framework. Unlike Vision Framework, this also includes a model for on-device text recognition of Latin characters. Currently, use of this library is free for on-device functionality, with charges for using cloud/SAAS API offerings from Google. I have opted to use this in my project, as the speed and accuracy of recognition seems quite good, and I also will be creating an Android app with the same functionality, so having a single cross platform solution is ideal for me.
ABBYY Real-Time Recognition SDK
https://rtrsdk.com/
This commercial SDK for iOS and Android is free to download for evaluation and limited commercial use (up to 5000 units as of time of writing this post). Further commercial use requires an Extended License. I did not evaluate this offering due to its opaque pricing.
'Real time' is just a set of images. You don't even need to think about processing all of them, just enough to broadly represent the motion of the device (or the change in the camera position). There is nothing built into the iOS SDK to do what you want, but you can use a 3rd party OCR library (like Tesseract) to process the images you grab from the camera.
I would look into Tesseract. It's an open source OCR library that takes image data and processes it. You can add different regular expressions and only look for specific characters as well. It isn't perfect, but from my experience it works pretty well. Also it can be installed as a CocoaPod if you're into that sort of thing.
If you wanted to capture that in real time you might be able to use GPUImage to catch images in the live feed and do processing on the incoming images to speed up Tesseract by using different filters or reducing the size or quality of the incoming images.
There's a project similar to that on github: https://github.com/Devxhkl/RealtimeOCR

Resources