How to identify audio files containing same set of words? - signal-processing

I want to implement an application where given an audio containing a speech as query,it returns the most similar audio that was already submitted by an other user.
Here two audio are similar if they contains approximately the same set of words.
For example if the query speech is "Hello World!":
it returns "Hello my World!", "Hello Worlds!"
it doesn't necessary returns "Hello Earth" or "Bye world!"
it doesn't return "Trump is a dickhead" (even if it is true, but this an other story :) )
Notice that this "audio detector" MUST be robust against different timbers (different users voices). It would be cool if it is robust against noise (like reasonable outdoor noise) and melody distortion (like matching "Hello World!" with "Hellooo World!").

Related

Is there an AI / ML service that is generally used after STT?

I am successfully done with the speech to text conversion program and the results show a usual 80-90 % confidence in the transcript.
Now is there a way (a service) that can be used (or usually used) to improve the confidence in the transcript?
For example:
I want to search a name from the directory
.
.
Sunil Chauhan
Sunit Chavan
Sumit Chawhan
.
.
All the above three names are valid (as in they exist). But Sunit is less common than Sunil or Sumit, and all of them have an almost similar surname.
We can understand the difference in human speech but how to differentiate the text response from Google speech recognition which gives the most common Sunil/Sumit with most common Chauhan for Sunit Chavan.
Is there an available AI or ML service which can be used in such cases?

Text to Speech framework for iOS with kids voice

I am trying to build a kids game using swift. I want to use text to speech API in my app, but all the API which i came through were either male or female robot kind of voice. Is there any API available which converts text to speech with kids voice or something similar?
Thanks!
You can just use the standard AVSpeechSynthesizer and increase the pitch:
let utterance = AVSpeechUtterance( "Hi, uh.. I'm a.. um kid!" )
utterance.pitchMultiplier = 1.3 // or whatever value you find works well

NLP & ML Text Extraction

I have some user chat data and categorised in various categories, the problem is there are a lot of algorithm generated categories, please see example below:
Message | Category
I want to play cricket | Play cricket
I wish to watch cricket | Watch cricket
I want to play cricket outside | Play cricket outside
As you can see Categories (essentially phrases) are extracted from the text itself,
based on my data there are 10,000 messages with approx 4,500 unique catgories.
Is there any suitable algorithm which can give me good prediction accuracy in such cases.
Well, I habitually use OpenNLP's DocumentCategorizer for tasks like this, but StanfordNLP core I think does some similar stuff. OpenNLP uses Maximum Entropy for this, but there are many ways to do it.
First some thoughts on the amount of unique labels. Basically you only have a few samples per class, and that is generally a bad thing: your classifier is going to give sucky results no matter what it is if you try to do it the way you are implying because of overlap and / or underfitting. So here's what i've done before in a similar situation: separate concepts into different thematic classifiers, then assemble the best scores for each. For example, based on what you wrote above, you may be able to detect OUTSIDE or INSIDE with one classification model, and then WATCHING CRICKET vs PLAYING CRICKET in another. Then at runtime, you would pass the text into both classifiers, and take the best hit for each to assemble a single category. Pseudo code:
DoccatModel outOrIn = new DoccatModel(modelThatDetectsOutsideOrInside);
DoccatModel cricketMode = new DoccatModel(modelThatDetectsPlayingOrWatchingCricket)
String stringToDetectClassOf = "Some dude is playing cricket outside, he sucks";
String outOrInCat = outOrIn.classify(stringToDetectClassOf);
String cricketModeCat = cricketMode .classify(stringToDetectClassOf);
String best = outOrInCat + " " + cricketModeCat ;
you get the point I think.
Also some other random thoughts:
- Use a text index to explore the amount of data you get back to figure out how to break up the categories.
- You want a few hundred examples for each model
let me know if you want me to give you some code examples from OpenNLP if you are doing this in Java

Scan video for text string?

My goal is to find the title screen from a movie trailer. I need a service where I can search a video for a string, then return the frame with that string. Pretty obscure, does anything like this exist?
e.g. for this movie, I'd scan for "Sausage Party" and retrieve this frame:
Edit: I found the cloudsight api which would actually work except cost is prohibitive # $.04 per call assuming I need to split the video into 1s intervals and scan every image (at least 60 calls per video).
No exact service that I can find, but you could attempt to do this yourself...
ffmpeg -i sausage_party.mp4 -r 1 %04d.png
/usr/local/bin/parallel --no-notice -j 8 \
/usr/local/bin/tesseract -psm 6 -l eng {} {.} \
::: *.png
This extracts one frame a second from the video file, and then uses tesseract to extract the text via OCR into files of the same name as the image frame (eg. 0135.txt. However your results are going to vary massively depending on the font used and the quality of the video file.
You'd probably find it cheaper/easier to use something like Amazon Mechanical Turk , especially since the OCR is going to have a hard time doing this automatically.
Another option could be implementing this service by yourself using the Scene Text Detection and Recognition module in OpenCV (docs.opencv.org/3.0-beta/modules/text/doc/text.html). You can take a look at this video to get an idea of how such a system would operate. As pointed out above the accuracy would depend on the font used in the movie titles, the quality of the video files, and the OCR.
OpenCV relies on Tesseract as the underlying OCR but, alternatively, you could use the text detection and localization functions (docs.opencv.org/3.0-beta/modules/text/doc/erfilter.html) in OpenCV to find text areas in the image and then employ a different OCR to perform the recognition. The text detection and localization stage can be done very quickly thus achieving real time performance would be mostly a matter of picking a fast OCR.

How can I synchronize a text with audio/sound in XNA/XACT?

I wanted to display the text while sound is playing at background. In short if there is sound/audio for "What is this", I want to display the text "What is this" in text box synchronously. Is this possible with XNA/XACT? and can I use this in standard C# based WPF or Silverlight applications?
Appreciating your help.
I'm not sure if xna has any build in support for this but you could set up a second meta file that holds time and action information. For example mark a time for each word or phase spaken in a file and write out the text at the appropriate time.
same way as when making subtitles for movies. example:
00:02 Who are you?
00:05 An angel.
00:07 What's your name?
compare movie time with this, and show messages in texbox with some duration.

Resources