Speech Recognition Using MFCC to rectify pronunciation - ios

I am building a speech recognition application for iOS in objective C/C++ for rectifying the pronunciation of the speaker.
I am using Mel-Frequency-Cepstrum Coefficients and Matching the two Sound-Waves using DTW.
Please correct me if I am wrong.
Now I want to know that which word in the sentence (two sound files) mismatches.
e.g. My two sound files speak
1. I live in New York.
2. I laav in New York.
My algorithm should some how point to 2nd word by some sort of indication.
I have used Match-Box open library for reference. Here is its link.
Any new algorithm or any new library is welcome.
PS. I don't want to use text to speech synthesis and speaker recognition.
Please direct me to right resources if I have posted question at wrong place.
Any little hint is also welcomed.

Related

How to feed speech files into RNN/LSTM for speech recognition?

I am working on RNN/LSTMs. I have done a simple project with RNN in which i input text into RNNs. But i don't know how to input speech into RNNs and how to preprocess speeches for recurrent networks. I have read many articles from medium and other sites. But i am not able to use speech in the networks. You can share any project in which speech and RNN/LSTMs or anything that can help me.
You will need to convert raw audio signal into spectrogram or some other convenient format that is easier to process using RNN/LSTMS. This medium blog should be helpful. You can look at this github repo for implementation.

Waveform Comparison

I am working on a personal research project.
My objective is to be able to recognize a sound and identify if it belongs to the IPA or not by comparing it's waveform to a wave form in my data base. I have some skill with Mathematica, SciPy, and PyBrain.
For the first phase, I'm only using the English (US) phonetic alphabet.
I have a simple test bank of English phonetic alphabet sound files I found online. The trick here is:
I want to separate a sound file into wave forms that correspond to different syllables- this will take a learning algorithm. So, 'I like apples' would be cut up into the syllable waveforms that would make up the sentence.
Each waveform is then compared against the English PA's wave forms. I'm not certain how to do this part. I was thinking of using Praat to detect the waveforms, capture the image of the wave form and compare it to the one stored in the database with image analysis (which is kind of fun to do).
The damage here, is that I don't know how to make Praat generate a wave form file automatically then cut it up between syllables into waveform chunks. Logically, I would just prepare test cases for a learning algorithm and teach the comp to do it.
Instead of needing a wave form image- could I do this with fast Fourier transformation and compare two fft's- within x% margin of error consider it y syllable?
Frankly I don't really know about Praat, But I find your project super cool and interesting. I have experience with car motor's fault detection using it's sound, which might be connected to your project. I used Neural Networks and SVM to do the classification because multiple research papers proved it. Thus I didn't have any doubt about the way I chose. So my advice is maybe you should research and read some Papers about it. It really helps when you have questions like this (Will it work?, Can I use it instead or Am I using optimal solution? etc...). And good luck that's an awesome project :)
You could try Praat scripting.
Using just FFT will give you rather terrible results. Very long feature vector that will be really difficult to segment and run any training on it. That's thousands of points for a single syllable. Some deep neural networks are able to cope with it, but that's assuming you design them properly and provide huge training set. The advantage of using neural networks is that they can build features for you from the "raw data" (and I would consider fft also "raw"). However, when you work with sound, it's not that badly needed - you can manually engineer features. In case of sounds, science knows very well what sort of "features" sound have.
You can calculate these features with libraries like Yaafe. I recommend checking it even if you are not doing it in C++ or Python - the link I provided also delivers formulas for calculating them. I used some of them in my kiwi classifier.
Another good approach comes from scikit-talkbox, which provides exactly the tooling you might need.

Correlation audio opencv

I guess use of opencv correlation, I need to know if a piece of an audio file is inside another audio file, can anyone tell me how I could proceed?
Or another solution?
Thanks Guys
Correlation might be the right tool for the job if the problem you are trying to solve is checking for the occurrence of an exact section of one file in another. However, if the following are true you will need another solution:
You intend searching a corpus (e.g. a database of files) for occurrences [Scales badly]
The audio has been processed (e.g. stretched, compressed) [correlation not particularly robust]
The usual way of solving this problem is with Feature Extraction and feature matching algorithms. Whilst OpenCV provides examples of both of these types of algorithms for image processing, it is probably not the weapon of choice for audio.

How to detect character in natural text image?

I have a project about Character Recognition (using openCV libraries).
I don't know how to detect character in text image.
Can you recommend some methods to do this?
Thanks all!
Here is a tutorial, it is dated and uses the C-style API though. This online book has a bunch related to OCR using OpenCV in chapter 5. Many people have done work intergrating tesseract (an OCR engine) with OpenCV, so you might want to check that out.

Analyze voice patterns IOS

I am looking for a way / library to analyze voice patterns. Say, there are 6 people in the room. I want to identify each one by voice.
Any hints are much appreciated.
Dmitry
The task of taking a long contiguous audio recording and splitting it up in chunks in which only one speaker is speaking - without any prior knowledge about the voice characteristics of each speaker - is called "Speaker diarization". You can find links to research code on the wikipedia page.
If you have prior recordings of each voice, and would rather do classification, this is a slightly different problem (Speaker recognition or Speaker identification). Software tools for that are available here (note that general purposes speech recognition packages like Sphinx or HTK are flexible enough to be coaxed into doing that).
Answered here https://dsp.stackexchange.com/questions/3119/library-to-differentiate-people-by-their-voice-timbre

Resources