I am working on RNN/LSTMs. I have done a simple project with RNN in which i input text into RNNs. But i don't know how to input speech into RNNs and how to preprocess speeches for recurrent networks. I have read many articles from medium and other sites. But i am not able to use speech in the networks. You can share any project in which speech and RNN/LSTMs or anything that can help me.
You will need to convert raw audio signal into spectrogram or some other convenient format that is easier to process using RNN/LSTMS. This medium blog should be helpful. You can look at this github repo for implementation.
Related
I was interested in symbol recognition recently and I start to read about it in the Internet. I got more information about preprocessing and segmentation stages, but all of it is just prestage for transformation from image to string. And all notes from Internet led me to using ready solution, like Tesseract, which do all works behind interface. However, I interested in detailed description of this process and I want to get all steps of this transformation.
Can anybody give me some links to exhaustive literature or articles about this theme? For example, Tesseract image_to_string() function algorithm. I will thankful for any help
The most straightforward way is the github page of Tesseract, especially the Wiki page of Tesseract.
Or if you want to recognize specific symbol, you can make your own recognizer using neural networks, follow this step-to-step tutorial.
I've been doing some research on the feasibility of building a mobile/web app that allows users to say a phrase and detects the accent of the user (Boston, New York, Canadian, etc.). There will be about 5 to 10 predefined phrases that a user can say. I'm familiar with some of the Speech to Text API's that are available (Nuance, Bing, Google, etc.) but none seem to offer this additional functionality. The closest examples that I've found are Google Now or Microsoft's Speaker Recognition API:
http://www.androidauthority.com/google-now-accents-515684/
https://www.microsoft.com/cognitive-services/en-us/speaker-recognition-api
Because there are going to be 5-10 predefined phrases I'm thinking of using a machine learning software like Tensorflow or Wekinator. I'd have initial audio created in each accent to use as the initial data. Before I dig deeper into this path I just wanted to get some feedback on this approach or if there are better approaches out there. Let me know if I need to clarify anything.
There is no public API for such a rare task.
Accent detection as language detection is commonly implemented with i-vectors. Tutorial is here. Implementation is available in Kaldi.
You need significant amount of data to train the system even if your sentences are fixed. It might be easier to collect accented speech without focusing on the specific sentences you have.
End-to-end tensorflow implementation is also possible but would probably require too much data since you need to separate speaker-instrinic things from accent-instrinic things (basically perform the factorization like i-vector is doing). You can find descriptions of similar works like this and this one.
You could use(this is just an idea, you will need to experiment a lot) a neural network with as many outputs as possible accents you have with a softmax output layer and cross entropy cost function
I am building a speech recognition application for iOS in objective C/C++ for rectifying the pronunciation of the speaker.
I am using Mel-Frequency-Cepstrum Coefficients and Matching the two Sound-Waves using DTW.
Please correct me if I am wrong.
Now I want to know that which word in the sentence (two sound files) mismatches.
e.g. My two sound files speak
1. I live in New York.
2. I laav in New York.
My algorithm should some how point to 2nd word by some sort of indication.
I have used Match-Box open library for reference. Here is its link.
Any new algorithm or any new library is welcome.
PS. I don't want to use text to speech synthesis and speaker recognition.
Please direct me to right resources if I have posted question at wrong place.
Any little hint is also welcomed.
I want to make an iOS app to count interrogative sentences. I will look for WH questions and also "will I, am I?" format questions.
I am not very get in the speech or audio technology world, but I did Google and found that there are few speech recognition SDKs. But still no idea how can I detect and graph intonation. Are there any SDKs that support intonation or emotional speech recognition?
AFAIK there is no cloud-based Speech Recognition SDK which also gives you intonation. You could search for pitch-tracking solutions and derive intonation from the pitch contour. An opensource one is available in the librosa package in Python:
https://librosa.org/librosa/generated/librosa.core.piptrack.html
If you can't embed Python in your application, there is always the option of serving it in a REST API with Flask or fastapi.
I'm trying to create a lightweight diphone speech synthesizer. Everything seems pretty straightforward because my native language has pretty simple pronunciation and text processing rules. The only problem I've stumbled upon is pitch control.
As far as I understand, to control the pitch of the voice, most speech synthesizers are using LPC (linear predictive coding), which essentially separates the pitch information away from the recorded voice samples, and then during synthesis I can supply my own pitch as needed.
The problem is that I'm not a DSP specialist. I have used a Ooura FFT library to extract AFR information, I know a little bit about using Hann and Hamming windows (have implemented C++ code myself), but mostly I treat DSP algorithms as black boxes.
I hoped to find some open-source library which is just bare LPC code with usage examples, but I couldn't find anything. Most of the available code (like Festival engine) is tightly integrated in to the synth and it would be pretty hard task to separate it and learn how to use it.
Is there any C/C++/C#/Java open source DSP library with a "black box" style LPC algorithm and usage examples, so I can just throw a PCM sample data at it and get the LPC coded output, and then throw the coded data and synthesize the decoded speech data?
it's not exactly what you're looking for, but maybe you get some ideas from this quite sophisticated toolbox: Praat