I'm trying to create a lightweight diphone speech synthesizer. Everything seems pretty straightforward because my native language has pretty simple pronunciation and text processing rules. The only problem I've stumbled upon is pitch control.
As far as I understand, to control the pitch of the voice, most speech synthesizers are using LPC (linear predictive coding), which essentially separates the pitch information away from the recorded voice samples, and then during synthesis I can supply my own pitch as needed.
The problem is that I'm not a DSP specialist. I have used a Ooura FFT library to extract AFR information, I know a little bit about using Hann and Hamming windows (have implemented C++ code myself), but mostly I treat DSP algorithms as black boxes.
I hoped to find some open-source library which is just bare LPC code with usage examples, but I couldn't find anything. Most of the available code (like Festival engine) is tightly integrated in to the synth and it would be pretty hard task to separate it and learn how to use it.
Is there any C/C++/C#/Java open source DSP library with a "black box" style LPC algorithm and usage examples, so I can just throw a PCM sample data at it and get the LPC coded output, and then throw the coded data and synthesize the decoded speech data?
it's not exactly what you're looking for, but maybe you get some ideas from this quite sophisticated toolbox: Praat
Related
I am working on a personal research project.
My objective is to be able to recognize a sound and identify if it belongs to the IPA or not by comparing it's waveform to a wave form in my data base. I have some skill with Mathematica, SciPy, and PyBrain.
For the first phase, I'm only using the English (US) phonetic alphabet.
I have a simple test bank of English phonetic alphabet sound files I found online. The trick here is:
I want to separate a sound file into wave forms that correspond to different syllables- this will take a learning algorithm. So, 'I like apples' would be cut up into the syllable waveforms that would make up the sentence.
Each waveform is then compared against the English PA's wave forms. I'm not certain how to do this part. I was thinking of using Praat to detect the waveforms, capture the image of the wave form and compare it to the one stored in the database with image analysis (which is kind of fun to do).
The damage here, is that I don't know how to make Praat generate a wave form file automatically then cut it up between syllables into waveform chunks. Logically, I would just prepare test cases for a learning algorithm and teach the comp to do it.
Instead of needing a wave form image- could I do this with fast Fourier transformation and compare two fft's- within x% margin of error consider it y syllable?
Frankly I don't really know about Praat, But I find your project super cool and interesting. I have experience with car motor's fault detection using it's sound, which might be connected to your project. I used Neural Networks and SVM to do the classification because multiple research papers proved it. Thus I didn't have any doubt about the way I chose. So my advice is maybe you should research and read some Papers about it. It really helps when you have questions like this (Will it work?, Can I use it instead or Am I using optimal solution? etc...). And good luck that's an awesome project :)
You could try Praat scripting.
Using just FFT will give you rather terrible results. Very long feature vector that will be really difficult to segment and run any training on it. That's thousands of points for a single syllable. Some deep neural networks are able to cope with it, but that's assuming you design them properly and provide huge training set. The advantage of using neural networks is that they can build features for you from the "raw data" (and I would consider fft also "raw"). However, when you work with sound, it's not that badly needed - you can manually engineer features. In case of sounds, science knows very well what sort of "features" sound have.
You can calculate these features with libraries like Yaafe. I recommend checking it even if you are not doing it in C++ or Python - the link I provided also delivers formulas for calculating them. I used some of them in my kiwi classifier.
Another good approach comes from scikit-talkbox, which provides exactly the tooling you might need.
I want to make an iOS app that allows me to graph the intonation (the rise and fall of the pitch of their voice) of an audio sample as read in by the user. Intonation is very important in various languages around the world and this would be an attempt to practice intonation as well as pronunciation.
I am not very versed in the world of speech/audio technology, so what do I need? Are there libraries that come installed with Cocoa-touch that gives me the ability to access the data I need from a voice sample? What exactly am I going to be looking to capture?
If anyone has an idea of the technology I am going to need to leverage, I would appreciate a point in the right direction.
Thanks!
What you're looking for is called formant analysis.
Formants are, in essence, the spectral peaks of the uttered sounds. They are listed in order of frequency, as in f1, f2, etc. Seems to me that what you're looking to plot is f1.
Formant analysis is at the core of speech recognition, usually f1 and f2 are enough to differentiate vowels apart. I'd recommend you do a search on formant analysis algorithms and take it from there.
Good luck :)
I need to write a speech detection algorithm (not speech recognition).
At first I thought I just have to measure the microphone power and compare it to some threshold value. But the problem gets much harder once you have to take the ambient sound level into consideration (for example in a pub a simple power threshold is crossed immediately because of other people talking).
So in the second version I thought I have to measure the current power spikes against the average sound level or something like that. Coding this idea proved to be quite hairy for me, at which point I decided it might be time to research already existing solutions.
Do you know of some general algorithm description for speech detection? Existing code or library in C/C++/Objective-C is also fine, be it commercial or free.
P.S. I guess there is a difference between “speech” and “sound” recognition, with the first one only responding to frequencies close to human speech range. I’m fine with the second, simpler case.
The key phrase that you need to Google for is Voice Activity Detection (VAD) – it's implemented widely in telecomms, particularly in Acoustic Echo Cancellation (AEC).
I'd like to programatically do some signal processing on a live sound feed.
Specifically I'd like to be able to isolate certain bands of frequencies and play around with phase shifting.
I've not worked in this area before from a purely software perspective and a quick google search turned up very little useful information.
Does anyone know of any good information resources for this topic area?
Matlab is a good starting point. It has the necessary toolboxes and functions that will allow you to capture audio signals, run different kind of filters over them and write them to wav files. The UI is easy to navigate through and it's simple enough to generate plots and visualize results.
http://www.mathworks.com/products/signal/
If, however, you're looking to develop real-world applications, then Python can come in handy. They have toolkits like SciPy, Numpy, Audiolab that offer the same functions as Matlab does.
http://www.scipy.org
Link
http://scikits.appspot.com/audiolab
In a nutshell, Matlab is good for testing ideas and prototyping, Python is good for testing as well as real-world application development. And Python is free. Matlab might cost you if you're not a student anymore.
http://www.dspguide.com/
This is a super excellent reference on digital signal processing techniques in general. It's not a programming guide, per se, but covers the techniques and the theory clearly and simply, and provides pseudocode and examples so that you can implement in the language of your choice. You'll be hard up to find a more complete reference, and you can download it for free online!
Does anybody here do computer vision work on Mathematica? I would like to know what external libraries are available for doing that. The built in image processing functions are not enough. I am looking for things like SURF, stereo, camera calibration, multi-view geometry etc.
How difficult would it be to wrap OpenCV for use in Mathematica?
Apart from the extensive set of image processing tools that are now (version 8) natively present in Mathematica, and which include a number of CV algorithms like finding morphologic objects, image segmentation and feature detection (see figure below), there's the new LibraryLink functionality, which makes working with DLLs very easy. You wouldn't have to change OpenCV much to be able to call it from Mathematica. Just some wrappers for the functions to be called and you're basically done.
I don't think such a thing exists, but I'm getting started.
It has the advantage that you can perform some analytic methods... for example rather than hacking in openCV or even Matlab endlessly, you can compute analytically a quantity, and see that the method leading to this matrix is numerically unstable as a function of input variables. Thus you do not need to hack, as it would be pointless.
As for wrapping opencv, that doesn't seem to make sense. The correct procedure would be to fix bad implementations in opencv based on your analysis in Mathematica and on paper.
Agreeing with Peter, I don't believe that forcing Mathematica to use OpenCV is a great thing.
All of the computer vision people that I've talked to, read about, and seen examples are using Matlab and the Imaging toolkit. Its either that, or go with a OpenCV compatible language + OpenCV.
Mathematica has a rich set of tools for image processing, but I'm uncertain about the computer vision capabilities.