OK,
let me try and rephrase this:
I'm looking for a method, that takes an audio file as an input, and outputs a list of transients (distinctive peak), based upon a given sensitivity.
The audio is a recording of a spoken phrase of for example 5 words. The method would return a list of numbers (e.g. amount of samples or milliseconds) where the words start. My ultimate goal is to play each word individually.
As suggested in a comment (I really struck some negative chord here) I am NOT asking anyone to write any code for me.
I've been around on this forum a while now, and the community has always been very helpful. The most helpful answers were those that pointed out my rigid way of thinking, offering surprising alternatives or work arounds, based upon their own experiences.
I guess this topic is just too much of a niche.
Before Edit:
For my iOS app, I need to programmatically cut up a spoken phrase into words for further processing. I know what words to expect, so, I can make some assumptions on where words would start.
However, in any case, a transient detection algorithm/method would be very helpful.
Google points me to either commercial products, or highly academic papers that are beyond my brain power.
Luckily, you are much smarter and knowledgeable than me, so you can help and simplify my problems.
Don't let me down!
There are a couple of simple basic ideas you can put to work here.
First, take the input audio and divide into small sized buckets (on the order of 10's on milliseconds). For each bucket compute the power of the samples in it by summing the squares of each sample value.
For example say you had 16 bit samples at 44.1 kHz, in array called s. One second's worth of data would be 44100 samples. A 10 msec bucket size would give you 441 samples per bucket. To compute the power you could do this:
float power = 0;
for (int i = 0; i < 441; i++) {
float normalized = (float)s[i] / 32768.0f;
power = power + (normalized * normalized);
}
Once you build an array of power values, you can look at relative changes in power from bucket to bucket to do basic signal detection.
Good luck!
Audio analysis is a very complex topic. You could easily detect individual words and slice them apart, but actually identifying them requires a lot of processing and advanced algorythms.
Sadly, there is not much we can tell you besides that there is no way around it. You said you found commercial products and I would suggest going for those. Papers are not always complete enough or right for the language/platform/usecase you want, and often lack details for proper implementation for someone without prior knowledge of the topic.
You may be lucky and find an open source implementation that suits your needs. Here's what a little bit of research returned:
How to use Speech Recognition inside the iOS SDK?
free speech recognition engines for iOS?
You'll quickly see speech recognition is not something you should start from scratch. Choose a library, try it for a little bit and see if it works!
Related
My friend Prasad Raghavendra and me, were trying to experiment with Machine Learning on audio.
We were doing it to learn and to explore interesting possibilities at any upcoming get-togethers.
I decided to see how deep learning or any machine learning can be fed with certain audios rated by humans (evaluation).
To our dismay, we found that the problem had to be split to accommodate for the dimensionality of input.
So, we decided to discard vocals and assess by accompaniments with an assumption that vocals and instruments are always correlated.
We tried to look for mp3/wav to MIDI converter. Unfortunately, they were only for single instruments on SourceForge and Github and other options are paid options. (Ableton Live, Fruity Loops etc.) We decided to take this as a sub-problem.
We thought of FFT, band-pass filters and moving window to accommodate for these.
But, we are not understanding as to how we can go about splitting instruments if chords are played and there are 5-6 instruments in file.
What are the algorithms that I can look for?
My friend knows to play Keyboard. So, I will be able to get MIDI data. But, are there any data-sets meant for this?
How many instruments can these algorithms detect?
How do we split the audio? We do not have multiple audios or the mixing matrix
We were also thinking about finding out the patterns of accompaniments and using those accompaniments in real-time while singing along. I guess we will be able to think about it once we get answers to 1,2,3 and 4. (We are thinking about both Chord progressions and Markovian dynamics)
Thanks for all help!
P.S.: We also tried FFT and we are able to see some harmonics. Is it due to Sinc() in fft when rectangular wave is input in time domain? Can that be used to determine timbre?
We were able to formulate the problem roughly. But, still, we are finding it difficult to formulate the problem. If we use frequency domain for certain frequency, then the instruments are indistinguishable. A trombone playing at 440 Hz or a Guitar playing at 440 Hz would have same frequency excepting timbre. We still do not know how we can determine timbre. We decided to go by time domain by considering notes. If a note exceeds a certain octave, we would use that as a separate dimension +1 for next octave, 0 for current octave and -1 for the previous octave.
If notes are represented by letters such as 'A', 'B', 'C' etc, then the problem reduces to mixing matrices.
O = MI during training.
M is the mixing matrix that will have to be found out using the known O output and I input of MIDI file.
During prediction though, M must be replaced by a probability matrix P which would be generated using previous M matrices.
The problem reduces to Ipredicted = P-1O. The error would then be reduced to LMSE of I. We can use DNN to adjust P using back-propagation.
But, in this approach, we assume that the notes 'A','B','C' etc are known. How do we detect them instantaneously or in small duration like 0.1 seconds? Because, template matching may not work due to harmonics. Any suggestions would be much appreciated.
Splitting out the different parts is a machine learning problem all to its own. Unfortunately, you can't look at this problem in audio land only. You must consider the music.
You need to train something to understand musical patterns and progressions in the context of the type of music you give it. It needs to understand what the different instruments sound like, both mixed and not mixed. It needs to understand how these instruments are often played together, if it's going to have any chance at all at separating what's going on.
This is a very, very difficult problem.
This is a very hard problem mainly because converting audio to pitch isnt very simple due to Nyquist folding harmonics that are 22Khz+ back down and also other harmonic introductions such as saturators/distortion and other analogue equipment that introduce harmonics.
The fundamental harmonic isnt always the loudest which is why your plan will not work.
The hardest thing to measure would be a distorted guitar. The harmonic some pedals/plugins can make is crazy.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I have been searching for emotion detection through voice/speech solution on mobile (iOS) and web.
I found Moodies-iOS and Vokaturi solution, but they are not free.
I couldn't find any open source or paid version software available to integrate in my app and test the solution.
Could someone share if you have any info on this related.
Is there any OPEN SOURCE for iOS for Emotion analysis and detection through Voice/Speech, Please let me know.
As a former research in affective computing, I highly doubt you can find a ready-for-use iOS open source solution for emotion recognition from speech. The main reason is that it is a damn difficult task that requires a lot of research and a lot of proper data to train models. That is why companies like BeyondVerbal and Vokaturi do not share their models with others. Thus, you will be very lucky if you can find anything in open source, I am not even talking about iOS solutions.
I am aware about some toolkits you can use for this task (namely, the openEAR toolkit), but to build something working from it, you need an expert knowledge in the field and data to train models. A comprehensive list of databases can be found here: http://emotion-research.net/wiki/Databases. A lot of them a freely available.
As Dmytro Prylipko said it is very doubtful that there is any open-source lib for emotion recognition from speech.
You may write your own solution. It is not hard. Trouble is, as mentioned before, proper training and/or trasholding takes a lot of time and nerves.
I will give you a short theory how you should begin writing the algo, but training and so on is on you.
First big trouble is that different people differently relay their emotions vocally.
For example: one shocked person will to their shock respond with overexclaimed sentence while another will "freeze" and their response would sound very flat (almost robot-like).
Therefore you will need a lot of templates from which to learn how to classify your input speech by emotions.
You can remove some difficulties by using context recognition along with voice prosody.
That is what I'd advise you to do.
First make an algorithm that will use speech-recognized text to put it into emotion context. E.g. you can use specific words and phrases that people use when expressing different emotions.
That is easily done. You may use a neural network or simple branching or whatever.
So you will be able to recognize whether person is thankful and surprised at the same time by combining context recognition and emotions from prosody.
Now, to recognize the emotion from prosody you have to get prosody parameters and some others.
For example, some emotions may be recognized by looking at duration of particular words in a sentence.
So you have the sentence and the text of that sentence. You know that the speed of normal speech is approximately 200 words per minute. Knowing this and number of words in the sentence you can see how fast is someone talking. Then you measure the duration of each word and get its speed. By knowing how fast is the speech and how long is the word you can get normalized ratios that can be used for classifications in order to determine the closest guess of the emotion.
For instance, when someone is presented with a present that he/she likes very much, the "thank you" will sound pretty long. It will also be of higher pitch than that person's usual speech.
So the next step would be to get the average pitch for each word to see the relation between them. So you will be able to see how the sentence prosody modulates. From lower to higher, or vice versa.
Also, how prosody changes inside the phrases within the sentence.
You may go about this by comparing curves of known emotion directly, or you may use aproximation to get coefficients from the prosody curve vector. The square function does good for normal speech prosody (with no particular emotions in). So some higher order polynomial should do. So, you can get coefficients of the polynom and use them to get what emotion should whole sentence or phrase relay.
The same goes for individual words within the sentence. You get the pitch for each phoneme or syllable or just the pitch curve for e.g. every 20 ms of the word. Then you either calculate few coefficients to aproximate the polynom you decided is good enough for you, or you take the whole curve and normalize it to e.g. 30 points to use it with recognition.
To compare curves directly you may use gesture recognition algorithm by Oleg Dopertchouk:
http://www.gamedev.net/reference/articles/article2039.asp
I tried it on pitch curves of melodies, it works just fine.
The trouble is, you need a database of speech with context and emotion with clear manually done classification to give your algo something to compare with.
If you use polynomials instead of whole curves, you can do some recognition by using thresholds on coefficients, but results will be a bit shaky. Only real excuse for using coeffs at all is that you do not need to know how long is the word in question. I.e. the same polynom should work on a word with 2 phonemes and on one with 5. (should work)
You see, a theory is nice and easy. Use speech recognition, measure speech rate, and duration of each word, construct pitch curve for whole phrase and pitch curve for each word using FFT, do some comparison between ready database and the input. And walla, emotion recognized.
But where will you find the database with word curves marked with emotions.
For example, you would need for each emotion at least one pitch curve for words with different number of phonemes. At least one, because it is important whether the word starts with vowel or ends with one, or simply someone differently relays the same emotion even if the curve represents the same word.
OK, so you can say that you can make one. Where would you find recorded samples to make your curves or calculate coeffs? Hm, perhaps a recording of some drama. Not bad idea, but the acted emotions aren't the same as the natural ones.
It is a big job to teach a machine such a thing.
Oh, yeah, I almost forgot, emotions aren't only, or sometimes at all transfered using pitch changes, sometimes it's only the way in which the word is being pronounced.
So, for some cases, you would probably need LPC or some other coefficients showing some more info on how phonemes in the word sounds. Or you would need to take in view other harmonics from FFT, not just the one representing the pitch of excitation train.
The best that you can do without following my hints and developing your own algo, is to use NLTK (natural language toolkit) to develop a statistical speech (emotionally rich) model and use algorithms from there (perhaps a bit modified) to try to get to the emotion in question.
But I fear it would be a greater job than going from zero. As far as I know NLTK doesn't support emotions. Just normal speech prosody.
You may try to integrate some things I wrote about into Sphinx, to develop emotion based speech models and introduce emotion recognition directly into sphinxes VR algorithm.
If you really need this, I advise you to learn enough DSP to write your own algo, then pay someone to make you initial database from audiobooks, radio dramas and similar stuff (using a tool you provide).
After your algo starts to work reasonably well, implement autolearning by giving users an option to correct the algo's wrong guesses. After some time you will get 90% reliable algo to recognize emotions from speech.
I am working on a personal research project.
My objective is to be able to recognize a sound and identify if it belongs to the IPA or not by comparing it's waveform to a wave form in my data base. I have some skill with Mathematica, SciPy, and PyBrain.
For the first phase, I'm only using the English (US) phonetic alphabet.
I have a simple test bank of English phonetic alphabet sound files I found online. The trick here is:
I want to separate a sound file into wave forms that correspond to different syllables- this will take a learning algorithm. So, 'I like apples' would be cut up into the syllable waveforms that would make up the sentence.
Each waveform is then compared against the English PA's wave forms. I'm not certain how to do this part. I was thinking of using Praat to detect the waveforms, capture the image of the wave form and compare it to the one stored in the database with image analysis (which is kind of fun to do).
The damage here, is that I don't know how to make Praat generate a wave form file automatically then cut it up between syllables into waveform chunks. Logically, I would just prepare test cases for a learning algorithm and teach the comp to do it.
Instead of needing a wave form image- could I do this with fast Fourier transformation and compare two fft's- within x% margin of error consider it y syllable?
Frankly I don't really know about Praat, But I find your project super cool and interesting. I have experience with car motor's fault detection using it's sound, which might be connected to your project. I used Neural Networks and SVM to do the classification because multiple research papers proved it. Thus I didn't have any doubt about the way I chose. So my advice is maybe you should research and read some Papers about it. It really helps when you have questions like this (Will it work?, Can I use it instead or Am I using optimal solution? etc...). And good luck that's an awesome project :)
You could try Praat scripting.
Using just FFT will give you rather terrible results. Very long feature vector that will be really difficult to segment and run any training on it. That's thousands of points for a single syllable. Some deep neural networks are able to cope with it, but that's assuming you design them properly and provide huge training set. The advantage of using neural networks is that they can build features for you from the "raw data" (and I would consider fft also "raw"). However, when you work with sound, it's not that badly needed - you can manually engineer features. In case of sounds, science knows very well what sort of "features" sound have.
You can calculate these features with libraries like Yaafe. I recommend checking it even if you are not doing it in C++ or Python - the link I provided also delivers formulas for calculating them. I used some of them in my kiwi classifier.
Another good approach comes from scikit-talkbox, which provides exactly the tooling you might need.
I want to do pedestrian detection and tracking.
Input: Video Stream from CCTV camera.
Output:
#(no of) people going from left to right
# people going from right to left
# No. of people in the middle
What have i done so far:
For pedestrian detection I am using HOG and SVM. The detection is decent with high false positive rate. And its very slow as i am running in android platform.
Question:
After detection how to do I calculate the required values listed above. Can anyone tell me what is the tracking algorithm I have to use and any good algorithm for pedestrian detection.
Or should I use tracking algorithm? Is there a way to do without it?
Any references to codes/blogs/technical papers is appreciated.
Platform: C++ & OpenCV / android.
--Thanks
This is somehow close to a research problem.
You may want to have a look to this website which gathers a lot of references.
In particular, the work done by the group from Oxford present therein is pretty close to what you are doing, since their are using HOG for detection. (That work has been extremely illuminating for me).
EPFL and Julich have as well work done in the field.
You may also want to give a look to this review which describes several detection/tracking techniques, often involving variants of the HOG algorithm.
Along with #Acorbe response, I suggest the publications section of this (archived) website.
A recent work at the end of last year also released a code base here:
https://bitbucket.org/rodrigob/doppia
There have also been earlier pedestrian detector works that have released code as well:
https://sites.google.com/site/wujx2001/home/c4
http://www.vision.caltech.edu/Image_Datasets/CaltechPedestrians
The best accurate way is to use tracking algorithm instead of statistic appearance counting of incoming people and detection occurred left right and middle..
You can use extended statistical models.. That produce how many inputs producing one of the outputs and back validate from output detection the input.
My experience is that tracking leads to better results than approach above. But is also little bit complicated. We talk about multi target tracking when the critical is match detection with tracked model which should be update based on detection. If tracking is matched with wrong model. The problems are there.
Here on youtube I developed some multi target tracker by simple LBP people detector, but multi model and kalman filter for tracking. Both capabilities are available in opencv. You need to when something is detected create new kalman filter for each object and update in case you match same detection. Predict in case detection is not here in frame and also remove the Kalman i it is not necessary to track any more.
1 Detect
2 Match detections with kalmans, hungarian algorithm and l2 norm. (for example)
3 Lot of work. Decide if kalman shoudl be established, remove, update, or results is not detected and should be predicted. This is lot of work here.
Pure statistic approach is less accurate, second one is for experience people at least one moth of coding and 3 month of tuning.. If you need to be faster and your resources are quite limited. You can by smart statistic achieve your results by pure detection much faster and little bit less accurate. People are judge the image and video tracking even multi target tracking is capable to beat human. Try to count and register each person in video and count exits point. You are not able to do this in some number of people. It is really repents on, what you want, application, customer you have, and results you show to customers. If this is 4 numbers income, left, right, middle and your error is 20 percent is still much more than one bored small paid guard should achieved by all day long counting..
https://www.youtube.com/watch?v=d-RCKfVjFI4
You can find on my BLOG Some dataset for people detection and car detection on my blog same as script for learning ideas, tutorials and tracking examples..
Opencv blog tutorials code and ideas
You can use KLT for this purpose as this will tell you the flow of person traveling from left to right then you can compute that by computing line length which in given example is drawn using cv2.line you can use input parameters of this functions to compute your case, little math involved. if there is a flow of pixels from left to right this is case 1 or right to left then case 3 and for no flow case 2. Or you can use this basic tutorial to track object movement. LINK
I want to make an iOS app that allows me to graph the intonation (the rise and fall of the pitch of their voice) of an audio sample as read in by the user. Intonation is very important in various languages around the world and this would be an attempt to practice intonation as well as pronunciation.
I am not very versed in the world of speech/audio technology, so what do I need? Are there libraries that come installed with Cocoa-touch that gives me the ability to access the data I need from a voice sample? What exactly am I going to be looking to capture?
If anyone has an idea of the technology I am going to need to leverage, I would appreciate a point in the right direction.
Thanks!
What you're looking for is called formant analysis.
Formants are, in essence, the spectral peaks of the uttered sounds. They are listed in order of frequency, as in f1, f2, etc. Seems to me that what you're looking to plot is f1.
Formant analysis is at the core of speech recognition, usually f1 and f2 are enough to differentiate vowels apart. I'd recommend you do a search on formant analysis algorithms and take it from there.
Good luck :)