Emotion detection through voice/speech solution for Mobile and web [closed] - ios

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I have been searching for emotion detection through voice/speech solution on mobile (iOS) and web.
I found Moodies-iOS and Vokaturi solution, but they are not free.
I couldn't find any open source or paid version software available to integrate in my app and test the solution.
Could someone share if you have any info on this related.
Is there any OPEN SOURCE for iOS for Emotion analysis and detection through Voice/Speech, Please let me know.

As a former research in affective computing, I highly doubt you can find a ready-for-use iOS open source solution for emotion recognition from speech. The main reason is that it is a damn difficult task that requires a lot of research and a lot of proper data to train models. That is why companies like BeyondVerbal and Vokaturi do not share their models with others. Thus, you will be very lucky if you can find anything in open source, I am not even talking about iOS solutions.
I am aware about some toolkits you can use for this task (namely, the openEAR toolkit), but to build something working from it, you need an expert knowledge in the field and data to train models. A comprehensive list of databases can be found here: http://emotion-research.net/wiki/Databases. A lot of them a freely available.

As Dmytro Prylipko said it is very doubtful that there is any open-source lib for emotion recognition from speech.
You may write your own solution. It is not hard. Trouble is, as mentioned before, proper training and/or trasholding takes a lot of time and nerves.
I will give you a short theory how you should begin writing the algo, but training and so on is on you.
First big trouble is that different people differently relay their emotions vocally.
For example: one shocked person will to their shock respond with overexclaimed sentence while another will "freeze" and their response would sound very flat (almost robot-like).
Therefore you will need a lot of templates from which to learn how to classify your input speech by emotions.
You can remove some difficulties by using context recognition along with voice prosody.
That is what I'd advise you to do.
First make an algorithm that will use speech-recognized text to put it into emotion context. E.g. you can use specific words and phrases that people use when expressing different emotions.
That is easily done. You may use a neural network or simple branching or whatever.
So you will be able to recognize whether person is thankful and surprised at the same time by combining context recognition and emotions from prosody.
Now, to recognize the emotion from prosody you have to get prosody parameters and some others.
For example, some emotions may be recognized by looking at duration of particular words in a sentence.
So you have the sentence and the text of that sentence. You know that the speed of normal speech is approximately 200 words per minute. Knowing this and number of words in the sentence you can see how fast is someone talking. Then you measure the duration of each word and get its speed. By knowing how fast is the speech and how long is the word you can get normalized ratios that can be used for classifications in order to determine the closest guess of the emotion.
For instance, when someone is presented with a present that he/she likes very much, the "thank you" will sound pretty long. It will also be of higher pitch than that person's usual speech.
So the next step would be to get the average pitch for each word to see the relation between them. So you will be able to see how the sentence prosody modulates. From lower to higher, or vice versa.
Also, how prosody changes inside the phrases within the sentence.
You may go about this by comparing curves of known emotion directly, or you may use aproximation to get coefficients from the prosody curve vector. The square function does good for normal speech prosody (with no particular emotions in). So some higher order polynomial should do. So, you can get coefficients of the polynom and use them to get what emotion should whole sentence or phrase relay.
The same goes for individual words within the sentence. You get the pitch for each phoneme or syllable or just the pitch curve for e.g. every 20 ms of the word. Then you either calculate few coefficients to aproximate the polynom you decided is good enough for you, or you take the whole curve and normalize it to e.g. 30 points to use it with recognition.
To compare curves directly you may use gesture recognition algorithm by Oleg Dopertchouk:
http://www.gamedev.net/reference/articles/article2039.asp
I tried it on pitch curves of melodies, it works just fine.
The trouble is, you need a database of speech with context and emotion with clear manually done classification to give your algo something to compare with.
If you use polynomials instead of whole curves, you can do some recognition by using thresholds on coefficients, but results will be a bit shaky. Only real excuse for using coeffs at all is that you do not need to know how long is the word in question. I.e. the same polynom should work on a word with 2 phonemes and on one with 5. (should work)
You see, a theory is nice and easy. Use speech recognition, measure speech rate, and duration of each word, construct pitch curve for whole phrase and pitch curve for each word using FFT, do some comparison between ready database and the input. And walla, emotion recognized.
But where will you find the database with word curves marked with emotions.
For example, you would need for each emotion at least one pitch curve for words with different number of phonemes. At least one, because it is important whether the word starts with vowel or ends with one, or simply someone differently relays the same emotion even if the curve represents the same word.
OK, so you can say that you can make one. Where would you find recorded samples to make your curves or calculate coeffs? Hm, perhaps a recording of some drama. Not bad idea, but the acted emotions aren't the same as the natural ones.
It is a big job to teach a machine such a thing.
Oh, yeah, I almost forgot, emotions aren't only, or sometimes at all transfered using pitch changes, sometimes it's only the way in which the word is being pronounced.
So, for some cases, you would probably need LPC or some other coefficients showing some more info on how phonemes in the word sounds. Or you would need to take in view other harmonics from FFT, not just the one representing the pitch of excitation train.
The best that you can do without following my hints and developing your own algo, is to use NLTK (natural language toolkit) to develop a statistical speech (emotionally rich) model and use algorithms from there (perhaps a bit modified) to try to get to the emotion in question.
But I fear it would be a greater job than going from zero. As far as I know NLTK doesn't support emotions. Just normal speech prosody.
You may try to integrate some things I wrote about into Sphinx, to develop emotion based speech models and introduce emotion recognition directly into sphinxes VR algorithm.
If you really need this, I advise you to learn enough DSP to write your own algo, then pay someone to make you initial database from audiobooks, radio dramas and similar stuff (using a tool you provide).
After your algo starts to work reasonably well, implement autolearning by giving users an option to correct the algo's wrong guesses. After some time you will get 90% reliable algo to recognize emotions from speech.

Related

Accent detection API?

I've been doing some research on the feasibility of building a mobile/web app that allows users to say a phrase and detects the accent of the user (Boston, New York, Canadian, etc.). There will be about 5 to 10 predefined phrases that a user can say. I'm familiar with some of the Speech to Text API's that are available (Nuance, Bing, Google, etc.) but none seem to offer this additional functionality. The closest examples that I've found are Google Now or Microsoft's Speaker Recognition API:
http://www.androidauthority.com/google-now-accents-515684/
https://www.microsoft.com/cognitive-services/en-us/speaker-recognition-api
Because there are going to be 5-10 predefined phrases I'm thinking of using a machine learning software like Tensorflow or Wekinator. I'd have initial audio created in each accent to use as the initial data. Before I dig deeper into this path I just wanted to get some feedback on this approach or if there are better approaches out there. Let me know if I need to clarify anything.
There is no public API for such a rare task.
Accent detection as language detection is commonly implemented with i-vectors. Tutorial is here. Implementation is available in Kaldi.
You need significant amount of data to train the system even if your sentences are fixed. It might be easier to collect accented speech without focusing on the specific sentences you have.
End-to-end tensorflow implementation is also possible but would probably require too much data since you need to separate speaker-instrinic things from accent-instrinic things (basically perform the factorization like i-vector is doing). You can find descriptions of similar works like this and this one.
You could use(this is just an idea, you will need to experiment a lot) a neural network with as many outputs as possible accents you have with a softmax output layer and cross entropy cost function

Waveform Comparison

I am working on a personal research project.
My objective is to be able to recognize a sound and identify if it belongs to the IPA or not by comparing it's waveform to a wave form in my data base. I have some skill with Mathematica, SciPy, and PyBrain.
For the first phase, I'm only using the English (US) phonetic alphabet.
I have a simple test bank of English phonetic alphabet sound files I found online. The trick here is:
I want to separate a sound file into wave forms that correspond to different syllables- this will take a learning algorithm. So, 'I like apples' would be cut up into the syllable waveforms that would make up the sentence.
Each waveform is then compared against the English PA's wave forms. I'm not certain how to do this part. I was thinking of using Praat to detect the waveforms, capture the image of the wave form and compare it to the one stored in the database with image analysis (which is kind of fun to do).
The damage here, is that I don't know how to make Praat generate a wave form file automatically then cut it up between syllables into waveform chunks. Logically, I would just prepare test cases for a learning algorithm and teach the comp to do it.
Instead of needing a wave form image- could I do this with fast Fourier transformation and compare two fft's- within x% margin of error consider it y syllable?
Frankly I don't really know about Praat, But I find your project super cool and interesting. I have experience with car motor's fault detection using it's sound, which might be connected to your project. I used Neural Networks and SVM to do the classification because multiple research papers proved it. Thus I didn't have any doubt about the way I chose. So my advice is maybe you should research and read some Papers about it. It really helps when you have questions like this (Will it work?, Can I use it instead or Am I using optimal solution? etc...). And good luck that's an awesome project :)
You could try Praat scripting.
Using just FFT will give you rather terrible results. Very long feature vector that will be really difficult to segment and run any training on it. That's thousands of points for a single syllable. Some deep neural networks are able to cope with it, but that's assuming you design them properly and provide huge training set. The advantage of using neural networks is that they can build features for you from the "raw data" (and I would consider fft also "raw"). However, when you work with sound, it's not that badly needed - you can manually engineer features. In case of sounds, science knows very well what sort of "features" sound have.
You can calculate these features with libraries like Yaafe. I recommend checking it even if you are not doing it in C++ or Python - the link I provided also delivers formulas for calculating them. I used some of them in my kiwi classifier.
Another good approach comes from scikit-talkbox, which provides exactly the tooling you might need.

People Detection and Tracking

I want to do pedestrian detection and tracking.
Input: Video Stream from CCTV camera.
Output:
#(no of) people going from left to right
# people going from right to left
# No. of people in the middle
What have i done so far:
For pedestrian detection I am using HOG and SVM. The detection is decent with high false positive rate. And its very slow as i am running in android platform.
Question:
After detection how to do I calculate the required values listed above. Can anyone tell me what is the tracking algorithm I have to use and any good algorithm for pedestrian detection.
Or should I use tracking algorithm? Is there a way to do without it?
Any references to codes/blogs/technical papers is appreciated.
Platform: C++ & OpenCV / android.
--Thanks
This is somehow close to a research problem.
You may want to have a look to this website which gathers a lot of references.
In particular, the work done by the group from Oxford present therein is pretty close to what you are doing, since their are using HOG for detection. (That work has been extremely illuminating for me).
EPFL and Julich have as well work done in the field.
You may also want to give a look to this review which describes several detection/tracking techniques, often involving variants of the HOG algorithm.
Along with #Acorbe response, I suggest the publications section of this (archived) website.
A recent work at the end of last year also released a code base here:
https://bitbucket.org/rodrigob/doppia
There have also been earlier pedestrian detector works that have released code as well:
https://sites.google.com/site/wujx2001/home/c4
http://www.vision.caltech.edu/Image_Datasets/CaltechPedestrians
The best accurate way is to use tracking algorithm instead of statistic appearance counting of incoming people and detection occurred left right and middle..
You can use extended statistical models.. That produce how many inputs producing one of the outputs and back validate from output detection the input.
My experience is that tracking leads to better results than approach above. But is also little bit complicated. We talk about multi target tracking when the critical is match detection with tracked model which should be update based on detection. If tracking is matched with wrong model. The problems are there.
Here on youtube I developed some multi target tracker by simple LBP people detector, but multi model and kalman filter for tracking. Both capabilities are available in opencv. You need to when something is detected create new kalman filter for each object and update in case you match same detection. Predict in case detection is not here in frame and also remove the Kalman i it is not necessary to track any more.
1 Detect
2 Match detections with kalmans, hungarian algorithm and l2 norm. (for example)
3 Lot of work. Decide if kalman shoudl be established, remove, update, or results is not detected and should be predicted. This is lot of work here.
Pure statistic approach is less accurate, second one is for experience people at least one moth of coding and 3 month of tuning.. If you need to be faster and your resources are quite limited. You can by smart statistic achieve your results by pure detection much faster and little bit less accurate. People are judge the image and video tracking even multi target tracking is capable to beat human. Try to count and register each person in video and count exits point. You are not able to do this in some number of people. It is really repents on, what you want, application, customer you have, and results you show to customers. If this is 4 numbers income, left, right, middle and your error is 20 percent is still much more than one bored small paid guard should achieved by all day long counting..
https://www.youtube.com/watch?v=d-RCKfVjFI4
You can find on my BLOG Some dataset for people detection and car detection on my blog same as script for learning ideas, tutorials and tracking examples..
Opencv blog tutorials code and ideas
You can use KLT for this purpose as this will tell you the flow of person traveling from left to right then you can compute that by computing line length which in given example is drawn using cv2.line you can use input parameters of this functions to compute your case, little math involved. if there is a flow of pixels from left to right this is case 1 or right to left then case 3 and for no flow case 2. Or you can use this basic tutorial to track object movement. LINK

Training the algorithm for better image recognition

This is a research question not a direct programming question.
I am working on a symbol recognition algorithm, What the software currently does, it takes an image, divide it into contours (blobs) and start matching each contour with a list of predefined templates. Then for each contour it takes the one that has the highest match rate.
The algorithm is doing fairely however I need to train it better. What I mean is this:
I want to use a machine learning algorithm that will train the algorithm to have better matching. So lets take an example:
I run the recognition on a symbol, the algorithm will run and find that this symbol is a car, then I have to confirm that result (maybe by clicking on "Yes" or "No") the algorithm should learn from that. So if I click on NO the algorithm should learn that this is not a car and will have better result next time (maybe try to match something else). while if i click on YES he will know that he was correct and next time he will perform better when searching for a car.
This is the concept I am trying to research. I need documents or algorithm that can achieve this sort of things. I am not looking for implementations or programming, just concept or researches.
I have done many researches and read a lot about machine learning, neural networks, decision trees.... but i was not able to know how can I use any in my scenarion.
I hope I was clear and this type of question is allowed on stack overflow. if not I'm sorry
Thanks a lot for any help or tip
Image recognition is still a challenge in the community. What you described in your process of manually clicking yes/no is just creating labeled data. Since this is a very broad area, I will just point you to a few links that might be useful.
To get start, you might want to use some existing image databases instead of creating your own, which saves you a lot of effort. e.g., this car dataset in UCIC image db.
Since you already have the background of machine learning, you can take a look at some survey paper that exactly match your project interests, e.g., search object recognition survey paper or feature extraction car in google.
Then you can dive into some good papers and see whether they are suitable for your project. For example, you can check the two papers below that linked with the UCIC image db.
Shivani Agarwal, Aatif Awan, and Dan Roth,
Learning to detect objects in images via a sparse, part-based representation.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(11):1475-1490, 2004.
Shivani Agarwal and Dan Roth,
Learning a sparse representation for object detection.
In Proceedings of the Seventh European Conference on Computer Vision, Part IV, pages 113-130, Copenhagen, Denmark, 2002.
Also check for some implemented softwares instead of starting from scratch, in your case, opencv should be good one to start with.
For image recognition, feature extraction is one of the most important step. You might want to check some stat-of-the-art algorithms in the community. (SIFT, mean-shift, harr features etc).
Boosting algorithm might also be useful when you reach the classification step. I see a lot of scholars mention this in image recognition community.
As #nickbar suggest, discuss more at https://stats.stackexchange.com/

Pitch detection using neural networks [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I am trying to use ANN for pitch detection of musical notes. The network is a simple two-layer MLP, whose inputs are basically a DFT (averaged and logarithmically distributed), and 12 outputs correspond to the 12 notes of a particular octave.
The network is trained with several samples of those 12 notes played by some instrument (one note at a time), and a few samples of "silence".
The results are actually good. The network is able to detect those notes played by different instruments preety accurately, it's relatively amune to noise, and even doesn't loose it's sanety completely when being played a song.
The goal, however, is to be able to detect polyphonic sound. So that when two or more notes are played together, the two corresponding neurons will fire. The surprising thing is that the network actually already does that to some extent (being trained over monophonic samples only), however less consistently and less accurately than for monophonic notes. My question is how can I enhance it's ability to recognise polyphnic sound?
The problem is I don't truely understand why it actually works already. The different notes (or their DFTs) are basically different points in space for which the network is trained. So I see why it does recognise similiar sounds (nearby points), but not how it "concludes" the output for a combination of notes (which form a distant point from each of the training examples). The same way an AND network which is trained over (0,0) (0,1) (1,0) = (0), is not expected to "conclude" that (1,1) = (1).
The brute force aprroach to this is to train the network with as many polyphonic samples as possible. However, since the network seem to somehow vaguely grasp the idea from the monophonic samples, there's probably something more fundemential here.
Any pointers? (sorry for the length, btw :).
The reason it works already is probably quite simply that you didn't train it to pick one and only one output (at least I assume you didn't). In the simple case when the output is just a dot product of the input and the weights, the weights would become matched filters for the corresponding pitch. Since everything is linear, multiple outputs would simultaneously get activated if multiple matched filters simultaneously saw good matches (as is the case for polyphonic notes). Since your network probably includes nonlinearities, the picture is a bit more complex, but the idea is probably the same.
Regarding ways to improve it, training with polyphonic samples is certainly one possibility. Another possibility is to switch to a linear filter. The DFT of a polyphonic sound is basically the sum of DFTs of each individual sound. You want a linear combination of inputs to become a corresponding linear combination of outputs, so a linear filter is appropriate.
Incidentally, why do you use a neural network for this in the first place? It seems that just looking at the DFT and, say, taking the maximum frequency would give you better results more easily.
Anssi Klapuri is a well-respected audio researcher who has published a method to perform pitch detection upon polyphonic recordings using Neural Networks.
You might want to compare Klapuri's method to yours. It is fully described in his master's thesis, Signal Processing Methods for the Automatic Transcription of Music. You can find his many papers online, or buy his book which explains his algorithm and test results. His master's thesis is linked below.
https://www.cs.tut.fi/sgn/arg/klap/phd/klap_phd.pdf
Pitch Detection upon polyphonic recordings is a very difficult topic and contains many controversies -- be prepared to do a lot of reading. The link below contains another approach to pitch detection upon polyphonic recordings which I developed for a free app called PitchScope Player. My C++ source code is available on GitHub.com, and is referenced within the link below. A free executable version of PitchScope Player is also available on the web and runs on Windows.
Real time pitch detection
I experimented with evolving a CTRNN (Continuous Time Recurrent Neural Network) on detecting the difference between 2 sine waves. I had moderate success, but never had time to follow up with a bank of these neurons (ie in bands similar to the cochlear).
One possible approach would be to employ Genetic Programming (GP), to generate short snippets of code that detects the pitch. This way you would be able to generate a rule for how the pitch detection works, which would hopefully be human readable.

Resources