I would like to add a gender detection capability to a news video translator app I'm working on, so that the app can switch between male and female voice according to the voice onscreen. I'm not expecting 100% accuracy.
I used EZAudio to obtain waveform data of a time period of audio and used the average RMS value to set a threshold(cutOff) value between male and female. Initially cutOff = 3.3.
- (void)setInitialVoiceGenderDetectionParameters:(NSArray *)arrayAudioDetails
{
float initialMaleAvg = ((ConvertedTextDetails *)[arrayAudioDetails firstObject]).audioAverageRMS;
// The average RMS value of a time period of Audio, say 5 sec
float initialMaleVector = initialMaleAvg * 80;
// MaleVector is the parameter to change the threshold according to different news clippings
cutOff = (initialMaleVector < 5.3) ? initialMaleVector : 5.3;
cutOff = (initialMaleVector > 23) ? initialMaleVector/2 : 5.3;
}
Initially adjustValue = -0.9 and tanCutOff = 0.45. These values 5.3, 23, cutOff, adjustValue and tanCutOff are obtained from rigorous testing. Also tan of values are used to magnify the difference in values.
- (BOOL)checkGenderWithPeekRMS:(float)pRMS andAverageRMS:(float)aRMS
{
//pRMS is the peak RMS value in the audio snippet and aRMS is the average RMS value
BOOL male = NO;
if(tan(pRMS) < tanCutOff)
{
if(pRMS/aRMS > cutOff)
{
cutOff = cutOff + adjustValue;
NSLog(#"FEMALE....");
male = NO;
}
else
{
NSLog(#"MALE....");
male = YES;
cutOff = cutOff - adjustValue;
}
}
else
{
NSLog(#"FEMALE.");
male = NO;
}
return male;
}
Usage of the adjustValue is to calibrate the threshold each time a news video is translated as each video has different noise levels. But I know this method is noob-ish. What can I do create a stable threshold? or How can I normalise each audio snippet?
Alternate or more efficient ways to determine gender from audio wave data is also welcome.
Edit: From Nikolay's suggestion I researched on gender recognition using CMU Sphinx. Can anybody suggest how can I extract MFCC features and feed into a GMM/SVM classifier using Open Ears (CMU Sphinx for iOS platform) ?
Accurate gender identification can be implemented with GMM classifier of MFCC features. You can read about it here:
AGE AND GENDER RECOGNITION FOR TELEPHONE APPLICATIONS BASED ON GMM SUPERVECTORS AND SUPPORT VECTOR MACHINES
To the date I am not aware of open source implementation of this, though many components are available in open source speech recognition toolkits like CMUSphinx.
Accurate gender identification can be implemented with training a GMM classifier on MFCC features of male and female. Here is how one can go about it.
One needs to collect training set for each of the gender.
Extract MFCCs features from all the audios of respective gender(One can find python implementation like scikit-talkbox etc).
Train GMM models for both the gender on the extracted features from their training set audios.
For details, Here is an open source python implementation of the same. The following tutorials evaluates the code on subset extracted from Google's AudioSet which is released this year (2017)
https://appliedmachinelearning.wordpress.com/2017/06/14/voice-gender-detection-using-gmms-a-python-primer/
Related
I'm working on advanced vision system which consist of two static cameras (used for obtaining accurate 3d object location) and some targeting device. Object detection and stereovision modules have been already done. Unfortunately, due to the delay of targeting system it is obligatory to develop a proper prediction module.
I did some tests using Kalman filter but it is working not accurate enough.
kalman = cv2.KalmanFilter(6,3,0)
...
kalman.statePre[0,0] = x
kalman.statePre[1,0] = y
kalman.statePre[2,0] = z
kalman.statePre[3,0] = 0
kalman.statePre[4,0] = 0
kalman.statePre[5,0] = 0
kalman.measurementMatrix = np.array([[1,0,0,0,0,0],[0,1,0,0,0,0],[0,0,1,0,0,0]],np.float32)
kalman.transitionMatrix = np.array([[1,0,0,1,0,0],[0,1,0,0,1,0],0,0,1,0,0,1],[0,0,0,1,0,0],[0,0,0,0,1,0],[0,0,0,0,0,1]],np.float32)
kalman.processNoiseCov = np.array([[1,0,0,0,0,0],[0,1,0,0,0,0],0,0,1,0,0,0],[0,0,0,1,0,0],[0,0,0,0,1,0],[0,0,0,0,0,1]],np.float32) * 0.03
kalman.measurementNoiseCov = np.array([[1,0,0],[0,1,0],0,0,1]],np.float32) * 0.003
I noticed that time periods between two frames are different each time (due to the various detection time).
How could I use last timestamp diff as an input? (Transition matrices?, controlParam?)
I want to determine the prediction time e.g want to predict position of object in 0,5sec or 1,5sec
I could provide example input 3d points.
Thanks in advance
1. How could I use last timestamp diff as an input? (Transition matrices?, controlParam?)
Step size is controlled through prediction matrix. You also need to adjust process noise covariance matrix to control uncertainty growth.
You are using a constant speed prediction model, so that p_x(t+dt) = p_x(t) + v_x(t)·dt will predict position in X with a time step dt (and the same for coords. Y and Z). In that case, your prediction matrix should be:
kalman.transitionMatrix = np.array([[1,0,0,dt,0,0],[0,1,0,0,dt,0],0,0,1,0,0,dt],[0,0,0,1,0,0],[0,0,0,0,1,0],[0,0,0,0,0,1]],np.float32)
I left the process noise cov. formulation as an exercise. Be careful with squaring or not squaring the dt term.
2. I want to determine the prediction time e.g want to predict position of object in 0,5sec or 1,5sec
You can follow two different approaches:
Use a small fixed dt (e.g. 0.02 sec for 50Hz) and calculate predictions in a loop until you reach your goal (e.g. get a new observation from your cameras).
Adjusting prediction and process noise matrices online to the desired dt (0,5 / 1,5 sec in your question) and execute a single prediction step.
If you are asking about how to anticipate the detection time of your cameras, that should be a different question and I am afraid I can't help you :-)
Recently, In some papers face recognition approaches are being evaluated through a new proposed protocol, names as closed-set and open-set face identification over LFW dataset. For open-set one, the Rank-1 accuracy is reported as Detection and Identification Rate (DIR) at a fixed False Alarm/Acceptance Rate (FAR). I have a gallery and a probe set and am using KNN for classification, however I don't know how to compute the DIR#FAR1%.
Update:
Specifically, what is ambiguous to me is fixating the FAR at a fixed threshold, or how the curves such as ROC, precision-recall and etc are plotted for face recognition. What does the threshold in the following paragraph mean?
Hence the performance is evaluated based on (i) Rank-1 detection and identification rate (DIR), which is the fraction of genuine probes matched correctly at Rank-1, and not rejected at a given threshold, and (ii) the false alarm rate (FAR) of the rejection step (i.e. the fraction of impostor probe images which are not rejected). We report the DIR vs. FAR curve describing the trade-off between true Rank-1 identifications and false alarms.
The reference paper is downloadable here.
Any help would be welcome.
I guess the DIR metric was established by the Biometrics society. This metric includes both the detection (exceeding some threshold) and the identification (rank). Let the gallery consists of a set of enrolled users in a biometric
database and the probe set may contain users who may or may not be present
in the database. Let g and p are two elements of the gallery and probe sets respectively. Moreover, let the probe set include two disjoint subsets: P1 including the samples of those who belong to the gallery subjects and P0 including those who do not.
Assume s(p,g) is a similarity score between a probe and a gallery elements, t is a threshold and k is the identification rank. Then DIR is given by:
You can find the complete formula in this reference:
Poh, N., et al. "Description of Metrics For the Evaluation of Biometric Performance." Seventh Framework Programme of Biometrics Evaluation and Testing (2012): 1-22.
I would like to add a gender detection capability to a news video translator app I'm working on, so that the app can switch between male and female voice according to the voice onscreen. I'm not expecting 100% accuracy.
I used EZAudio to obtain waveform data of a time period of audio and used the average RMS value to set a threshold(cutOff) value between male and female. Initially cutOff = 3.3.
- (void)setInitialVoiceGenderDetectionParameters:(NSArray *)arrayAudioDetails
{
float initialMaleAvg = ((ConvertedTextDetails *)[arrayAudioDetails firstObject]).audioAverageRMS;
// The average RMS value of a time period of Audio, say 5 sec
float initialMaleVector = initialMaleAvg * 80;
// MaleVector is the parameter to change the threshold according to different news clippings
cutOff = (initialMaleVector < 5.3) ? initialMaleVector : 5.3;
cutOff = (initialMaleVector > 23) ? initialMaleVector/2 : 5.3;
}
Initially adjustValue = -0.9 and tanCutOff = 0.45. These values 5.3, 23, cutOff, adjustValue and tanCutOff are obtained from rigorous testing. Also tan of values are used to magnify the difference in values.
- (BOOL)checkGenderWithPeekRMS:(float)pRMS andAverageRMS:(float)aRMS
{
//pRMS is the peak RMS value in the audio snippet and aRMS is the average RMS value
BOOL male = NO;
if(tan(pRMS) < tanCutOff)
{
if(pRMS/aRMS > cutOff)
{
cutOff = cutOff + adjustValue;
NSLog(#"FEMALE....");
male = NO;
}
else
{
NSLog(#"MALE....");
male = YES;
cutOff = cutOff - adjustValue;
}
}
else
{
NSLog(#"FEMALE.");
male = NO;
}
return male;
}
Usage of the adjustValue is to calibrate the threshold each time a news video is translated as each video has different noise levels. But I know this method is noob-ish. What can I do create a stable threshold? or How can I normalise each audio snippet?
Alternate or more efficient ways to determine gender from audio wave data is also welcome.
Edit: From Nikolay's suggestion I researched on gender recognition using CMU Sphinx. Can anybody suggest how can I extract MFCC features and feed into a GMM/SVM classifier using Open Ears (CMU Sphinx for iOS platform) ?
Accurate gender identification can be implemented with GMM classifier of MFCC features. You can read about it here:
AGE AND GENDER RECOGNITION FOR TELEPHONE APPLICATIONS BASED ON GMM SUPERVECTORS AND SUPPORT VECTOR MACHINES
To the date I am not aware of open source implementation of this, though many components are available in open source speech recognition toolkits like CMUSphinx.
Accurate gender identification can be implemented with training a GMM classifier on MFCC features of male and female. Here is how one can go about it.
One needs to collect training set for each of the gender.
Extract MFCCs features from all the audios of respective gender(One can find python implementation like scikit-talkbox etc).
Train GMM models for both the gender on the extracted features from their training set audios.
For details, Here is an open source python implementation of the same. The following tutorials evaluates the code on subset extracted from Google's AudioSet which is released this year (2017)
https://appliedmachinelearning.wordpress.com/2017/06/14/voice-gender-detection-using-gmms-a-python-primer/
I want to classify documents (composed of words) into 3 classes (Positive, Negative, Unknown/Neutral). A subset of the document words become the features.
Until now, I have programmed a Naive Bayes Classifier using as a feature selector Information gain and chi-square statistics. Now, I would like to see what happens if I use Odds ratio as a feature selector.
My problem is that I don't know hot to implement Odds-ratio. Should I:
1) Calculate Odds Ratio for every word w, every class:
E.g. for w:
Prob of word as positive Pw,p = #positive docs with w/#docs
Prob of word as negative Pw,n = #negative docs with w/#docs
Prob of word as unknown Pw,u = #unknown docs with w/#docs
OR(Wi,P) = log( Pw,p*(1-Pw,p) / (Pw,n + Pw,u)*(1-(Pw,n + Pw,u)) )
OR(Wi,N) ...
OR(Wi,U) ...
2) How should I decide if I choose or not the word as a feature ?
Thanks in advance...
Since it took me a while to independently wrap my head around all this, let me explain my findings here for the benefit of humanity.
Using the (log) odds ratio is a standard technique for filtering features prior to text classification. It is a 'one-sided metric' [Zheng et al., 2004] in the sense that it only discovers features which are positively correlated with a particular class. As a log-odds-ratio for the probability of seeing a feature 't' given the class 'c', it is defined as:
LOR(t,c) = log [Pr(t|c) / (1 - Pr(t|c))] : [Pr(t|!c) / (1 - Pr(t|!c))]
= log [Pr(t|c) (1 - Pr(t|!c))] / [Pr(t|!c) (1 - Pr(t|c))]
Here I use '!c' to mean a document where the class is not c.
But how do you actually calculate Pr(t|c) and Pr(t|!c)?
One subtlety to note is that feature selection probabilities, in general, are usually defined over a document event model [McCallum & Nigam 1998, Manning et al. 2008], i.e., Pr(t|c) is the probability of seeing term t one or more times in the document given the class of the document is c (in other words, the presence of t given the class c). The maximum likelihood estimate (MLE) of this probability would be the proportion of documents of class c that contain t at least once. [Technically, this is known as a Multivariate Bernoulli event model, and is distinct from a Multinomial event model over words, which would calculate Pr(t|c) using integer word counts - see the McCallum paper or the Manning IR textbook for more details, specifically on how this applies to a Naive Bayes text classifier.]
One key to using LOR effectively is to smooth these conditional probability estimates, since, as #yura noted, rare events are problematic here (e.g., the MLE of Pr(t|!c) could be zero, leading to an infinite LOR). But how do we smooth?
In the literature, Forman reports smoothing the LOR by "adding one to any zero count in the denominator" (Forman, 2003), while Zheng et al (2004) use "ELE [Expected Likelihood Estimation] smoothing" which usually amounts to adding 0.5 to each count.
To smooth in a way that is consistent with probability theory, I follow standard practices in text classification with a Multivariate Bernoulli event model. Essentially, we assume that we have seen each presence count AND each absence count B extra times. So our estimate for Pr(t|c) can be written in terms of #(t,c): the number of times we've seen t and c, and #(t,!c): the number of times we've seen t without c, as follows:
Pr(t|c) = [#(t,c) + B] / [#(t,c) + #(t,!c) + 2B]
= [#(t,c) + B] / [#(c) + 2B]
If B = 0, we have the MLE. If B = 0.5, we have ELE. If B = 1, we have the Laplacian prior. Note this looks different than smoothing for the Multinomial event model, where the Laplacian prior leads you to add |V| in the denominator [McCallum & Nigam, 1998]
You can choose 0.5 or 1 as your smoothing value, depending on which prior work most inspires you, and plug this into the equation for LOR(t,c) above, and score all the features.
Typically, you then decide on how many features you want to use, say N, and then choose the N highest-ranked features based on the score.
In a multi-class setting, people have often used 1 vs All classifiers and thus did feature selection independently for each classifier and thus each positive class with the 1-sided metrics (Forman, 2003). However, if you want to find a unique reduced set of features that works in a multiclass setting, there are some advanced approaches in the literature (e.g. Chapelle & Keerthi, 2008).
References:
Zheng, Wu, Srihari, 2004
McCallum & Nigam 1998
Manning, Raghavan & Schütze, 2008
Forman, 2003
Chapelle & Keerthi, 2008
Odd ratio is not good measure for feature selection, because it is only shows what happen when feature present, and nothing when it is not. So it will not work for rare features and almost all features are rare so it not work for almost all features. Example feature with 100% confidence that class is positive which present in 0.0001 is useless for classification. Therefore if you still want to use odd ratio add threshold on frequency of feature, like feature present in 5% of cases. But I would recommend better approach - use Chi or info gain metrics which automatically solve those problems.
I'm currently working on a program that analyses a wav file of a solo musician playing an instrument and detects the notes within it. To do this it performs an FFT and then looks at the data produced. The goal is to (at some point) produce the sheet music by writing a midi file.
I just wanted to get a few opinions on what might be difficult about it, whether anyones tried it before, maybe a few things it would be good to research. At the moment my biggest struggle is that not all notes are purely one frequency and I cannot yet detect chords; just single notes. Also there has to be a pause between the notes I am detecting so I know for sure one has ended and the other started. Any comments on this would also be very welcome!
This is the code I use when A new frame comes in from the signal. it looks for the frequency that is most dominant in the sample:
//Get frequency vector for power match
double[] frequencyVectorDoubleArray = Accord.Audio.Tools.GetFrequencyVector(waveSignal.Length, waveSignal.SampleRate);
powerSpectrumDoubleArray[0] = 0; // zero DC
double[,] frequencyPowerDoubleArray = new double[powerSpectrumDoubleArray.Length, 2];
for (int i = 0; i < powerSpectrumDoubleArray.Length; i++)
{
if (frequencyVectorDoubleArray[i] > 15.00)
{
frequencyPowerDoubleArray[i, 0] = frequencyVectorDoubleArray[i];
frequencyPowerDoubleArray[i, 1] = powerSpectrumDoubleArray[i];
}
}
//Method for finding the highest frequency in a sample of frequency domain data
//But I want to filter out stuff
pulsePowerDouble = lowestPowerAcceptedDouble;//0;//lowestPowerAccepted;
int frequencyIndexAtPulseInt = 0;
int oldFrequencyIndexAtPulse = 0;
for (int j = 0; j < frequencyPowerDoubleArray.Length / 2; j++)
{
if (frequencyPowerDoubleArray[j, 1] > pulsePowerDouble)
{
oldPulsePowerDouble = pulsePowerDouble;
pulsePowerDouble = frequencyPowerDoubleArray[j, 1];
oldFrequencyIndexAtPulse = frequencyIndexAtPulseInt;
frequencyIndexAtPulseInt = j;
}
}
foundFreq = frequencyPowerDoubleArray[frequencyIndexAtPulseInt, 0];
1) There is a lot (several decades worth) of research literature on frequency estimation and pitch estimation (which are two different subjects).
2) Peak FFT frequency is not the same as the musical pitch. Some solo musical instruments can produces well over a dozen frequency peaks for just one note, let alone a chord, and with none of the largest peaks anywhere near the musical pitch. For some common instruments, the peaks might not even be mathematically exact harmonics.
3) Using the peak bin of a short unwindowed FFT isn't a great frequency estimator.
4) Note onset detection might require some sophisticated pattern matching, depending on the instrument.
You don't want to focus on the highest frequency, but rather the lowest. Every note from any musical instrument is full of harmonics. Expect to hear the fundamental, and every octave above it. Plus all the second and third harmonics.
Harmonics is what makes a trumpet sound different from a trombone when they are both playing the same note.
Unfortunately this is an extremely hard problem, some of the reasons have already been given. I would start with a literature search (Google Scholar, for instance) for "musical note identification".
If this isn't a spare time project, beware - I have seen masters theses founder on this particular shoal without getting any useful results.