I am quite new in machine learning and I am building a simple app to recognize spoken digits.
I used MFCC to extract the filtering characteristic of my audio files. MFCC outputs me a 13 x length_of_audio matrix. I would like to use this information for my feature vector. But obviously, each example would have different number of features.
My question is what are the approaches to handle different number of features. E.g. could I use PCA to always extract some fixed amount of features and then use them in a particular learning algorithm ?
I would like to use logistic regression as the learning algorithm.
This is what I received at analyzing one of the spoken digits.
In your case if you have for a user a length_of_audio_matrix = N you don't have a feature vector of 13*N, you have N features vectors of length 13 (they compose a sequence but they are different feature vectors).
You must compose a matrix of 13 features:
MFCCUser1,Slot1
MFCCUser1,Slot2
....
MFCCUser1,SlotN
MFCCUser2,Slot1
MFCCUser2,Slot2
...
And then you can apply Principal Component Analysis.
You only have 13 features, do you really need to reduce them?
Related
I have a classic NLP problem, I have to classify a news as fake or real.
I have created two sets of features:
A) Bigram Term Frequency-Inverse Document Frequency
B) Approximately 20 Features associated to each document obtained using pattern.en (https://www.clips.uantwerpen.be/pages/pattern-en) as subjectivity of the text, polarity, #stopwords, #verbs, #subject, relations grammaticals etc ...
Which is the best way to combine the TFIDF features with the other features for a single prediction?
Thanks a lot to everyone.
Not sure if your asking technically how to combine two objects in code or what to do theoretically after so I will try and answer both.
Technically your TFIDF is just a matrix where the rows are records and the columns are features. As such to combine you can append your new features as columns to the end of the matrix. Probably your matrix is a sparse matrix (from Scipy) if you did this with sklearn so you will have to make sure your new features are a sparse matrix as well (or make the other dense).
That gives you your training data, in terms of what to do with it then it is a little more tricky. Your features from a bigram frequency matrix will be sparse (im not talking data structures here I just mean that you will have a lot of 0s), and it will be binary. Whilst your other data is dense and continuous. This will run in most machine learning algorithms as is although the prediction will probably be dominated by the dense variables. However with a bit of feature engineering I have built several classifiers in the past using tree ensambles that take a combination of term-frequency variables enriched with some other more dense variables and give boosted results (for example a classifier that looks at twitter profiles and classifies them as companies or people). Usually I found better results when I could at least bin the dense variables into binary (or categorical and then hot encoded into binary) so that they didn't dominate.
What if you do use a classifier for the tfidf but use the pred to add a new feature say tfidf and the probabilities of it to give a better result, here is a pic from auto ml blueprint to show you the same The results were > 90 percent vs 80 percent for current vs the two separate classifier ones
I sort of understand what features are, say a ML algorithm that learns SPAM, certain keywords could be a feature?
But in the famous MNIST digits data set, I see a matrix of numbers, is the entire matrix one single feature? Or is a feature each number in the matrix?
In my opinion, you are lacking some critical literature review.
Here are some good papers about RNN and CNN that can be used for image recognition appications :
https://pdfs.semanticscholar.org/86ef/e7769f2b8a0e15ca213ab09881e6705caeb0.pdf
https://arxiv.org/pdf/1506.00019.pdf
What is a feature? A feature represents one of the elements of the input vector which will be used to train the model and produce output.
The feature set is to be determined depending on the application.
Each element of the input vector is a different (dependent or independent) feature.
Look at this tutorial for example using the MNIST digit data set:
https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/3_NeuralNetworks/recurrent_network.py
It says:
'''
To classify images using a recurrent neural network, we consider every image
row as a sequence of pixels. Because MNIST image shape is 28*28px, we will then
handle 28 sequences of 28 steps for every sample.
'''
The RNN is built on sequences, hence if the image is 28 by 28 you can break it in 28 sequences of 28 features.
# Network Parameters
num_input = 28 # MNIST data input (img shape: 28*28)
timesteps = 28 # timesteps
This is what you see in the network parameters. The 28 features (num_input = 28) representing one sequence of the image.
To repeat again, each element of the input vector is considered a feature. Furthermore, is the analyst's responsibility to properly define these features.
Technically, a feature is a numerical value which discriminatively represents (or attempts to discriminatively represent) input or some part(s) of input. In case of MNIST, where image size is 28 x 28, the entire image matrix is flattened (generally row-wise) into a 1D feature vector, each element of this feature vector is a feature (in this case, simply image intensity). The type or kind of feature which one wants to use is completely problem specific. For e.g., instead of flattening the entire MNIST digit image, you could have used number of white pixels as your feature; however, it boils down to how discriminative such a feature could be for the given problem.
In case of spam classification, generally the features are frequency of words (there are several other things involved, such as stop word elimination, stemming, etc.).
One can off-course select or design multiple features for a given problem, such as stroke length, curvature, number of edges, etc. which you mentioned in the comment above. However, the main idea is that features should be discriminative enough for all the classes and they should not be derived from each other (this point leads us to another problem called feature or dimensionality reduction). I suggest you to read this Wikipedia page here and then go on to read an academic presentation on feature extraction and dimensionality reduction, such as this (this one is specific to images). This would help you to understand the overall idea.
An additional note, the features are combined into a compact representation called a feature vector. In this particular case, as mentioned before, you have a 1-D feature vector, which contains image intensities as a features.
I'm currently developing a speech recognition project and I'm trying to select the most meaningful features.
Most of the relevant papers suggest using Zero Crossing Rates, F0, and MFCC features therefore I'm using those.
My question is, a training sample with duration of 00:03 has 268 features. Considering I'm doing a multi class classification project with 50+ samples per class training including all MFCC features may suffer the project from curse of dimensionality or 'reduce the importance' of the other features.
So my question is, should I include all MFCC features if not can you suggest an alternative?
You should not use f0 and zero crossing, they are too unstable. You can simply increase your training data and use mfccs, they have good representation capabilitites. But remember to mean-normalize them.
After getting the MFCC coefficient of each frame, you can represent as MFCC features as the combination of:
1) First 12 MFCC
2) 1 energy feature
3) 12 delta MFCC feature
4) 12 double-delta MFCC feature
5) 1 delta energy feature
6) 1 double delta energy feature
The concent of delta MFCC feature is described in this
link.
The 39 dimension MFCC feature is feed into HMM or Recurrent Neural Network.
The point I'd like to make is that MFCCs are not required. You can use MFCCs, and you can use the energy, delta and delta-delta features, as mentioned by #Mahendra Thapa but it is not "required". Some researchers use 40 CCs, some drop the DCT from MFCC calculation making it MFSCs (spectral not cepstral). Some add extra features. Some use less. Susceptibility to the curse of dimensionality depends on the your classifier, doesn't it? Some recently even claim to have made progress towards the "holy grail" of speech recognition, to train using the raw signal, using deep learning, learning the best features rather than hand-crafting them.
MFCC is widely used,and the effect is relatively better .
I've got a set of F features e.g. Lab color space, entropy. By concatenating all features together, I obtain a feature vector of dimension d (between 12 and 50, depending on which features selected.
I usually get between 1000 and 5000 new samples, denoted x. A Gaussian Mixture Model is then trained with the vectors, but I don't know which class the features are from. What I know though, is that there are only 2 classes. Based on the GMM prediction I get a probability of that feature vector belonging to class 1 or 2.
My question now is: How do I obtain the best subset of features, for instance only entropy and normalized rgb, that will give me the best classification accuracy? I guess this is achieved, if the class separability is increased, due to the feature subset selection.
Maybe I can utilize Fisher's linear discriminant analysis? Since I already have the mean and covariance matrices obtained from the GMM. But wouldn't I have to calculate the score for each combination of features then?
Would be nice to get some help if this is a unrewarding approach and I'm on the wrong track and/or any other suggestions?
One way of finding "informative" features is to use the features that will maximise the log likelihood. You could do this with cross validation.
https://www.cs.cmu.edu/~kdeng/thesis/feature.pdf
Another idea might be to use another unsupervised algorithm that automatically selects features such as an clustering forest
http://research.microsoft.com/pubs/155552/decisionForests_MSR_TR_2011_114.pdf
In that case the clustering algorithm will automatically split the data based on information gain.
Fisher LDA will not select features but project your original data into a lower dimensional subspace. If you are looking into the subspace method
another interesting approach might be spectral clustering, which also happens
in a subspace or unsupervised neural networks such as auto encoder.
I work on classifying some reviews (paragraphs) consists of multiple sentences. I classified them with bag-of-word features in Weka via libSVM. However, I had another idea which I don't know how to implement :
I thought creating syntactical and shallow-semantics based features per sentence in the reviews is worth to try. However, I couldn't find any way to encode those features sequentially, since a paragraph's sentence size varies. The reason that I wanted to keep those features in an order is that the order of sentence features may give a better clue for classification. For example, if I have two instances P1 (with 3 sentences) and P2 (2 sentences), I would have a space like that (assume each sentence has one binary feature as a or b):
P1 -> a b b /classX
P2 -> b a /classY
So, my question is that whether I can implement that classification of different feature sizes in feature space or not? If yes, is there any kind of classifier that I can use in Weka, scikit-learn or Mallet? I would appreciate any responses.
Thanks
Regardless of the implementation, an SVM with the standard kernels (linear, poly, RBF) requires fixed-length feature vectors. You can encode any information in those feature vectors by encoding as booleans; e.g. collect all syntactical/semantic features that occur in your corpus, then introduce booleans that represent that "feature such and such occurred in this document". If it's important to capture the fact that these features occur in multiple sentences, count them and use put the frequency in the feature vector (but be sure to normalize your frequencies by document length, as SVMs are not scale-invariant).
In case you are classifying textual data, I would suggest looking at "Rational Kernels" which are made on weighted finite transducers for classifying natural language texts. Rational Kernels can be applied on varied length vectors and are already implemented as an open source project (OpenFST).
It is the library's problem, since SVM itself does not require fixed-length feature vectors, it only need a kernel function, if you can provide a kernel function with varied length vector, it should be OK for SVM