Classification task based on speech recordings - machine-learning

I'm currently working with a huge dataset consisting of speech recordings of conersations of 120 persons. For each person, I have around 5 conversation recordings lasting 40-60min (the conversations are dyadic). For each recording, I have a label (i.e., around 600 labels). The labels provide information about the mental state of one of the person in the conversation (three classes). To classify this mental state based on the speech recordings, I see the following three possibilities:
Extracting Mel-Spectrograms or MFCCs (better for speech) and training
a RNN CNN (e.g., ConvLSTM) for the classification task. Here I see
the problem that with the small amount of labels, it might overfit.
In addition, recordings are long and thus training a RNN might be
difficult. A pretrained network on a different task (e.g., speech
recognition?) could also be used (but probably not available for
RNN).
Training a CNN autoencoder on Mel-Spectrograms or MFCCs over small
shifted windows (e.g., 1 minutes). The encoder could then be used to
extract features. The problem here is that probably the whole
recording must be used for the prediction. Thus, features need to be
extracted over the whole recording which would require the same
shifted windows as for training the autoencoder.
Extracting the features manually (e.g. frequency-based features etc.)
and using a SVM or Random Forest for the prediction which might suit
better for the small amount of labels (a fully-connected network
could also be used for comparison). Here the advantage is that
features can be chosen that are independent of the length of the
recording.
Which option do you think is best? Do you have any recommendations?

Related

How do engineered features help when they are not present in the test data

I am trying to classify between drones and birds using machine learning. I have got a big number of samples of feature vectors from a radar which generally consists of position(x,y,z), velocity(vx,vy,vz), acceleration(ax,ay,az), Noise, SNR etc plus some more features. Actual classes are known for the samples. However, These basic features are not able to distinguish between drones and birds for new(out of bag) samples. so I am going to try feature engineering to generate new features like standard deviation of speed calculated using mean-speed and then uses the difference between mean-speed and speeds obtained from individual samples(of the same track) to calculate standard deviation by averaging out the differences . Similarly, I generate new features using some other formula by using sum or difference or deviation from average(of different samples from same track) etc.
After obtaining these features we will use the same to create a trained model which will be used for classification.
However, I can apply all these feature engineering on the training dataset whereas the same engineered features will not be present in the test dataset obtained in the operational scenario where we get one sample after another. Also in operational scenario we do not know how many samples we will be getting for a track.
So, how can these engineered features be obtained so as to create a test feature vector with the same in actual operational scenario.
If these cannot be obtained while testing ,then how will the same engineered features (used for model training) be able to solve the classification problem when we do not have these in the test data?

MobileNet vs SqueezeNet vs ResNet50 vs Inception v3 vs VGG16

I have recently been looking into incorporating the machine learning release for iOS developers with my app. Since this is my first time ever using anything ML related I was very lost when I started reading the different model descriptions that Apple has made available. They have the same purpose/description, the only difference being the actual file size. What is the difference between these models and how would you know which one is best fit ?
The models Apple makes available are just for simple demo purposes. Most of the time, these models are not sufficient for use in your own app.
The models on Apple's download page are trained for a very specific purpose: image classification on the ImageNet dataset. This means they can take an image and tell you what the "main" object is in the image, but only if it's one of the 1,000 categories from the ImageNet dataset.
Usually, this is not what you want to do in your own apps. If your app wants to do image classification, typically you want to train a model on your own categories (like food or cars or whatever). In that case you can take something like Inception-v3 (the original, not the Core ML version) and re-train it on your own data. That gives you a new model, which you then need to convert to Core ML again.
If your app wants to do something other than image classification, you can use these pretrained models as "feature extractors" in a larger neural network structure. But again this involves training your own model (usually from scratch) and then converting the result to Core ML.
So only in a very specific use case -- image classification using the 1,000 ImageNet categories -- are these Apple-provided models useful to your app.
If you do want to use any of these models, the difference between them is speed vs. accuracy. The smaller models are fastest but also least accurate. (In my opinion, VGG16 shouldn't be used on mobile. It's just too big and it's no more accurate than Inception or even MobileNet.)
SqueezeNets are fully convolutional and use Fire modules which have a squeeze layer of 1x1 convolutions which vastly decreases parameters as it can restrict the number of input channels each layer. This makes SqueezeNets extremely low latency, in addition to the fact they don't have dense layers.
MobileNets utilise depth-wise separable convolutions, very similar to inception towers in inception. These also reduce the number of a parameters and hence latency. MobileNets also have useful model-shrinking parameters than you can call before training to make it exact size you want. The Keras implementation can use ImageNet pre-trained weights too.
The other models are very deep, large models. The reduced number of parameters / style of convolution is not used for low latency but just for the ability to train very deep models, essentially. ResNet introduced residual connections between layers which were originally believed to be key in training very deep models. These aren't seen in the previously mentioned low latency models.

Adjusting hyperparameters of neural network used for (offline) handwriting recognition

I'm currently using a java library to do some naive experimentation with offline handwriting recognition. I give my program an image of a pre-written English sentence and segment it into individual characters, which I then feed to a very naively constructed neural network.
I'm new to the idea of neural nets, so my question is where to start with regard to optimising this network's hyperparameters. Currently it's a simple feed forward network which I train using resilient propagation, so the only parameters I can optimise are the number of hidden layers, and the number of neurons in each hidden layer. I could of course do an exhaustive search through a large but finite number of combination, but this would be very time-consuming, and I'm sure someone out there who is more informed in this art must be able to point me in the right direction.
I found a post somewhere on here that stated a good place to start for any network in general was to use only one hidden layer with number of neurons equal to the mean number of neurons in your input and output layer, so that's what I'm doing at the moment.
I'm getting performance of about 40-60% (depending on character) accuracy with this model.

How to do machine learning when the inputs are of different sizes?

In standard cookbook machine learning, we operate on a rectangular matrix; that is, all of our data points have the same number of features. How do we cope with situations in which all of our data points have different numbers of features? For example, if we want to do visual classification but all of our pictures are of different dimensions, or if we want to do sentiment analysis but all of our sentences have different amounts of words, or if we want to do stellar classification but all of the stars have been observed a different number of times, etc.
I think the normal way would be to extract features of regular size from these irregularly sized data. But I attended a talk on deep learning recently where the speaker emphasized that instead of hand-crafting features from data, deep learners are able to learn the appropriate features themselves. But how do we use e.g. a neural network if the input layer is not of a fixed size?
Since you are asking about deep learning, I assume you are more interested in end-to-end systems, rather then feature design. Neural networks that can handle variable data inputs are:
1) Convolutional neural networks with pooling layers. They are usually used in image recognition context, but recently were applied to modeling sentences as well. ( I think they should also be good at classifiying stars ).
2) Recurrent neural networks. (Good for sequential data, like time series,sequence labeling tasks, also good for machine translation).
3) Tree-based autoencoders (also called recursive autoencoders) for data arranged in tree-like structures (can be applied to sentence parse trees)
Lot of papers describing example applications can readily be found by googling.
For uncommon tasks you can select one of these based on the structure of your data, or you can design some variants and combinations of these systems.
You can usually make the number of features the same for all instances quite easily:
if we want to do visual classification but all of our pictures are of different dimensions
Resize them all to a certain dimension / number of pixels.
if we want to do sentiment analysis but all of our sentences have different amounts of words
Keep a dictionary of the k words in all of your text data. Each instance will consist of a boolean vector of size k where the i-th entry is true if word i from the dictionary appears in that instance (this is not the best representation, but many are based on it). See the bag of words model.
if we want to do stellar classification but all of the stars have been observed a different number of times
Take the features that have been observed for all the stars.
But I attended a talk on deep learning recently where the speaker emphasized that instead of hand-crafting features from data deep learners are able to learn the appropriate features themselves.
I think the speaker probably referred to higher level features. For example, you shouldn't manually extract the feature "contains a nose" if you want to detect faces in an image. You should feed it the raw pixels, and the deep learner will learn the "contains a nose" feature somewhere in the deeper layers.

What is suitable neural network architecture for the prediction of popularity of articles?

I am a newbie in machine learning and also in neural networks. Currently I'm taking a course at coursera.org about neural networks, but I don't understand everything. I have a little problem with my thesis. I should use a neural network, but I don't know how to choose the right neural network architecture for my problem.
I have a lot of data from web portals (typically online editions of newspapers, magazines). There is information about articles for example, name, text of article and release of article. There are also large amounts of sequence data that capture behavior of users.
My goal is to predict the popularity of an article (number of readers or clicks on article by unique user). I want to make vectors from this data and feed my neural network with these vectors.
I have two questions:
1. How do I create the right vector?
2. Which neural network architecture is best suited for this problem?
Those are very broad questions. You'll need to identify smaller issues if you want more exact answers.
How to create a right vector?
For text data, you usually use the vector space model. Best results are often obtained using tf-idf weighting.
Which neural network architecture is suitable for this problem?
This is very hard to say. I would start with a network with k input neurons (where k is the size of your vectors after applying tf-idf: you might also want to do some sort of feature selection to reduce the number of features. A good feature selection method is by using the chi squared test.)
Then, a standard network layout is given by using a single hidden layer with number of neurons equal to the average between the number of input neurons and output neurons. Then it looks like you only need a single output neuron that will output how popular the article is going to be (this can be a linear neuron or a sigmoid neuron).
For the neurons in your hidden layer, you can also experiment with linear and sigmoid neurons.
There are many other things you can try as well: weight decay, the momentum technique, networks with multiple layers, recurrent networks and so on. It's impossible to say what would work best for your given problem without a lot of experimentation.

Resources