How do we use LSTM to classify sequences? - machine-learning

LSTM is good for predicting what is going to happen after a sequence, but I assume that we have many sequences and that each sequence corresponds to a class label.
How can we use LSTM to classify these sequences?

LSTM can be used for prediction as well as classification tasks.
For classification, you can follow most commonly used architectures that I have described below. However, you can build your own model depending on your requirement.
As the output of LSTM (Here I explain dynamic_rnn with time_major == False), we have a tensor with a shape of output = [batch_size, sequnce_length, cell.output_size], which means that for each row in the batch we have [sequnce_length, cell.output_size].
1. Method 1
1. Method 2
Hope this helps.

Related

Activation functions in Neural Networks

I have a few set of questions related to the usage of various activation functions used in neural networks? I would highly appreciate if someone could give good explanatory answers.
Why ReLU is used only on hidden layers specifically?
Why Sigmoid is a not used in Multi-class classification?
Why we do not use any activation function in regression problems having all negative values?
Why we use "average='micro','macro','average'" while calculating performance metric in multi_class classification?
I'll answer to the best of my ability the 2 first questions:
Relu (=max(0,x)) is used to extract feature maps from data. This is why it is used in the hidden layers where we're learning what important characteristics or features the data holds that could make the model learn how to classify for example. In the FC layers, it's time to make a decision about the output, so we usually use sigmoid or softmax, which tend to give us numbers between 0 and 1 (probability) that can give an interpretable result.
Sigmoid gives a probability for each class. So, if you have 10 classes, you'll have 10 probabilities. And depending on the threshold used, your model would predict for example that the image corresponds to two classes when in multi-classification you want just one predicted class per image. That's why softmax is used in this context: It chooses the class with the maximum probability. So it'll predict just one class.

The best loss function for pixelwise binary classification in keras

I built a deep learning model which accept image of size 250*250*3 and output 62500(250*250) binary vector which contains 0s in pixels that represent the background and 1s in pixels which represents ROI.
My model is based on DenseNet121 but when i use softmax as an activation function in last layer and categorical cross entropy loss function , the loss is nan.
What is the best loss and activation function that i can use it in my model?
What is the difference between binary cross entropy and categorical cross entropy loss function?
Thanks in advance.
What is the best loss and activation function that i can use it in my model?
Use binary_crossentropy because every output is independent, not mutually exclusive and can take values 0 or 1, use sigmoid in the last layer.
Check this interesting question/answer
What is the difference between binary cross entropy and categorical cross entropy loss function?
Here is a good set of answers to that question.
Edit 1: My bad, use binary_crossentropy.
After a quick look at the code (again) I can see that keras uses:
for binary_crossentropy -> tf.nn.sigmoid_cross_entropy_with_logits
(From tf docs): Measures the probability error in discrete classification tasks in which each class is independent and not mutually exclusive. For instance, one could perform multilabel classification where a picture can contain both an elephant and a dog at the same time.
for categorical_crossentropy -> tf.nn.softmax_cross_entropy_with_logits
(From tf docs): Measures the probability error in discrete classification tasks in which the classes are mutually exclusive (each entry is in exactly one class). For example, each CIFAR-10 image is labeled with one and only one label: an image can be a dog or a truck, but not both.

how to fine-tune word2vec when training our CNN for text classification?

I have 3 Questions about fine-tuning word vectors. Please, help me out. I will really appreciate it! Many thanks in advance!
When I train my own CNN for text classification, I use Word2vec to initialize the words, then I just employ these pre-trained vectors as my input features to train CNN, so if I never had a embedding layer, it surely can not do any fine-tunes through back-propagation. my question is if I want to do fine-tuning, does it means to create a Embedding layer?and how to create it?
When we train Word2vec, we use unsupervised training right? as in my case, I use the skip-gram model to get my pre-trained word2vec; But when I had the vec.bin and use it in the text classification model (CNN) as my words initialiser, if I could fine-tune the word-to-vector map in vec.bin, does it means that I have to have a CNN net structure exactly same as the one when training my Word2vec? and does the fine-tunes stuff would change the vec.bin or just fine-tune in computer memory?
Are the skip-gram model and CBOW model are only used for unsupervised Word2vec training? Or they could also apply for other general text classification tasks? and what's the different of the network between Word2vec unsupervised training supervised fine-tuning?
#Franck Dernoncourt thank you for reminding me. I'm green here, and hope to learn something from the powerful community. Please have a look at my questions when you have time, thank you again!
1) What you need is just a good example of using pretrained word embedding with trainable/fixed embedding layer with following change in code. In Keras you can update this layer by default, to exclude it from training you need set trainable to False.
embedding_layer = Embedding(nb_words + 1,
EMBEDDING_DIM,
weights=[embedding_matrix],
input_length=MAX_SEQUENCE_LENGTH,
trainable=True)
2) Your w2v is just for embedding layer initialization , no more relation to what CNN structure you are going to use. Will only update the weights in memory.
Answer to your 1st question-
When you set trainable=True in your Embedding constructor. Your pretrained-embeddings are set as weights of that embedding layer. Now any fine-tuning that happens on those weights has nothing to do with w2v(CBOW or SG). If you want to finetune you will have to finetune your w2v model using any of these techniques. Refer to these answers.
Answer 2-
Any finetuning on weights of embedding layer does not affect your vec.bin. These updated weights are saved along with the model though so theoretically you can get them out.
Answer 3 -
gensim implements only these two methods (SG and CBOW). However there are multiple new methods being used for training word vectors like MLM(masked language modeling).
glove tries to model probabilities of co-occurences of words.
If you want to fine-tune using your own custom method. You will just have to specify a task(like text classification) and then save your updated embedding layer weights. You will have to take proper care of indexing to assign each word its corresponding vector.

scikit learn classifies stopwords

Here is the example where there is step by step procedure to make system learn and classify input data.
It classifies correctly for given 5 datasets domains. Additionally it also classifies stopwords.
e.g
Input : docs_new = ['God is love', 'what is where']
Output :
'God is love' => soc.religion.christian
'what is where' => soc.religion.christian
Here what is where should not be classified as it contains only stopwords. How scikit learn functions in this scenario?
I am not sure what classifier you are using. But let's assume you use a Naive Bayes classifier.
In this case, the sample is labeled as the class for which the posterior probability is maximum given a particular pattern of words.
And the posterior probability is calculated as
posterior = likelihood x prior
Note that the evidence term was dropped since it is constant). Additionally, there is an additive smoothening to avoid scenarios where the likelihood is zero.
Anyway, if you have only stop words in your input text, the likelihood is constant for all classes and the posterior probability is entirely determined by your prior probability. So, what basically happens is that a Naive Bayes classifier (if the priors were estimated from the training data) will assign the class label that occurs most often in the training data.
A classifier always predicts one of the classes that it saw during its training phase, by definition. I don't know what you did to produce the classifier, but most likely it's just predicting the majority class for any sample without interesting features; that what naive Bayes, linear SVMs and other typical text classifiers do.
Standard text classification uses TfidfVectorizer to transform text to tokens and to vectors of features as input to classifier.
One of the init parameters is stop_words, in case stop_words='english' the vectorizer will produce no features for the sentence 'what is where'.
Stop words are matched lexically against every input token using a built in english stop words list you can examine here: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/stop_words.py

How to do text classification with label probabilities?

I'm trying to solve a text classification problem for academic purpose. I need to classify the tweets into labels like "cloud" ,"cold", "dry", "hot", "humid", "hurricane", "ice", "rain", "snow", "storms", "wind" and "other". Each tweet in training data has probabilities against all the label. Say the message "Can already tell it's going to be a tough scoring day. It's as windy right now as it was yesterday afternoon." has 21% chance for being hot and 79% chance for wind. I have worked on the classification problems which predicts whether its wind or hot or others. But in this problem, each training data has probabilities against all the labels. I have previously used mahout naive bayes classifier which take a specific label for a given text to build model. How to convert these input probabilities for various labels as input to any classifier?
In a probabilistic setting, these probabilities reflect uncertainty about the class label of your training instance. This affects parameter learning in your classifier.
There's a natural way to incorporate this: in Naive Bayes, for instance, when estimating parameters in your models, instead of each word getting a count of one for the class to which the document belongs, it gets a count of probability. Thus documents with high probability of belonging to a class contribute more to that class's parameters. The situation is exactly equivalent to when learning a mixture of multinomials model using EM, where the probabilities you have are identical to the membership/indicator variables for your instances.
Alternatively, if your classifier were a neural net with softmax output, instead of the target output being a vector with a single [1] and lots of zeros, the target output becomes the probability vector you're supplied with.
I don't, unfortunately, know of any standard implementations that would allow you to incorporate these ideas.
If you want an off the shelf solution, you could use a learner the supports multiclass classification and instance weights. Let's say you have k classes with probabilities p_1, ..., p_k. For each input instance, create k new training instances with identical features, and with label 1, ..., k, and assign weights p_1, ..., p_k respectively.
Vowpal Wabbit is one such learner that supports multiclass classification with instance weights.

Resources