(Bidirectional) RNN for simple text classification - machine-learning

TL;DR: Is Bidirectional RNN helpful for simple text classification and is padding evil?
In my recent work, I created a LSTM model and a BLSTM model for the same task, that is, text classification. The LSTM model did a pretty good job, yet I decided to give BLSTM a shot to see whether it may even push the accuracy further. In the end, I found BLSTM much slower to converge and surprisingly, it overfitted, even though I applied dropout with the probability of 50%.
In the implementation, I used unrolled RNN for both LSTM and BLSTM, expecting for faster training. To meet the requirement, I manually padded the input texts to a fixed length.
Let's say we have a sentence "I slept late in the morning and missed th interview with Nebuchadnezzar", which is then padded with 0 at its end when converted to an array of indices of pre-trained word embeddings. So we get something like [21, 43, 25, 64, 43, 25, 6, 234, 23, 0, 0, 29, 0, 0, 0, ..., 0]. Note that the "th" (should be "the") is a typo and the name "Nebuchadnezzar" is too rare so both of them are not present in the vocabulary so we replace it with also 0, which corresponds to a special full-zero word vector.
Here are my reflections:
Some people prefer changing unknown words into a special word like "< unk >" before feeding the corpus into a GloVe or Word2Vec model. Does it mean that we have to first build the vocabulary and change some low-frequency words (according to min count setting) into "< unk >" before training? Is it better than changing unknown words into 0 or just removing them when training RNN?
The trailing 0s fed into LSTM or BLSTM networks, as far as I'm concerned, mess the output up. Although there is no new information from outside, the cell state still gets updated for each time step that follows, so the output of the final cell will be heavily impacted by the long trailing 0s. And to my belief, BLSTM will be impacted even more as it also processes the text from the inverse order, that is something like [0, 0, 0, ..., 0, 321, 231], especially if we set the initial forget gate to 1.0 to foster memory at the beginning. I see a lot of people use padding but will not it cause a disaster if the texts are padded to a great length and in the case of BLSTM?
Any idea on these issues? :-o

I agree mostly with Fabrice's answer above, but to add a few comments:
You should NEVER use the same token with UNK and PAD. Most deep learning libraries mask over PAD, since it provides no information. UNK, on the other hand, does provide information to your model (there is a word here, we just don't know what it is, and it's probably a special word), so you should not be masking that over. Yes, this does mean that in a separate preprocessing step, you should go through your training/testing data, build a vocabulary of, say, the top 10,000 most common words, and switch everything else to UNKs.
As noted in 1, most libraries simply mask over (i.e. ignore) padding characters, so this is not an issue. But as you said, it isn't necessary for you to pad the sentences. For example, you can group them by length while training, or feed in the sentences one at a time into your model.

By experience, having a different embedding for UNKNOWN and PADDING is helpful. Since you're doing text classification, I guess removing them wouldn't be too harmful if there aren't too many but I'm not familiar enough with text classification to say that with certainty.
As for padding your sequences, have you tried padding your sequences differently? For example, pad the beginning of your sequence for a forward LSTM and the end of a sequence for a backward LSTM. Since you're padding with zeros, the activations are not going to be as strong (if any) and your LSTMs will now end with your sequence instead of zeros which could overwrite your LSTM memory.
Of course, these are just suggestions out of the top of my head (I do not have enough reputation to comment) so I do not have THE answer. You'll have to try it out yourself and see if it helps or not. I hope it does.
Cheers.

Related

String classification, how to encode character-by-character and train?

I am trying to build a classifier to classify some files into 150 categories based on the name of those files. Here are some examples of file names in my dataset (~700k files):
104932489 - urgent - contract validation for xyz limited.msg
treatment - an I l - contract n°4934283 received by partner.pdf
- invoice_8843238_1_europe services_business 8592342sid paris.xls
140159498736656.txt
140159498736843.txt
fsk_000000001090296_sdiselacrusefeyre_2000912.xls
fsk_000000001091293_lidlsnd1753mdeas_2009316.xls
You can see that the filenames can really be anything, but that however there is always some pattern that is respected for the same categories. It can be in the numbers (that are sometimes close), in the special characters (spaces, -, °), sometimes the length, etc.
Extracting all those patterns one by one will take ages because I have approximately 700k documents. Also, I am not interested in 100% accuracy, 70% can be good enough.
The real problem is that I don't know how to encode this data. I have tried many methods:
Tokenizing character by character and feeding them to an LSTM model with an embedding layer. However, I wasn't able to implement it and got dimension errors.
Adapting Word2Vec to convert the characters into vectors. However, this automatically drops all punctuation and space characters, also, I lose the numeric data. Another problem is that it creates more useless dimensions: if the size is 20, I will have my data in 20 dimensions but if I look closely, there are always the same 150 vectors in those 20 dimensions so it's really useless. I could use a 2 dimensions size but still, I need the numeric data and the special characters.
Generating n-grams from each path, in the range 1-4, then using a CountVectorizer to compute the frequencies. I checked and special characters were not dropped but it gave me like 400,000 features! I am running a dimensionality reduction using UMAP (n_components=5, metric='hellinger') but the reduction runs for 2 hours and then the kernel crashes.
Any ideas?
I am currently also working on a character level lstm. And it works exactly the same like when you would use words. You need a vocabulary, for example a - z and then you just take the index of the letter as its integer representation. For example:
"bad" -> "b", "a", "d" -> [1, 0, 3]
Now you could create an embedding lookup table (for example using pytorchs nn.Embedding function). You just have to create a random vector for every index of your vocab. For example:
"a" -> 0 > [-0.93, 0.024, -.0.73, ..., -0.12]
You said that you tried this but encountered dimension errors? Maybe show us the code!
Or you could create non-random embedding using word2vec using the Gensim libary:
from gensim.models import Word2Vec
# 'total_words' is a list containing every word of your dataset split into its characters
total_words = [...]
model = Word2Vec(total_words , min_count=1, size=32)
model.save(save_model_file)
# lets test it for the character 'a'
embedder = Word2Vec.load(save_model_file)
v = embedder["a"]
# v now will be a the embedding vector of a with size 32x1
I hope I could make clear how to create embeddings for characters.
You can treat characters in single-word-classification the exact same way you would treat words in sentence-classification.

Avoiding overfitting with random forest

I am training a random forest model for the first time and I find this situation.
My accuracy on the training set, with the default parameters (as in
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html ) is very high, 0.95 or more , which looks a lot like overfitting. On the test set, accuracy goes to 0.66. My goal would be to make the model less overfitting, hoping to improve performance on the test set.
I tried to perform 5-fold cross validation, using random grid search like here ( https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74 ) with the following grid:
n_estimators = [16,32,64,128]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
The best model had an accuracy of 0.7 across the folds.
I used the best selected parameters in step 2 on the training set and test set, but again accuracy on training set was 0.95 and test set 0.66.
Any suggestion ? What do you think is going on here ? How can I reach the result to avoid overfitting ( and maybe improve model performance ) ?
Over here someone had the same question and received some helpful answers:
https://stats.stackexchange.com/questions/111968/random-forest-how-to-handle-overfitting
Your approach to use 5-fold crossvalidation is already very good and can perhaps be improved by utilizing 10-fold crossvalidation instead.
Another question you can ask yourself is about the quality of your data set. Are your classes balanced? If they aren't you could try to handle a class imbalance issue, because with imbalance comes usually a bias towards the majority class.
It is also possible that the dataset is perhaps not big enough and increasing it could boost your performance as well.
I hope this helps a bit.
Adding this late comment in case it helps others.
In addition to the parameters mentioned above (n_estimators, max_features, max_depth, and min_samples_leaf) consider setting 'min_impurity_decrease'.
You can use 'gini' or 'entropy' for the Criterion, however, I recommend sticking with 'gini', the default. In the majority of cases, they produce the same result but 'entropy' is more computational expensive to compute.
Max depth works well and is an intuitive way to stop a tree from growing, however, just because a node is less than the max depth doesn't always mean it should split. If the information gained from splitting only addresses a single/few misclassification(s) then splitting that node may be supporting overfitting. You may or may not find this parameter useful, depending on the size of your dataset and/or your feature space size and complexity, but it is worth considering while tuning your parameters.
You do not describe how you split your dataset, so consider using a slightly smaller training set. Also make sure you do not have categorical variables in your feature space. If you do, use OneHotEncoding or pd.get_dummies to break those out.
I'm not sure how large your feature space is but you may want to use a smaller subset of your features (depending on how many noise variables you have). You also may want to look at a smaller max_depth. Your depth test all the way up to 110, that's very large. Again, I do not know your feature space but look to the lower end of your range to start and expand from there. i.e. try: [5, 7, 9] if 9 is optimal then adjust to say [9, 11, 13] etc. Although even a depth of 9 can cause overfitting (depending on the data) so be careful not to grow this too much. Possible pair with the gini index.

LSTM-RNN : How to shape multivariate Inputs

Hi everybody I am struggeling with the tensorflow RNN implementation:
The problem:
I want to train an LSTM implentation of an RNN to detect malicious connections in the KDD99 dataset. Its a dataset with 41 features and (after some preprocessing) a label vector of the size 5.
[
[x1, x2, x3, .....x40, x41],
...
[x1, x2, x3, .....x40, x41]
]
[
[0, 1, 0, 0, 0],
...
[0, 0, 1, 0, 0]
]
As a basic architurecture I would like to implement the following:
cell = tf.nn.rnn_cell.LSTMCell(num_units=64, state_is_tuple=True)
cell = tf.nn.rnn_cell.DropoutWrapper(cell=cell, output_keep_prob=0.5)
cell = tf.nn.rnn_cell.MultiRNNCell(cells=[cell] * 3, state_is_tuple=True)
My question is: In order to feed it to the model, how would i need to reshape the input features?
Would I not just have to reshape the input features, but to build sliding window sequences?
What I mean by that:
Assuming a sequence length of ten, the first suqence would contains data point 0 - 9, the second one contains data points 1 - 10, 2 - 11 and so on.
Thanks!
I do not know the dataset but I think that you problem is the following: you have a very long sequence and you want to know how to shape this sequence in order to provide this to the network.
The 'tf.contrib.rnn.static_rnn' has the following signature:
tf.contrib.rnn.static_rnn(cell, inputs, initial_state=None, dtype=None, sequence_length=None, scope=None)
where
inputs: A length T list of inputs, each a Tensor of shape [batch_size, input_size], or a nested tuple of such elements.
So the inputs need to be shaped into lists, where each element of the list is the element of the input sequence at each time step.
The length of this list depend on your problem and/or on computational issues.
In Natural Language Processing, for example, the length of this list can be the maximum sentence length of your document, where shorter sentences are padded to that length. As in this case, in many domains the length of the sequence is driven by the problem
However, you can have no such evidences in your problem or still having a long sequence. Long sequences are very heavy from a computational point of view. The BPTT algorithm, used to optimize this models, "unfolds" the recurrent network in a very deep feedforward network with shared parameters and back propagates over it. In this cases, it is still convenient to "cut" the sequence to a fixed length.
And here we arrive at your question, given this fixed length, let us say 10, how do I shape my input?
Usually, what is done is to cut the dataset in non overlapping windows (in your example, we will have 1-9, 10-19, 20-29, etc. What happens here is that the network only looks a the last 10 elements of the sequence each time it updates the weights with BPTT.
However, since the sequence has been arbitrarily cut, it is likely that predictions need to exploit evidences that are far back in the sequence, outside the current window. To do this, we initialize the initial state of the RNN at window i with the final state of the window i-1 using the parameter:
initial_state: (optional) An initial state for the RNN.
Finally, I give you two sources to go into more details:
RNN Tutorial This is the official tutorial of tensorflow. It is applied to the task of Language Modeling. At a certain point of the code, you will see that the final state is fed to the network from one run to the following one, in order to implement what said above.
feed_dict = {}
for i, (c, h) in enumerate(model.initial_state):
feed_dict[c] = state[i].c
feed_dict[h] = state[i].h
DevSummit 2017 This is a video of a talk during the Tensorflow DevSummit 2017 where, in the first section (Reading and Batching Sequence Data), it is explained how and using which functions you should shape your sequence inputs.
Hope this helps :)

How to use Encoder-Decoder RNN when my input and output are not length-fixed?

For example,my every input data's length can be 5 or 10 or 15 ,it is not fixed, it is same as the output,how to handle this problem when I use Encoder-Decoder RNN (seq2seq)?
I'm assuming the problem arises when mini-batching, otherwise, there is no problem processing each sample on its own..
The simplest strategy is to ensure all samples in a mini-batch have the same length.
Otherwise, another solution is padding, which adds a new symbol to the vocabulary called PAD, and then pads short samples to the length of the longest sample in the mini-batch.

LSTM network learning

I have attempted to program my own LSTM (long short term memory) neural network. I would like to verify that the basic functionality is working. I have implemented a Back propagation through time BPTT algorithm to train a single cell network.
Should a single cell LSTM network be able to learn a simple sequence, or are more than one cells necessary? The network does not seem to be able to learn a simple sequence such as 1 0 0 0 1 0 0 0 1 0 0 0 1.
I am sending the the sequence 1's and 0's one by one, in order, into the network, and feeding it forward. I record each output for the sequence.
After running the whole sequence through the LSTM cell, I feed the mean error signals back into the cell, saving the weight changes internal to the cell, in a seperate collection, and after running all the errors one by one through and calculating the new weights after each error, I average the new weights together to get the new weight, for each weight in the cell.
Am i doing something wrong? I would very appreciate any advice.
Thank you so much!
Having only one cell (one hidden unit) is not a good idea even if you are just testing the correctness of your code. You should try 50 even for such simple problem. This paper here: http://arxiv.org/pdf/1503.04069.pdf gives you very clear gradient rules for updating the parameters. Having said that, there is no need to implement your own even if your dataset and/or the problem you are working on is new LSTM. Pick from the existing library (Theano, mxnet, Torch etc...) and modify from there I think is a easier way, given that it's less error prone and it supports gpu computing which is essential for training lstm within a reasonable amount of time.
I haven't tried 1 hidden unit before, but I am sure 2 or 3 hidden units will work for sequence 0,1,0,1,0,1. It is not necessarily the more the cells, the better the result. Training difficulty also increases with the number of cells.
You said you averaged new weights together to get the new weight. Does that mean you run many training sessions and take the average of the trained weights?
There are many possibilities your LSTM did not work, even if you implemented it correctly. The weights are not easy to train by simple gradient descent.
Here are my suggestion for weight optimization.
Using Momentum method for gradient descent.
Add some gaussian noise to your training set to prevent overfitting.
using adaptive learning rates for each unit.
Maybe you can take a look at Coursera's course Neural Network offered by Toronto University, and discuss with people there.
Or you can take a look at other examples on GitHub. For instance :
https://github.com/JANNLab/JANNLab/tree/master/examples/de/jannlab/examples
The best way to test an LSTM implementation (after gradient checking) is to try it out on the toy memory problems described in the original LSTM paper itself.
The best one that I often use is the 'Addition Problem':
We give a sequence of tuples of the form (value, mask). Value is a real valued scalar number between 0 and 1. Mask is a binary value - either 0 or 1.
0.23, 0
0.65, 0
...
0.86, 0
0.13, 1
0.76, 0
...
0.34, 0
0.43, 0
0.12, 1
0.09, 0
..
0.83, 0 -> 0.125
In the entire sequence of such tuples (usually of length 100), only 2 tuples should have mask as 1, the rest of the tuples should have the mask as 0. The target at the final time step is the a average of the two values for which the mask was 1. The outputs at all other time steps, other than the last one is ignored. The values and the positions of the mask are arbitrarily chosen. Thus, this simple task shows if your implementation can actually remember things over long periods of time.

Resources