Input of LSTM seq2seq network - Tensorflow - machine-learning

Using the Tensorflow seq2seq tutorial code I am creating a character-based chatbot. I don't use word embeddings. I have an array of characters (the alphabet and some punctuation marks) and special symbols like the GO, EOS and UNK symbol.
Because I'm not using word embeddings, I use the standard tf.nn.seq2seq.basic_rnn_seq2seq() seq2seq model. However, I am confused about what shape encoder_inputs and decoder_inputs should have. Should they be an array of integers, corresponding to the index of the characters in the alphabet-array, or should I turn those integers into one-hot vectors first?
How many input nodes does one LSTM cell have? Can you specify that? Because I guess in my case an LSTM cell should have an input neuron for each letter in the alphabet (therefore the one-hot vectors?).
Also, what is the LSTM "size" you have to pass in the constructor tf.nn.rnn_cell.BasicLSTMCell(size)?
Thank you.
Appendix: these are the bugs I am trying to fix.
When I use the following code, according to the tutorial:
for i in xrange(buckets[-1][0]): # Last bucket is the biggest one.
self.encoder_inputs.append(tf.placeholder(tf.int32, shape=[None], name="encoder{0}".format(i)))
for i in xrange(buckets[-1][1] + 1):
self.decoder_inputs.append(tf.placeholder(tf.int32, shape=[None], name="decoder{0}".format(i)))
self.target_weights.append(tf.placeholder(dtype, shape=[None], name="weight{0}".format(i)))
And run the self_test() function, I get the error:
ValueError: Linear is expecting 2D arguments: [[None], [None, 32]]
Then, when I change the shapes in the above code to shape=[None, 32] I get this error:
TypeError: Expected int32, got -0.21650635094610965 of type 'float' instead.

The number of inputs of an lstm cell is the dimension of whatever tensor you pass as inputs to the tf.rnn function when instantiating things.
The size argument is the number of hidden units in your lstm (so a bigger number is slower but can lead to more accurate models).
I'd need a bigger stack trace to understand these errors.

It turns out the size argument passed to BasicLSTMCell represents both the size of the hidden state of the LSTM and the size of the input layer. So if you want a different hidden size than input size, you can first propagate your inputs through an additional projection layer or use the built-in seq2seq word embeddings function.

Related

String classification, how to encode character-by-character and train?

I am trying to build a classifier to classify some files into 150 categories based on the name of those files. Here are some examples of file names in my dataset (~700k files):
104932489 - urgent - contract validation for xyz limited.msg
treatment - an I l - contract n°4934283 received by partner.pdf
- invoice_8843238_1_europe services_business 8592342sid paris.xls
140159498736656.txt
140159498736843.txt
fsk_000000001090296_sdiselacrusefeyre_2000912.xls
fsk_000000001091293_lidlsnd1753mdeas_2009316.xls
You can see that the filenames can really be anything, but that however there is always some pattern that is respected for the same categories. It can be in the numbers (that are sometimes close), in the special characters (spaces, -, °), sometimes the length, etc.
Extracting all those patterns one by one will take ages because I have approximately 700k documents. Also, I am not interested in 100% accuracy, 70% can be good enough.
The real problem is that I don't know how to encode this data. I have tried many methods:
Tokenizing character by character and feeding them to an LSTM model with an embedding layer. However, I wasn't able to implement it and got dimension errors.
Adapting Word2Vec to convert the characters into vectors. However, this automatically drops all punctuation and space characters, also, I lose the numeric data. Another problem is that it creates more useless dimensions: if the size is 20, I will have my data in 20 dimensions but if I look closely, there are always the same 150 vectors in those 20 dimensions so it's really useless. I could use a 2 dimensions size but still, I need the numeric data and the special characters.
Generating n-grams from each path, in the range 1-4, then using a CountVectorizer to compute the frequencies. I checked and special characters were not dropped but it gave me like 400,000 features! I am running a dimensionality reduction using UMAP (n_components=5, metric='hellinger') but the reduction runs for 2 hours and then the kernel crashes.
Any ideas?
I am currently also working on a character level lstm. And it works exactly the same like when you would use words. You need a vocabulary, for example a - z and then you just take the index of the letter as its integer representation. For example:
"bad" -> "b", "a", "d" -> [1, 0, 3]
Now you could create an embedding lookup table (for example using pytorchs nn.Embedding function). You just have to create a random vector for every index of your vocab. For example:
"a" -> 0 > [-0.93, 0.024, -.0.73, ..., -0.12]
You said that you tried this but encountered dimension errors? Maybe show us the code!
Or you could create non-random embedding using word2vec using the Gensim libary:
from gensim.models import Word2Vec
# 'total_words' is a list containing every word of your dataset split into its characters
total_words = [...]
model = Word2Vec(total_words , min_count=1, size=32)
model.save(save_model_file)
# lets test it for the character 'a'
embedder = Word2Vec.load(save_model_file)
v = embedder["a"]
# v now will be a the embedding vector of a with size 32x1
I hope I could make clear how to create embeddings for characters.
You can treat characters in single-word-classification the exact same way you would treat words in sentence-classification.

Dimensions of LSTM variant in Deep Mind's Differentiable Neural Computer (DNC)

I'm trying to implement Deep Mind's DNC - Nature paper- with PyTorch 0.4.0.
When implementing the variant of LSTM they used I encountered some troubles with dimensions.
To simplify suppose BATCH=1.
The equations they list in the paper are these:
where [x;h] means a concatenation of x and h into one single vector, and i, f and o are column vectors.
My question is about how the state s_t is computed.
The second addendum is obtained by multiplying i with a column vector and so the result is either a scalar (transpose i first, then do scalar product) or wrong (two column vectors multiplied).
So the state results in a single scalar...
With the same reasoning the hidden state h_t is a scalar too, but it has to be a column vector.
Obviously I'm wrong somewhere, but I can't figure out where.
By looking at Wikipedia LSTM Article I think I figured it out.
This is the formal implementation of standard LSTM found in the article:
The circle represents element-by-element product.
By using this product in the corresponding parts of DNC equations (s_t and o_t) the dimensions work.

What is Sequence length in LSTM?

The dimensions for the input data for LSTM are [Batch Size, Sequence Length, Input Dimension] in tensorflow.
What is the meaning of Sequence Length & Input Dimension ?
How do we assign the values to them if my input data is of the form :
[[[1.23] [2.24] [5.68] [9.54] [6.90] [7.74] [3.26]]] ?
LSTMs are a subclass of recurrent neural networks. Recurrent neural nets are by definition applied on sequential data, which without loss of generality means data samples that change over a time axis. A full history of a data sample is then described by the sample values over a finite time window, i.e. if your data live in an N-dimensional space and evolve over t-time steps, your input representation must be of shape (num_samples, t, N).
Your data does not fit the above description. I assume, however, that this representation means you have a scalar value x which evolves over 7 time instances, such that x[0] = 1.23, x[1] = 2.24, etc.
If that is the case, you need to reshape your input such that instead of a list of 7 elements, you have an array of shape (7,1). Then, your full data can be described by a 3rd order tensor of shape (num_samples, 7, 1) which can be accepted by a LSTM.
Simply put seq_len is number of time steps that will be inputted into LSTM network, Let's understand this by example...
Suppose you are doing a sentiment classification using LSTM.
Your input sentence to the network is =["I hate to eat apples"]. Every single token would be fed as input at each timestep, So accordingly here the seq_Len would total number of tokens in a sentence that is 5.
Coming to the input_dim you might know we can't directly feed words to the netowrk you would need to encode those words into numbers. In Pytorch/tensorflow embedding layers are used where we have to specify embedding dimension.
Suppose your embedding dimension is 50 that means that embedding layer will take index of respective token and convert it into vector representation of size 50. So the input dim to LSTM network would become 50.

Sampled softmax loss over variable sequence batches?

Background info: I'm working on sequence-to-sequence models, and right now my model accepts variable-length input tensors (not lists) with input shapes corresponding to [batch size, sequence length]. However, in my implementation, sequence length is unspecified (set to None) to allow for variable length inputs. Specifically, input sequence batches are padded only to the length of the longest sequence in that batch. This has sped up my training time considerably, so I'd prefer to keep it this way, as opposed to going back to bucketed models and/or padded all sequences in the training data to the same length. I'm using TensorFlow 1.0.0.
Problem: I'm currently using the following to compute the loss (which runs just fine).
loss = tf.losses.sparse_softmax_cross_entropy(
weights=target_labels, # shape: [batch size, None]
logits=outputs[:, :-1, :], # shape: [batch size, None, vocab size]
weights=target_weights[:, :-1]) # shape: [batch size, None]
where vocab size is typically about 40,000. I'd like to use a sampled softmax, but I've ran into an issue that's due to the unspecified nature of the input shape. According to the documentation for tf.nn.sampled_softmax_loss, it requires the inputs to be fed separately for each timestep. However, I can't call, for example,
tf.unstack(target_labels, axis=1)
since the axis is unknown beforehand.Does anyone know how I might go about implementing this? One would assume that since both dynamic_rnn and tf.losses.sparse_softmax_cross_entropy seem to have no issue doing this, that a workaround could be implemented with the sampled softmax loss somehow. After digging around in the source code and even models repository, I've come up empty handed. Any help/suggestions would be greatly appreciated.

What is a "label" in Caffe?

In Caffe when you are defining your inputs for the NN in the protobuf file, you can input "data" and "label". I'm guessing label contains the expected output for training data (what it is normally considered the y values in Machine Learning literature).
My problem is that in the caffe.proto file, label is defined as a scalar (int or long). At least with data, I can set it to an numpy array, because it takes String values. If I'm training for more than one prediction output, how could I pass it as an array?
Or am I mistaken? What is label? What is it for? And how can I pass the y values to caffe?
The basic use case of caffe used to be image classification: assigning a single integer label per input image. Thus, the "datum" data structure reserves space for a 4D float array (batches of 3 channels images) and an integer "label" per image in the batch.
This restriction can be easily overcome using HDF5 input data layer.
See e.g., this answer.

Resources