The dimensions for the input data for LSTM are [Batch Size, Sequence Length, Input Dimension] in tensorflow.
What is the meaning of Sequence Length & Input Dimension ?
How do we assign the values to them if my input data is of the form :
[[[1.23] [2.24] [5.68] [9.54] [6.90] [7.74] [3.26]]] ?
LSTMs are a subclass of recurrent neural networks. Recurrent neural nets are by definition applied on sequential data, which without loss of generality means data samples that change over a time axis. A full history of a data sample is then described by the sample values over a finite time window, i.e. if your data live in an N-dimensional space and evolve over t-time steps, your input representation must be of shape (num_samples, t, N).
Your data does not fit the above description. I assume, however, that this representation means you have a scalar value x which evolves over 7 time instances, such that x[0] = 1.23, x[1] = 2.24, etc.
If that is the case, you need to reshape your input such that instead of a list of 7 elements, you have an array of shape (7,1). Then, your full data can be described by a 3rd order tensor of shape (num_samples, 7, 1) which can be accepted by a LSTM.
Simply put seq_len is number of time steps that will be inputted into LSTM network, Let's understand this by example...
Suppose you are doing a sentiment classification using LSTM.
Your input sentence to the network is =["I hate to eat apples"]. Every single token would be fed as input at each timestep, So accordingly here the seq_Len would total number of tokens in a sentence that is 5.
Coming to the input_dim you might know we can't directly feed words to the netowrk you would need to encode those words into numbers. In Pytorch/tensorflow embedding layers are used where we have to specify embedding dimension.
Suppose your embedding dimension is 50 that means that embedding layer will take index of respective token and convert it into vector representation of size 50. So the input dim to LSTM network would become 50.
Related
Denote a[2, 3] to be a matrix of dimension 2x3. Say there are 10 elements in each input and the network is a two-element classifier (cat or dog, for example). Say there is just one dense layer. For now I am ignoring the bias vector. I know this is an over-simplified neural net, but it is just for this example. Each output in a dense layer of a neural net can be calculated as
output = matmul(input, weights)
Where weights is a weight matrix 10x2, input is an input vector 1x10, and output is an output vector 1x2.
My question is this: Can an entire series of inputs be computed at the same time with a single matrix multiplication? It seems like you could compute
output = matmul(input, weights)
Where there are 100 inputs total, and input is 100x10, weights is 10x2, and output is 100x2.
In back propagation, you could do something similar:
input_err = matmul(output_err, transpose(weights))
weights_err = matmul(transpose(input), output_err)
weights -= learning_rate*weights_err
Where weights is the same, output_err is 100x2, and input is 100x10.
However, I tried to implement a neural network in this way from scratch and I am currently unsuccessful. I am wondering if I have some other error or if my approach is fundamentally wrong.
Thanks!
If anyone else is wondering, I found the answer to my question. This does not in fact work, for a few reasons. Essentially, computing all inputs in this way is like running a network with a batch size equal to the number of inputs. The weights do not get updated between inputs, but rather all at once. And so while it seems that calculating together would be valid, it makes it so that each input does not individually influence the training step by step. However, with a reasonable batch size, you can do 2d matrix multiplications, where the input is batch_size by input_size in order to speed up training.
In addition, if predicting on many inputs (in the test stage, for example), since no weights are updated, an entire matrix multiplication of num_inputs by input_size can be run to compute all inputs in parallel.
I have a neural network with an input layer having 10 nodes, some hidden layers and an output layer with only 1 node. Then I put a pattern in the input layer, and after some processing, it outputs the value in the output neuron which is a number from 1 to 10. After the training this model is able to get the output , provided the input pattern.
Now, my question is, if it is possible to calculate the inverse model: This means, that I provide a number from output side, (i.e. using output side as input) and then getting the random pattern from those 10 input neurons (i.e. using input as output side).
I want to do this because I will first train a network on basis of difficulty of pattern (input is the pattern and output is difficulty to understand the pattern). Then I want to feed the network with a number so it creates the random patterns on basis of difficulty.
I hope I understood your problem correctly, so I will summarize it in my own words: You have a given model, and want to determine the input which yields a given output.
Supposed, that this is correct, there is at least one way I know of, how you can do this approximately. This way is very easy to implement, but might take a while to calculate a value - probably there are better ways to do this, but I am not sure. (I needed this technique some weeks ago in the topic of reinforcement learning, and did not find anything better, compared to this): Lets assume that your Model maps an input to an output . We now have to create a new model, which we will call : This model will later on calculate the inverse of the model , so that it gives you the input which yields a specific output. To construct we will create a new model, which consists of one plain Dense layer which has the same dimension m as the input. This layer will be connected to the input of the model now. Next, you make all weights of non-trainable (this is very important!).
Now we are setup to find an inverse value already: Assuming you want to find the input corresponding (corresponding means here: it creates the output, but is not unique) to the output y. You have to create a new input vector v which is the unity of . Then you create a input-output data pair consisting of (v, y). Now you use any optimizer you wish to let the input-output-trainingdata propagate through your network, until the error converges to zero. Once this has happend, you can calculate the real input, which gives the output y by doing this: Supposed, that the weights if the new input layer are called w, and the bias is b, the desired input u is u = w*1 + b (whereby 1 )
You might be asking for the reason why this equation holds, so let me try to answer it: You model will try to learn the weights of your new input layer, so that the unity as an input will create the given output. As only the newly added input layer is trainable, only this weights will be changed. Therefore, each weight in this vector will represent the corresponding component of the desired input vector. By using an optimizer and minimizing the l^2 distance between the wanted output and the output of our inverse-model , we will finally determine a set of weights, which will give you a good approximation for the input vector.
Hi everybody I am struggeling with the tensorflow RNN implementation:
The problem:
I want to train an LSTM implentation of an RNN to detect malicious connections in the KDD99 dataset. Its a dataset with 41 features and (after some preprocessing) a label vector of the size 5.
[
[x1, x2, x3, .....x40, x41],
...
[x1, x2, x3, .....x40, x41]
]
[
[0, 1, 0, 0, 0],
...
[0, 0, 1, 0, 0]
]
As a basic architurecture I would like to implement the following:
cell = tf.nn.rnn_cell.LSTMCell(num_units=64, state_is_tuple=True)
cell = tf.nn.rnn_cell.DropoutWrapper(cell=cell, output_keep_prob=0.5)
cell = tf.nn.rnn_cell.MultiRNNCell(cells=[cell] * 3, state_is_tuple=True)
My question is: In order to feed it to the model, how would i need to reshape the input features?
Would I not just have to reshape the input features, but to build sliding window sequences?
What I mean by that:
Assuming a sequence length of ten, the first suqence would contains data point 0 - 9, the second one contains data points 1 - 10, 2 - 11 and so on.
Thanks!
I do not know the dataset but I think that you problem is the following: you have a very long sequence and you want to know how to shape this sequence in order to provide this to the network.
The 'tf.contrib.rnn.static_rnn' has the following signature:
tf.contrib.rnn.static_rnn(cell, inputs, initial_state=None, dtype=None, sequence_length=None, scope=None)
where
inputs: A length T list of inputs, each a Tensor of shape [batch_size, input_size], or a nested tuple of such elements.
So the inputs need to be shaped into lists, where each element of the list is the element of the input sequence at each time step.
The length of this list depend on your problem and/or on computational issues.
In Natural Language Processing, for example, the length of this list can be the maximum sentence length of your document, where shorter sentences are padded to that length. As in this case, in many domains the length of the sequence is driven by the problem
However, you can have no such evidences in your problem or still having a long sequence. Long sequences are very heavy from a computational point of view. The BPTT algorithm, used to optimize this models, "unfolds" the recurrent network in a very deep feedforward network with shared parameters and back propagates over it. In this cases, it is still convenient to "cut" the sequence to a fixed length.
And here we arrive at your question, given this fixed length, let us say 10, how do I shape my input?
Usually, what is done is to cut the dataset in non overlapping windows (in your example, we will have 1-9, 10-19, 20-29, etc. What happens here is that the network only looks a the last 10 elements of the sequence each time it updates the weights with BPTT.
However, since the sequence has been arbitrarily cut, it is likely that predictions need to exploit evidences that are far back in the sequence, outside the current window. To do this, we initialize the initial state of the RNN at window i with the final state of the window i-1 using the parameter:
initial_state: (optional) An initial state for the RNN.
Finally, I give you two sources to go into more details:
RNN Tutorial This is the official tutorial of tensorflow. It is applied to the task of Language Modeling. At a certain point of the code, you will see that the final state is fed to the network from one run to the following one, in order to implement what said above.
feed_dict = {}
for i, (c, h) in enumerate(model.initial_state):
feed_dict[c] = state[i].c
feed_dict[h] = state[i].h
DevSummit 2017 This is a video of a talk during the Tensorflow DevSummit 2017 where, in the first section (Reading and Batching Sequence Data), it is explained how and using which functions you should shape your sequence inputs.
Hope this helps :)
Using the Tensorflow seq2seq tutorial code I am creating a character-based chatbot. I don't use word embeddings. I have an array of characters (the alphabet and some punctuation marks) and special symbols like the GO, EOS and UNK symbol.
Because I'm not using word embeddings, I use the standard tf.nn.seq2seq.basic_rnn_seq2seq() seq2seq model. However, I am confused about what shape encoder_inputs and decoder_inputs should have. Should they be an array of integers, corresponding to the index of the characters in the alphabet-array, or should I turn those integers into one-hot vectors first?
How many input nodes does one LSTM cell have? Can you specify that? Because I guess in my case an LSTM cell should have an input neuron for each letter in the alphabet (therefore the one-hot vectors?).
Also, what is the LSTM "size" you have to pass in the constructor tf.nn.rnn_cell.BasicLSTMCell(size)?
Thank you.
Appendix: these are the bugs I am trying to fix.
When I use the following code, according to the tutorial:
for i in xrange(buckets[-1][0]): # Last bucket is the biggest one.
self.encoder_inputs.append(tf.placeholder(tf.int32, shape=[None], name="encoder{0}".format(i)))
for i in xrange(buckets[-1][1] + 1):
self.decoder_inputs.append(tf.placeholder(tf.int32, shape=[None], name="decoder{0}".format(i)))
self.target_weights.append(tf.placeholder(dtype, shape=[None], name="weight{0}".format(i)))
And run the self_test() function, I get the error:
ValueError: Linear is expecting 2D arguments: [[None], [None, 32]]
Then, when I change the shapes in the above code to shape=[None, 32] I get this error:
TypeError: Expected int32, got -0.21650635094610965 of type 'float' instead.
The number of inputs of an lstm cell is the dimension of whatever tensor you pass as inputs to the tf.rnn function when instantiating things.
The size argument is the number of hidden units in your lstm (so a bigger number is slower but can lead to more accurate models).
I'd need a bigger stack trace to understand these errors.
It turns out the size argument passed to BasicLSTMCell represents both the size of the hidden state of the LSTM and the size of the input layer. So if you want a different hidden size than input size, you can first propagate your inputs through an additional projection layer or use the built-in seq2seq word embeddings function.
I've been playing with some SVM implementations and I am wondering - what is the best way to normalize feature values to fit into one range? (from 0 to 1)
Let's suppose I have 3 features with values in ranges of:
3 - 5.
0.02 - 0.05
10-15.
How do I convert all of those values into range of [0,1]?
What If, during training, the highest value of feature number 1 that I will encounter is 5 and after I begin to use my model on much bigger datasets, I will stumble upon values as high as 7? Then in the converted range, it would exceed 1...
How do I normalize values during training to account for the possibility of "values in the wild" exceeding the highest(or lowest) values the model "seen" during training? How will the model react to that and how I make it work properly when that happens?
Besides scaling to unit length method provided by Tim, standardization is most often used in machine learning field. Please note that when your test data comes, it makes more sense to use the mean value and standard deviation from your training samples to do this scaling. If you have a very large amount of training data, it is safe to assume they obey the normal distribution, so the possibility that new test data is out-of-range won't be that high. Refer to this post for more details.
You normalise a vector by converting it to a unit vector. This trains the SVM on the relative values of the features, not the magnitudes. The normalisation algorithm will work on vectors with any values.
To convert to a unit vector, divide each value by the length of the vector. For example, a vector of [4 0.02 12] has a length of 12.6491. The normalised vector is then [4/12.6491 0.02/12.6491 12/12.6491] = [0.316 0.0016 0.949].
If "in the wild" we encounter a vector of [400 2 1200] it will normalise to the same unit vector as above. The magnitudes of the features is "cancelled out" by the normalisation and we are left with relative values between 0 and 1.