LSTM, multi-variate, multi-feature in pytorch - machine-learning

I'm having trouble understanding the format of data for an LSTM in pytorch. Lets say i have a CSV file with 4 features, laid out in timestamps one after the other ( a classic time series)
time1 feature1 feature2 feature3 feature4
time2 feature1 feature2 feature3 feature4
time3 feature1 feature2 feature3 feature4
time4 feature1 feature2 feature3 feature4, label
However, this entire set of 4 sequences only has a single label. The thing we're trying to classify started at time1, but we don't know how to label it until time 4.
My question is, can a typical pytorch LSTM support this? All of the tutorials i've read, watched, walked through, involve looking at a time sequence of a single feature, or a word model, which is still a dataset with a single dimension.
If it can support it, does the data need to be flattened in some way?
Pytorch's LSTM reference states:
input: tensor of shape (L,N,Hin)(L, N, H_{in})(L,N,Hin​) when batch_first=False or (N,L,Hin)(N, L, H_{in})(N,L,Hin​) when batch_first=True containing the features of the input sequence. The input can also be a packed variable length sequence.
Does this mean that it cannot support any input that contains multiple sequences? Or is there another name for this?
I'm really lost here, and could use any advice, pointers, help, so on. Maybe some disambiguation too.
I've posted a couple times here but gotten no responses at all. If this post is misplaced, could someone kindly direct me towards the correct place to post it?
Edit: Following Daniel's advice, do i understand correctly that the four features should be put together like this:
[(feature1, feature2, feature3, feature4, feature1, feature2, feature3, feature4, feature1, feature2, feature3, feature4, feature1, feature2, feature3, feature4), label] when given to the LSTM?
If that's correct, is the input size (16) in this case?
Finally, I was under the impression that the output of the LSTM Would be the predicted label. Do I have that wrong?

As you show, the LSTM layer's input size is (batch_size, Sequence_length, feature_size). This means that the feature is assumed to be a 1D vector.
So to use it in your case you need to stack your four features into one vector (if they are more then 1D themselves then flatten them first) and use that vector as the layer's input.
Regarding the label. It is defiantly supported to have a label only after a few iterations. The LSTM will output a sequence with the same length as the input sequence, but when training the LSTM you can choose to use any part of that sequence in the loss function. In your case you will want to use the last element only.

Related

Why is there an activation function in each neural net layer, and not just one in the final layer?

I'm trying to teach myself machine learning and I have a similar question to this.
Is this correct:
For example, if I have an input matrix, where X1, X2 and X3 are three numerical features (e.g. say they are petal length, stem length, flower length, and I'm trying to label whether the sample is a particular flower species or not):
x1 x2 x3 label
5 1 2 yes
3 9 8 no
1 2 3 yes
9 9 9 no
That you take the vector of the first ROW (not column) of the table above to be inputted into the network like this:
i.e. there would be three neurons (1 for each value of the first table row), and then w1,w2 and w3 are randomly selected, then to calculate the first neuron in the next column, you do the multiplication I have described, and you add a randomly selected bias term. This gives the value of that node.
This is done for a set of nodes (i.e. each column actually will have four nodes (three + a bias), for simplicity, i removed the other three nodes from the second column), and then in the last node before the output, there is an activation function to transform the sum into a value (e.g. 0-1 for sigmoid) and that value tells you whether the classification is yes or no.
I'm sorry for how basic this is, I want to really understand the process, and I'm doing it from free resources. So therefore generally, you should select the number of nodes in your network to be a multiple of the number of features, e.g. in this case, it would make sense to write:
from keras.models import Sequential
from keras.models import Dense
model = Sequential()
model.add(Dense(6,input_dim=3,activation='relu'))
model.add(Dense(6,input_dim=3,activation='relu'))
model.add(Dense(3,activation='softmax'))
What I don't understand is why the keras model has an activation function in each layer of the network and not just at the end, which is why I'm wondering if my understanding is correct/why I added the picture.
Edit 1: Just a note I saw that in the bias neuron, I put on the edge 'b=1', that might be confusing, I know the bias doesn't have a weight, so that was just a reminder to myself that the weight of the bias node is 1.
Several issues here apart from the question in your title, but since this is not the time & place for full tutorials, I'll limit the discussion to some of your points, taking also into account that at least one more answer already exists.
So therefore generally, you should select the number of nodes in your network to be a multiple of the number of features,
No.
The number of features is passed in the input_dim argument, which is set only for the first layer of the model; the number of inputs for every layer except the first one is simply the number of outputs of the previous one. The Keras model you have written is not valid, and it will produce an error, since for your 2nd layer you ask for input_dim=3, while the previous one has clearly 6 outputs (nodes).
Beyond this input_dim argument, there is no other relationship whatsoever between the number of data features and the number of network nodes; and since it seems you have in mind the iris data (4 features), here is a simple reproducible example of applying a Keras model to them.
What is somewhat hidden in the Keras sequential API (which you use here) is that there is in fact an implicit input layer, and the number of its nodes is the dimensionality of the input; see own answer in Keras Sequential model input layer for details.
So, the model you have drawn in your pad actually corresponds to the following Keras model written using the sequential API:
model = Sequential()
model.add(Dense(1,input_dim=3,activation='linear'))
where in the functional API it would be written as:
inputs = Input(shape=(3,))
outputs = Dense(1, activation='linear')(inputs)
model = Model(inputs, outputs)
and that's all, i.e. it is actually just linear regression.
I know the bias doesn't have a weight
The bias does have a weight. Again, the useful analogy is with the constant term of linear (or logistic) regression: the bias "input" itself is always 1, and its corresponding coefficient (weight) is learned through the fitting process.
why the keras model has an activation function in each layer of the network and not just at the end
I trust this has been covered sufficiently in the other answer.
I'm sorry for how basic this is, I want to really understand the process, and I'm doing it from free resources.
We all did; no excuse though to not benefit from Andrew Ng's free & excellent Machine Learning MOOC at Coursera.
It seems your question is why there is a activation function for each layer instead of just the last layer. The simple answer is, if there are no non-linear activations in the middle, no matter how deep your network is, it can be boiled down to a single linear equation. Therefore, non-linear activation is one of the big enablers that enable deep networks to be actually "deep" and learn high-level features.
Take the following example, say you have 3 layer neural network without any non-linear activations in the middle, but a final softmax layer. The weights and biases for these layers are (W1, b1), (W2, b2) and (W3, b3). Then you can write the network's final output as follows.
h1 = W1.x + b1
h2 = W2.h1 + b2
h3 = Softmax(W3.h2 + b3)
Let's do some manipulations. We'll simply replace h3 as a function of x,
h3 = Softmax(W3.(W2.(W1.x + b1) + b2) + b3)
h3 = Softmax((W3.W2.W1) x + (W3.W2.b1 + W3.b2 + b3))
In other words, h3 is in the following format.
h3 = Softmax(W.x + b)
So, without the non-linear activations, our 3-layer networks has been squashed to a single layer network. That's is why non-linear activations are important.
Imagine, you have an activation layer only in the last layer (In your case, sigmoid. It can be something else too.. say softmax). The purpose of this is to convert real values to a 0 to 1 range for a classification sort of answer. But, the activation in the inner layers (hidden layers) has a different purpose altogether. This is to introduce nonlinearity. Without the activation (say ReLu, tanh etc.), what you get is a linear function. And how many ever, hidden layers you have, you still end up with a linear function. And finally, you convert this into a nonlinear function in the last layer. This might work in some simple nonlinear problems, but will not be able to capture a complex nonlinear function.
Each hidden unit (in each layer) comprises of activation function to incorporate nonlinearity.

What is Sequence length in LSTM?

The dimensions for the input data for LSTM are [Batch Size, Sequence Length, Input Dimension] in tensorflow.
What is the meaning of Sequence Length & Input Dimension ?
How do we assign the values to them if my input data is of the form :
[[[1.23] [2.24] [5.68] [9.54] [6.90] [7.74] [3.26]]] ?
LSTMs are a subclass of recurrent neural networks. Recurrent neural nets are by definition applied on sequential data, which without loss of generality means data samples that change over a time axis. A full history of a data sample is then described by the sample values over a finite time window, i.e. if your data live in an N-dimensional space and evolve over t-time steps, your input representation must be of shape (num_samples, t, N).
Your data does not fit the above description. I assume, however, that this representation means you have a scalar value x which evolves over 7 time instances, such that x[0] = 1.23, x[1] = 2.24, etc.
If that is the case, you need to reshape your input such that instead of a list of 7 elements, you have an array of shape (7,1). Then, your full data can be described by a 3rd order tensor of shape (num_samples, 7, 1) which can be accepted by a LSTM.
Simply put seq_len is number of time steps that will be inputted into LSTM network, Let's understand this by example...
Suppose you are doing a sentiment classification using LSTM.
Your input sentence to the network is =["I hate to eat apples"]. Every single token would be fed as input at each timestep, So accordingly here the seq_Len would total number of tokens in a sentence that is 5.
Coming to the input_dim you might know we can't directly feed words to the netowrk you would need to encode those words into numbers. In Pytorch/tensorflow embedding layers are used where we have to specify embedding dimension.
Suppose your embedding dimension is 50 that means that embedding layer will take index of respective token and convert it into vector representation of size 50. So the input dim to LSTM network would become 50.

How to train a neural network in forward manner and using it in backward manner

I have a neural network with an input layer having 10 nodes, some hidden layers and an output layer with only 1 node. Then I put a pattern in the input layer, and after some processing, it outputs the value in the output neuron which is a number from 1 to 10. After the training this model is able to get the output , provided the input pattern.
Now, my question is, if it is possible to calculate the inverse model: This means, that I provide a number from output side, (i.e. using output side as input) and then getting the random pattern from those 10 input neurons (i.e. using input as output side).
I want to do this because I will first train a network on basis of difficulty of pattern (input is the pattern and output is difficulty to understand the pattern). Then I want to feed the network with a number so it creates the random patterns on basis of difficulty.
I hope I understood your problem correctly, so I will summarize it in my own words: You have a given model, and want to determine the input which yields a given output.
Supposed, that this is correct, there is at least one way I know of, how you can do this approximately. This way is very easy to implement, but might take a while to calculate a value - probably there are better ways to do this, but I am not sure. (I needed this technique some weeks ago in the topic of reinforcement learning, and did not find anything better, compared to this): Lets assume that your Model maps an input to an output . We now have to create a new model, which we will call : This model will later on calculate the inverse of the model , so that it gives you the input which yields a specific output. To construct we will create a new model, which consists of one plain Dense layer which has the same dimension m as the input. This layer will be connected to the input of the model now. Next, you make all weights of non-trainable (this is very important!).
Now we are setup to find an inverse value already: Assuming you want to find the input corresponding (corresponding means here: it creates the output, but is not unique) to the output y. You have to create a new input vector v which is the unity of . Then you create a input-output data pair consisting of (v, y). Now you use any optimizer you wish to let the input-output-trainingdata propagate through your network, until the error converges to zero. Once this has happend, you can calculate the real input, which gives the output y by doing this: Supposed, that the weights if the new input layer are called w, and the bias is b, the desired input u is u = w*1 + b (whereby 1 )
You might be asking for the reason why this equation holds, so let me try to answer it: You model will try to learn the weights of your new input layer, so that the unity as an input will create the given output. As only the newly added input layer is trainable, only this weights will be changed. Therefore, each weight in this vector will represent the corresponding component of the desired input vector. By using an optimizer and minimizing the l^2 distance between the wanted output and the output of our inverse-model , we will finally determine a set of weights, which will give you a good approximation for the input vector.

LSTM-RNN : How to shape multivariate Inputs

Hi everybody I am struggeling with the tensorflow RNN implementation:
The problem:
I want to train an LSTM implentation of an RNN to detect malicious connections in the KDD99 dataset. Its a dataset with 41 features and (after some preprocessing) a label vector of the size 5.
[
[x1, x2, x3, .....x40, x41],
...
[x1, x2, x3, .....x40, x41]
]
[
[0, 1, 0, 0, 0],
...
[0, 0, 1, 0, 0]
]
As a basic architurecture I would like to implement the following:
cell = tf.nn.rnn_cell.LSTMCell(num_units=64, state_is_tuple=True)
cell = tf.nn.rnn_cell.DropoutWrapper(cell=cell, output_keep_prob=0.5)
cell = tf.nn.rnn_cell.MultiRNNCell(cells=[cell] * 3, state_is_tuple=True)
My question is: In order to feed it to the model, how would i need to reshape the input features?
Would I not just have to reshape the input features, but to build sliding window sequences?
What I mean by that:
Assuming a sequence length of ten, the first suqence would contains data point 0 - 9, the second one contains data points 1 - 10, 2 - 11 and so on.
Thanks!
I do not know the dataset but I think that you problem is the following: you have a very long sequence and you want to know how to shape this sequence in order to provide this to the network.
The 'tf.contrib.rnn.static_rnn' has the following signature:
tf.contrib.rnn.static_rnn(cell, inputs, initial_state=None, dtype=None, sequence_length=None, scope=None)
where
inputs: A length T list of inputs, each a Tensor of shape [batch_size, input_size], or a nested tuple of such elements.
So the inputs need to be shaped into lists, where each element of the list is the element of the input sequence at each time step.
The length of this list depend on your problem and/or on computational issues.
In Natural Language Processing, for example, the length of this list can be the maximum sentence length of your document, where shorter sentences are padded to that length. As in this case, in many domains the length of the sequence is driven by the problem
However, you can have no such evidences in your problem or still having a long sequence. Long sequences are very heavy from a computational point of view. The BPTT algorithm, used to optimize this models, "unfolds" the recurrent network in a very deep feedforward network with shared parameters and back propagates over it. In this cases, it is still convenient to "cut" the sequence to a fixed length.
And here we arrive at your question, given this fixed length, let us say 10, how do I shape my input?
Usually, what is done is to cut the dataset in non overlapping windows (in your example, we will have 1-9, 10-19, 20-29, etc. What happens here is that the network only looks a the last 10 elements of the sequence each time it updates the weights with BPTT.
However, since the sequence has been arbitrarily cut, it is likely that predictions need to exploit evidences that are far back in the sequence, outside the current window. To do this, we initialize the initial state of the RNN at window i with the final state of the window i-1 using the parameter:
initial_state: (optional) An initial state for the RNN.
Finally, I give you two sources to go into more details:
RNN Tutorial This is the official tutorial of tensorflow. It is applied to the task of Language Modeling. At a certain point of the code, you will see that the final state is fed to the network from one run to the following one, in order to implement what said above.
feed_dict = {}
for i, (c, h) in enumerate(model.initial_state):
feed_dict[c] = state[i].c
feed_dict[h] = state[i].h
DevSummit 2017 This is a video of a talk during the Tensorflow DevSummit 2017 where, in the first section (Reading and Batching Sequence Data), it is explained how and using which functions you should shape your sequence inputs.
Hope this helps :)

Artificial Neural Network for formula classification/calculation

I am trying to create an ANN for calculating/classifying a/any formula.
I initially tried to replicate Fibonacci Sequence. I using the inputs:
[1,2] output [3]
[2,3] output [5]
[3,5] output [8]
etc...
The issue I am trying to overcome is how to normalize the data that could be potentially infinite or scale exponentially? I then tried to create an ANN to calculate the slope-intercept formula y = mx+b (2x+2) with inputs
[1] output [4]
[2] output [6]
etc...
Again I do not know how to normalize the data. If I normalize only the training data how would the network be able to calculate or classify with inputs outside of what was used for normalization?
So would it be possible to create an ANN to calculate/classify the formula ((a+2b+c^2+3d-5e) modulo 2), where the formula is unknown, but the inputs (some) a,b,c,d,and e are given as well as the output? Essentially classifying whether the calculations output is odd or even and the inputs are between -+infinity...
Okay, I think I understand what you're trying to do now. Basically, you are going to have a set of inputs representing the coefficients of a function. You want the ANN to tell you whether the function, with those coefficients, will produce an even or an odd output. Let me know if that's wrong. There are a few potential issues here:
First, while it is possible to use a neural network to do addition, it is not generally very efficient. You also need to set your ANN up in a very specific way, either by using a different node type than is usually used, or by setting up complicated recurrent topologies. This would explain your lack of success with the Fibonacci sequence and the line equation.
But there's a more fundamental problem. You might have heard that ANNs are general function approximators. However, in this case, the function that the ANN is learning won't be your formula. When you have an ANN that is learning to output either 0 or 1 in response to a set of inputs, it's actually trying to learn a function for a line (or set of lines, or hyperplane, depending on the topology) that separates all of the inputs for which the output should be 0 from all of the inputs for which the output should be 1. (see the answers to this question for a more thorough explanation, with pictures). So the question, then, is whether or not there is a hyperplane that separates coefficients that will result in an even output from coefficients that will result in an odd output.
I'm inclined to say that the answer to that question is no. If you consider the a coefficient in your example, for instance, you will see that every time you increment or decrement it by 1, the correct output switches. The same is true for the c, d, and e terms. This means that there aren't big clumps of relatively similar inputs that all return the same output.
Why do you need to know whether the output of an unknown function is even or odd? There might be other, more appropriate techniques.

Resources