LSTM connections between cells/units (not timesteps) - machine-learning

My question is regarding how an LSTM layer is built, for example in keras:
keras.layers.LSTM(units,... other options)
Are these units individual cells or the dimensions of the cell state?
I've read conflicting comments on the subject, could someone clarify if all LSTM units or blocks are different units interconnected with a delay of 1 timestep or is a LSTM layer just a cell with 'units' number of dimensions for the cell state?
I've made 3 diagramms, the first is the normal LSTM cell as it is usually shown (feel free to check it for errors), the other two are, as far as i understand them, the other options concerning the 'many cell' layer.
LSTM normal diagramm
LSTM each cell connected to the next in layer
LSTM with all cells connected?

Units are the number of cells in your LSTM layer.
model.add(LSTM(32))
Implies that you are adding an LSTM layer that has 32 LSTM cells that are connected to the previous and next layer. This will result in an output shape of (batch_size, 32), as units also correspond to the dimensionality of the output shape (when return_sequences is false).

Related

What is the connections between two stacked LSTM layers?

The question is like this one What's the input of each LSTM layer in a stacked LSTM network?, but more into implementing details.
For simplicity how about 4 units and 2 units structures like the following
model.add(LSTM(4, input_shape=input_shape, return_sequences=True))
model.add(LSTM(2,input_shape=input_shape))
So I know the output of LSTM_1 is 4 length but how do the next 2 units handle these 4 inputs, are they fully connected to the next layer of nodes?
I guess they are fully connected but not sure like the following figure, it was not stated in the Keras document
Thanks!
It's not length 4, it's 4 "features".
The length is in the input shape and it never changes, there is absolutely no difference between what happens when you give a regular input to one LSTM and what happens when you give an output of an LSTM to another LSTM.
You can just look at the model's summary to see the shapes and understand what is going on. You never change the length using LSTMs.
They don't communicate at all. Each one takes the length dimension, processes it recurrently, independently from the other. When one finishes and outputs a tensor, the next one gets the tensor and process it alone following the same rules.

What's the difference between LSTM() and LSTMCell()?

I've checked the source code for both functions, and it seems that LSTM() makes the LSTM network in general, while LSTMCell() only returns one cell.
However, in most cases people only use one LSTM Cell in their program. Does this mean when you have only one LSTM Cell (ex. in simple Seq2Seq), calling LSTMCell() and LSTM() would make no difference?
LSTM is a recurrent layer
LSTMCell is an object (which happens to be a layer too) used by the LSTM layer that contains the calculation logic for one step.
A recurrent layer contains a cell object. The cell contains the core code for the calculations of each step, while the recurrent layer commands the cell and performs the actual recurrent calculations.
Usually, people use LSTM layers in their code.
Or they use RNN layers containing LSTMCell.
Both things are almost the same. An LSTM layer is a RNN layer using an LSTMCell, as you can check out in the source code.
About the number of cells:
Alghout it seems, because of its name, that LSTMCell is a single cell, it is actually an object that manages all the units/cells as we may think. In the same code mentioned, you can see that the units argument is used when creating an instance of LSTMCell.

Scikit learn multilayer neural network

As per the documentation provided by Scikit learn
hidden_layer_sizes : tuple, length = n_layers - 2, default (100,)
I have little doubt.
In my code what I have configured is
MLPClassifier(algorithm='l-bfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1)
so what does 5 and 2 indicates?
What I understand is, 5 is the numbers of hidden layers, but then what is 2?
Ref - http://scikit-learn.org/dev/modules/generated/sklearn.neural_network.MLPClassifier.html#
From the link you provided, in parameter table, hidden_layer_sizes row:
The ith element represents the number of neurons in the ith hidden
layer
Which means that you will have len(hidden_layer_sizes) hidden layers, and, each hidden layer i will have hidden_layer_sizes[i] neurons.
In your case, (5, 2) means:
1rst hidden layer has 5 neurons
2nd hidden layer has 2 neurons
So the number of hidden layers is implicitely set
Some details that I found online concerning the architecture and the units of the input, hidden and output layers in sklearn.
The number of input units will be the number of features
For multiclass classification the number of output units will be the number of labels
Try a single hidden layer, or if more than one then each hidden layer should have the same number of units
The more units in a hidden layer the better, try the same as the number of input features up to twice or even three or four times that

Neural Networks (input and output layers)

When dealing with muticlass classification, is it always that the number of nodes (which are vectors) in the input layer excluding bias is the same as the number of nodes in the output layer?
No. The input layer ingests the features. The output layer makes predictions for classes. The number of features and classes does not need to be the same; it also depends on how exactly you model the multiple classes output.
Lars Kotthoff is right. However, when you are using an artificial neural network to build an autoencoder, you will want to have the same number of input and output nodes, and you will want the output nodes to learn the values of the input nodes.
Nope,
Usually number of input unites equals to number of features you are going use for training the NN classifier.
Size of the output layer equals to number of classes in the dataset. Further, if dataset has two classes only just one output unit is enough for discriminating these two classes.
The ANN output layer has a node for each class: if you have 3 classes, you use 3 nodes. The input layer (often called a feature vector) has a node for each feature used for prediction and usually an extra bias node. You usually need only 1 hidden layer, and discerning its ideal size tricky.
Having too many hidden layer nodes can result in overfitting and slow training. Having too few hidden layer nodes can result in underfitting (overgeneralizing).
Here are a few general guidelines (source) to start with:
The number of hidden neurons should be between the size of the input layer and the size of the output layer.
The number of hidden neurons should be 2/3 the size of the input layer, plus the size of the output layer.
The number of hidden neurons should be less than twice the size of the input layer.
If you have 3 classes and an input vector of 30 features, you can start with a hidden layer of around 23 nodes. Add and remove nodes from this layer during training to reduce your error, while testing against validation data to prevent overfitting.

In 3 layer MLP, why should the input to hidden weights be random?

For Example for 3-1-1 layer if the weights are initialized equally the MLP might not learn well. But why does this happen?
If you only have one neuron in the hidden layer, it doesn't matter. But, imagine a network with two neurons in the hidden layer. If they have the same weights for their input, than both neurons would always have the exact same activation, there is no additional information by having a second neuron. And in the backpropagation step, those weights would change by an equal amount. Hence, in every iteration, those hidden neurons have the same activation.
It looks like you have a typo in your question title. I'm guessing that you mean why should the weights of hidden layer be random. For the example network you indicate (3-1-1), it won't matter because you only have a single unit in the hidden layer. However, if you had multiple units in the hidden layer of a fully connected network (e.g., 3-2-1) you should randomize the weights because otherwise, all of the weights to the hidden layer will be updated identically. That is not what you want because each hidden layer unit would be producing the same hyperplane, which is no different than just having a single unit in that layer.

Resources