Dimensions of LSTM variant in Deep Mind's Differentiable Neural Computer (DNC) - machine-learning

I'm trying to implement Deep Mind's DNC - Nature paper- with PyTorch 0.4.0.
When implementing the variant of LSTM they used I encountered some troubles with dimensions.
To simplify suppose BATCH=1.
The equations they list in the paper are these:
where [x;h] means a concatenation of x and h into one single vector, and i, f and o are column vectors.
My question is about how the state s_t is computed.
The second addendum is obtained by multiplying i with a column vector and so the result is either a scalar (transpose i first, then do scalar product) or wrong (two column vectors multiplied).
So the state results in a single scalar...
With the same reasoning the hidden state h_t is a scalar too, but it has to be a column vector.
Obviously I'm wrong somewhere, but I can't figure out where.

By looking at Wikipedia LSTM Article I think I figured it out.
This is the formal implementation of standard LSTM found in the article:
The circle represents element-by-element product.
By using this product in the corresponding parts of DNC equations (s_t and o_t) the dimensions work.

Related

Transforming Features to increase similarity

I have a large dataset (~20,000 samples x 2,000 features-- each sample w/ a corresponding y-value) that I'm constructing a regression ML model for.
The input vectors are bitvectors with either 1s or 0s at each position.
Interestingly, I have noticed that when I 'randomly' select N samples such that their y-values are between two arbitrary values A and B (such that B-A is much smaller than the total range of values in y), the subsequent model is much better at predicting other values with the A-->B range not used in the training of the model.
However, the overall similarity of the input X vectors for these values are in no way more similar than any random selection of X values across the whole dataset.
Is there an available method to transform the input X-vectors such that those with more similar y-values are "closer" (I'm not particular the methodology, but it could be something like cosine similarity), and those with not similar y-values are separated?
After more thought, I believe this question can be re-framed as a supervised clustering problem. What might be able to accomplish this might be as simple as:
import umap
print(df.shape)
>> (23,312, 2149)
print(len(target))
>> 23,312
embedding = umap.UMAP().fit_transform(df, y=target)

Doc2Vec : Paragraph matrix (D) in the structure of the PV-DBOW model

I am confused about the meaning of paragraph matrix (D) in the structure of the PV-DBOW model (Doc2Vec).
Is the paragraph matrix the result of one-hot encoding of n input paragraph IDs?
Or is the paragraph matrix a randomly initialized weight to generate the shape(nxp), where n is the number of paragraph ID inputs and p is the vector dimension?
I've never found the original paper's diagrams very clear or helpful. (And, no one has been able to reproduce their claimed results in the 'concatenation' mode.)
But I can say, from familiarity with code implementations:
At the outset, every paragraph-ID gets a randomly-initialized vector. Thus, there is a matrix in the model with all of these vectors, of shape number_of_paragraph_ids x number_of_dimensions
For backprop-training in PV-DBOW, each individual paragraph-vector (one row from the above matrix) is adjusted to better predict the matching-paragraph's individual constituent words.
While figuratively, that's sort of a 1-hot selection of a single paragraph choice, in the code it's just a lookup of the single correct row using a paragraph-ID key.

Why is there an activation function in each neural net layer, and not just one in the final layer?

I'm trying to teach myself machine learning and I have a similar question to this.
Is this correct:
For example, if I have an input matrix, where X1, X2 and X3 are three numerical features (e.g. say they are petal length, stem length, flower length, and I'm trying to label whether the sample is a particular flower species or not):
x1 x2 x3 label
5 1 2 yes
3 9 8 no
1 2 3 yes
9 9 9 no
That you take the vector of the first ROW (not column) of the table above to be inputted into the network like this:
i.e. there would be three neurons (1 for each value of the first table row), and then w1,w2 and w3 are randomly selected, then to calculate the first neuron in the next column, you do the multiplication I have described, and you add a randomly selected bias term. This gives the value of that node.
This is done for a set of nodes (i.e. each column actually will have four nodes (three + a bias), for simplicity, i removed the other three nodes from the second column), and then in the last node before the output, there is an activation function to transform the sum into a value (e.g. 0-1 for sigmoid) and that value tells you whether the classification is yes or no.
I'm sorry for how basic this is, I want to really understand the process, and I'm doing it from free resources. So therefore generally, you should select the number of nodes in your network to be a multiple of the number of features, e.g. in this case, it would make sense to write:
from keras.models import Sequential
from keras.models import Dense
model = Sequential()
model.add(Dense(6,input_dim=3,activation='relu'))
model.add(Dense(6,input_dim=3,activation='relu'))
model.add(Dense(3,activation='softmax'))
What I don't understand is why the keras model has an activation function in each layer of the network and not just at the end, which is why I'm wondering if my understanding is correct/why I added the picture.
Edit 1: Just a note I saw that in the bias neuron, I put on the edge 'b=1', that might be confusing, I know the bias doesn't have a weight, so that was just a reminder to myself that the weight of the bias node is 1.
Several issues here apart from the question in your title, but since this is not the time & place for full tutorials, I'll limit the discussion to some of your points, taking also into account that at least one more answer already exists.
So therefore generally, you should select the number of nodes in your network to be a multiple of the number of features,
No.
The number of features is passed in the input_dim argument, which is set only for the first layer of the model; the number of inputs for every layer except the first one is simply the number of outputs of the previous one. The Keras model you have written is not valid, and it will produce an error, since for your 2nd layer you ask for input_dim=3, while the previous one has clearly 6 outputs (nodes).
Beyond this input_dim argument, there is no other relationship whatsoever between the number of data features and the number of network nodes; and since it seems you have in mind the iris data (4 features), here is a simple reproducible example of applying a Keras model to them.
What is somewhat hidden in the Keras sequential API (which you use here) is that there is in fact an implicit input layer, and the number of its nodes is the dimensionality of the input; see own answer in Keras Sequential model input layer for details.
So, the model you have drawn in your pad actually corresponds to the following Keras model written using the sequential API:
model = Sequential()
model.add(Dense(1,input_dim=3,activation='linear'))
where in the functional API it would be written as:
inputs = Input(shape=(3,))
outputs = Dense(1, activation='linear')(inputs)
model = Model(inputs, outputs)
and that's all, i.e. it is actually just linear regression.
I know the bias doesn't have a weight
The bias does have a weight. Again, the useful analogy is with the constant term of linear (or logistic) regression: the bias "input" itself is always 1, and its corresponding coefficient (weight) is learned through the fitting process.
why the keras model has an activation function in each layer of the network and not just at the end
I trust this has been covered sufficiently in the other answer.
I'm sorry for how basic this is, I want to really understand the process, and I'm doing it from free resources.
We all did; no excuse though to not benefit from Andrew Ng's free & excellent Machine Learning MOOC at Coursera.
It seems your question is why there is a activation function for each layer instead of just the last layer. The simple answer is, if there are no non-linear activations in the middle, no matter how deep your network is, it can be boiled down to a single linear equation. Therefore, non-linear activation is one of the big enablers that enable deep networks to be actually "deep" and learn high-level features.
Take the following example, say you have 3 layer neural network without any non-linear activations in the middle, but a final softmax layer. The weights and biases for these layers are (W1, b1), (W2, b2) and (W3, b3). Then you can write the network's final output as follows.
h1 = W1.x + b1
h2 = W2.h1 + b2
h3 = Softmax(W3.h2 + b3)
Let's do some manipulations. We'll simply replace h3 as a function of x,
h3 = Softmax(W3.(W2.(W1.x + b1) + b2) + b3)
h3 = Softmax((W3.W2.W1) x + (W3.W2.b1 + W3.b2 + b3))
In other words, h3 is in the following format.
h3 = Softmax(W.x + b)
So, without the non-linear activations, our 3-layer networks has been squashed to a single layer network. That's is why non-linear activations are important.
Imagine, you have an activation layer only in the last layer (In your case, sigmoid. It can be something else too.. say softmax). The purpose of this is to convert real values to a 0 to 1 range for a classification sort of answer. But, the activation in the inner layers (hidden layers) has a different purpose altogether. This is to introduce nonlinearity. Without the activation (say ReLu, tanh etc.), what you get is a linear function. And how many ever, hidden layers you have, you still end up with a linear function. And finally, you convert this into a nonlinear function in the last layer. This might work in some simple nonlinear problems, but will not be able to capture a complex nonlinear function.
Each hidden unit (in each layer) comprises of activation function to incorporate nonlinearity.

Translating a TensorFlow LSTM into synapticjs

I'm working on implementing an interface between a TensorFlow basic LSTM that's already been trained and a javascript version that can be run in the browser. The problem is that in all of the literature that I've read LSTMs are modeled as mini-networks (using only connections, nodes and gates) and TensorFlow seems to have a lot more going on.
The two questions that I have are:
Can the TensorFlow model be easily translated into a more conventional neural network structure?
Is there a practical way to map the trainable variables that TensorFlow gives you to this structure?
I can get the 'trainable variables' out of TensorFlow, the issue is that they appear to only have one value for bias per LSTM node, where most of the models I've seen would include several biases for the memory cell, the inputs and the output.
Internally, the LSTMCell class stores the LSTM weights as a one big matrix instead of 8 smaller ones for efficiency purposes. It is quite easy to divide it horizontally and vertically to get to the more conventional representation. However, it might be easier and more efficient if your library does the similar optimization.
Here is the relevant piece of code of the BasicLSTMCell:
concat = linear([inputs, h], 4 * self._num_units, True)
# i = input_gate, j = new_input, f = forget_gate, o = output_gate
i, j, f, o = array_ops.split(1, 4, concat)
The linear function does the matrix multiplication to transform the concatenated input and the previous h state into 4 matrices of [batch_size, self._num_units] shape. The linear transformation uses a single matrix and bias variables that you're referring to in the question. The result is then split into different gates used by the LSTM transformation.
If you'd like to explicitly get the transformations for each gate, you can split that matrix and bias into 4 blocks. It is also quite easy to implement it from scratch using 4 or 8 linear transformations.

What is the role of the bias in neural networks? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I'm aware of the gradient descent and the back-propagation algorithm. What I don't get is: when is using a bias important and how do you use it?
For example, when mapping the AND function, when I use two inputs and one output, it does not give the correct weights. However, when I use three inputs (one of which is a bias), it gives the correct weights.
I think that biases are almost always helpful. In effect, a bias value allows you to shift the activation function to the left or right, which may be critical for successful learning.
It might help to look at a simple example. Consider this 1-input, 1-output network that has no bias:
The output of the network is computed by multiplying the input (x) by the weight (w0) and passing the result through some kind of activation function (e.g. a sigmoid function.)
Here is the function that this network computes, for various values of w0:
Changing the weight w0 essentially changes the "steepness" of the sigmoid. That's useful, but what if you wanted the network to output 0 when x is 2? Just changing the steepness of the sigmoid won't really work -- you want to be able to shift the entire curve to the right.
That's exactly what the bias allows you to do. If we add a bias to that network, like so:
...then the output of the network becomes sig(w0*x + w1*1.0). Here is what the output of the network looks like for various values of w1:
Having a weight of -5 for w1 shifts the curve to the right, which allows us to have a network that outputs 0 when x is 2.
A simpler way to understand what the bias is: it is somehow similar to the constant b of a linear function
y = ax + b
It allows you to move the line up and down to fit the prediction with the data better.
Without b, the line always goes through the origin (0, 0) and you may get a poorer fit.
Here are some further illustrations showing the result of a simple 2-layer feed forward neural network with and without bias units on a two-variable regression problem. Weights are initialized randomly and standard ReLU activation is used. As the answers before me concluded, without the bias the ReLU-network is not able to deviate from zero at (0,0).
Two different kinds of parameters can
be adjusted during the training of an
ANN, the weights and the value in the
activation functions. This is
impractical and it would be easier if
only one of the parameters should be
adjusted. To cope with this problem a
bias neuron is invented. The bias
neuron lies in one layer, is connected
to all the neurons in the next layer,
but none in the previous layer and it
always emits 1. Since the bias neuron
emits 1 the weights, connected to the
bias neuron, are added directly to the
combined sum of the other weights
(equation 2.1), just like the t value
in the activation functions.1
The reason it's impractical is because you're simultaneously adjusting the weight and the value, so any change to the weight can neutralize the change to the value that was useful for a previous data instance... adding a bias neuron without a changing value allows you to control the behavior of the layer.
Furthermore the bias allows you to use a single neural net to represent similar cases. Consider the AND boolean function represented by the following neural network:
(source: aihorizon.com)
w0 corresponds to b.
w1 corresponds to x1.
w2 corresponds to x2.
A single perceptron can be used to
represent many boolean functions.
For example, if we assume boolean values
of 1 (true) and -1 (false), then one
way to use a two-input perceptron to
implement the AND function is to set
the weights w0 = -3, and w1 = w2 = .5.
This perceptron can be made to
represent the OR function instead by
altering the threshold to w0 = -.3. In
fact, AND and OR can be viewed as
special cases of m-of-n functions:
that is, functions where at least m of
the n inputs to the perceptron must be
true. The OR function corresponds to
m = 1 and the AND function to m = n.
Any m-of-n function is easily
represented using a perceptron by
setting all input weights to the same
value (e.g., 0.5) and then setting the
threshold w0 accordingly.
Perceptrons can represent all of the
primitive boolean functions AND, OR,
NAND ( 1 AND), and NOR ( 1 OR). Machine Learning- Tom Mitchell)
The threshold is the bias and w0 is the weight associated with the bias/threshold neuron.
The bias is not an NN term. It's a generic algebra term to consider.
Y = M*X + C (straight line equation)
Now if C(Bias) = 0 then, the line will always pass through the origin, i.e. (0,0), and depends on only one parameter, i.e. M, which is the slope so we have less things to play with.
C, which is the bias takes any number and has the activity to shift the graph, and hence able to represent more complex situations.
In a logistic regression, the expected value of the target is transformed by a link function to restrict its value to the unit interval. In this way, model predictions can be viewed as primary outcome probabilities as shown:
Sigmoid function on Wikipedia
This is the final activation layer in the NN map that turns on and off the neuron. Here also bias has a role to play and it shifts the curve flexibly to help us map the model.
A layer in a neural network without a bias is nothing more than the multiplication of an input vector with a matrix. (The output vector might be passed through a sigmoid function for normalisation and for use in multi-layered ANN afterwards, but that’s not important.)
This means that you’re using a linear function and thus an input of all zeros will always be mapped to an output of all zeros. This might be a reasonable solution for some systems but in general it is too restrictive.
Using a bias, you’re effectively adding another dimension to your input space, which always takes the value one, so you’re avoiding an input vector of all zeros. You don’t lose any generality by this because your trained weight matrix needs not be surjective, so it still can map to all values previously possible.
2D ANN:
For a ANN mapping two dimensions to one dimension, as in reproducing the AND or the OR (or XOR) functions, you can think of a neuronal network as doing the following:
On the 2D plane mark all positions of input vectors. So, for boolean values, you’d want to mark (-1,-1), (1,1), (-1,1), (1,-1). What your ANN now does is drawing a straight line on the 2d plane, separating the positive output from the negative output values.
Without bias, this straight line has to go through zero, whereas with bias, you’re free to put it anywhere.
So, you’ll see that without bias you’re facing a problem with the AND function, since you can’t put both (1,-1) and (-1,1) to the negative side. (They are not allowed to be on the line.) The problem is equal for the OR function. With a bias, however, it’s easy to draw the line.
Note that the XOR function in that situation can’t be solved even with bias.
When you use ANNs, you rarely know about the internals of the systems you want to learn. Some things cannot be learned without a bias. E.g., have a look at the following data: (0, 1), (1, 1), (2, 1), basically a function that maps any x to 1.
If you have a one layered network (or a linear mapping), you cannot find a solution. However, if you have a bias it's trivial!
In an ideal setting, a bias could also map all points to the mean of the target points and let the hidden neurons model the differences from that point.
Modification of neuron WEIGHTS alone only serves to manipulate the shape/curvature of your transfer function, and not its equilibrium/zero crossing point.
The introduction of bias neurons allows you to shift the transfer function curve horizontally (left/right) along the input axis while leaving the shape/curvature unaltered.
This will allow the network to produce arbitrary outputs different from the defaults and hence you can customize/shift the input-to-output mapping to suit your particular needs.
See here for graphical explanation:
http://www.heatonresearch.com/wiki/Bias
In a couple of experiments in my masters thesis (e.g. page 59), I found that the bias might be important for the first layer(s), but especially at the fully connected layers at the end it seems not to play a big role.
This might be highly dependent on the network architecture / dataset.
If you're working with images, you might actually prefer to not use a bias at all. In theory, that way your network will be more independent of data magnitude, as in whether the picture is dark, or bright and vivid. And the net is going to learn to do it's job through studying relativity inside your data. Lots of modern neural networks utilize this.
For other data having biases might be critical. It depends on what type of data you're dealing with. If your information is magnitude-invariant --- if inputting [1,0,0.1] should lead to the same result as if inputting [100,0,10], you might be better off without a bias.
Bias determines how much angle your weight will rotate.
In a two-dimensional chart, weight and bias can help us to find the decision boundary of outputs.
Say we need to build a AND function, the input(p)-output(t) pair should be
{p=[0,0], t=0},{p=[1,0], t=0},{p=[0,1], t=0},{p=[1,1], t=1}
Now we need to find a decision boundary, and the ideal boundary should be:
See? W is perpendicular to our boundary. Thus, we say W decided the direction of boundary.
However, it is hard to find correct W at first time. Mostly, we choose original W value randomly. Thus, the first boundary may be this:
Now the boundary is parallel to the y axis.
We want to rotate the boundary. How?
By changing the W.
So, we use the learning rule function: W'=W+P:
W'=W+P is equivalent to W' = W + bP, while b=1.
Therefore, by changing the value of b(bias), you can decide the angle between W' and W. That is "the learning rule of ANN".
You could also read Neural Network Design by Martin T. Hagan / Howard B. Demuth / Mark H. Beale, chapter 4 "Perceptron Learning Rule"
In simpler terms, biases allow for more and more variations of weights to be learnt/stored... (side-note: sometimes given some threshold). Anyway, more variations mean that biases add richer representation of the input space to the model's learnt/stored weights. (Where better weights can enhance the neural net’s guessing power)
For example, in learning models, the hypothesis/guess is desirably bounded by y=0 or y=1 given some input, in maybe some classification task... i.e some y=0 for some x=(1,1) and some y=1 for some x=(0,1). (The condition on the hypothesis/outcome is the threshold I talked about above. Note that my examples setup inputs X to be each x=a double or 2 valued-vector, instead of Nate's single valued x inputs of some collection X).
If we ignore the bias, many inputs may end up being represented by a lot of the same weights (i.e. the learnt weights mostly occur close to the origin (0,0).
The model would then be limited to poorer quantities of good weights, instead of the many many more good weights it could better learn with bias. (Where poorly learnt weights lead to poorer guesses or a decrease in the neural net’s guessing power)
So, it is optimal that the model learns both close to the origin, but also, in as many places as possible inside the threshold/decision boundary. With the bias we can enable degrees of freedom close to the origin, but not limited to origin's immediate region.
In neural networks:
Each neuron has a bias
You can view bias as a threshold (generally opposite values of threshold)
Weighted sum from input layers + bias decides activation of a neuron
Bias increases the flexibility of the model.
In absence of bias, the neuron may not be activated by considering only the weighted sum from the input layer. If the neuron is not activated, the information from this neuron is not passed through rest of the neural network.
The value of bias is learnable.
Effectively, bias = — threshold. You can think of bias as how easy it is to get the neuron to output a 1 — with a really big bias, it’s very easy for the neuron to output a 1, but if the bias is very negative, then it’s difficult.
In summary: bias helps in controlling the value at which the activation function will trigger.
Follow this video for more details.
Few more useful links:
geeksforgeeks
towardsdatascience
Expanding on zfy's explanation:
The equation for one input, one neuron, one output should look:
y = a * x + b * 1 and out = f(y)
where x is the value from the input node and 1 is the value of the bias node;
y can be directly your output or be passed into a function, often a sigmoid function. Also note that the bias could be any constant, but to make everything simpler we always pick 1 (and probably that's so common that zfy did it without showing & explaining it).
Your network is trying to learn coefficients a and b to adapt to your data.
So you can see why adding the element b * 1 allows it to fit better to more data: now you can change both slope and intercept.
If you have more than one input your equation will look like:
y = a0 * x0 + a1 * x1 + ... + aN * 1
Note that the equation still describes a one neuron, one output network; if you have more neurons you just add one dimension to the coefficient matrix, to multiplex the inputs to all nodes and sum back each node contribution.
That you can write in vectorized format as
A = [a0, a1, .., aN] , X = [x0, x1, ..., 1]
Y = A . XT
i.e. putting coefficients in one array and (inputs + bias) in another you have your desired solution as the dot product of the two vectors (you need to transpose X for the shape to be correct, I wrote XT a 'X transposed')
So in the end you can also see your bias as is just one more input to represent the part of the output that is actually independent of your input.
To think in a simple way, if you have y=w1*x where y is your output and w1 is the weight, imagine a condition where x=0 then y=w1*x equals to 0.
If you want to update your weight you have to compute how much change by delw=target-y where target is your target output. In this case 'delw' will not change since y is computed as 0. So, suppose if you can add some extra value it will help y = w1x + w01, where bias=1 and weight can be adjusted to get a correct bias. Consider the example below.
In terms of line slope, intercept is a specific form of linear equations.
y = mx + b
Check the image
image
Here b is (0,2)
If you want to increase it to (0,3) how will you do it by changing the value of b the bias.
For all the ML books I studied, the W is always defined as the connectivity index between two neurons, which means the higher connectivity between two neurons.
The stronger the signals will be transmitted from the firing neuron to the target neuron or Y = w * X as a result to maintain the biological character of neurons, we need to keep the 1 >=W >= -1, but in the real regression, the W will end up with |W| >=1 which contradicts how the neurons are working.
As a result, I propose W = cos(theta), while 1 >= |cos(theta)|, and Y= a * X = W * X + b while a = b + W = b + cos(theta), b is an integer.
Bias acts as our anchor. It's a way for us to have some kind of baseline where we don't go below that. In terms of a graph, think of like y=mx+b it's like a y-intercept of this function.
output = input times the weight value and added a bias value and then apply an activation function.
The term bias is used to adjust the final output matrix as the y-intercept does. For instance, in the classic equation, y = mx + c, if c = 0, then the line will always pass through 0. Adding the bias term provides more flexibility and better generalisation to our neural network model.
The bias helps to get a better equation.
Imagine the input and output like a function y = ax + b and you need to put the right line between the input(x) and output(y) to minimise the global error between each point and the line, if you keep the equation like this y = ax, you will have one parameter for adaptation only, even if you find the best a minimising the global error it will be kind of far from the wanted value.
You can say the bias makes the equation more flexible to adapt to the best values

Resources