When using tensorflow's automatic differentiation, exactly which variables does it update each iteration?
For instance, if I had a vanilla RNN with a state variable, would automatic differentiation compute it's gradient and update it for each time-step too, since it contributes to the loss?
For instance, the equation for a vanilla RNN would be:
h_t = X_t W_x + h_{t-1} W_h + b
How does tensorflow know to update the values of W_x, W_h, and b, but not h_{t-1} ?
Sorry if I'm missing something obvious.
State of the RNN is not defined as a trainable tensorflow variable. It's just a node in a graph which is used to connect some other. On the contrary, weights matrices W_x and W_h are explicitly defined as tensorflow variables with tf.Variable(..., trainable = True). Gradients are computed with respect to the trainable variables, so they are only objects updated during optimization process.
Related
Usually the learning cycle contains:
optim.zero_grad()
loss(m, op).backward()
optim.step()
But what should be the cycle when the data does not fit in the graphics card?
First option:
for ip, op in DataLoader(TensorDataset(inputs, outputs),
batch_size=int(1e4), pin_memory=True):
m = model(ip.to(dev))
op = op.to(dev)
optim.zero_grad()
loss(m, op).backward()
optim.step()
Second option:
optim.zero_grad()
for ip, op in DataLoader(TensorDataset(inputs, outputs),
batch_size=int(1e4), pin_memory=True):
m = model(ip.to(dev))
op = op.to(dev)
loss(m, op).backward()
optim.step()
The third option:
Accumulate gradients after calling backward().
The first option is correct and corresponds to batch gradient descent.
The second option will not work because m and op are being overwritten at each step, so your optimizer step will only correspond to optimizing based on the final batch.
The proper way of training a model using Stochastic Gradient Descent (SGD) is following these steps:
instantiate a model, and randomly init its weights. This is done only once.
instantiate the dataset and the dataloader, defining appropriate batch_size.
Iterate over the all examples, batch by batch. At each iteration
3.a Compute a stochastic estimate of the loss using only a batch, rather than the entire set (aka "forward pass")
3.b Compute the gradient of the loss w.r.t the model's parameters (aka "backward pass")
3.c Update the weights based on the current gradient
This is how the code should look like
model = MyModel(...) # instantiate a model once
dl = DataLoader(TensorDataset(inputs, outputs), batch_size=int(1e4), pin_memory=True)
for ei in range(num_epochs):
for ip, op in dl:
optim.zero_grad()
predict = model(ip.to(dev)) # forward pass
loss = criterion(predict, op.to(dev)) # estimate current loss
loss.backward() # backward pass - propagate gradients
optim.step() # update the weights based on current batch
Note that during training you iterate several times over the entire training set. Each such iteration is usually referred to as an "epoch".
I am currently building a CNN in tensorflow and I am initialising my weight matrix using a He normal weight initialisation. However, I am unsure how I should initialise my bias values. I am using ReLU as my activation function between each convolutional layer. Is there a standard method to initialising bias values?
# Define approximate xavier weight initialization (with RelU correction described by He)
def xavier_over_two(shape):
std = np.sqrt(shape[0] * shape[1] * shape[2])
return tf.random_normal(shape, stddev=std)
def bias_init(shape):
return #???
Initializing the biases. It is possible and common to initialize the
biases to be zero, since the asymmetry breaking is provided by the
small random numbers in the weights. For ReLU non-linearities, some
people like to use small constant value such as 0.01 for all biases
because this ensures that all ReLU units fire in the beginning and
therefore obtain and propagate some gradient. However, it is not clear
if this provides a consistent improvement (in fact some results seem
to indicate that this performs worse) and it is more common to simply
use 0 bias initialization.
source: http://cs231n.github.io/neural-networks-2/
Be aware of the specific case of the last layer's bias. As Andrej Karpathy explains in his Recipe for Training Neural Networks:
init well. Initialize the final layer weights correctly. E.g. if you are regressing some values that have a mean of 50 then initialize the final bias to 50. If you have an imbalanced dataset of a ratio 1:10 of positives:negatives, set the bias on your logits such that your network predicts probability of 0.1 at initialization. Setting these correctly will speed up convergence and eliminate “hockey stick” loss curves where in the first few iteration your network is basically just learning the bias.
Keras enables adding a layer which calculates a user defined lambda function.
What I don't get is how Keras knows to calculate the gradient of this user defined function for the backpropagation.
That one of the benefit of using Theano/Tensorflow and libraries build on top of them. They can give you automatic gradient calculation of the mathematical functions and operations.
Keras gets them by calling:
# keras/theano_backend.py
def gradients(loss, variables):
return T.grad(loss, variables)
# keras/tensorflow_backend.py
def gradients(loss, variables):
'''Returns the gradients of `variables` (list of tensor variables)
with regard to `loss`.
'''
return tf.gradients(loss, variables, colocate_gradients_with_ops=True)
which are in turn called by the optimizers(keras/optimizers.py) grads = self.get_gradients(loss, params) to get the gradients which are used to write the update rule for all the params. params here are the trainable weights of the layers. But layers created by the Lambda functional layers don't have any trainable weights. But they affect the loss function though the forward prob and hence indirectly affect the calculation of the gradients of trainable weights of other layers.
The only time you need to write new gradient calculation is when you are defining a new basic mathematical operation/function. Also, when you write a custom loss function the auto grad almost always takes care of the gradient calculation. But optionally you can optimize training (not always) if you implement analytical gradient of your custom functions. For example softwax function can be expressed in exp, sum and div and auto grad can take care of it, but its analytical/symbolic grad is usually implemented in Theano/Tensorflow.
For implementing new Ops you can see the below links for that:
http://deeplearning.net/software/theano/extending/extending_theano.html
https://www.tensorflow.org/versions/r0.12/how_tos/adding_an_op/index.html
I'm working on implementing an interface between a TensorFlow basic LSTM that's already been trained and a javascript version that can be run in the browser. The problem is that in all of the literature that I've read LSTMs are modeled as mini-networks (using only connections, nodes and gates) and TensorFlow seems to have a lot more going on.
The two questions that I have are:
Can the TensorFlow model be easily translated into a more conventional neural network structure?
Is there a practical way to map the trainable variables that TensorFlow gives you to this structure?
I can get the 'trainable variables' out of TensorFlow, the issue is that they appear to only have one value for bias per LSTM node, where most of the models I've seen would include several biases for the memory cell, the inputs and the output.
Internally, the LSTMCell class stores the LSTM weights as a one big matrix instead of 8 smaller ones for efficiency purposes. It is quite easy to divide it horizontally and vertically to get to the more conventional representation. However, it might be easier and more efficient if your library does the similar optimization.
Here is the relevant piece of code of the BasicLSTMCell:
concat = linear([inputs, h], 4 * self._num_units, True)
# i = input_gate, j = new_input, f = forget_gate, o = output_gate
i, j, f, o = array_ops.split(1, 4, concat)
The linear function does the matrix multiplication to transform the concatenated input and the previous h state into 4 matrices of [batch_size, self._num_units] shape. The linear transformation uses a single matrix and bias variables that you're referring to in the question. The result is then split into different gates used by the LSTM transformation.
If you'd like to explicitly get the transformations for each gate, you can split that matrix and bias into 4 blocks. It is also quite easy to implement it from scratch using 4 or 8 linear transformations.
Is it a good practice to use sigmoid or tanh output layers in Neural networks directly to estimate probabilities?
i.e the probability of given input to occur is the output of sigmoid function in the NN
EDIT
I wanted to use neural network to learn and predict the probability of a given input to occur..
You may consider the input as State1-Action-State2 tuple.
Hence the output of NN is the probability that State2 happens when applying Action on State1..
I Hope that does clear things..
EDIT
When training NN, I do random Action on State1 and observe resultant State2; then teach NN that input State1-Action-State2 should result in output 1.0
First, just a couple of small points on the conventional MLP lexicon (might help for internet searches, etc.): 'sigmoid' and 'tanh' are not 'output layers' but functions, usually referred to as "activation functions". The return value of the activation function is indeed the output from each layer, but they are not the output layer themselves (nor do they calculate probabilities).
Additionally, your question recites a choice between two "alternatives" ("sigmoid and tanh"), but they are not actually alternatives, rather the term 'sigmoidal function' is a generic/informal term for a class of functions, which includes the hyperbolic tangent ('tanh') that you refer to.
The term 'sigmoidal' is probably due to the characteristic shape of the function--the return (y) values are constrained between two asymptotic values regardless of the x value. The function output is usually normalized so that these two values are -1 and 1 (or 0 and 1). (This output behavior, by the way, is obviously inspired by the biological neuron which either fires (+1) or it doesn't (-1)). A look at the key properties of sigmoidal functions and you can see why they are ideally suited as activation functions in feed-forward, backpropagating neural networks: (i) real-valued and differentiable, (ii) having exactly one inflection point, and (iii) having a pair of horizontal asymptotes.
In turn, the sigmoidal function is one category of functions used as the activation function (aka "squashing function") in FF neural networks solved using backprop. During training or prediction, the weighted sum of the inputs (for a given layer, one layer at a time) is passed in as an argument to the activation function which returns the output for that layer. Another group of functions apparently used as the activation function is piecewise linear function. The step function is the binary variant of a PLF:
def step_fn(x) :
if x <= 0 :
y = 0
if x > 0 :
y = 1
(On practical grounds, I doubt the step function is a plausible choice for the activation function, but perhaps it helps understand the purpose of the activation function in NN operation.)
I suppose there an unlimited number of possible activation functions, but in practice, you only see a handful; in fact just two account for the overwhelming majority of cases (both are sigmoidal). Here they are (in python) so you can experiment for yourself, given that the primary selection criterion is a practical one:
# logistic function
def sigmoid2(x) :
return 1 / (1 + e**(-x))
# hyperbolic tangent
def sigmoid1(x) :
return math.tanh(x)
what are the factors to consider in selecting an activation function?
First the function has to give the desired behavior (arising from or as evidenced by sigmoidal shape). Second, the function must be differentiable. This is a requirement for backpropagation, which is the optimization technique used during training to 'fill in' the values of the hidden layers.
For instance, the derivative of the hyperbolic tangent is (in terms of the output, which is how it is usually written) :
def dsigmoid(y) :
return 1.0 - y**2
Beyond those two requriements, what makes one function between than another is how efficiently it trains the network--i.e., which one causes convergence (reaching the local minimum error) in the fewest epochs?
#-------- Edit (see OP's comment below) ---------#
I am not quite sure i understood--sometimes it's difficult to communicate details of a NN, without the code, so i should probably just say that it's fine subject to this proviso: What you want the NN to predict must be the same as the dependent variable used during training. So for instance, if you train your NN using two states (e.g., 0, 1) as the single dependent variable (which is obviously missing from your testing/production data) then that's what your NN will return when run in "prediction mode" (post training, or with a competent weight matrix).
You should choose the right loss function to minimize.
The squared error does not lead to the maximum likelihood hypothesis here.
The squared error is derived from a model with Gaussian noise:
P(y|x,h) = k1 * e**-(k2 * (y - h(x))**2)
You estimate the probabilities directly. Your model is:
P(Y=1|x,h) = h(x)
P(Y=0|x,h) = 1 - h(x)
P(Y=1|x,h) is the probability that event Y=1 will happen after seeing x.
The maximum likelihood hypothesis for your model is:
h_max_likelihood = argmax_h product(
h(x)**y * (1-h(x))**(1-y) for x, y in examples)
This leads to the "cross entropy" loss function.
See chapter 6 in Mitchell's Machine Learning
for the loss function and its derivation.
There is one problem with this approach: if you have vectors from R^n and your network maps those vectors into the interval [0, 1], it will not be guaranteed that the network represents a valid probability density function, since the integral of the network is not guaranteed to equal 1.
E.g., a neural network could map any input form R^n to 1.0. But that is clearly not possible.
So the answer to your question is: no, you can't.
However, you can just say that your network never sees "unrealistic" code samples and thus ignore this fact. For a discussion of this (and also some more cool information on how to model PDFs with neural networks) see contrastive backprop.