I am trying to produce a mathematical operation selection nn model, which is based on the scalar input. The operation is selected based on the softmax result which is produce by the nn. Then this operation has to be applied to the scalar input in order to produce the final output. So far I’ve come up with applying argmax and onehot on the softmax output in order to produce a mask which then is applied on the concated values matrix from all the possible operations to be performed (as show in the pseudo code below). The issue is that neither argmax nor onehot appears to be differentiable. I am new to this, so any would be highly appreciated. Thanks in advance.
#perform softmax
logits = tf.matmul(current_input, W) + b
softmax = tf.nn.softmax(logits)
#perform all possible operations on the input
op_1_val = tf_op_1(current_input)
op_2_val = tf_op_2(current_input)
op_3_val = tf_op_2(current_input)
values = tf.concat([op_1_val, op_2_val, op_3_val], 1)
#create a mask
argmax = tf.argmax(softmax, 1)
mask = tf.one_hot(argmax, num_of_operations)
#produce the input, by masking out those operation results which have not been selected
output = values * mask
I believe that this is not possible. This is similar to Hard Attention described in this paper. Hard attention is used in Image captioning to allow the model to focus only on a certain part of the image at each step. Hard attention is not differentiable but there are 2 ways to go around this:
1- Use Reinforcement Learning (RL): RL is made to train models that makes decisions. Even though, the loss function won't back-propagate any gradients to the softmax used for the decision, you can use RL techniques to optimize the decision. For a simplified example, you can consider the loss as penalty, and send to the node, with the maximum value in the softmax layer, a policy gradient proportional to the penalty in order to decrease the score of the decision if it was bad (results in a high loss).
2- Use something like soft attention: instead of picking only one operation, mix them with weights based on the softmax. so instead of:
output = values * mask
Use:
output = values * softmax
Now, the operations will converge down to zero based on how much the softmax will not select them. This is easier to train compared to RL but it won't work if you must completely remove the non-selected operations from the final result (set them to zero completely).
This is another answer that talks about Hard and Soft attention that you may find helpful: https://stackoverflow.com/a/35852153/6938290
Related
Denote a[2, 3] to be a matrix of dimension 2x3. Say there are 10 elements in each input and the network is a two-element classifier (cat or dog, for example). Say there is just one dense layer. For now I am ignoring the bias vector. I know this is an over-simplified neural net, but it is just for this example. Each output in a dense layer of a neural net can be calculated as
output = matmul(input, weights)
Where weights is a weight matrix 10x2, input is an input vector 1x10, and output is an output vector 1x2.
My question is this: Can an entire series of inputs be computed at the same time with a single matrix multiplication? It seems like you could compute
output = matmul(input, weights)
Where there are 100 inputs total, and input is 100x10, weights is 10x2, and output is 100x2.
In back propagation, you could do something similar:
input_err = matmul(output_err, transpose(weights))
weights_err = matmul(transpose(input), output_err)
weights -= learning_rate*weights_err
Where weights is the same, output_err is 100x2, and input is 100x10.
However, I tried to implement a neural network in this way from scratch and I am currently unsuccessful. I am wondering if I have some other error or if my approach is fundamentally wrong.
Thanks!
If anyone else is wondering, I found the answer to my question. This does not in fact work, for a few reasons. Essentially, computing all inputs in this way is like running a network with a batch size equal to the number of inputs. The weights do not get updated between inputs, but rather all at once. And so while it seems that calculating together would be valid, it makes it so that each input does not individually influence the training step by step. However, with a reasonable batch size, you can do 2d matrix multiplications, where the input is batch_size by input_size in order to speed up training.
In addition, if predicting on many inputs (in the test stage, for example), since no weights are updated, an entire matrix multiplication of num_inputs by input_size can be run to compute all inputs in parallel.
The general idea I am trying to realize is a seq2seq-model (taken from the translate.py-example in the models, based on the seq2seq-class). This trains well.
Furthermore I am using the hidden state of the rnn after all the encoding is done, right before decoding starts (I call it the “hidden state at end of encoding”). I use this hidden state at end of encoding to feed it into a further sub-graph which I call “prices” (see below). The training gradients of this sub-graph backprop not only through this additional sub-graph, but also back into the encoder-part of the rnn (which is what I want and need).
The plan is to add more such sub-graph to the hidden state at end of encoding, as I want to analyze the input phrases in a variety of ways.
Now during training when I evaluate and train both sub-graphs (encoder+prices AND encoder+decoder) at the same time, the net does NOT converge. However, if I train by executing the training in the following way (pseudo-code):
if global_step % 10 == 0:
execute-the-price-training_code
else:
execute-the-decoder-training_code
So I am not training both sub-graphs simultaneously. Now it does converge, but the encoder+decoder-part converges MUCH slower than if I ONLY train this part and never train the prices-sub-graph.
My question is: I should be able to train both sub-graphs simultaneously. But probably I have to rescale the gradients flowing back into the hidden state at end of encoding. Here we get the gradients from the prices sub-graph AND from the decoder-sub-graph. How should this rescaling be done. I didnt find any papers describing such an undertaking, but maybe I am searching with the wrong keywords.
Here is the training-part of the code:
This is the (almost original) training-op-preparation:
if not forward_only:
self.gradient_norms = []
self.updates = []
opt = tf.train.AdadeltaOptimizer(self.learning_rate)
for bucket_id in xrange(len(buckets)):
tf.scalar_summary("seq2seq loss", self.losses[bucket_id])
gradients = tf.gradients(self.losses[bucket_id], var_list_seq2seq)
clipped_gradients, norm = tf.clip_by_global_norm(gradients, max_gradient_norm)
self.gradient_norms.append(norm)
self.updates.append(opt.apply_gradients(zip(clipped_gradients, var_list_seq2seq), global_step=self.global_step))
Now, additionally, I am running a second sub-graph that takes the hidden state at end of encoding as input:
with tf.name_scope('prices') as scope:
#First layer
W_price_first_layer = tf.Variable(tf.random_normal([num_layers*size, self.prices_hidden_layer_size], stddev=0.35), name="W_price_first_layer")
B_price_first_layer = tf.Variable(tf.zeros([self.prices_hidden_layer_size]), name="B_price_first_layer")
self.output_price_first_layer = tf.add(tf.matmul(self.hidden_state, W_price_first_layer), B_price_first_layer)
self.activation_price_first_layer = tf.nn.sigmoid(self.output_price_first_layer)
#self.activation_price_first_layer = tf.nn.Relu(self.output_price_first_layer)
#Second layer to softmax (price ranges)
W_price = tf.Variable(tf.random_normal([self.prices_hidden_layer_size, self.prices_bit_size], stddev=0.35), name="W_price")
W_price_t = tf.transpose(W_price)
B_price = tf.Variable(tf.zeros([self.prices_bit_size]), name="B_price")
self.output_price_second_layer = tf.add(tf.matmul(self.activation_price_first_layer, W_price),B_price)
self.price_prediction = tf.nn.softmax(self.output_price_second_layer)
self.label_price = tf.placeholder(tf.int32, shape=[self.batch_size], name="price_label")
#Remember the prices trainables
var_list_prices = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, "prices")
var_list_all = tf.trainable_variables()
#Backprop
self.loss_price = tf.nn.sparse_softmax_cross_entropy_with_logits(self.output_price_second_layer, self.label_price)
self.loss_price_scalar = tf.reduce_mean(self.loss_price)
self.optimizer_price = tf.train.AdadeltaOptimizer(self.learning_rate_prices)
self.training_op_price = self.optimizer_price.minimize(self.loss_price, var_list=var_list_all)
Thx a bunch
I expect that running two optimizers simultaneously will lead to inconsistent gradient updates on the common variables, and this might be causing your training not to converge.
Instead, if you add the scalar loss from each sub-network to the "losses collection" (e.g. via tf.contrib.losses.add_loss() or tf.add_to_collection(tf.GraphKeys.LOSSES, ...), you can use tf.contrib.losses.get_total_loss() to get a single loss value that can be passed to a single standard TensorFlow tf.train.Optimizer subclass. TensorFlow will derive the appropriate back-prop computation for your split network.
The get_total_loss() method simply computes an unweighted sum of the values that have been added to the losses collection. I'm not familiar with the literature on how or if you should scale these values, but you can use any arbitrary (differentiable) TensorFlow expression to combine the losses and pass the result to a single optimizer.
I can't get TensorFlow RELU activations (neither tf.nn.relu nor tf.nn.relu6) working without NaN values for activations and weights killing my training runs.
I believe I'm following all the right general advice. For example I initialize my weights with
weights = tf.Variable(tf.truncated_normal(w_dims, stddev=0.1))
biases = tf.Variable(tf.constant(0.1 if neuron_fn in [tf.nn.relu, tf.nn.relu6] else 0.0, shape=b_dims))
and use a slow training rate, e.g.,
tf.train.MomentumOptimizer(0.02, momentum=0.5).minimize(cross_entropy_loss)
But any network of appreciable depth results in NaN for cost and and at least some weights (at least in the summary histograms for them). In fact, the cost is often NaN right from the start (before training).
I seem to have these issues even when I use L2 (about 0.001) regularization, and dropout (about 50%).
Is there some parameter or setting that I should adjust to avoid these issues? I'm at a loss as to where to even begin looking, so any suggestions would be appreciated!
Following He et. al (as suggested in lejlot's comment), initializing the weights of the l-th layer to a zero-mean Gaussian distribution with standard deviation
where nl is the flattened length of the the input vector or
stddev=np.sqrt(2 / np.prod(input_tensor.get_shape().as_list()[1:]))
results in weights that generally do not diverge.
If you use a softmax classifier at the top of your network, try to make the initial weights of the layer just below the softmax very small (e.g. std=1e-4). This makes the initial distribution of outputs of the network very soft (high temperature), and helps ensure that the first few steps of your optimization are not too large and numerically unstable.
Have you tried gradient clipping and/or a smaller learning rate?
Basically, you will need to process your gradients before applying them, as follows (from tf docs, mostly):
# Replace this with what follows
# opt = tf.train.MomentumOptimizer(0.02, momentum=0.5).minimize(cross_entropy_loss)
# Create an optimizer.
opt = tf.train.MomentumOptimizer(learning_rate=0.001, momentum=0.5)
# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(cross_entropy_loss, tf.trainable_variables())
# grads_and_vars is a list of tuples (gradient, variable). Do whatever you
# need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(tf.clip_by_value(gv[0], -5., 5.), gv[1]) for gv in grads_and_vars]
# Ask the optimizer to apply the capped gradients.
opt = opt.apply_gradients(capped_grads_and_vars)
Also, the discussion in this question might help.
I am moving my first steps in neural networks and to do so I am experimenting with a very simple single layer, single output perceptron which uses a sigmoidal activation function. I am updating my weights on-line each time a training example is presented using:
weights += learningRate * (correct - result) * {input,1}
Here weights is a n-length vector which also contains the weight from the bias neuron (- threshold), result is the result as computed by the perceptron (and processed using the sigmoid) when given the input, correct is the correct result and {input,1} is the input augmented with 1 (the fixed input from the bias neuron). Now, when I try to train the perceptron to perform logic AND, the weights don't converge for a long time, instead they keep growing similarly and they maintain a ratio of circa -1.5 with the threshold, for instance the three weights are in sequence:
5.067160008240718 5.105631826680446 -7.945513136885797
...
8.40390853077094 8.43890306970281 -12.889540730182592
I would expect the perceptron to stop at 1, 1, -1.5.
Apart from this problem, which looks like connected to some missing stopping condition in the learning, if I try to use the identity function as activation function, I get weight values oscillating around:
0.43601272528257057 0.49092558197172703 -0.23106430854347537
and I obtain similar results with tanh. I can't give an explanation to this.
Thank you
Tunnuz
It is because the sigmoid activation function doesn't reach one (or zero) even with very highly positive (or negative) inputs. So (correct - result) will always be non-zero, and your weights will always get updated. Try it with the step function as the activation function (i.e. f(x) = 1 for x > 0, f(x) = 0 otherwise).
Your average weight values don't seem right for the identity activation function. It might be that your learning rate is a little high -- try reducing it and see if that reduces the size of the oscillations.
Also, when doing online learning (aka stochastic gradient descent), it is common practice to reduce the learning rate over time so that you converge to a solution. Otherwise your weights will continue to oscillate.
When trying to analyze the behavior of the perception, it helps to also look at correct and result.
Is it a good practice to use sigmoid or tanh output layers in Neural networks directly to estimate probabilities?
i.e the probability of given input to occur is the output of sigmoid function in the NN
EDIT
I wanted to use neural network to learn and predict the probability of a given input to occur..
You may consider the input as State1-Action-State2 tuple.
Hence the output of NN is the probability that State2 happens when applying Action on State1..
I Hope that does clear things..
EDIT
When training NN, I do random Action on State1 and observe resultant State2; then teach NN that input State1-Action-State2 should result in output 1.0
First, just a couple of small points on the conventional MLP lexicon (might help for internet searches, etc.): 'sigmoid' and 'tanh' are not 'output layers' but functions, usually referred to as "activation functions". The return value of the activation function is indeed the output from each layer, but they are not the output layer themselves (nor do they calculate probabilities).
Additionally, your question recites a choice between two "alternatives" ("sigmoid and tanh"), but they are not actually alternatives, rather the term 'sigmoidal function' is a generic/informal term for a class of functions, which includes the hyperbolic tangent ('tanh') that you refer to.
The term 'sigmoidal' is probably due to the characteristic shape of the function--the return (y) values are constrained between two asymptotic values regardless of the x value. The function output is usually normalized so that these two values are -1 and 1 (or 0 and 1). (This output behavior, by the way, is obviously inspired by the biological neuron which either fires (+1) or it doesn't (-1)). A look at the key properties of sigmoidal functions and you can see why they are ideally suited as activation functions in feed-forward, backpropagating neural networks: (i) real-valued and differentiable, (ii) having exactly one inflection point, and (iii) having a pair of horizontal asymptotes.
In turn, the sigmoidal function is one category of functions used as the activation function (aka "squashing function") in FF neural networks solved using backprop. During training or prediction, the weighted sum of the inputs (for a given layer, one layer at a time) is passed in as an argument to the activation function which returns the output for that layer. Another group of functions apparently used as the activation function is piecewise linear function. The step function is the binary variant of a PLF:
def step_fn(x) :
if x <= 0 :
y = 0
if x > 0 :
y = 1
(On practical grounds, I doubt the step function is a plausible choice for the activation function, but perhaps it helps understand the purpose of the activation function in NN operation.)
I suppose there an unlimited number of possible activation functions, but in practice, you only see a handful; in fact just two account for the overwhelming majority of cases (both are sigmoidal). Here they are (in python) so you can experiment for yourself, given that the primary selection criterion is a practical one:
# logistic function
def sigmoid2(x) :
return 1 / (1 + e**(-x))
# hyperbolic tangent
def sigmoid1(x) :
return math.tanh(x)
what are the factors to consider in selecting an activation function?
First the function has to give the desired behavior (arising from or as evidenced by sigmoidal shape). Second, the function must be differentiable. This is a requirement for backpropagation, which is the optimization technique used during training to 'fill in' the values of the hidden layers.
For instance, the derivative of the hyperbolic tangent is (in terms of the output, which is how it is usually written) :
def dsigmoid(y) :
return 1.0 - y**2
Beyond those two requriements, what makes one function between than another is how efficiently it trains the network--i.e., which one causes convergence (reaching the local minimum error) in the fewest epochs?
#-------- Edit (see OP's comment below) ---------#
I am not quite sure i understood--sometimes it's difficult to communicate details of a NN, without the code, so i should probably just say that it's fine subject to this proviso: What you want the NN to predict must be the same as the dependent variable used during training. So for instance, if you train your NN using two states (e.g., 0, 1) as the single dependent variable (which is obviously missing from your testing/production data) then that's what your NN will return when run in "prediction mode" (post training, or with a competent weight matrix).
You should choose the right loss function to minimize.
The squared error does not lead to the maximum likelihood hypothesis here.
The squared error is derived from a model with Gaussian noise:
P(y|x,h) = k1 * e**-(k2 * (y - h(x))**2)
You estimate the probabilities directly. Your model is:
P(Y=1|x,h) = h(x)
P(Y=0|x,h) = 1 - h(x)
P(Y=1|x,h) is the probability that event Y=1 will happen after seeing x.
The maximum likelihood hypothesis for your model is:
h_max_likelihood = argmax_h product(
h(x)**y * (1-h(x))**(1-y) for x, y in examples)
This leads to the "cross entropy" loss function.
See chapter 6 in Mitchell's Machine Learning
for the loss function and its derivation.
There is one problem with this approach: if you have vectors from R^n and your network maps those vectors into the interval [0, 1], it will not be guaranteed that the network represents a valid probability density function, since the integral of the network is not guaranteed to equal 1.
E.g., a neural network could map any input form R^n to 1.0. But that is clearly not possible.
So the answer to your question is: no, you can't.
However, you can just say that your network never sees "unrealistic" code samples and thus ignore this fact. For a discussion of this (and also some more cool information on how to model PDFs with neural networks) see contrastive backprop.