I'm using CNNs in Keras for an NLP task and instead of max pooling, I'm trying to achieve max over time pooling.
Any ideas/hacks on how to achieve this?
What I mean by max over time pooling is to pool the highest value, no matter where they are in the vector
Assuming that your data shape is (batch_size, seq_len, features) you may apply:
seq_model = Reshape((seq_len * features, 1))(seq_model)
seq_model = GlobalMaxPooling1D()(seq_model)
Related
I have implemented Autoencoder using Keras that takes 112*112*3 neurons as input and 100 neurons as the compressed/encoded state. I want to find the neurons out of these 100 that learns the important features. So far i have calculated eigen values(e) and eigen vectors(v) using the following steps. And i found out that around first 30 values of (e) is greater than 0. Does that mean the first 30 modes are the important ones? Is there any other method that could find the important neurons?
Thanks in Advance
x_enc = enc_model.predict(x_train, batch_size=BATCH_SIZE) # shape (3156,100)
x_mean = np.mean(x_enc, axis=0) # shape (100,)
x_stds = np.std(x_enc, axis=0) # shape (100,)
x_cov = np.cov((x_enc - x_mean).T) # shape (100,100)
e, v = np.linalg.eig(x_cov) # shape (100,) and (100,100) respectively
I don't know if the approach you are using will actually give you any useful results since the way the network learns and what it exactly learns aren't known, I suggest you use a different kind of autoencoder, that automatically learns disentangled representations of the data in a latent space, this way you can be sure that all the parameters you find are actually contributing to the representation of your data. check this article
I have a set of inputs that has 5000ish features with values that vary from 0.005 to 9000000. Each of the features has similar values (a feature with a value of 10ish will not also have a value of 0.1ish)
I am trying to apply linear regression to this data set, however, the wide range of input values is inhibiting effective gradient descent.
What is the best way to handle this variance? If normalization is best, please include details on the best way to implement this normalization.
Thanks!
Simply perform it as a pre-processing step. You can do it as following:
1) Calculate mean values for each of the features in the training set and store it. Be careful, do not mess up feature mean and sample mean, so you will have a vector of size [number_of_features (5000ish)].
2) Calculate std. for each feature in the training set and store it. Size of [number_of_feature] as well
3) Update each training and testing entry as:
updated = (original_vector - mean_vector)/ std_vector
That's it!
The code will look like:
# train_data shape [train_length,5000]
# test_data [test_length, 5000]
mean = np.mean(train_data,1)
std = np.std(train_data,1)
normalized_train_data = (train_data - mean)/ std
normalized_test_data = (test_data - mean)/ std
I am currently building a CNN in tensorflow and I am initialising my weight matrix using a He normal weight initialisation. However, I am unsure how I should initialise my bias values. I am using ReLU as my activation function between each convolutional layer. Is there a standard method to initialising bias values?
# Define approximate xavier weight initialization (with RelU correction described by He)
def xavier_over_two(shape):
std = np.sqrt(shape[0] * shape[1] * shape[2])
return tf.random_normal(shape, stddev=std)
def bias_init(shape):
return #???
Initializing the biases. It is possible and common to initialize the
biases to be zero, since the asymmetry breaking is provided by the
small random numbers in the weights. For ReLU non-linearities, some
people like to use small constant value such as 0.01 for all biases
because this ensures that all ReLU units fire in the beginning and
therefore obtain and propagate some gradient. However, it is not clear
if this provides a consistent improvement (in fact some results seem
to indicate that this performs worse) and it is more common to simply
use 0 bias initialization.
source: http://cs231n.github.io/neural-networks-2/
Be aware of the specific case of the last layer's bias. As Andrej Karpathy explains in his Recipe for Training Neural Networks:
init well. Initialize the final layer weights correctly. E.g. if you are regressing some values that have a mean of 50 then initialize the final bias to 50. If you have an imbalanced dataset of a ratio 1:10 of positives:negatives, set the bias on your logits such that your network predicts probability of 0.1 at initialization. Setting these correctly will speed up convergence and eliminate “hockey stick” loss curves where in the first few iteration your network is basically just learning the bias.
When using the function tf.nn.fractional_max_pool in Tensorflow, in addition to the output pooled tensor it returns, it also returns a row_pooling_sequence and a col_pooling_sequence, which I presume is used in backpropagation to find the gradient of. This is in contrast to the normal $2 \times 2$ max pooling, which just returns the pooled tensor.
My question is: do we have to handle the row_pooling and col_pooling values ourselves? How would we include them into a network to get backpropagation working properly? I modified a simple convolutional neural network to use fractional max pooling instead of 2 x 2 max pooling without making use of these values and the results were much poorer, leading me to believe we must explicitly handle these.
Here's the relevant portion of my code that makes use of the FMP:
def add_layer_ops_FMP(conv_func, x_input, W, keep_prob_layer, training_phase):
h_conv = conv_func(x_input, W, stride_l = 1)
h_BN = batch_norm(h_conv, training_phase, epsilon)
h_elu = tf.nn.elu(h_BN) # Rectified unit layer - change accordingly
def dropout_no_training(h_elu=h_elu):
return dropout_op(h_elu, keep_prob = 1.0)
def dropout_in_training(h_elu=h_elu, keep_prob_layer=keep_prob_layer):
return dropout_op(h_elu, keep_prob = keep_prob_layer)
h_drop = tf.cond(training_phase, dropout_in_training, dropout_no_training)
h_pool, row_pooling_sequence, col_pooling_sequence = tf.nn.fractional_max_pool(h_drop) # FMP layer. See Ben Graham's paper
return h_pool
Link to function on github.
Do we need to handle row_pooling_sequence and col_pooling_sequence?
Even though the tf.nn.fractional_max_pool documentation says it turns 2 extra tensors which are needed to calculate gradient, I believe we do not need to specially handle these 2 extra tensors and add them into gradient calculation operation. The backpropagation of tf.nn.fractional_max_poolin TensorFlow is already registered into the gradient calculation flow by the _FractionalMaxPoolGrad function. As you can see in the _FractionalMaxPoolGrad, the row_pooling_sequence and col_pooling_sequence are extracted by op.outputs[1] and op.outputs[2] and used to calculate gradient.
#ops.RegisterGradient("FractionalMaxPool")
def _FractionalMaxPoolGrad(op, grad_0, unused_grad_1, unused_grad_2):
"""..."""
return gen_nn_ops._fractional_max_pool_grad(op.inputs[0], op.outputs[0],
grad_0, op.outputs[1],
op.outputs[2],
op.get_attr("overlapping"))
Possible reasons for poorer performance after using fractional_max_pool (in my personal opinions).
In the fractional max pooling paper, the author used fractional max pooling in a spatially-sparse convolutional network. According to his spatially-sparse convolutional network design, he actually extended the image input spatial size by padding zeros. Additionally, fractional max pooling downsizes the input by a factor of pooling_ratio which is often less than 2. These two combined together allowed stacking more convolutional layers than using regular max pooling and hence building a deeper network. (i.e. imagine using CIFAR-10 dataset, the (non-padding) input spatial size is 32x32, the spatial size drops to 4x4 after 3 convolutional layers and 3 max pooling operations. If using fractional max pooling with pooling_ratio=1.4, the spatial size drops to 4x4 after 6 convolutional and 6 fractional max pooling layers). I experimented with building a CNN with 2-conv-layer+2-pooling-layer(regular max pool vs. fractional max pool with pooling_ratio=1.47)+2-fully-connected-layer on MNIST dataset. The one using regular max pooling also produced a better performance than the one using fractional max pooling (down by 15~20% performance). Comparing the spatial size before feeding into fully connected layers, the model with regular max pooling has spatial size of 7x7, the one with fractional max pooling has spatial size of 12x12. Adding one more conv+fractional_max_pool into the latter model (final spatial size dropped to be 8x8) improved the performance to a more comparative level with the former model with regular max pooling.
In summary, I personally think the good performance in the Fractional Max-Pooling paper is achieved by a combination of using spatially-sparse CNN with fractional max-pooling and small filters (and network in network) which enable building a deep network even when the input image spatial size is small. Hence in regular CNN network, simply replace regular max pooling with fractional max pooling does not necessarily give you a better performance.
I can't get TensorFlow RELU activations (neither tf.nn.relu nor tf.nn.relu6) working without NaN values for activations and weights killing my training runs.
I believe I'm following all the right general advice. For example I initialize my weights with
weights = tf.Variable(tf.truncated_normal(w_dims, stddev=0.1))
biases = tf.Variable(tf.constant(0.1 if neuron_fn in [tf.nn.relu, tf.nn.relu6] else 0.0, shape=b_dims))
and use a slow training rate, e.g.,
tf.train.MomentumOptimizer(0.02, momentum=0.5).minimize(cross_entropy_loss)
But any network of appreciable depth results in NaN for cost and and at least some weights (at least in the summary histograms for them). In fact, the cost is often NaN right from the start (before training).
I seem to have these issues even when I use L2 (about 0.001) regularization, and dropout (about 50%).
Is there some parameter or setting that I should adjust to avoid these issues? I'm at a loss as to where to even begin looking, so any suggestions would be appreciated!
Following He et. al (as suggested in lejlot's comment), initializing the weights of the l-th layer to a zero-mean Gaussian distribution with standard deviation
where nl is the flattened length of the the input vector or
stddev=np.sqrt(2 / np.prod(input_tensor.get_shape().as_list()[1:]))
results in weights that generally do not diverge.
If you use a softmax classifier at the top of your network, try to make the initial weights of the layer just below the softmax very small (e.g. std=1e-4). This makes the initial distribution of outputs of the network very soft (high temperature), and helps ensure that the first few steps of your optimization are not too large and numerically unstable.
Have you tried gradient clipping and/or a smaller learning rate?
Basically, you will need to process your gradients before applying them, as follows (from tf docs, mostly):
# Replace this with what follows
# opt = tf.train.MomentumOptimizer(0.02, momentum=0.5).minimize(cross_entropy_loss)
# Create an optimizer.
opt = tf.train.MomentumOptimizer(learning_rate=0.001, momentum=0.5)
# Compute the gradients for a list of variables.
grads_and_vars = opt.compute_gradients(cross_entropy_loss, tf.trainable_variables())
# grads_and_vars is a list of tuples (gradient, variable). Do whatever you
# need to the 'gradient' part, for example cap them, etc.
capped_grads_and_vars = [(tf.clip_by_value(gv[0], -5., 5.), gv[1]) for gv in grads_and_vars]
# Ask the optimizer to apply the capped gradients.
opt = opt.apply_gradients(capped_grads_and_vars)
Also, the discussion in this question might help.