I have 2 images as input, x1 and x2 and try to use convolution as a similarity measure. The idea is that the learned weights substitute more traditional measure of similarity (cross correlation, NN, ...). Defining my forward function as follows:
def forward(self,x1,x2):
out_conv1a = self.conv1(x1)
out_conv2a = self.conv2(out_conv1a)
out_conv3a = self.conv3(out_conv2a)
out_conv1b = self.conv1(x2)
out_conv2b = self.conv2(out_conv1b)
out_conv3b = self.conv3(out_conv2b)
Now for the similarity measure:
out_cat = torch.cat([out_conv3a, out_conv3b],dim=1)
futher_conv = nn.Conv2d(out_cat)
My question is as follows:
1) Would Depthwise/Separable Convolutions as in the google paper yield any advantage over 2d convolution of the concatenated input. For that matter can convolution be a similarity measure, cross correlation and convolution are very similar.
2) It is my understanding that the groups=2 option in conv2d would provide 2 separate inputs to train weights with, in this case each of the previous networks weights. How are these combined afterwards?
For a basic concept see here.
Using a nn.Conv2d layer you assume weights are trainable parameters. However, if you want to filter one feature map with another, you can dive deeper and use torch.nn.functional.conv2d to explicitly define both input and filter yourself:
out = torch.nn.functional.conv2d(out_conv3a, out_conv3b)
Related
I have implemented Autoencoder using Keras that takes 112*112*3 neurons as input and 100 neurons as the compressed/encoded state. I want to find the neurons out of these 100 that learns the important features. So far i have calculated eigen values(e) and eigen vectors(v) using the following steps. And i found out that around first 30 values of (e) is greater than 0. Does that mean the first 30 modes are the important ones? Is there any other method that could find the important neurons?
Thanks in Advance
x_enc = enc_model.predict(x_train, batch_size=BATCH_SIZE) # shape (3156,100)
x_mean = np.mean(x_enc, axis=0) # shape (100,)
x_stds = np.std(x_enc, axis=0) # shape (100,)
x_cov = np.cov((x_enc - x_mean).T) # shape (100,100)
e, v = np.linalg.eig(x_cov) # shape (100,) and (100,100) respectively
I don't know if the approach you are using will actually give you any useful results since the way the network learns and what it exactly learns aren't known, I suggest you use a different kind of autoencoder, that automatically learns disentangled representations of the data in a latent space, this way you can be sure that all the parameters you find are actually contributing to the representation of your data. check this article
I am confused why dz=da*g'(z)?
as we all know, in forward propagation,a=g(z),after taking the derivative of z, I can get da/dz=g'(z),so dz=da*1/g'(z)?
Thanks!!
From what I remember, in many courses, representations like dZ are a shorter way of writing dJ/dZ and and so on. All derivatives are of the cost with respect to various parameters, activations and weighted sums etc.
The Differential equations come up based on the last layer and then you can build them backwards, the equation as per your last layer can be based on few of the activation functions.
Linear g'(z) = 1 or 1D of 1 vector based on layer dimensions
Sigmoid g'(z) = g(z)*(1-g(z))
Tanh g'(z) = 1 - thanh^2(z)
Relu = 1 if g(z)>0 or else 0
Leaky Relu = 1 if g(z)>0 and whatever leaky relu slope you kept otherwise.
From there you basically have to compute partial gradients for the the previous layers. Check out http://neuralnetworksanddeeplearning.com/chap2.html for a deeper understanding
I am trying to produce a mathematical operation selection nn model, which is based on the scalar input. The operation is selected based on the softmax result which is produce by the nn. Then this operation has to be applied to the scalar input in order to produce the final output. So far I’ve come up with applying argmax and onehot on the softmax output in order to produce a mask which then is applied on the concated values matrix from all the possible operations to be performed (as show in the pseudo code below). The issue is that neither argmax nor onehot appears to be differentiable. I am new to this, so any would be highly appreciated. Thanks in advance.
#perform softmax
logits = tf.matmul(current_input, W) + b
softmax = tf.nn.softmax(logits)
#perform all possible operations on the input
op_1_val = tf_op_1(current_input)
op_2_val = tf_op_2(current_input)
op_3_val = tf_op_2(current_input)
values = tf.concat([op_1_val, op_2_val, op_3_val], 1)
#create a mask
argmax = tf.argmax(softmax, 1)
mask = tf.one_hot(argmax, num_of_operations)
#produce the input, by masking out those operation results which have not been selected
output = values * mask
I believe that this is not possible. This is similar to Hard Attention described in this paper. Hard attention is used in Image captioning to allow the model to focus only on a certain part of the image at each step. Hard attention is not differentiable but there are 2 ways to go around this:
1- Use Reinforcement Learning (RL): RL is made to train models that makes decisions. Even though, the loss function won't back-propagate any gradients to the softmax used for the decision, you can use RL techniques to optimize the decision. For a simplified example, you can consider the loss as penalty, and send to the node, with the maximum value in the softmax layer, a policy gradient proportional to the penalty in order to decrease the score of the decision if it was bad (results in a high loss).
2- Use something like soft attention: instead of picking only one operation, mix them with weights based on the softmax. so instead of:
output = values * mask
Use:
output = values * softmax
Now, the operations will converge down to zero based on how much the softmax will not select them. This is easier to train compared to RL but it won't work if you must completely remove the non-selected operations from the final result (set them to zero completely).
This is another answer that talks about Hard and Soft attention that you may find helpful: https://stackoverflow.com/a/35852153/6938290
I have been playing around with neural networks and their uses in applications lately. Just recently, I came across a tutorial describing a neural network that would learn how to classify handwritten numbers from 0-9 (MNIST). The portion of code from the tutorial that I'm having trouble with is below(https://pythonprogramming.net/tensorflow-neural-network-session-machine-learning-tutorial/)
def neural_network_model(data):
hidden_1_layer = {'weights':tf.Variable(tf.random_normal([784, nodes_hl1])),
'biases':tf.Variable(tf.random_normal([nodes_hl1]))}
hidden_2_layer = {'weights':tf.Variable(tf.random_normal([nodes_hl1, nodes_hl2])),
'biases':tf.Variable(tf.random_normal([nodes_hl2]))}
hidden_3_layer = {'weights':tf.Variable(tf.random_normal([nodes_hl2, nodes_hl3])),
'biases':tf.Variable(tf.random_normal([nodes_hl3]))}
output_layer = {'weights':tf.Variable(tf.random_normal([nodes_hl3, n_classes])),
'biases':tf.Variable(tf.random_normal([n_classes])),}
l1 = tf.add(tf.matmul(data,hidden_1_layer['weights']), hidden_1_layer['biases'])
l1 = tf.nn.relu(l1)
l2 = tf.add(tf.matmul(l1,hidden_2_layer['weights']), hidden_2_layer['biases'])
l2 = tf.nn.relu(l2)
l3 = tf.add(tf.matmul(l2,hidden_3_layer['weights']), hidden_3_layer['biases'])
l3 = tf.nn.relu(l3)
output = tf.matmul(l3,output_layer['weights']) + output_layer['biases']
return output
I have a basic grasp of what is going on. The 3 hidden layers are each a set of nodes which are connected by biases and weights. The final output layer is the result of the neural network. I understand everything here, except the lines of code that include tf.nn.relu(). After looking at TensorFlow's documentation, all it mentions is that the function computes the rectified linear of features(https://www.tensorflow.org/api_docs/python/nn/activation_functions_#relu). I am rather confused as to what this function is performing, and what significance it has in the neural network.
some of the advantages of relu(rectified linear units are)
Less computationally expensive(and hence , better performance)
Some other functions like sigmoid tend to satturate
They have derivates that are easy to calc(remember the training process relies in derivates)
Please check this https://www.quora.com/What-are-the-benefits-of-using-rectified-linear-units-vs-the-typical-sigmoid-activation-function
I'm working on implementing an interface between a TensorFlow basic LSTM that's already been trained and a javascript version that can be run in the browser. The problem is that in all of the literature that I've read LSTMs are modeled as mini-networks (using only connections, nodes and gates) and TensorFlow seems to have a lot more going on.
The two questions that I have are:
Can the TensorFlow model be easily translated into a more conventional neural network structure?
Is there a practical way to map the trainable variables that TensorFlow gives you to this structure?
I can get the 'trainable variables' out of TensorFlow, the issue is that they appear to only have one value for bias per LSTM node, where most of the models I've seen would include several biases for the memory cell, the inputs and the output.
Internally, the LSTMCell class stores the LSTM weights as a one big matrix instead of 8 smaller ones for efficiency purposes. It is quite easy to divide it horizontally and vertically to get to the more conventional representation. However, it might be easier and more efficient if your library does the similar optimization.
Here is the relevant piece of code of the BasicLSTMCell:
concat = linear([inputs, h], 4 * self._num_units, True)
# i = input_gate, j = new_input, f = forget_gate, o = output_gate
i, j, f, o = array_ops.split(1, 4, concat)
The linear function does the matrix multiplication to transform the concatenated input and the previous h state into 4 matrices of [batch_size, self._num_units] shape. The linear transformation uses a single matrix and bias variables that you're referring to in the question. The result is then split into different gates used by the LSTM transformation.
If you'd like to explicitly get the transformations for each gate, you can split that matrix and bias into 4 blocks. It is also quite easy to implement it from scratch using 4 or 8 linear transformations.