Layer-wise linear combination of two neural network models in Pytorch - machine-learning

I have two trained neural network models M1 and M2 in Pytorch.
Both have exactly the same neural network architecture (same # of layers, which we will denote by 'n').
Furthermore, I have a n-dimensional vector 'lambda', where each value 'lambda_i' is in range [0, 1].
What I want to calculate now is the 'layer-wise' linear combination 'M = lambda * M1 + (1 - lambda) * M2'.
Means that the i-th layer in M is calculated as the linear combination of the i-th layer in M1 (using factor 'lambda_n') and the i-th layer in M2 (using factor '1 - lambda_i').
The linear combination shall be done both for 'weight' and 'bias' of the i-th layer.
How can I calculate this in Pytorch ?

Related

Deep Learning Log Likelihood

I am new babie to the Deep Learning field, and I am use log-likelihood method to compare the MSE metrics.Could anyone be able to show how to calculate the following 2 predicted output examples with 3 outputs neurons each. Thanks
yt = [ [1,0,0],[0,0,1]]
yp = [ [0.9, 0.2,0.2], [0.2,0.8,0.3] ]
MSE or Mean Squared Error is simply the expected value of the squared difference between the predicted and the ground truth labels, represented as
\text{MSE}(\hat{\theta}) = E\left[(\hat{\theta} - \theta)^2\right]
where theta is the ground truth labels and theta^hat is the predicted labels
I am not sure what are you referring to exactly, like a theoretical question or a part of code
As a Python implementation
def mean_squared_error(A, B):
return np.square(np.subtract(A,B)).mean()
yt = [[1,0,0],[0,0,1]]
yp = [[0.9, 0.2,0.2], [0.2,0.8,0.3]]
mse = mean_squared_error(yt, yp)
print(mse)
This will give a value of 0.21
If you are using one of the DL frameworks like TensorFlow, then they are already providing the function which calculates the mse loss between tensors
tf.losses.mean_squared_error
where
tf.losses.mean_squared_error(
labels,
predictions,
weights=1.0,
scope=None,
loss_collection=tf.GraphKeys.LOSSES,
reduction=Reduction.SUM_BY_NONZERO_WEIGHTS
)
Args:
labels: The ground truth output tensor, same dimensions as 'predictions'.
predictions: The predicted outputs.
weights: Optional Tensor whose rank is either 0, or the same rank as labels, and must be broadcastable to labels (i.e., all dimensions
must be either 1, or the same as the corresponding losses dimension).
scope: The scope for the operations performed in computing the loss.
loss_collection: collection to which the loss will be added.
reduction: Type of reduction to apply to loss.
Returns:
Weighted loss float Tensor. If reduction is NONE, this has the same
shape as labels; otherwise, it is scalar.

Weighted Training Examples in Tensorflow

Given a set of training examples for training a neural network, we want to give more or less weight to various examples in training. We apply a weight between 0.0 and 1.0 to each example based on some criteria for the "value" (e.g. validity or confidence) of the example. How can this be implemented in Tensorflow, in particular when using tf.nn.sparse_softmax_cross_entropy_with_logits()?
In the most common case where you call tf.nn.sparse_softmax_cross_entropy_with_logits with logits of shape [batch_size, num_classes] and labels of shape [batch_size], the function returns a tensor of shape batch_size. You can multiply this tensor with a weight tensor before reducing them to a single loss value:
weights = tf.placeholder(name="loss_weights", shape=[None], dtype=tf.float32)
loss_per_example = tf.nn.sparse_softmax_cross_entropy_with_logits(logits, labels)
loss = tf.reduce_mean(weights * loss_per_example)

Why input is scaled in tf.nn.dropout in tensorflow?

I can't understand why dropout works like this in tensorflow. The blog of CS231n says that, "dropout is implemented by only keeping a neuron active with some probability p (a hyperparameter), or setting it to zero otherwise." Also you can see this from picture(Taken from the same site)
From tensorflow site, With probability keep_prob, outputs the input element scaled up by 1 / keep_prob, otherwise outputs 0.
Now, why the input element is scaled up by 1/keep_prob? Why not keep the input element as it is with probability and not scale it with 1/keep_prob?
This scaling enables the same network to be used for training (with keep_prob < 1.0) and evaluation (with keep_prob == 1.0). From the Dropout paper:
The idea is to use a single neural net at test time without dropout. The weights of this network are scaled-down versions of the trained weights. If a unit is retained with probability p during training, the outgoing weights of that unit are multiplied by p at test time as shown in Figure 2.
Rather than adding ops to scale down the weights by keep_prob at test time, the TensorFlow implementation adds an op to scale up the weights by 1. / keep_prob at training time. The effect on performance is negligible, and the code is simpler (because we use the same graph and treat keep_prob as a tf.placeholder() that is fed a different value depending on whether we are training or evaluating the network).
Let's say the network had n neurons and we applied dropout rate 1/2
Training phase, we would be left with n/2 neurons. So if you were expecting output x with all the neurons, now you will get on x/2. So for every batch, the network weights are trained according to this x/2
Testing/Inference/Validation phase, we dont apply any dropout so the output is x. So, in this case, the output would be with x and not x/2, which would give you the incorrect result. So what you can do is scale it to x/2 during testing.
Rather than the above scaling specific to Testing phase. What Tensorflow's dropout layer does is that whether it is with dropout or without (Training or testing), it scales the output so that the sum is constant.
Here is a quick experiment to disperse any remaining confusion.
Statistically the weights of a NN-layer follow a distribution that is usually close to normal (but not necessarily), but even in the case when trying to sample a perfect normal distribution in practice, there are always computational errors.
Then consider the following experiment:
DIM = 1_000_000 # set our dims for weights and input
x = np.ones((DIM,1)) # our input vector
#x = np.random.rand(DIM,1)*2-1.0 # or could also be a more realistic normalized input
probs = [1.0, 0.7, 0.5, 0.3] # define dropout probs
W = np.random.normal(size=(DIM,1)) # sample normally distributed weights
print("W-mean = ", W.mean()) # note the mean is not perfect --> sampling error!
# DO THE DRILL
h = defaultdict(list)
for i in range(1000):
for p in probs:
M = np.random.rand(DIM,1)
M = (M < p).astype(int)
Wp = W * M
a = np.dot(Wp.T, x)
h[str(p)].append(a)
for k,v in h.items():
print("For drop-out prob %r the average linear activation is %r (unscaled) and %r (scaled)" % (k, np.mean(v), np.mean(v)/float(k)))
Sample output:
x-mean = 1.0
W-mean = -0.001003985674840264
For drop-out prob '1.0' the average linear activation is -1003.985674840258 (unscaled) and -1003.985674840258 (scaled)
For drop-out prob '0.7' the average linear activation is -700.6128015029908 (unscaled) and -1000.8754307185584 (scaled)
For drop-out prob '0.5' the average linear activation is -512.1602655283492 (unscaled) and -1024.3205310566984 (scaled)
For drop-out prob '0.3' the average linear activation is -303.21194422742315 (unscaled) and -1010.7064807580772 (scaled)
Notice that the unscaled activations diminish due to the statistically imperfect normal distribution.
Can you spot an obvious correlation between the W-mean and the average linear activation means?
If you keep reading in cs231n, the difference between dropout and inverted dropout is explained.
In a network with no dropout, the activations in layer L will be aL. The weights of next layer (L+1) will be learned in such a manner that it receives aL and produces output accordingly. But with a network containing dropout (with keep_prob = p), the weights of L+1 will be learned in such a manner that it receives p*aL and produces output accordingly. Why p*aL? Because the Expected value, E(aL), will be probability_of_keeping(aL)*aL + probability_of_not_keeping(aL)*0 which will be equal to p*aL + (1-p)*0 = p*aL. In the same network, during testing time there will be no dropout. Hence the layer L+1 will receive aL simply. But its weights were trained to expect p*aL as input. Therefore, during testing time you will have to multiply the activations with p. But instead of doing this, you can multiply the activations with 1/p during training only. This is called inverted dropout.
Since we want to leave the forward pass at test time untouched (and tweak our network just during training), tf.nn.dropout directly implements inverted dropout, scaling the values.

How are the following types of neural network-like techniques called?

The neural network applications I've seen always learn the weights of their inputs and use fixed "hidden layers".
But I'm wondering about the following techniques:
1) fixed inputs, but the hidden layers are no longer fixed, in the sense that the functions of the input they compute can be tweaked (learned)
2) fixed inputs, but the hidden layers are no longer fixed, in the sense that although they have clusters which compute fixed functions (multiplication, addition, etc... just like ALUs in a CPU or GPU) of their inputs, the weights of the connections between them and between them and the input can be learned (this should in some ways be equivalent to 1) )
These could be used to model systems for which we know the inputs and the output but not how the input is turned into the output (figuring out what is inside a "black box"). Do such techniques exist and if so, what are they called?
For part (1) of your question, there are a couple of relatively recent techniques that come to mind.
The first one is a type of feedforward layer called "maxout" which computes a piecewise linear output function of its inputs.
Consider a traditional neural network unit with d inputs and a linear transfer function. We can describe the output of this unit as a function of its input z (a vector with d elements) as g(z) = w z, where w is a vector with d weight values.
In a maxout unit, the output of the unit is described as
g(z) = max_k w_k z
where w_k is a vector with d weight values, and there are k such weight vectors [w_1 ... w_k] per unit. Each of the weight vectors in the maxout unit computes some linear function of the input, and the max combines all of these linear functions into a single, convex, piecewise linear function. The individual weight vectors can be learned by the network, so that in effect each linear transform learns to model a specific part of the input (z) space.
You can read more about maxout networks at http://arxiv.org/abs/1302.4389.
The second technique that has recently been developed is the "parametric relu" unit. In this type of unit, all neurons in a network layer compute an output g(z) = max(0, w z) + a min(w z, 0), as compared to the more traditional rectified linear unit, which computes g(z) = max(0, w z). The parameter a is shared across all neurons in a layer in the network and is learned along with the weight vector w.
The prelu technique is described by http://arxiv.org/abs/1502.01852.
Maxout units have been shown to work well for a number of image classification tasks, particularly when combined with dropout to prevent overtraining. It's unclear whether the parametric relu units are extremely useful in modeling images, but the prelu paper gets really great results on what has for a while been considered the benchmark task in image classification.

Neural Networks: Why does the perceptron rule only work for linearly separable data?

I previously asked for an explanation of linearly separable data. Still reading Mitchell's Machine Learning book, I have some trouble understanding why exactly the perceptron rule only works for linearly separable data?
Mitchell defines a perceptron as follows:
That is, it is y is 1 or -1 if the sum of the weighted inputs exceeds some threshold.
Now, the problem is to determine a weight vector that causes the perceptron to produce the correct output (1 or -1) for each of the given training examples. One way of achieving this is through the perceptron rule:
One way to learn an acceptable weight vector is to begin with random
weights, then iteratively apply the perceptron to each training
example, modify- ing the perceptron weights whenever it misclassifies
an example. This process is repeated, iterating through the training
examples as many times as needed until the perceptron classifies all
training examples correctly. Weights are modified at each step
according to the perceptron training rule, which revises the weight wi
associated with input xi according to the rule:
So, my question is: Why does this only work with linearly separable data? Thanks.
Because the dot product of w and x is a linear combination of xs, and you, in fact, split your data into 2 classes using a hyperplane a_1 x_1 + … + a_n x_n > 0
Consider a 2D example: X = (x, y) and W = (a, b) then X * W = a*x + b*y. sgn returns 1 if its argument is greater than 0, that is, for class #1 you have a*x + b*y > 0, which is equivalent to y > -a/b x (assuming b != 0). And this equation is linear and divides a 2D plane into 2 parts.

Resources