Xavier initialisation - machine-learning

I have a global idea of how Xavier initialisation works. But lately I was reading a code that splitted the weights of an LSTM (the weights related to the input and the hidden state) to 4 chunks before applying xavier initialisation.
Here is a part of the code :
def _init_lstm(self, weight):
for w in weight.chunk(4, 0):
init.xavier_uniform_(w)
self._init_lstm(self.lstm.weight_ih_l0)
self._init_lstm(self.lstm.weight_hh_l0)
The only reason I can see behind this is that the fan_in and fan_out in the Xavier Formula will be divided by 4.
Why did he split the weights to 4 chunks?

Related

deep neural network model stops learning after one epoch

I am training a unsupervised NN model and for some reason, after exactly one epoch (80 steps), model stops learning.
]
Do you have any idea why it might happen and what should I do to prevent it?
This is more info about my NN:
I have a deep NN that tries to solve an optimization problem. My loss function is customized and it is my objective function in the optimization problem.
So if my optimization problems is min f(x) ==> loss, now in my DNN loss = f(x). I have 64 input, 64 output, 3 layers in between :
self.l1 = nn.Linear(input_size, hidden_size)
self.relu1 = nn.LeakyReLU()
self.BN1 = nn.BatchNorm1d(hidden_size)
and last layer is:
self.l5 = nn.Linear(hidden_size, output_size)
self.tan5 = nn.Tanh()
self.BN5 = nn.BatchNorm1d(output_size)
to scale my network.
with more layers and nodes(doubles: 8 layers each 200 nodes), I can get a little more progress toward lower error, but again after 100 steps training error becomes flat!
The symptom is that the training loss stops being improved relatively early. Suppose that your problem is learnable at all, there are many reasons for the for this behavior. Following are most relavant:
Improper preprocessing of input: Neural network prefers input with
zero mean. E.g., if the input is all positive, it will restrict the
weights to be updated in the same direction, which may not be
desirable (https://youtu.be/gYpoJMlgyXA).
Therefore, you may want to subtract the mean from all the images (e.g., subtract 127.5 from each of the 3 channels). Scaling to make unit standard deviation in each channel may also be helpful.
Generalization ability of the network: The network is not complicated
or deep enough for the task.
This is very easy to check. You can train the network on just a few
images (says from 3 to 10). The network should be able to overfit the
data and drives the loss to almost 0. If it is not the case, you may
have to add more layers such as using more than 1 Dense layer.
Another good idea is to used pre-trained weights (in applications of Keras documentation). You may adjust the Dense layers at the top to fit with your problem.
Improper weight initialization. Improper weight initialization can
prevent the network from converging (https://youtu.be/gYpoJMlgyXA,
the same video as before).
For the ReLU activation, you may want to use He initialization
instead of the default Glorot initialiation. I find that this may be
necessary sometimes but not always.
Lastly, you can use debugging tools for Keras such as keras-vis, keplr-io, deep-viz-keras. They are very useful to open the blackbox of convolutional networks.
I faced the same problem then I followed the following:
After going through a blog post, I managed to determine that my problem resulted from the encoding of my labels. Originally I had them as one-hot encodings which looked like [[0, 1], [1, 0], [1, 0]] and in the blog post they were in the format [0 1 0 0 1]. Changing my labels to this and using binary crossentropy has gotten my model to work properly. Thanks to Ngoc Anh Huynh and rafaelvalle!

Why is there an activation function in each neural net layer, and not just one in the final layer?

I'm trying to teach myself machine learning and I have a similar question to this.
Is this correct:
For example, if I have an input matrix, where X1, X2 and X3 are three numerical features (e.g. say they are petal length, stem length, flower length, and I'm trying to label whether the sample is a particular flower species or not):
x1 x2 x3 label
5 1 2 yes
3 9 8 no
1 2 3 yes
9 9 9 no
That you take the vector of the first ROW (not column) of the table above to be inputted into the network like this:
i.e. there would be three neurons (1 for each value of the first table row), and then w1,w2 and w3 are randomly selected, then to calculate the first neuron in the next column, you do the multiplication I have described, and you add a randomly selected bias term. This gives the value of that node.
This is done for a set of nodes (i.e. each column actually will have four nodes (three + a bias), for simplicity, i removed the other three nodes from the second column), and then in the last node before the output, there is an activation function to transform the sum into a value (e.g. 0-1 for sigmoid) and that value tells you whether the classification is yes or no.
I'm sorry for how basic this is, I want to really understand the process, and I'm doing it from free resources. So therefore generally, you should select the number of nodes in your network to be a multiple of the number of features, e.g. in this case, it would make sense to write:
from keras.models import Sequential
from keras.models import Dense
model = Sequential()
model.add(Dense(6,input_dim=3,activation='relu'))
model.add(Dense(6,input_dim=3,activation='relu'))
model.add(Dense(3,activation='softmax'))
What I don't understand is why the keras model has an activation function in each layer of the network and not just at the end, which is why I'm wondering if my understanding is correct/why I added the picture.
Edit 1: Just a note I saw that in the bias neuron, I put on the edge 'b=1', that might be confusing, I know the bias doesn't have a weight, so that was just a reminder to myself that the weight of the bias node is 1.
Several issues here apart from the question in your title, but since this is not the time & place for full tutorials, I'll limit the discussion to some of your points, taking also into account that at least one more answer already exists.
So therefore generally, you should select the number of nodes in your network to be a multiple of the number of features,
No.
The number of features is passed in the input_dim argument, which is set only for the first layer of the model; the number of inputs for every layer except the first one is simply the number of outputs of the previous one. The Keras model you have written is not valid, and it will produce an error, since for your 2nd layer you ask for input_dim=3, while the previous one has clearly 6 outputs (nodes).
Beyond this input_dim argument, there is no other relationship whatsoever between the number of data features and the number of network nodes; and since it seems you have in mind the iris data (4 features), here is a simple reproducible example of applying a Keras model to them.
What is somewhat hidden in the Keras sequential API (which you use here) is that there is in fact an implicit input layer, and the number of its nodes is the dimensionality of the input; see own answer in Keras Sequential model input layer for details.
So, the model you have drawn in your pad actually corresponds to the following Keras model written using the sequential API:
model = Sequential()
model.add(Dense(1,input_dim=3,activation='linear'))
where in the functional API it would be written as:
inputs = Input(shape=(3,))
outputs = Dense(1, activation='linear')(inputs)
model = Model(inputs, outputs)
and that's all, i.e. it is actually just linear regression.
I know the bias doesn't have a weight
The bias does have a weight. Again, the useful analogy is with the constant term of linear (or logistic) regression: the bias "input" itself is always 1, and its corresponding coefficient (weight) is learned through the fitting process.
why the keras model has an activation function in each layer of the network and not just at the end
I trust this has been covered sufficiently in the other answer.
I'm sorry for how basic this is, I want to really understand the process, and I'm doing it from free resources.
We all did; no excuse though to not benefit from Andrew Ng's free & excellent Machine Learning MOOC at Coursera.
It seems your question is why there is a activation function for each layer instead of just the last layer. The simple answer is, if there are no non-linear activations in the middle, no matter how deep your network is, it can be boiled down to a single linear equation. Therefore, non-linear activation is one of the big enablers that enable deep networks to be actually "deep" and learn high-level features.
Take the following example, say you have 3 layer neural network without any non-linear activations in the middle, but a final softmax layer. The weights and biases for these layers are (W1, b1), (W2, b2) and (W3, b3). Then you can write the network's final output as follows.
h1 = W1.x + b1
h2 = W2.h1 + b2
h3 = Softmax(W3.h2 + b3)
Let's do some manipulations. We'll simply replace h3 as a function of x,
h3 = Softmax(W3.(W2.(W1.x + b1) + b2) + b3)
h3 = Softmax((W3.W2.W1) x + (W3.W2.b1 + W3.b2 + b3))
In other words, h3 is in the following format.
h3 = Softmax(W.x + b)
So, without the non-linear activations, our 3-layer networks has been squashed to a single layer network. That's is why non-linear activations are important.
Imagine, you have an activation layer only in the last layer (In your case, sigmoid. It can be something else too.. say softmax). The purpose of this is to convert real values to a 0 to 1 range for a classification sort of answer. But, the activation in the inner layers (hidden layers) has a different purpose altogether. This is to introduce nonlinearity. Without the activation (say ReLu, tanh etc.), what you get is a linear function. And how many ever, hidden layers you have, you still end up with a linear function. And finally, you convert this into a nonlinear function in the last layer. This might work in some simple nonlinear problems, but will not be able to capture a complex nonlinear function.
Each hidden unit (in each layer) comprises of activation function to incorporate nonlinearity.

Deeplearning4j vs R results differ

I believe that deeplearning4j and R with exactly the same parameters should perform the same, comparable MSE. But I am not sure how to achieve that.
I have a csv file with the following format, which contains 46 variables and 2 outputs. Totally there are 1,0000 samples. All the data is normalized and the model is for regression analysis.
S1 | S2 | ... | S46 | X | Y
In R, I use neuralnet package, and the code is:
rn <- colnames(traindata)
f <- as.formula(paste("X + Y ~", paste(rn[1:(length(rn)-2)], collapse="+")))
nn <- neuralnet(f,
rep=1,
data=traindata,
hidden=c(10),
linear.output=T,
threshold = 0.5)
which is quite straightforward.
As I want to integrate the algorithm into Java project, so I consider dl4j to train the model. The trainset is exactly the same as that in R code. The test set is randomly selected. THe dl4j code is:
MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
.seed(rngSeed) //include a random seed for reproducibility
// use stochastic gradient descent as an optimization algorithm
.optimizationAlgo(OptimizationAlgorithm.STOCHASTIC_GRADIENT_DESCENT)
.iterations(100)
.learningRate(0.0001) //specify the learning rate
.updater(Updater.NESTEROVS).momentum(0.9) //specify the rate of change of the learning rate.
.regularization(true).l2(0.0001)
.list()
.layer(0, new DenseLayer.Builder() //create the first, input layer with xavier initialization
.nIn(46)
.nOut(10)
.activation(Activation.TANH)
.weightInit(WeightInit.XAVIER)
.build())
.layer(1, new OutputLayer.Builder(LossFunctions.LossFunction.MSE) //create hidden layer
.nIn(10)
.nOut(outputNum)
.activation(Activation.IDENTITY)
.build())
.pretrain(false).backprop(true) //use backpropagation to adjust weights
.build();
The number of epoch is 10 and batchsize is 128.
Using the test set, the performance of R is
and the performance of dl4j is the following, I think it does not work out its full potential.
The mornitor of dl4j is
As there is much more parameters in dl4j such as updater, regulization and weightInit. So I think some of the parameters are not properly set. BTW, why there are periodic thorns in the mornitor graph.
Can any one help?
Most neural net training happens in minibatches. Deeplearning4j assumes you aren't doing toy problems by default (all data in memory < 10 examples etc)
The neural net config has a function called minibatch you should look for.
Set minibatch to false on the configuration and you should get the same results.
If you're wondering why this happens, it's because minibatch learning is different from doing everything in memory. Minibatch learning automatically divides the gradient by the minibatch size. When you do everything in memory you don't want that.
Take note of this when you're running other experiments.
For more on this see:
https://deeplearning4j.org/toyproblems

Translating a TensorFlow LSTM into synapticjs

I'm working on implementing an interface between a TensorFlow basic LSTM that's already been trained and a javascript version that can be run in the browser. The problem is that in all of the literature that I've read LSTMs are modeled as mini-networks (using only connections, nodes and gates) and TensorFlow seems to have a lot more going on.
The two questions that I have are:
Can the TensorFlow model be easily translated into a more conventional neural network structure?
Is there a practical way to map the trainable variables that TensorFlow gives you to this structure?
I can get the 'trainable variables' out of TensorFlow, the issue is that they appear to only have one value for bias per LSTM node, where most of the models I've seen would include several biases for the memory cell, the inputs and the output.
Internally, the LSTMCell class stores the LSTM weights as a one big matrix instead of 8 smaller ones for efficiency purposes. It is quite easy to divide it horizontally and vertically to get to the more conventional representation. However, it might be easier and more efficient if your library does the similar optimization.
Here is the relevant piece of code of the BasicLSTMCell:
concat = linear([inputs, h], 4 * self._num_units, True)
# i = input_gate, j = new_input, f = forget_gate, o = output_gate
i, j, f, o = array_ops.split(1, 4, concat)
The linear function does the matrix multiplication to transform the concatenated input and the previous h state into 4 matrices of [batch_size, self._num_units] shape. The linear transformation uses a single matrix and bias variables that you're referring to in the question. The result is then split into different gates used by the LSTM transformation.
If you'd like to explicitly get the transformations for each gate, you can split that matrix and bias into 4 blocks. It is also quite easy to implement it from scratch using 4 or 8 linear transformations.

How are the following types of neural network-like techniques called?

The neural network applications I've seen always learn the weights of their inputs and use fixed "hidden layers".
But I'm wondering about the following techniques:
1) fixed inputs, but the hidden layers are no longer fixed, in the sense that the functions of the input they compute can be tweaked (learned)
2) fixed inputs, but the hidden layers are no longer fixed, in the sense that although they have clusters which compute fixed functions (multiplication, addition, etc... just like ALUs in a CPU or GPU) of their inputs, the weights of the connections between them and between them and the input can be learned (this should in some ways be equivalent to 1) )
These could be used to model systems for which we know the inputs and the output but not how the input is turned into the output (figuring out what is inside a "black box"). Do such techniques exist and if so, what are they called?
For part (1) of your question, there are a couple of relatively recent techniques that come to mind.
The first one is a type of feedforward layer called "maxout" which computes a piecewise linear output function of its inputs.
Consider a traditional neural network unit with d inputs and a linear transfer function. We can describe the output of this unit as a function of its input z (a vector with d elements) as g(z) = w z, where w is a vector with d weight values.
In a maxout unit, the output of the unit is described as
g(z) = max_k w_k z
where w_k is a vector with d weight values, and there are k such weight vectors [w_1 ... w_k] per unit. Each of the weight vectors in the maxout unit computes some linear function of the input, and the max combines all of these linear functions into a single, convex, piecewise linear function. The individual weight vectors can be learned by the network, so that in effect each linear transform learns to model a specific part of the input (z) space.
You can read more about maxout networks at http://arxiv.org/abs/1302.4389.
The second technique that has recently been developed is the "parametric relu" unit. In this type of unit, all neurons in a network layer compute an output g(z) = max(0, w z) + a min(w z, 0), as compared to the more traditional rectified linear unit, which computes g(z) = max(0, w z). The parameter a is shared across all neurons in a layer in the network and is learned along with the weight vector w.
The prelu technique is described by http://arxiv.org/abs/1502.01852.
Maxout units have been shown to work well for a number of image classification tasks, particularly when combined with dropout to prevent overtraining. It's unclear whether the parametric relu units are extremely useful in modeling images, but the prelu paper gets really great results on what has for a while been considered the benchmark task in image classification.

Resources