How to train and fine-tune fully unsupervised deep neural networks? - machine-learning

In scenario 1, I had a multi-layer sparse autoencoder that tries to reproduce my input, so all my layers are trained together with random-initiated weights. Without a supervised layer, on my data this didn't learn any relevant information (the code works fine, verified as I've already used it in many other deep neural network problems)
In scenario 2, I simply train multiple auto-encoders in a greedy layer-wise training similar to that of deep learning (but without a supervised step in the end), each layer on the output of the hidden layer of the previous autoencoder. They'll now learn some patterns (as I see from the visualized weights) separately, but not awesome, as I'd expect it from single layer AEs.
So I've decided to try if now the pretrained layers connected into 1 multi-layer AE could perform better than the random-initialized version. As you see this is same as the idea of the fine-tuning step in deep neural networks.
But during my fine-tuning, instead of improvement, the neurons of all the layers seem to quickly converge towards an all-the-same pattern and end up learning nothing.
Question: What's the best configuration to train a fully unsupervised multi-layer reconstructive neural network? Layer-wise first and then some sort of fine tuning? Why is my configuration not working?

After some tests I've came up with a method that seems to give very good results, and as you'd expect from a 'fine-tuning' it improves the performance of all the layers:
Just like normally, during the greedy layer-wise learning phase, each new autoencoder tries to reconstruct the activations of the previous autoencoder's hidden layer. However, the last autoencoder (that will be the last layer of our multi-layer autoencoder during fine-tuning) is different, this one will use the activations of the previous layer and tries to reconstruct the 'global' input (ie the original input that was fed to the first layer).
This way when I connect all the layers and train them together, the multi-layer autoencoder will really reconstruct the original image in the final output. I found a huge improvement in the features learned, even without a supervised step.
I don't know if this is supposed to somehow correspond with standard implementations but I haven't found this trick anywhere before.

Related

In what circumstances might using biases in a neural network not be beneficial?

I am currently looking through Michael Nielsen's ebook Neural Networks and Deep Learning and have run the code found at the end of chapter 1 which trains a neural network to recognize hand-written digits (with a slight modification to make the backpropagation algorithm over a mini-batch matrix-based).
However, having run this code and achieving a classification accuracy of just under 94%, I decided to remove the use of biases from the network. After re-training the modified network, I found no difference in classification accuracy!
NB: The output layer of this network contains ten neurons; if the ith of these neurons has the highest activation then the input is classified as being the digit i.
This got me wondering why it is necessary to use biases in a neural network, rather than just weights, and what differentiates between a task where biases will improve the performance of a network and a task where they will not?
My code can be found here: https://github.com/pipthagoras/neural-network-1
Biases are used to account for the fact that your underlying data might not be centered. It is clearer to see in the case of a linear regression.
If you do a regression without an intercept (or bias), you are forcing the underlying model to pass through the origin, which will result in a poor model if the underlying data is not centered (for example if the true generating process is Y=3000). If, on the other hand, your data is centered or close to centered, then eliminating bias is good, since you won't introduce a term that is, in fact, independent to your predictive variable (it's like selecting a simpler model, which will tend to generalize better PROVIDED that it actually reflects the underlying data).

When making prediction with a trained neural net, does input have to run through all layers?

When you have an input that you want to make a prediction on, does the input have to be run through the entire neural net?
Yes, that is how current neural networks work. The only exception are some state of the art networks such as CNNs with Adaptive Inference Graphs.
Yes. In most Neural Nets, each layer except the final one will generally produce outputs that do not correspond to output categories. Instead, these intermediate layers extract features and find patterns in the output from the previous layer. Only the final layer looks at all these high-level features (produced by the combination of all previous layers) to decide how the input should be categorised (example).

Why do neural networks work so well?

I understand all the computational steps of training a neural network with gradient descent using forwardprop and backprop, but I'm trying to wrap my head around why they work so much better than logistic regression.
For now all I can think of is:
A) the neural network can learn it's own parameters
B) there are many more weights than simple logistic regression thus allowing for more complex hypotheses
Can someone explain why a neural network works so well in general? I am a relative beginner.
Neural Networks can have a large number of free parameters (the weights and biases between interconnected units) and this gives them the flexibility to fit highly complex data (when trained correctly) that other models are too simple to fit. This model complexity brings with it the problems of training such a complex network and ensuring the resultant model generalises to the examples it’s trained on (typically neural networks require large volumes of training data, that other models don't).
Classically logistic regression has been limited to binary classification using a linear classifier (although multi-class classification can easily be achieved with one-vs-all, one-vs-one approaches etc. and there are kernalised variants of logistic regression that allow for non-linear classification tasks). In general therefore, logistic regression is typically applied to more simple, linearly-separable classification tasks, where small amounts of training data are available.
Models such as logistic regression and linear regression can be thought of as simple multi-layer perceptrons (check out this site for one explanation of how).
To conclude, it’s the model complexity that allows neural nets to solve more complex classification tasks, and to have a broader application (particularly when applied to raw data such as image pixel intensities etc.), but their complexity means that large volumes of training data are required and training them can be a difficult task.
Recently Dr. Naftali Tishby's idea of Information Bottleneck to explain the effectiveness of deep neural networks is making the rounds in the academic circles.
His video explaining the idea (link below) can be rather dense so I'll try to give the distilled/general form of the core idea to help build intuition
https://www.youtube.com/watch?v=XL07WEc2TRI
To ground your thinking, vizualize the MNIST task of classifying the digit in the image. For this, I am only talking about simple fully-connected neural networks (not Convolutional NN as is typically used for MNIST)
The input to a NN contains information about the output hidden inside of it. Some function is needed to transform the input to the output form. Pretty obvious.
The key difference in thinking needed to build better intuition is to think of the input as a signal with "information" in it (I won't go into information theory here). Some of this information is relevant for the task at hand (predicting the output). Think of the output as also a signal with a certain amount of "information". The neural network tries to "successively refine" and compress the input signal's information to match the desired output signal. Think of each layer as cutting away at the unneccessary parts of the input information, and
keeping and/or transforming the output information along the way through the network.
The fully-connected neural network will transform the input information into a form in the final hidden layer, such that it is linearly separable by the output layer.
This is a very high-level and fundamental interpretation of the NN, and I hope it will help you see it clearer. If there are parts you'd like me to clarify, let me know.
There are other essential pieces in Dr.Tishby's work, such as how minibatch noise helps training, and how the weights of a neural network layer can be seen as doing a random walk within the constraints of the problem.
These parts are a little more detailed, and I'd recommend first toying with neural networks and taking a course on Information Theory to help build your understanding.
Consider you have a large dataset and you want to build a binary classification model for that, Now you have two options that you have pointed out
Logistic Regression
Neural Networks ( Consider FFN for now )
Each node in a neural network will be associated with an activation function for example let's choose Sigmoid since Logistic regression also uses sigmoid internally to make decision.
Let's see how the decision of logistic regression looks when applied on the data
See some of the green spots present in the red boundary?
Now let's see the decision boundary of neural network (Forgive me for using a different color)
Why this happens? Why does the decision boundary of neural network is so flexible which gives more accurate results than Logistic regression?
or the question you asked is "Why neural networks works so well ?" is because of it's hidden units or hidden layers and their representation power.
Let me put it this way.
You have a logistic regression model and a Neural network which has say 100 neurons each of Sigmoid activation. Now each neuron will be equivalent to one logistic regression.
Now assume a hundred logistic units trained together to solve one problem versus one logistic regression model. Because of these hidden layers the decision boundary expands and yields better results.
While you are experimenting you can add more number of neurons and see how the decision boundary is changing. A logistic regression is same as a neural network with single neuron.
The above given is just an example. Neural networks can be trained to get very complex decision boundaries
Neural networks allow the person training them to algorithmically discover features, as you pointed out. However, they also allow for very general nonlinearity. If you wish, you can use polynomial terms in logistic regression to achieve some degree of nonlinearity, however, you must decide which terms you will use. That is you must decide a priori which model will work. Neural networks can discover the nonlinear model that is needed.
'Work so well' depends on the concrete scenario. Both of them do essentially the same thing: predicting.
The main difference here is neural network can have hidden nodes for concepts, if it's propperly set up (not easy), using these inputs to make the final decission.
Whereas linear regression is based on more obvious facts, and not side effects. A neural network should de able to make more accurate predictions than linear regression.
Neural networks excel at a variety of tasks, but to get an understanding of exactly why, it may be easier to take a particular task like classification and dive deeper.
In simple terms, machine learning techniques learn a function to predict which class a particular input belongs to, depending on past examples. What sets neural nets apart is their ability to construct these functions that can explain even complex patterns in the data. The heart of a neural network is an activation function like Relu, which allows it to draw some basic classification boundaries like:
Example classification boundaries of Relus
By composing hundreds of such Relus together, neural networks can create arbitrarily complex classification boundaries, for example:
Composing classification boundaries
The following article tries to explain the intuition behind how neural networks work: https://medium.com/machine-intelligence-report/how-do-neural-networks-work-57d1ab5337ce
Before you step into neural network see if you have assessed all aspects of normal regression.
Use this as a guide
and even before you discard normal regression - for curved type of dependencies - you should strongly consider kernels with SVM
Neural networks are defined with an objective and loss function. The only process that happens within a neural net is to optimize for the objective function by reducing the loss function or error. The back propagation helps in finding the optimized objective function and reach our output with an output condition.

Training of a CNN using Backpropagation

I have earlier worked in shallow(one or two layered) neural networks, so i have understanding of them, that how they work, and it is quite easy to visualize the derivations for forward and backward pass during the training of them, Currently I am studying about Deep neural networks(More precisely CNN), I have read lots of articles about their training, but still I am unable to understand the big picture of the training of the CNN, because in some cases people using pre- trained layers where convolution weights are extracted using auto-encoders, in some cases random weights were used for convolution, and then using back propagation they train the weights, Can any one help me to give full picture of the training process from input to fully connected layer(Forward Pass) and from fully connected layer to input layer (Backward pass).
Thank You
I'd like to recommend you a very good explanation of how to train a multilayer neural network using backpropagation. This tutorial is the 5th post of a very detailed explanation of how backpropagation works, and it also has Python examples of different types of neural nets to fully understand what's going on.
As a summary of Peter Roelants tutorial, I'll try to explain a little bit what is backpropagation.
As you have already said, there are two ways to initialize a deep NN: with random weights or pre-trained weights. In the case of random weights and for a supervised learning scenario, backpropagation works as following:
Initialize your network parameters randomly.
Feed forward a batch of labeled examples.
Compute the error (given by your loss function) within the desired output and the actual one.
Compute the partial derivative of the output error w.r.t each parameter.
These derivatives are the gradients of the error w.r.t to the network's parameters. In other words, they are telling you how to change the value of the weights in order to get the desired output, instead of the produced one.
Update the weights according to those gradients and the desired learning rate.
Perform another forward pass with different training examples, repeat the following steps until the error stops decreasing.
Starting with random weights is not a problem for the backpropagation algorithm, given enough training data and iterations it will tune the weights until they work for the given task.
I really encourage you to follow the full tutorial I linked, because you'll get a very detalied view of how and why backpropagation works for multi layered neural networks.

Why use a restricted Boltzmann machine rather than a multi-layer perceptron?

I'm trying to understand the difference between a restricted Boltzmann machine (RBM), and a feed-forward neural network (NN). I know that an RBM is a generative model, where the idea is to reconstruct the input, whereas an NN is a discriminative model, where the idea is the predict a label. But what I am unclear about, is why you cannot just use a NN for a generative model? In particular, I am thinking about deep belief networks and multi-layer perceptrons.
Suppose my input to the NN is a set of notes called x, and my output of the NN is a set of nodes y. In a discriminative model, my loss during training would be the difference between y, and the value of y that I want x to produce (e.g. ground truth probabilities for class labels). However, what about if I just made the output have the same number of nodes as the input, and then set the loss to be the difference between x and y? In this way, the network would learn to reconstruct the input, like in an RBM.
So, given that a NN (or a multi-layer perceptron) can be used to train a generative model in this way, why would you use an RBM (or a deep belief network) instead? Or in this case, would they be exactly the same?
You can use a NN for a generative model in exactly the way you describe. This is known as an autoencoder, and these can work quite well. In fact, these are often the building blocks of deep belief networks.
An RBM is a quite different model from a feed-forward neural network. They have connections going both ways (forward and backward) that have a probabilistic / energy interpretation. You'll need to read the details to understand.
A deep belief network (DBN) is just a neural network with many layers. This can be a large NN with layers consisting of a sort of autoencoders, or consist of stacked RBMs. You need special methods, tricks and lots of data for training these deep and large networks. Simple back-propagation suffers from the vanishing gradients problem. But if you do manage to train them, they can be very powerful (encode "higher level" concepts).
Hope this helps to point you in the right directions.

Resources