Machine Learning, After training, how exactly does it get a prediction? opencv - opencv

So after you have a machine learning algorithm trained, with your layers, nodes, and weights, how exactly does it go about getting a prediction for an input vector? I am using MultiLayer Perceptron (neural networks).
From what I currently understand, you start with your input vector to be predicted. Then you send it to your hidden layer(s) where it adds your bias term to each data point, then adds the sum of the product of each data point and the weight for each node (found in training), then runs that through the same activation function used in training. Repeat for each hidden layer, then does the same for your output layer. Then each node in the output layer is your prediction(s).
Is this correct?
I got confused when using opencv to do this, because in the guide it says when you use the function predict:
If you are using the default cvANN_MLP::SIGMOID_SYM activation
function with the default parameter values fparam1=0 and fparam2=0
then the function used is y = 1.7159*tanh(2/3 * x), so the output
will range from [-1.7159, 1.7159], instead of [0,1].
However, when training it is also stated in the documentation that SIGMOID_SYM uses the activation function:
f(x)= beta*(1-e^{-alpha x})/(1+e^{-alpha x} )
Where alpha and beta are user defined variables.
So, I'm not quite sure what this means. Where does the tanh function come into play? Can anyone clear this up please? Thanks for the time!
The documentation where this is found is here:
reference to the tanh is under function descriptions predict.
reference to activation function is by the S looking graph in the top part of the page.
Since this is a general question, and not code specific, I did not post any code with it.

I would suggest that you read about appropriate algorithm that your are using or plan to use. To be honest there is no one definite algorithm to solve a problem but you can explore what features you got and what you need.
Regarding how an algorithm performs prediction is totally depended on the choice of algorithm. Support Vector Machine (SVM) performs prediction by fitting hyperplanes on the feature space and using some metric such as distance for learning and than the learnt model is used for prediction. KNN on the other than uses simple nearest neighbor measurement for prediction.
Please do more work on what exactly you need and read through the research papers to get proper understanding. There is not magic involved in prediction but rather mathematical formulations.

Related

Can gradient descent itself solve non-linear problem in ANN?

I'm recently studying the theory about neural network. And I'm a little confuse about the role of gradient descent and activation function in ANN.
From what I understand, the activation function is used for transforming the model to non-linear model. So that it can solve the problem that is not linear separable. And the gradient descent is the tool to help model learn.
So my questions are :
If I use an activation function such as sigmoid for the model, but instead of using gradient decent to improve the model, I use classic perceptron learning rule : Wj = Wj + a*(y-h(x)), where the h(x) is the sigmoid function with the net input. Can the model learn the non-linear separable problem ?
If I do not include the non-linear activation function in the model. Just simple net input : h(x) = w0 + w1*x1 + ... + wj*xj. And using gradient decent to improve the model. Can the model learn the non-linear separable problem ?
I'm really confused about this problem, that which one is the main reason that the model can learn non-linear separable problem.
Supervised Learning 101
This is a pretty deep question, so I'm going to review the basics first to make sure we understand each other. In its simplest form, supervised learning, and classification in particular, attempts to learn a function f such that y=f(x), from a set of observations {(x_i,y_i)}. The following problems arise in practice:
You know nothing about f. It could be a polynomial, exponential, or some exotic highly non-linear thing that doesn't even have a proper name in math.
The dataset you're using to learn is just a limited, and potentially noisy, subset of the true data distribution you're trying to learn.
Because of this, any solution you find will have to be approximate. The type of architecture you will use will determine a family of function h_w(x), and each value of w will represent one function in this family. Note that because there is usually an infinite number of possible w, the family of functions h_w(x) are often infinitely large.
The goal of learning will then be to determine which w is most appropriate. This is where gradient descent intervenes: it is just an optimisation tool that helps you pick reasonably good w, and thus select a particular model h(x).
The problem is, the actual f function you are trying to approximate may not be part of the family h_w you decided to pick, and so you are .
Answering the actual questions
Now that the basics are covered, let's answer your questions:
Putting a non-linear activation function like sigmoid at the output of a single layer model ANN will not help it learn a non-linear function. Indeed a single layer ANN is equivalent to linear regression, and adding the sigmoid transforms it into Logistic Regression. Why doesn't it work? Let me try an intuitive explanation: the sigmoid at the output of the single layer is there to squash it to [0,1], so that it can be interpreted as a class membership probability. In short, the sigmoid acts a differentiable approximation to a hard step function. Our learning procedure relies on this smoothness (a well-behaved gradient is available everywhere), and using a step function would break eg. gradient descent. This doesn't change the fact that the decision boundary of the model is linear, because the final class decision is taken from the value of sum(w_i*x_i). This is probably not really convincing, so let's illustrate instead using the Tensorflow Playground. Note that the learning rule does not matter here, because the family of function you're optimising over consist only of linear functions on their input, so you will never learn a non-linear one!
If you drop the sigmoid activation, you're left with a simple linear regression. You don't even project your result back to [0,1], so the output will not be simple to interpret as class probability, but the final result will be the same. See the Playground for a visual proof.
What is needed then?
To learn a non-linearly separable problem, you have several solutions:
Preprocess the input x into x', so that taking x' as an input makes the problem linearly separable. This is only possible if you know the shape that the decision boundary should take, so generally only applicable to very simple problems. In the playground problem, since we're working with a circle, we can add the squares of x1 and x2 to the input. Although our model is linear in its input, an appropriate non-linear transformation of the input has been carefully selected, so we get an excellent fit.
We could try to automatically learn the right representation of the data, by adding one or more hidden layers, which will work to extract a good non-linear transformation. It can be proven that using a single hidden layer is enough to approximate anything as long as make the number of hidden neurons high enough. For our example, we get a good fit using only a few hidden neurons with ReLU activations. Intuitively, the more neurons you add, the more "flexible" the decision boundary can become. People in deep learning have been adding depth rather than width because it can be shown that making the network deeper makes it require less neurons overall, even though it makes training more complex.
Yes, gradient descent is quite capable of solving a non-linear problem. The method works as long as the various transformations are roughly linear within a "delta" of the adjustments. This is why we adjust our learning rates: to stay within the ranges in which linear assumptions are relatively accurate.
Non-linear transformations give us a better separation to implement the ideas "this is boring" and "this is exactly what I'm looking for!" If these functions are smooth, or have a very small quantity of jumps, we can apply our accustomed approximations and iterations to solve the overall system.
Determining the useful operating ranges is not a closed-form computation, by any means; as with much of AI research, it requires experimentation and refinement. The direct answer to your question is that you've asked the wrong entity -- try the choices you've listed, and see which works best for your application.

Sigmoid activation for multi-class classification?

I am implementing a simple neural net from scratch, just for practice. I have got it working fine with sigmoid, tanh and ReLU activations for binary classification problems. I am now attempting to use it for multi-class, mutually exclusive problems. Of course, softmax is the best option for this.
Unfortunately, I have had a lot of trouble understanding how to implement softmax, cross-entropy loss and their derivatives in backprop. Even after asking a couple of questions here and on Cross Validated, I can't get any good guidance.
Before I try to go further with implementing softmax, is it possible to somehow use sigmoid for multi-class problems (I am trying to predict 1 of n characters, which are encoded as one-hot vectors)? And if so, which loss function would be best? I have been using the squared error for all binary classifications.
Your question is about the fundamentals of neural networks and therefore I strongly suggest you start here ( Michael Nielsen's book ).
It is python-oriented book with graphical, textual and formulated explanations - great for beginners. I am confident that you will find this book useful for your understanding. Look for chapters 2 and 3 to address your problems.
Addressing your question about the Sigmoids, it is possible to use it for multiclass predictions, but not recommended. Consider the following facts.
Sigmoids are activation functions of the form 1/(1+exp(-z)) where z is the scalar multiplication of the previous hidden layer (or inputs) and a row of the weights matrix, in addition to a bias (reminder: z=w_i . x + b where w_i is the i-th row of the weight matrix ). This activation is independent of the others rows of the matrix.
Classification tasks are regarding categories. Without any prior knowledge ,and even with, most of the times, categories have no order-value interpretation; predicting apple instead of orange is no worse than predicting banana instead of nuts. Therefore, one-hot encoding for categories usually performs better than predicting a category number using a single activation function.
To recap, we want an output layer with number of neurons equals to number of categories, and sigmoids are independent of each other, given the previous layer values. We also would like to predict the most probable category, which implies that we want the activations of the output layer to have a meaning of probability disribution. But Sigmoids are not guaranteed to sum to 1, while softmax activation does.
Using L2-loss function is also problematic due to vanishing gradients issue. Shortly, the derivative of the loss is (sigmoid(z)-y) . sigmoid'(z) (error times the derivative), that makes this quantity small, even more when the sigmoid is closed to saturation. You can choose cross entropy instead, or a log-loss.
EDIT:
Corrected phrasing about ordering the categories. To clarify, classification is a general term for many tasks related to what we used today as categorical predictions for definite finite sets of values. As of today, using softmax in deep models to predict these categories in a general "dog/cat/horse" classifier, one-hot-encoding and cross entropy is a very common practice. It is reasonable to use that if the aforementioned is correct. However, there are (many) cases it doesn't apply. For instance, when trying to balance the data. For some tasks, e.g. semantic segmentation tasks, categories can have ordering/distance between them (or their embeddings) with meaning. So please, choose wisely the tools for your applications, understanding what their doing mathematically and what their implications are.
What you ask is a very broad question.
As far as I know, when the class become 2, the softmax function will be the same as sigmoid, so yes they are related. Cross entropy maybe the best loss function.
For the backpropgation, it is not easy to find the formula...there
are many ways.Since the help of CUDA, I don't think it is necessary to spend much time on it if you just want to use the NN or CNN in the future. Maybe try some framework like Tensorflow or Keras(highly recommand for beginers) will help you.
There is also many other factors like methods of gradient descent, the setting of hyper parameters...
Like I said, the topic is very abroad. Why not trying the machine learning/deep learning courses on Coursera or Stanford online course?

Training of a CNN using Backpropagation

I have earlier worked in shallow(one or two layered) neural networks, so i have understanding of them, that how they work, and it is quite easy to visualize the derivations for forward and backward pass during the training of them, Currently I am studying about Deep neural networks(More precisely CNN), I have read lots of articles about their training, but still I am unable to understand the big picture of the training of the CNN, because in some cases people using pre- trained layers where convolution weights are extracted using auto-encoders, in some cases random weights were used for convolution, and then using back propagation they train the weights, Can any one help me to give full picture of the training process from input to fully connected layer(Forward Pass) and from fully connected layer to input layer (Backward pass).
Thank You
I'd like to recommend you a very good explanation of how to train a multilayer neural network using backpropagation. This tutorial is the 5th post of a very detailed explanation of how backpropagation works, and it also has Python examples of different types of neural nets to fully understand what's going on.
As a summary of Peter Roelants tutorial, I'll try to explain a little bit what is backpropagation.
As you have already said, there are two ways to initialize a deep NN: with random weights or pre-trained weights. In the case of random weights and for a supervised learning scenario, backpropagation works as following:
Initialize your network parameters randomly.
Feed forward a batch of labeled examples.
Compute the error (given by your loss function) within the desired output and the actual one.
Compute the partial derivative of the output error w.r.t each parameter.
These derivatives are the gradients of the error w.r.t to the network's parameters. In other words, they are telling you how to change the value of the weights in order to get the desired output, instead of the produced one.
Update the weights according to those gradients and the desired learning rate.
Perform another forward pass with different training examples, repeat the following steps until the error stops decreasing.
Starting with random weights is not a problem for the backpropagation algorithm, given enough training data and iterations it will tune the weights until they work for the given task.
I really encourage you to follow the full tutorial I linked, because you'll get a very detalied view of how and why backpropagation works for multi layered neural networks.

Estimating parameters in multivariate classification

Newbie here typesetting my question, so excuse me if this don't work.
I am trying to give a bayesian classifier for a multivariate classification problem where input is assumed to have multivariate normal distribution. I choose to use a discriminant function defined as log(likelihood * prior).
However, from the distribution,
$${f(x \mid\mu,\Sigma) = (2\pi)^{-Nd/2}\det(\Sigma)^{-N/2}exp[(-1/2)(x-\mu)'\Sigma^{-1}(x-\mu)]}$$
i encounter a term -log(det($S_i$)), where $S_i$ is my sample covariance matrix for a specific class i. Since my input actually represents a square image data, my $S_i$ discovers quite some correlation and resulting in det(S_i) being zero. Then my discriminant function all turn Inf, which is disastrous for me.
I know there must be a lot of things go wrong here, anyone willling to help me out?
UPDATE: Anyone can help how to get the formula working?
I do not analyze the concept, as it is not very clear to me what you are trying to accomplish here, and do not know the dataset, but regarding the problem with the covariance matrix:
The most obvious solution for data, where you need a covariance matrix and its determinant, and from numerical reasons it is not feasible is to use some kind of dimensionality reduction technique in order to capture the most informative dimensions and simply discard the rest. One such method is Principal Component Analysis (PCA), which applied to your data and truncated after for example 5-20 dimensions would yield the reduced covariance matrix with non-zero determinant.
PS. It may be a good idea to post this question on Cross Validated
Probably you do not have enough data to infer parameters in a space of dimension d. Typically, the way you would get around this is to take an MAP estimate as opposed to an ML.
For the multivariate normal, this is a normal-inverse-wishart distribution. The MAP estimate adds the matrix parameter of inverse Wishart distribution to the ML covariance matrix estimate and, if chosen correctly, will get rid of the singularity problem.
If you are actually trying to create a classifier for normally distributed data, and not just doing an experiment, then a better way to do this would be with a discriminative method. The decision boundary for a multivariate normal is quadratic, so just use a quadratic kernel in conjunction with an SVM.

Machine Learning: Unsupervised Backpropagation

I'm having trouble with some of the concepts in machine learning through neural networks. One of them is backpropagation. In the weight updating equation,
delta_w = a*(t - y)*g'(h)*x
t is the "target output", which would be your class label, or something, in the case of supervised learning. But what would the "target output" be for unsupervised learning?
Can someone kindly provide an example of how you'd use BP in unsupervised learning, specifically for clustering of classification?
Thanks in advance.
The most common thing to do is train an autoencoder, where the desired outputs are equal to the inputs. This makes the network try to learn a representation that best "compresses" the input distribution.
Here's a patent describing a different approach, where the output labels are assigned randomly and then sometimes flipped based on convergence rates. It seems weird to me, but okay.
I'm not familiar with other methods that use backpropogation for clustering or other unsupervised tasks. Clustering approaches with ANNs seem to use other algorithms (example 1, example 2).
I'm not sure which unsupervised machine learning algorithm uses backpropagation specifically; if there is one I haven't heard of it. Can you point to an example?
Backpropagation is used to compute the derivatives of the error function for training an artificial neural network with respect to the weights in the network. It's named as such because the "errors" are "propagating" through the network "backwards". You need it in this case because the final error with respect to the target depends on a function of functions (of functions ... depending on how many layers in your ANN.) The derivatives allow you to then adjust the values to improve the error function, tempered by the learning rate (this is gradient descent).
In unsupervised algorithms, you don't need to do this. For example, in k-Means, where you are trying to minimize the mean squared error (MSE), you can minimize the error directly at each step given the assignments; no gradients needed. In other clustering models, such as a mixture of Gaussians, the expectation-maximization (EM) algorithm is much more powerful and accurate than any gradient-descent based method.
What you might be asking is about unsupervised feature learning and deep learning.
Feature learning is the only unsupervised method I can think of with respect of NN or its recent variant.(a variant called mixture of RBM's is there analogous to mixture of gaussians but you can build a lot of models based on the two). But basically Two models I am familiar with are RBM's(restricted boltzman machines) and Autoencoders.
Autoencoders(optionally sparse activations can be encoded in optimization function) are just feedforward neural networks which tune its weights in such a way that the output is a reconstructed input. Multiple hidden layers can be used but the weight initialization uses a greedy layer wise training for better starting point. So to answer the question the target function will be input itself.
RBM's are stochastic networks usually interpreted as graphical model which has restrictions on connections. In this setting there is no output layer and the connection between input and latent layer is bidirectional like an undirected graphical model. What it tries to learn is a distribution on inputs(observed and unobserved variables). Here also your answer would be input is the target.
Mixture of RBM's(analogous to mixture of gaussians) can be used for soft clustering or KRBM(analogous to K-means) can be used for hard clustering. Which in effect feels like learning multiple non-linear subspaces.
http://deeplearning.net/tutorial/rbm.html
http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial
An alternative approach is to use something like generative backpropagation. In this scenario, you train a neural network updating the weights AND the input values. The given values are used as the output values since you can compute an error value directly. This approach has been used in dimensionality reduction, matrix completion (missing value imputation) among other applications. For more information, see non-linear principal component analysis (NLPCA) and unsupervised backpropagation (UBP) which uses the idea of generative backpropagation. UBP extends NLPCA by introducing a pre-training stage. An implementation of UBP and NLPCA and unsupervised backpropagation can be found in the waffles machine learning toolkit. The documentation for UBP and NLPCA can be found using the nlpca command.
To use back-propagation for unsupervised learning it is merely necessary to set t, the target output, at each stage of the algorithm to the class for which the average distance to each element of the class before updating is least. In short we always try to train the ANN to place its input into the class whose members are most similar in terms of our input. Because this process is sensitive to input scale it is necessary to first normalize the input data in each dimension by subtracting the average and dividing by the standard deviation for each component in order to calculate the distance in a scale-invariant manner.
The advantage to using a back-prop neural network rather than a simple distance from a center definition of the clusters is that neural networks can allow for more complex and irregular boundaries between clusters.

Resources