How to output a symmetric hollow matrix using Deep Learning Model? - machine-learning

I'm working on a project that requires to output a symmetric hollow matrix where diagonal element are all zero using deep learning model. Any reference on related research? or Code is appreciated.

Use whatever you want to output a matrix A (e.g. an MLP or whatever DL architecture makes sense for your problem), and then just make it symmetric and hollow by defining
sym_hollow(A) := (A + A.T) - 2*diag(A)

Related

Can gradient descent itself solve non-linear problem in ANN?

I'm recently studying the theory about neural network. And I'm a little confuse about the role of gradient descent and activation function in ANN.
From what I understand, the activation function is used for transforming the model to non-linear model. So that it can solve the problem that is not linear separable. And the gradient descent is the tool to help model learn.
So my questions are :
If I use an activation function such as sigmoid for the model, but instead of using gradient decent to improve the model, I use classic perceptron learning rule : Wj = Wj + a*(y-h(x)), where the h(x) is the sigmoid function with the net input. Can the model learn the non-linear separable problem ?
If I do not include the non-linear activation function in the model. Just simple net input : h(x) = w0 + w1*x1 + ... + wj*xj. And using gradient decent to improve the model. Can the model learn the non-linear separable problem ?
I'm really confused about this problem, that which one is the main reason that the model can learn non-linear separable problem.
Supervised Learning 101
This is a pretty deep question, so I'm going to review the basics first to make sure we understand each other. In its simplest form, supervised learning, and classification in particular, attempts to learn a function f such that y=f(x), from a set of observations {(x_i,y_i)}. The following problems arise in practice:
You know nothing about f. It could be a polynomial, exponential, or some exotic highly non-linear thing that doesn't even have a proper name in math.
The dataset you're using to learn is just a limited, and potentially noisy, subset of the true data distribution you're trying to learn.
Because of this, any solution you find will have to be approximate. The type of architecture you will use will determine a family of function h_w(x), and each value of w will represent one function in this family. Note that because there is usually an infinite number of possible w, the family of functions h_w(x) are often infinitely large.
The goal of learning will then be to determine which w is most appropriate. This is where gradient descent intervenes: it is just an optimisation tool that helps you pick reasonably good w, and thus select a particular model h(x).
The problem is, the actual f function you are trying to approximate may not be part of the family h_w you decided to pick, and so you are .
Answering the actual questions
Now that the basics are covered, let's answer your questions:
Putting a non-linear activation function like sigmoid at the output of a single layer model ANN will not help it learn a non-linear function. Indeed a single layer ANN is equivalent to linear regression, and adding the sigmoid transforms it into Logistic Regression. Why doesn't it work? Let me try an intuitive explanation: the sigmoid at the output of the single layer is there to squash it to [0,1], so that it can be interpreted as a class membership probability. In short, the sigmoid acts a differentiable approximation to a hard step function. Our learning procedure relies on this smoothness (a well-behaved gradient is available everywhere), and using a step function would break eg. gradient descent. This doesn't change the fact that the decision boundary of the model is linear, because the final class decision is taken from the value of sum(w_i*x_i). This is probably not really convincing, so let's illustrate instead using the Tensorflow Playground. Note that the learning rule does not matter here, because the family of function you're optimising over consist only of linear functions on their input, so you will never learn a non-linear one!
If you drop the sigmoid activation, you're left with a simple linear regression. You don't even project your result back to [0,1], so the output will not be simple to interpret as class probability, but the final result will be the same. See the Playground for a visual proof.
What is needed then?
To learn a non-linearly separable problem, you have several solutions:
Preprocess the input x into x', so that taking x' as an input makes the problem linearly separable. This is only possible if you know the shape that the decision boundary should take, so generally only applicable to very simple problems. In the playground problem, since we're working with a circle, we can add the squares of x1 and x2 to the input. Although our model is linear in its input, an appropriate non-linear transformation of the input has been carefully selected, so we get an excellent fit.
We could try to automatically learn the right representation of the data, by adding one or more hidden layers, which will work to extract a good non-linear transformation. It can be proven that using a single hidden layer is enough to approximate anything as long as make the number of hidden neurons high enough. For our example, we get a good fit using only a few hidden neurons with ReLU activations. Intuitively, the more neurons you add, the more "flexible" the decision boundary can become. People in deep learning have been adding depth rather than width because it can be shown that making the network deeper makes it require less neurons overall, even though it makes training more complex.
Yes, gradient descent is quite capable of solving a non-linear problem. The method works as long as the various transformations are roughly linear within a "delta" of the adjustments. This is why we adjust our learning rates: to stay within the ranges in which linear assumptions are relatively accurate.
Non-linear transformations give us a better separation to implement the ideas "this is boring" and "this is exactly what I'm looking for!" If these functions are smooth, or have a very small quantity of jumps, we can apply our accustomed approximations and iterations to solve the overall system.
Determining the useful operating ranges is not a closed-form computation, by any means; as with much of AI research, it requires experimentation and refinement. The direct answer to your question is that you've asked the wrong entity -- try the choices you've listed, and see which works best for your application.

Why do we omit coefficients of polynomial terms when feature mapping?

I've been studying the Machine Learning course on Coursera by Andrew Ng.
I've solved the programming assignment that is given at the end of the week 3. But there was something that I was wondering and I do not want to skip things without being able to reason them, at least I couldn't.
Why do we omit coefficients of polynomial terms when feature mapping?
I quote from the assignment's PDF (which you can view from the link).
2.2 Feature mapping
One way to fit the data better is to create more features from each data
point. In the provided function mapFeature.m, we will map the features into
all polynomial terms of x 1 and x 2 up to the sixth power.
As you can see coefficients are not included.
Considering I am not good at math I think we found all the polynomial terms doing:
since here we would get coefficients after we take the powers, why do we omit them in the vector?
By coefficients I mean coefficients of the terms
instead of in the vector.

Support Vector Machines understanding

Recently,I have been going through lectures and texts and trying to understand how SVM's enable use to work in higher dimensional space.
In normal logistic regression,we use the features as it is..but in SVM's we use a mapping which helps us attain a non linear decision boundary.
Normally we work directly with features..but with the help of kernel trick we can find relations in data using square of the features..product between them etc..is this correct?
We do this with the help of kernel.
Now..i understand that a polynomial kernel corresponds to a known feature vector..but i am unable to understand what gaussian kernel corresponds to( i am told an infinite dimensional feature vector..but what?)
Also,I am unable to grasp the concept that kernel is a measure of similiarity between training examples..how is this a part of the SVM's working?
I have spent lot of time trying to understand these..but in vain.Any help would be much apppreciated!
Thanks in advance :)
Normally we work directly with features..but with the help of kernel trick we can find relations in data using square of the features..product between them etc..is this correct?
Even using a kernel you still work with features, you can simply exploit more complex relations of these features. Such as in your example - polynomial kernel gives you access to low-degree polynomial relations between features (such as squares, or products of features).
Now..i understand that a polynomial kernel corresponds to a known feature vector..but i am unable to understand what gaussian kernel corresponds to( i am told an infinite dimensional feature vector..but what?)
Gaussian kernel maps your feature vector to the unnormalized Gaussian probability density function. In other words, you map each point onto a space of functions, where your point is now a Gaussian centered in this point (with variance corresponding to the hyperparameter gamma of the gaussian kernel). Kernel is always a dot product between vectors. In particular, in function spaces L2 we define classic dot product as an integral over the product, so
<f,g> = integral (f*g) (x) dx
where f,g are Gaussian distributions.
Luckily, for two Gaussian densities, integral of their product is also a Gaussian, this is why gaussian kernel is so similar to the pdf function of the gaussian distribution.
Also,I am unable to grasp the concept that kernel is a measure of similiarity between training examples..how is this a part of the SVM's working?
As mentioned before, kernel is a dot product, and dot product can be seen as a measure of similarity (it is maximized when two vectors have the same direction). However it does not work the other way around, you cannot use every similarity measure as a kernel, because not every similarity measure is a valid dot product.
just a bit of introduction about svm before i start answering the question. This will help you get overview about svm. Svm task is to find the best margin-maximing hyperplane that best separates the data. We have soft margin representation of svm which is also known as primal form and its equivalent form is dual form of svm. Dual form of svm makes the use of kernel trick.
kernel trick is partially replacing the feature engineering which is the most important step in machine learning when we have datasets that are not linear (eg. datasets in shape of concentric circles).
Now you can transform this dataset from non-linear to linear by both FE and kernel trick. By FE you can square each of the features in this dataset and it will transform into linear dataset and then you can apply techniques like logisitic regression which work best for linear data.
In kernel trick you can use the polynomial kernel whose general form is (a + x_i(transpose)x_j)^d where a and d are constants and d specifies the degree, suppose if the degree is 2 then we can say it is quadratic and likewise. now lets say we apply d =2 now our equation becomes (a + x_i(transpose)x_j)^2. lets the we 2 features in our original dataset (eg. for x_1 vector is [x_11,x__12] and x_2 the vector is [x_21,x_22]) now when we apply polynomial kernel on this we get we get 6-d vectors. Now we have transformed the features from 2-d to 6-d.Now you can see intuitively that higher your dimension of data better would svm work because it will eventually transform that features to a higher space. Infact the best case of svm is if you have high dimensionality, then go for svm.
Now you can see both kernel trick and Feature Engineering solve and transforms the dataset(concentric circle one) but the difference is we are doing FE explicitly but kernel trick comes implicitly with svm. There is also a general purpose kernel know as Radial basis function kernel which can be used when you don't know the kernel in advance.
RBF kernel has a parameter (sigma) if the value of sigma is set to 1, then you get a curve that looks like gaussian curve.
You can consider kernel as a similarity metric only and can interpret if the distance between the points is less, higher will be the similarity.

Machine Learning, After training, how exactly does it get a prediction? opencv

So after you have a machine learning algorithm trained, with your layers, nodes, and weights, how exactly does it go about getting a prediction for an input vector? I am using MultiLayer Perceptron (neural networks).
From what I currently understand, you start with your input vector to be predicted. Then you send it to your hidden layer(s) where it adds your bias term to each data point, then adds the sum of the product of each data point and the weight for each node (found in training), then runs that through the same activation function used in training. Repeat for each hidden layer, then does the same for your output layer. Then each node in the output layer is your prediction(s).
Is this correct?
I got confused when using opencv to do this, because in the guide it says when you use the function predict:
If you are using the default cvANN_MLP::SIGMOID_SYM activation
function with the default parameter values fparam1=0 and fparam2=0
then the function used is y = 1.7159*tanh(2/3 * x), so the output
will range from [-1.7159, 1.7159], instead of [0,1].
However, when training it is also stated in the documentation that SIGMOID_SYM uses the activation function:
f(x)= beta*(1-e^{-alpha x})/(1+e^{-alpha x} )
Where alpha and beta are user defined variables.
So, I'm not quite sure what this means. Where does the tanh function come into play? Can anyone clear this up please? Thanks for the time!
The documentation where this is found is here:
reference to the tanh is under function descriptions predict.
reference to activation function is by the S looking graph in the top part of the page.
Since this is a general question, and not code specific, I did not post any code with it.
I would suggest that you read about appropriate algorithm that your are using or plan to use. To be honest there is no one definite algorithm to solve a problem but you can explore what features you got and what you need.
Regarding how an algorithm performs prediction is totally depended on the choice of algorithm. Support Vector Machine (SVM) performs prediction by fitting hyperplanes on the feature space and using some metric such as distance for learning and than the learnt model is used for prediction. KNN on the other than uses simple nearest neighbor measurement for prediction.
Please do more work on what exactly you need and read through the research papers to get proper understanding. There is not magic involved in prediction but rather mathematical formulations.

Estimating parameters in multivariate classification

Newbie here typesetting my question, so excuse me if this don't work.
I am trying to give a bayesian classifier for a multivariate classification problem where input is assumed to have multivariate normal distribution. I choose to use a discriminant function defined as log(likelihood * prior).
However, from the distribution,
$${f(x \mid\mu,\Sigma) = (2\pi)^{-Nd/2}\det(\Sigma)^{-N/2}exp[(-1/2)(x-\mu)'\Sigma^{-1}(x-\mu)]}$$
i encounter a term -log(det($S_i$)), where $S_i$ is my sample covariance matrix for a specific class i. Since my input actually represents a square image data, my $S_i$ discovers quite some correlation and resulting in det(S_i) being zero. Then my discriminant function all turn Inf, which is disastrous for me.
I know there must be a lot of things go wrong here, anyone willling to help me out?
UPDATE: Anyone can help how to get the formula working?
I do not analyze the concept, as it is not very clear to me what you are trying to accomplish here, and do not know the dataset, but regarding the problem with the covariance matrix:
The most obvious solution for data, where you need a covariance matrix and its determinant, and from numerical reasons it is not feasible is to use some kind of dimensionality reduction technique in order to capture the most informative dimensions and simply discard the rest. One such method is Principal Component Analysis (PCA), which applied to your data and truncated after for example 5-20 dimensions would yield the reduced covariance matrix with non-zero determinant.
PS. It may be a good idea to post this question on Cross Validated
Probably you do not have enough data to infer parameters in a space of dimension d. Typically, the way you would get around this is to take an MAP estimate as opposed to an ML.
For the multivariate normal, this is a normal-inverse-wishart distribution. The MAP estimate adds the matrix parameter of inverse Wishart distribution to the ML covariance matrix estimate and, if chosen correctly, will get rid of the singularity problem.
If you are actually trying to create a classifier for normally distributed data, and not just doing an experiment, then a better way to do this would be with a discriminative method. The decision boundary for a multivariate normal is quadratic, so just use a quadratic kernel in conjunction with an SVM.

Resources