Direct Transcription of nonlinear system with cost function dependent on K matrices returned by time-varying LQR - drake

I'm working on implementing a trajectory optimization algorithm named DIRTREL, which is essentially direct transcription with an added cost function. However, the cost function incorporates the K matrices obtained by linearizing the system around the decision variables (x, u) and employing discrete time-varying LQR. My question is how to most efficiently and concisely express this in drake as my current approach describes the system symbolically and results in extremely lengthly symbolic equations (which will only increase in length with more timesteps) due to the recursive nature of the Riccati difference equation, and if this symbolic approach is even appropriate.
For more details:
Specify my system as a LeafSystem
Declare a MathematicalProgram with decision variables x, u
To obtain time-varying linearized dynamics, specify a class that takes in the dynamics and decision variables at a single timestep and returns Jacobians for that timestep with symbolic.Jacobian(args)
Add cost function which takes in the entire trajectory, so all x, u
Inside the cost function:
Obtain linearized matrices A_i, B_i, G_i (G_i for noise) for each timestep by using the class that takes in decision variables and returns Jacobians
Compute the TVLQR cost (S[n]) with the Riccati difference equations employing the A_i and B_i matrices and solving for Ks
return a cost for the mathematical program that is essentially a large linear combination of the K matrices
One side note is I am not sure of the most tractable way to compute an inverse symbolically, but I am most concerned with my methodology and whether this symbolic description is appropriate.

I think there are several details on DIRTREL worth discussion:
The cost-to-go matrix S[n] depends on the linearized dynamics Ai, Bi. I think in DIRTREL you will need to solve a nonlinear optimization problem, which requires the gradient of the cost. So to compute the gradient of of your cost, you will need the gradient of S[n], which requires the gradient of Ai, Bi. Since Ai and Bi are gradient of the dynamics function f(x, u), you will need to compute the second order gradient of the dynamics.
We had a paper on doing trajectory optimization and optimizing the cost function related to the LQR cost-to-go. DIRTREL made several improvement upon our paper. In our implementation, we treated S also as a decision variable, so our decision variables are x, u, S, with the constraint include both the dynamics constraint x[n+1] = f(x[n], u[n]), and the Riccati equation as constraint on S. I think DIRTREL's approach scales better with less decision variables, but I haven't compared the numerical performance between the two approaches.
I am not sure why you need to compute the inverse symbolically. First what is the inverse you need to compute? And second, Drake supports using automatic differentiation to compute the gradient in the numerical value. I would recommend doing numerical computation instead of symbolic computation. Since in numerical optimization, you only need the value and gradient of the cost/constraints, it is usually much more efficient to compute these values numerically, rather than first deriving the symbolic expression, and then evaluating the symbolic expression.

Related

Full-Rank Assumption in Least Squares Estimation (Linear Regression)

In Ordinary Least Square Estimation, the assumption is for the Samples matrix X (of shape N_samples x N_features) to have "full column rank".
This is apparently needed so that the linear regression can be reduced to a simple algebraic equation using the Moore–Penrose inverse. See this section of the Wikipedia article for OLS:
https://en.wikipedia.org/wiki/Ordinary_least_squares#Estimation
In theory this means that if all columns of X (i.e. features) are linearly independent we can make an assumption that makes OLS simple to calculate, correct?
What does this mean in practice?
Does this mean that OLS is not calculable and will result in an error for such input data X? Or will the result just be bad?
Are there any classical datasets for which linear regression fails due to this assumption not being true?
The full rank assumption is only needed if you were to use the inverse (or cholesky decomposition, or QR or any other method that is (mathematically) equivalent to computing the inverse). If you use the Moore-Penrose inverse you will still compute an answer. When the full rank assumtion is violated there is no longer a unique answer, ie there are many x that minimise
||A*x-b||
The one you will compute with the Moore-Penrose will be the x of minimum norm. See here, for exampleA

Why is inference in Markov Random Fields hard?

I'm studying Markov Random Fields, and, apparently, inference in MRF is hard / computationally expensive. Specifically, Kevin Murphy's book Machine Learning: A Probabilistic Perspective says the following:
"In the first term, we fix y to its observed values; this is sometimes called the clamped term. In the second term, y is free; this is sometimes called the unclamped term or contrastive term. Note that computing the unclamped term requires inference in the model, and this must be done once per gradient step. This makes training undirected graphical models harder than training directed graphical models."
Why are we performing inference here? I understand that we're summing over all y's, which seems expensive, but I don't see where we're actually estimating any parameters. Wikipedia also talks about inference, but only talks about calculating the conditional distribution, and needing to sum over all non-specified nodes.. but.. that's not what we're doing here, is it?
Alternatively, any have good intuition on why inference in MRF is difficult?
Sources:
Chapter 19 of ML:PP: https://www.cs.ubc.ca/~murphyk/MLbook/pml-print3-ch19.pdf
Specific section seen below
When training your CRF, you want to estimate your parameters, \theta.
In order to do this, you can differentiate your loss function (Equation 19.38) with respect to \theta, set it to 0, and solve for \theta.
You can't analytically solve the equation for \theta if you do this though. You can, however, minimise Equation 19.38 by gradient descent. Since the loss function is convex, it is guaranteed that gradient descent will get you the globally optimal solution when it converges.
Equation 19.41 is the actual gradient which you need to compute in order to be able to do gradient descent. The first term is easy (and computationally cheap) to compute as you are summing up over the observed values of y. However, the second term requires you to do inference. In this term, you are not summing up over the observed value of y as in the first term. Instead, you need to compute the configuration of y (inference), and then calculate the value of the potential function under this configuration.

max-margin linear separator using libsvm

I have a set of N data points X with +/- labels for which I'd like to calculate the max-margin linear separator (aka classifier, hyperplane) or fail if no such linear separator exist.
I do not want to avoid overfitting in the context of this question, as I do so elsewhere. So no slack variables ; no cross-validation ; no limits on the number of support vectors ; just find max-margin separator or fail.
How to I use libsvm to do so? I believe you can't give c=0 in C-SVM and you can't give nu=1 in nu-svm.
Related question (which I think didn't provide an answer):
Which of the parameters in LibSVM is the slack variable?
In the case of C-SVM, you should use a linear kernel and a very large C value (or nu = 0.999... for nu-SVM). If you still have slacks with this setting, probably your data is not linearly separable.
Quick explanation: the C-SVM optimization function tries to find the hyperplane having maximum margin and lowest misclassification costs at the same time. The misclassification costs in the C-SVM formulation is defined by: distance from the misclassified point to its correct side of the hyperplane, multiplied by C. If you increase the C value (or nu value for nu-SVM), every misclassified point will be too costly and an hyperplane that separates the data perfectly will be preferable for the optimization function.

Query about SVM mapping of input vector? And SVM optimization equation

I have read through a lot of papers and understand the basic concept of a support vector machine at a very high level. You give it a training input vector which has a set of features and bases on how the "optimization function" evaluates this input vector lets call it x, (lets say we're talking about text classification), the text associated with the input vector x is classified into one of two pre-defined classes, this is only in the case of binary classification.
So my first question is through this procedure described above, all the papers say first that this training input vector x is mapped to a higher (maybe infinite) dimensional space. So what does this mapping achieve or why is this required? Lets say the input vector x has 5 features so who decides which "higher dimension" x is going to be mapped to?
Second question is about the following optimization equation:
min 1/2 wi(transpose)*wi + C Σi = 1..n ξi
so I understand that w has something to do with the margins of the hyperplane from the support vectors in the graph and I know that C is some sort of a penalty but I dont' know what it is a penalty for. And also what is ξi representing in this case.
A simple explanation of the second question would be much appreciated as I have not had much luck understanding it by reading technical papers.
When they talk about mapping to a higher-dimensional space, they mean that the kernel accomplishes the same thing as mapping the points to a higher-dimensional space and then taking dot products there. SVMs are fundamentally a linear classifier, but if you use kernels, they're linear in a space that's different from the original data space.
To be concrete, let's talk about the kernel
K(x, y) = (xy + 1)^2 = (xy)^2 + 2xy + 1,
where x and y are each real numbers (one-dimensional). Note that
(x^2, sqrt(2) x, 1) • (y^2, sqrt(2) y, 1) = x^2 y^2 + 2 x y + 1
has the same value. So K(x, y) = phi(x) • phi(y), where phi(a) = (a^2, sqrt(2), 1), and doing an SVM with this kernel (the inhomogeneous polynomial kernel of degree 2) is the same as if you first mapped your 1d points into this 3d space and then did a linear kernel.
The popular Gaussian RBF kernel function is equivalent to mapping your points into an infinite-dimensional Hilbert space.
You're the one who decides what feature space it's mapped into, when you pick a kernel. You don't necessarily need to think about the explicit mapping when you do that, though, and it's important to note that the data is never actually transformed into that high-dimensional space explicitly - then infinite-dimensional points would be hard to represent. :)
The ξ_i are the "slack variables". Without them, SVMs would never be able to account for training sets that aren't linearly separable -- which most real-world datasets aren't. The ξ in some sense are the amount you need to push data points on the wrong side of the margin over to the correct side. C is a parameter that determines how much it costs you to increase the ξ (that's why it's multiplied there).
1) The higher dimension space happens through the kernel mechanism. However, when evaluating the test sample, the higher dimension space need not be explicitly computed. (Clearly this must be the case because we cannot represent infinite dimensions on a computer.) For instance, radial basis function kernels imply infinite dimensional spaces, yet we don't need to map into this infinite dimension space explicitly. We only need to compute, K(x_sv,x_test), where x_sv is one of the support vectors and x_test is the test sample.
The specific higher dimensional space is chosen by the training procedure and parameters, which choose a set of support vectors and their corresponding weights.
2) C is the weight associated with the cost of not being able to classify the training set perfectly. The optimization equation says to trade-off between the two undesirable cases of non-perfect classification and low margin. The ξi variables represent by how much we're unable to classify instance i of the training set, i.e., the training error of instance i.
See Chris Burges' tutorial on SVM's for about the most intuitive explanation you're going to get of this stuff anywhere (IMO).

importance of PCA or SVD in machine learning

All this time (specially in Netflix contest), I always come across this blog (or leaderboard forum) where they mention how by applying a simple SVD step on data helped them in reducing sparsity in data or in general improved the performance of their algorithm in hand.
I am trying to think (since long time) but I am not able to guess why is it so.
In general, the data in hand I get is very noisy (which is also the fun part of bigdata) and then I do know some basic feature scaling stuff like log-transformation stuff , mean normalization.
But how does something like SVD helps.
So lets say i have a huge matrix of user rating movies..and then in this matrix, I implement some version of recommendation system (say collaborative filtering):
1) Without SVD
2) With SVD
how does it helps
SVD is not used to normalize the data, but to get rid of redundant data, that is, for dimensionality reduction. For example, if you have two variables, one is humidity index and another one is probability of rain, then their correlation is so high, that the second one does not contribute with any additional information useful for a classification or regression task. The eigenvalues in SVD help you determine what variables are most informative, and which ones you can do without.
The way it works is simple. You perform SVD over your training data (call it matrix A), to obtain U, S and V*. Then set to zero all values of S less than a certain arbitrary threshold (e.g. 0.1), call this new matrix S'. Then obtain A' = US'V* and use A' as your new training data. Some of your features are now set to zero and can be removed, sometimes without any performance penalty (depending on your data and the threshold chosen). This is called k-truncated SVD.
SVD doesn't help you with sparsity though, only helps you when features are redundant. Two features can be both sparse and informative (relevant) for a prediction task, so you can't remove either one.
Using SVD, you go from n features to k features, where each one will be a linear combination of the original n. It's a dimensionality reduction step, just like feature selection is. When redundant features are present, though, a feature selection algorithm may lead to better classification performance than SVD depending on your data set (for example, maximum entropy feature selection). Weka comes with a bunch of them.
See: http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Dimensionality_Reduction/Singular_Value_Decomposition
https://stats.stackexchange.com/questions/33142/what-happens-when-you-apply-svd-to-a-collaborative-filtering-problem-what-is-th
The Singular Value Decomposition is often used to approximate a matrix X by a low rank matrix X_lr:
Compute the SVD X = U D V^T.
Form the matrix D' by keeping the k largest singular values and setting the others to zero.
Form the matrix X_lr by X_lr = U D' V^T.
The matrix X_lr is then the best approximation of rank k of the matrix X, for the Frobenius norm (the equivalent of the l2-norm for matrices). It is computationally efficient to use this representation, because if your matrix X is n by n and k << n, you can store its low rank approximation with only (2n + 1)k coefficients (by storing U, D' and V).
This was often used in matrix completion problems (such as collaborative filtering) because the true matrix of user ratings is assumed to be low rank (or well approximated by a low rank matrix). So, you wish to recover the true matrix by computing the best low rank approximation of your data matrix. However, there are now better ways to recover low rank matrices from noisy and missing observations, namely nuclear norm minimization. See for example the paper The power of convex relaxation: Near-optimal matrix completion by E. Candes and T. Tao.
(Note: the algorithms derived from this technique also store the SVD of the estimated matrix, but it is computed differently).
PCA or SVD, when used for dimensionality reduction, reduce the number of inputs. This, besides saving computational cost of learning and/or predicting, can sometimes produce more robust models that are not optimal in statistical sense, but have better performance in noisy conditions.
Mathematically, simpler models have less variance, i.e. they are less prone to overfitting. Underfitting, of-course, can be a problem too. This is known as bias-variance dilemma. Or, as said in plain words by Einstein: Things should be made as simple as possible, but not simpler.

Resources