Does loss function becomes non convex when we add polynomial features? - machine-learning

When we use polynomial features in case of polynomial regression, logistic regression, svm , does the loss function becomes non convex ?

If a loss function is convex for any choice of X -> y you're trying to estimate then adding a fixed set of polynomial features won't change that. You're simply trading your initial problem with the estimation problem X' -> y, where X' has the additional features.
If you're additionally trying to estimate the parameters for the new feature(s) then it's pretty easy to get a non-convex loss in those dimensions (assuming there are parameters to choose -- if you're just talking about adding a polynomial basis then this doesn't apply).
As some measure of proof, take the example of a 1D estimation problem and choose the feature f(x) = (x-a)^3. Assume your dataset has the single point (0, 0). With a little work you can show that the loss even for linear regression over the new feature is non-convex in places with respect to the parameter a. Note that the loss IS still convex with respect to the new features -- standard linear regression always satisfies that property -- it's the fact that we used linear regression along with a choice of polynomial to build a new non-convex regressor that causes this behavior.

Related

Difference between linear regression and gradient descent

based assignment and I chose machine learning as my topic. I'm still in highschool so I don't know much about calculus.
My end goal is to try using a machine learning algorithm to predict stock values. But I want to understand what I'm doing without copying and analyzing existing codes that perform my required function.
This also isn't programming-related but mostly concerns over the theory part of it? I read through articles on linear regression and watched the lecture that Stanford has on its youtube. But I don't get it. These are my main confusions:
Are linear regression and gradient descent different algorithms or a set of algorithms used together to predict or classify stuff?
Are y = mx + c and f(x) = ϴ0 + ϴx same? What can I calculate with this?
This equation is shown in the linear regression part so what exactly does this do?
I will try to answer all three questions you asked.
First, let me classify ML into some categories.
Regression - Predicting continuous valued output (example, stock prediction)
Classification - Predicting discrete valued output (example, spam classification)
Now regression can be also classified as linear regression or polynomial regression.
Linear Regression is the simplest one. This is how it works.
Suppose I have this data.
These are the house prices plotted against size of the house. Now I want a straight line that can best fit this data. Maybe I will try this line.
And I will try more and more lines to see which actually fit best to the data. Now, to obtain different lines I will vary parameters like a and b in y=a+bx. This answers your second question, this equation represents a straight line which you are trying to fit to the data.
But, how will I decide if one line is better fit than the other. I will calculate some value which represents the error my line makes in correctly predicting the y values of all the x values in my data. This is actually called cost function. I can choose a cost function like this :
(Ignore if it doesn't make sense).
But basically I want my cost function (error representing value) to be minimum and Gradient Descent is one such algorithm that can minimize my cost function. Gradient Descent can actually minimize any general function and hence it is not exclusive to Linear Regression but still it is popular for linear regression. This answers your first question.
Next step is to know how Gradient descent work. This is the algo:
This is what you have asked in your third question. This is the line of code which actually adjusts your fitting line(called hypothesis) while minimizing the cost function.

Logistic Regression is sensitive to outliers? Using on synthetic 2D dataset

I am currently using sklearn's Logistic Regression function to work on a synthetic 2d problem. The dataset is shown as below:
I'm basic plugging the data into sklearn's model, and this is what I'm getting (the light green; disregard the dark green):
The code for this is only two lines; model = LogisticRegression(); model.fit(tr_data,tr_labels). I've checked the plotting function; that's fine as well. I'm using no regularizer (should that affect it?)
It seems really strange to me that the boundaries behave in this way. Intuitively I feel they should be more diagonal, as the data is (mostly) located top-right and bottom-left, and from testing some things out it seems a few stray datapoints are what's causing the boundaries to behave in this manner.
For example here's another dataset and its boundaries
Would anyone know what might be causing this? From my understanding Logistic Regression shouldn't be this sensitive to outliers.
Your model is overfitting the data (The decision regions it found perform indeed better on the training set than the diagonal line you would expect).
The loss is optimal when all the data is classified correctly with probability 1. The distances to the decision boundary enter in the probability computation. The unregularized algorithm can use large weights to make the decision region very sharp, so in your example it finds an optimal solution, where (some of) the outliers are classified correctly.
By a stronger regularization you prevent that and the distances play a bigger role. Try different values for the inverse regularization strength C, e.g.
model = LogisticRegression(C=0.1)
model.fit(tr_data,tr_labels)
Note: the default value C=1.0 corresponds already to a regularized version of logistic regression.
Let us further qualify why logistic regression overfits here: After all, there's just a few outliers, but hundreds of other data points. To see why it helps to note that
logistic loss is kind of a smoothed version of hinge loss (used in SVM).
SVM does not 'care' about samples on the correct side of the margin at all - as long as they do not cross the margin they inflict zero cost. Since logistic regression is a smoothed version of SVM, the far-away samples do inflict a cost but it is negligible compared to the cost inflicted by samples near the decision boundary.
So, unlike e.g. Linear Discriminant Analysis, samples close to the decision boundary have disproportionately more impact on the solution than far-away samples.

Why calculate jacobians in ekf-slam

I know it is a very basic question but I want to know why do we calculate the Jacobian matrices in EKF-SLAM, I have tried so hard to understand this, well it won't be that hard but I want to know it. I was wondering if anyone could help me on this.
The Kalman filter operates on linear systems. The steps update two parts in parallel: The state x, and the error covariances P. In a linear system we predict the next x by Fx. It turns out that you can compute the exact covariance of Fx as FPF^T. In a non-linear system, we can update x as f(x), but how do we update P? There are two popular approaches:
In the EKF, we choose a linear approximation of f() at x, and then use the usual method FPF^T.
In the UKF, we build an approximation of the distribution of x with covariance P. The approximation is a set of points called sigma points. Then we propagate those states through our real f(sigma_point) and we measure the resulting distribution's variance.
You are concerned with the EKF (case 1). What is a good linear approximation of a function? If you zoom way in on a curve, it starts to look like a straight line, with a slope that's the derivative of the function at that point. If that sounds strange, look at Taylor series. The multi-variate equivalent is called the Jacobian. So we evaluate the Jacobian of f() at x to give us an F. Now Fx != f(x), but that's okay as long as the changes we're making to x are small (small enough that our approximated F wouldn't change much from before to after).
The main problem with the EKF approximation is that when we use the approximation to update the distributions after the measurement step, it tends to make the resulting covariance P too low. It acts like corrections "work" in a linear way. The actual update will depart slightly from the linear approximation and not be quite as good. These small amounts of overconfidence build up as the KF iterates and have to be offset by adding some fictitious process noise to Q.

How to use principal component analysis (PCA) to speed up detection?

I am not sure whether I am applying PCA correctly or not! I have p features and n observations (instances). I put these in an nxp matrix X. I perform mean normalization and I get the normalized matrix B. I calculate the eigenvalues and eigenvectors of the pxp covariance matrix C=(1/(n-1))B*.B where * denotes the conjugate transpose.
The eigenvectors corresponding to the descendingly ordered eigenvalues are in a pxp matrix E. Let's say I want to reduce the number of attributes from p to k. I use the equation X_new=B.E_reduced where E_reduced is produced by choosing the first k columns of E. Here are my questions:
1) Should it be X_new=B.E_reduced or X_new=X.E_reduced?
2) Should I repeat the above calculations in the testing phase? If testing phase is similar to training phase, then no speed-up is gained because I have to calculate all the p features for each instance in the testing phase and PCA makes the algorithm slower because of eigenvector calculation overhead.
3) After applying PCA, I noticed that the accuracy decreased. Is this related to the number k (I set k=p/2) or the fact that I am using linear PCA instead of kernel PCA? What is the best way to choose the number k? I read that I can find the ratio of summation of k eigenvalues over the summation of all eigenvalues and make a decision based on this ratio.
You apply the multiplication to the centered data usually, so your projected data is also centered.
Never re-run PCA during testing. Only usenit on training data, and keep the shift vector and projection matrix. You need to apply exactly the same projection as during training, not recompute a new projection.
Decreased performance can have many reasons. E.g. did you also apply scaling using the roots of the eigenvalues? And what method did you use the first place?

How many learning curves should I plot for a multi-class logistic regression classifier?

If we have K classes, do I have to plot K learning curves?
Because it seems impossible to me to calculate the train/validation error against all K theta vectors at once.
To clarify, the learning curve is a plot of the training & cross validation/test set error/cost vs training set size. This plot should allow you to see if increasing the training set size improves performance. More generally, the learning curve allows you to identify whether your algorithm suffers from a bias (under fitting) or variance (over fitting) problem.
It depends. Learning curves do not concern themselves with the number of classes. Like you said, it is a plot of training set and test set error, where that error is a numerical value. This is all learning curves are.
That error can be anything you want: accuracy, precision, recall, F1 score etc. (even MAE, MSE and others for regression).
However, the error you choose to use is the one that does or does not apply to your specific problem, which in turn indirectly affects how you should use learning curves.
Accuracy is well defined for any number of classes, so if you use this, a single plot should suffice.
Precision and recall, however, are defined only for binary problems. You can somewhat generalize them (see here for example) by considering the binary problem with classes x and not x for each class x. In that case, you will probably want to plot learning curves for each class. This will also help you identify problems relating to certain classes better.
If you want to read more about performance metrics, I like this paper a lot.

Resources