based assignment and I chose machine learning as my topic. I'm still in highschool so I don't know much about calculus.
My end goal is to try using a machine learning algorithm to predict stock values. But I want to understand what I'm doing without copying and analyzing existing codes that perform my required function.
This also isn't programming-related but mostly concerns over the theory part of it? I read through articles on linear regression and watched the lecture that Stanford has on its youtube. But I don't get it. These are my main confusions:
Are linear regression and gradient descent different algorithms or a set of algorithms used together to predict or classify stuff?
Are y = mx + c and f(x) = ϴ0 + ϴx same? What can I calculate with this?
This equation is shown in the linear regression part so what exactly does this do?

I will try to answer all three questions you asked.
First, let me classify ML into some categories.
Regression - Predicting continuous valued output (example, stock prediction)
Classification - Predicting discrete valued output (example, spam classification)
Now regression can be also classified as linear regression or polynomial regression.
Linear Regression is the simplest one. This is how it works.
Suppose I have this data.
These are the house prices plotted against size of the house. Now I want a straight line that can best fit this data. Maybe I will try this line.
And I will try more and more lines to see which actually fit best to the data. Now, to obtain different lines I will vary parameters like a and b in y=a+bx. This answers your second question, this equation represents a straight line which you are trying to fit to the data.
But, how will I decide if one line is better fit than the other. I will calculate some value which represents the error my line makes in correctly predicting the y values of all the x values in my data. This is actually called cost function. I can choose a cost function like this :
(Ignore if it doesn't make sense).
But basically I want my cost function (error representing value) to be minimum and Gradient Descent is one such algorithm that can minimize my cost function. Gradient Descent can actually minimize any general function and hence it is not exclusive to Linear Regression but still it is popular for linear regression. This answers your first question.
Next step is to know how Gradient descent work. This is the algo:
This is what you have asked in your third question. This is the line of code which actually adjusts your fitting line(called hypothesis) while minimizing the cost function.


Why we need to normalize input as zero mean and unit variance before feed to network?

In deep learning, I saw many papers apply the pre-processing step as normalization step. It normalizes the input as zero mean and unit variance before feeding to the convolutional network (has BatchNorm). Why not use original intensity? What is the benefit of the normalization step? If I used histogram matching among images, should I still use the normalization step? Thanks
Normalization is important to bring features onto the same scale for the network to behave much better. Let's assume there are two features where one is measured on a scale of 1 to 10 and the second on a scale from 1 to 10,000. In terms of squared error function the network will be busy optimizing the weights according to the larger error on the second feature.
Therefore it is better to normalize.
The answer to this can be found in Andrew Ng's tutorial: https://youtu.be/UIp2CMI0748?t=133.
TLDR: If you do not normalize input features, some features can have a very different scale and will slow down Gradient Descent.
Long explanation: Let us consider a model that uses two features Feature1 and Feature2 with the following ranges:
Feature1: [10,10000]
Feature2: [0.00001, 0.001]
The Contour plot of these will look something like this (scaled for easier visibility).
Contour plot of Feature1 and Feature2
When you perform Gradient Descent, you will calculate d(Feature1) and d(Feature2) where "d" denotes differential in order to move the model weights closer to minimizing the loss. As evident from the contour plot above, d(Feature1) is going to be significantly smaller compared to d(Feature2), so even if you choose a reasonably medium value of learning rate, then you will be zig-zagging around because of relatively large values of d(Feature2) and may even miss the global minima.
Medium value of learning rate
In order to avoid this, if you choose a very small value of learning rate, Gradient Descent will take a very long time to converge and you may stop training even before reaching the global minima.
Very small Gradient Descent
So as you can see from the above examples, not scaling your features lead to an inefficient Gradient Descent which results in not finding the most optimal model

Logistic Regression is sensitive to outliers? Using on synthetic 2D dataset

I am currently using sklearn's Logistic Regression function to work on a synthetic 2d problem. The dataset is shown as below:
I'm basic plugging the data into sklearn's model, and this is what I'm getting (the light green; disregard the dark green):
The code for this is only two lines; model = LogisticRegression(); model.fit(tr_data,tr_labels). I've checked the plotting function; that's fine as well. I'm using no regularizer (should that affect it?)
It seems really strange to me that the boundaries behave in this way. Intuitively I feel they should be more diagonal, as the data is (mostly) located top-right and bottom-left, and from testing some things out it seems a few stray datapoints are what's causing the boundaries to behave in this manner.
For example here's another dataset and its boundaries
Would anyone know what might be causing this? From my understanding Logistic Regression shouldn't be this sensitive to outliers.
Your model is overfitting the data (The decision regions it found perform indeed better on the training set than the diagonal line you would expect).
The loss is optimal when all the data is classified correctly with probability 1. The distances to the decision boundary enter in the probability computation. The unregularized algorithm can use large weights to make the decision region very sharp, so in your example it finds an optimal solution, where (some of) the outliers are classified correctly.
By a stronger regularization you prevent that and the distances play a bigger role. Try different values for the inverse regularization strength C, e.g.
model = LogisticRegression(C=0.1)
Note: the default value C=1.0 corresponds already to a regularized version of logistic regression.
Let us further qualify why logistic regression overfits here: After all, there's just a few outliers, but hundreds of other data points. To see why it helps to note that
logistic loss is kind of a smoothed version of hinge loss (used in SVM).
SVM does not 'care' about samples on the correct side of the margin at all - as long as they do not cross the margin they inflict zero cost. Since logistic regression is a smoothed version of SVM, the far-away samples do inflict a cost but it is negligible compared to the cost inflicted by samples near the decision boundary.
So, unlike e.g. Linear Discriminant Analysis, samples close to the decision boundary have disproportionately more impact on the solution than far-away samples.

Is it possible to use libsvm for multilabel regression problems

I have a scenario where I need to predict spherical co-ordinates (r,theta,phi) depending upon the values of 6 attributes.I am using Libsvm with regression option. If I individually predict labels according to the object instance, it doesnt make sense. Also if I combine labels and assign a specific label for each r,theta,phi, it is not meaningful and SVM not converging in prediction. I want SVM to analyse the combination of three coordinates and accordingly create a training model. Is it possible? Please advise.
Not really: SVM is a classification algorithm, not a prediction algorithm. So far as SVM is concerned, a label of (0, 0, 1) is just as distinct from (0, 0, 2) as it is from (20, 3, -1): "not the same".
If you have a regression problem, then use a regression model: do a little research to find one that matches whatever your data set characteristics suggest.
From what little you've said, it sounds to me as if you want a multivariate regression, with a single loss function that describes the deviation from the desired output triple. You're correct that three separate regressions won't work for this scenario: the position in space depends on a non-linear combination of the three outputs.
I suggest that you make your loss function some useful distance function between the true and predicted positions. You will need to experiment with your model features, using linear, squared, and other terms for each of the six inputs. I can't suggest anything, as you haven't adequately described the problem.

How many learning curves should I plot for a multi-class logistic regression classifier?

If we have K classes, do I have to plot K learning curves?
Because it seems impossible to me to calculate the train/validation error against all K theta vectors at once.
To clarify, the learning curve is a plot of the training & cross validation/test set error/cost vs training set size. This plot should allow you to see if increasing the training set size improves performance. More generally, the learning curve allows you to identify whether your algorithm suffers from a bias (under fitting) or variance (over fitting) problem.
It depends. Learning curves do not concern themselves with the number of classes. Like you said, it is a plot of training set and test set error, where that error is a numerical value. This is all learning curves are.
That error can be anything you want: accuracy, precision, recall, F1 score etc. (even MAE, MSE and others for regression).
However, the error you choose to use is the one that does or does not apply to your specific problem, which in turn indirectly affects how you should use learning curves.
Accuracy is well defined for any number of classes, so if you use this, a single plot should suffice.
Precision and recall, however, are defined only for binary problems. You can somewhat generalize them (see here for example) by considering the binary problem with classes x and not x for each class x. In that case, you will probably want to plot learning curves for each class. This will also help you identify problems relating to certain classes better.
If you want to read more about performance metrics, I like this paper a lot.

What should be a generic enough convergence criteria of Stochastic Gradient Descent

I am implementing a generic module for Stochastic Gradient Descent. That takes arguments: training dataset, loss(x,y), dw(x,y) - per sample loss and per sample gradient change.
Now, for the convergence criteria, I have thought of :-
a) Checking loss function after every 10% of the dataset.size, averaged over some window
b) Checking the norm of the differences between weight vector, after every 10-20% of dataset size
c) Stabilization of error on the training set.
d) Change in the sign of the gradient (again, checked after every fixed intervals) -
I have noticed that these checks (precision of check etc.) depends on other stuff also, like step size, learning rate.. and the effect can vary from one training problem to another.
I can't seem to make up mind on, what should be the generic stopping criterion, regardless of the training set, fx,df/dw thrown at the SGD module. What do you guys do?
Also, for (d), what would be the meaning of "change in sign" for a n-dimensional vector? As, in - given dw_i, dw_i+1, how do I detect the change of sign, does it even have a meaning in more than 2 dimensions?
P.S. Apologies for non-math/latex symbols..still getting used to the stuff.
First, stochastic gradient descent is the on-line version of gradient descent method. The update rule is using a single example at a time.
Suppose, f(x) is your cost function for a single example, the stopping criteria of SGD for N-dimensional vector is usually:
See this1, or this2 for details.
Second, there is a further twist on stochastic gradient descent using so-called “minibatches”. It works identically to SGD, except that it uses more than one training example to make each estimate of the gradient. This technique reduces variance in the estimate of the gradient, and often makes better use of the hierarchical memory organization in modern computers. See this3.
