Gradient descent stochastic update - Stopping criterion and update rule - Machine Learning - machine-learning

My dataset has m features and n data points. Let w be a vector (to be estimated). I'm trying to implement gradient descent with stochastic update method. My minimizing function is least mean square.
The update algorithm is shown below:
for i = 1 ... n data:
for t = 1 ... m features:
w_t = w_t - alpha * (<w>.<x_i> - <y_i>) * x_t
where <x> is a raw vector of m features, <y> is a column vector of true labels, and alpha is a constant.
My questions:
Now according to wiki, I don't need to go through all data points and I can stop when error is small enough. Is it true?
I don't understand what should be the stopping criterion here. If anyone can help with this that would be great.
With this formula - which I used in for loop - is it correct? I believe (<w>.<x_i> - <y_i>) * x_t is my ∆Q(w).

Now according to wiki, I don't need to go through all data points and I can stop when error is small enough. Is it true?
This is especially true when you have a really huge training set and going through all the data points is so expensive. Then, you would check the convergence criterion after K stochastic updates (i.e. after processing K training examples). While it's possible, it doesn't make much sense to do this with a small training set. Another thing people do is randomizing the order in which training examples are processed to avoid having too many correlated examples in a raw which may result in "fake" convergence.
I don't understand what should be the stopping criterion here. If anyone can help with this that would be great.
There are a few options. I recommend trying as many of them and deciding based on empirical results.
difference in the objective function for the training data is smaller than a threshold.
difference in the objective function for held-out data (aka. development data, validation data) is smaller than a threshold. The held-out examples should NOT include any of the examples used for training (i.e. for stochastic updates) nor include any of the examples in the test set used for evaluation.
the total absolute difference in parameters w is smaller than a threshold.
in 1, 2, and 3 above, instead of specifying a threshold, you could specify a percentage. For example, a reasonable stopping criterion is to stop training when |squared_error(w) - squared_error(previous_w)| < 0.01 * squared_error(previous_w) $$.
sometimes, we don't care if we have the optimal parameters. We just want to improve the parameters we originally had. In such case, it's reasonable to preset a number of iterations over the training data and stop after that regardless of whether the objective function actually converged.
With this formula - which I used in for loop - is it correct? I believe (w.x_i - y_i) * x_t is my ∆Q(w).
It should be 2 * (w.x_i - y_i) * x_t but it's not a big deal given that you're multiplying by the learning rate alpha anyway.

Related

How to squish a continuous cosine-theta score to a discrete (0/1) output?

I implemented a cosine-theta function, which calculates the relation between two articles. If two articles are very similar then the words should contain quite some overlap. However, a cosine theta score of 0.54 does not mean "related" or "not related". I should end up with a definitive answer which is either 0 for 'not related' or 1 for 'related'.
I know that there are sigmoid and softmax functions, yet I should find the optimal parameters to give to such functions and I do not know if these functions are satisfactory solutions. I was thinking that I have the cosine theta score, I can calculate the percentage of overlap between two sentences two (e.g. the amount of overlapping words divided by the amount of words in the article) and maybe some more interesting things. Then with the data, I could maybe write a function (what type of function I do not know and is part of the question!), after which I can minimize the error via the SciPy library. This means that I should do some sort of supervised learning, and I am willing to label article pairs with labels (0/1) in order to train a network. Is this worth the effort?
# Count words of two strings.
v1, v2 = self.word_count(s1), self.word_count(s2)
# Calculate the intersection of the words in both strings.
v3 = set(v1.keys()) & set(v2.keys())
# Calculate some sort of ratio between the overlap and the
# article length (since 1 overlapping word on 2 words is more important
# then 4 overlapping words on articles of 492 words).
p = min(len(v1), len(v2)) / len(v3)
numerator = sum([v1[w] * v2[w] for w in v3])
w1 = sum([v1[w]**2 for w in v1.keys()])
w2 = sum([v2[w]**2 for w in v2.keys()])
denominator = math.sqrt(w1) * math.sqrt(w2)
# Calculate the cosine similarity
if not denominator:
return 0.0
else:
return (float(numerator) / denominator)
As said, I would like to use variables such as p, and the cosine theta score in order to produce an accurate discrete binary label, either 0 or 1.
As said, I would like to use variables such as p, and the cosine theta score in order to produce an accurate discrete binary label, either 0 or 1.
Here it really comes down to what you mean by accuracy. It is up to you to choose how the overlap affects whether or not two strings are "matching" unless you have a labelled data set. If you have a labelled data set (I.e., a set of pairs of strings along with a 0 or 1 label), then you can train a binary classification algorithm and try to optimise based on that. I would recommend something like a neural net or SVM due to the potentially high dimensional, categorical nature of your problem.
Even the optimisation, however, is a subjective measure. For example, in theory let's pretend you have a model which out of 100 samples only predicts 1 answer (Giving 99 unknowns). Technically if that one answer is correct, that is a model with 100% accuracy, but which has a very low recall. Generally in machine learning you will find a trade off between recall and accuracy.
Some people like to go for certain metrics which combine the two (The most famous of which is the F1 score), but honestly it depends on the application. If I have a marketing campaign with a fixed budget, then I care more about accuracy - I would only want to target consumers who are likely to buy my product. If however, we are looking to test for a deadly disease or markers for bank fraud, then it's feasible for that test to be accurate only 10% of the time - if its recall of true positives is somewhere close to 100%.
Finally, if you have no labelled data, then your best bet is just to define some cut off value which you believe indicates a good match. This is would then be more analogous to a binary clustering problem, and you could use some more abstract measure such as distance to a centroid to test which cluster (Either the "related" or "unrelated" cluster) the point belongs to. Note however that here your features feel like they would be incredibly hard to define.

Wouldn't setting the first derivative of Cost function J to 0 gives the exact Theta values that minimize the cost?

I am currently doing Andrew NG's ML course. From my calculus knowledge, the first derivative test of a function gives critical points if there are any. And considering the convex nature of Linear / Logistic Regression cost function, it is a given that there will be a global / local optima. If that is the case, rather than going a long route of taking a miniscule baby step at a time to reach the global minimum, why don't we use the first derivative test to get the values of Theta that minimize the cost function J in a single attempt , and have a happy ending?
That being said, I do know that there is a Gradient Descent alternative called Normal Equation that does just that in one successful step unlike the former.
On a second thought, I am thinking if it is mainly because of multiple unknown variables involved in the equation (which is why the Partial Derivative comes into play?) .
Let's take an example:
Gradient simple regression cost function:
Δ[RSS(w) = [(y-Hw)T(y-Hw)]
y : output
H : feature vector
w : weights
RSS: residual sum of squares
Equating this to 0 for getting the closed form solution will give:
w = (H T H)-1 HT y
Now assuming there are D features, the time complexity for calculating transpose of matrix is around O(D3). If there are a million features, it is computationally impossible to do within reasonable amount of time.
We use these gradient descent methods since they give solutions with reasonably acceptable solutions within much less time.

Interpretation of a learning curve in machine learning

While following the Coursera-Machine Learning class, I wanted to test what I learned on another dataset and plot the learning curve for different algorithms.
I (quite randomly) chose the Online News Popularity Data Set, and tried to apply a linear regression to it.
Note : I'm aware it's probably a bad choice but I wanted to start with linear reg to see later how other models would fit better.
I trained a linear regression and plotted the following learning curve :
This result is particularly surprising for me, so I have questions about it :
Is this curve even remotely possible or is my code necessarily flawed?
If it is correct, how can the training error grow so quickly when adding new training examples? How can the cross validation error be lower than the train error?
If it is not, any hint to where I made a mistake?
Here's my code (Octave / Matlab) just in case:
Plot :
lambda = 0;
startPoint = 5000;
stepSize = 500;
[error_train, error_val] = ...
learningCurve([ones(mTrain, 1) X_train], y_train, ...
[ones(size(X_val, 1), 1) X_val], y_val, ...
lambda, startPoint, stepSize);
plot(error_train(:,1),error_train(:,2),error_val(:,1),error_val(:,2))
title('Learning curve for linear regression')
legend('Train', 'Cross Validation')
xlabel('Number of training examples')
ylabel('Error')
Learning curve :
S = ['Reg with '];
for i = startPoint:stepSize:m
temp_X = X(1:i,:);
temp_y = y(1:i);
% Initialize Theta
initial_theta = zeros(size(X, 2), 1);
% Create "short hand" for the cost function to be minimized
costFunction = #(t) linearRegCostFunction(X, y, t, lambda);
% Now, costFunction is a function that takes in only one argument
options = optimset('MaxIter', 50, 'GradObj', 'on');
% Minimize using fmincg
theta = fmincg(costFunction, initial_theta, options);
[J, grad] = linearRegCostFunction(temp_X, temp_y, theta, 0);
error_train = [error_train; [i J]];
[J, grad] = linearRegCostFunction(Xval, yval, theta, 0);
error_val = [error_val; [i J]];
fprintf('%s %6i examples \r', S, i);
fflush(stdout);
end
Edit : if I shuffle the whole dataset before splitting train/validation and doing the learning curve, I have very different results, like the 3 following :
Note : the training set size is always around 24k examples, and validation set around 8k examples.
Is this curve even remotely possible or is my code necessarily flawed?
It's possible, but not very likely. You might be picking the hard to predict instances for the training set and the easy ones for the test set all the time. Make sure you shuffle your data, and use 10 fold cross validation.
Even if you do all this, it is still possible for it to happen, without necessarily indicating a problem in the methodology or the implementation.
If it is correct, how can the training error grow so quickly when adding new training examples? How can the cross validation error be lower than the train error?
Let's assume that your data can only be properly fitted by a 3rd degree polynomial, and you're using linear regression. This means that the more data you add, the more obviously it will be that your model is inadequate (higher training error). Now, if you choose few instances for the test set, the error will be smaller, because linear vs 3rd degree might not show a big difference for too few test instances for this particular problem.
For example, if you do some regression on 2D points, and you always pick 2 points for your test set, you will always have 0 error for linear regression. An extreme example, but you get the idea.
How big is your test set?
Also, make sure that your test set remains constant throughout the plotting of the learning curves. Only the train set should increase.
If it is not, any hint to where I made a mistake?
Your test set might not be large enough or your train and test sets might not be properly randomized. You should shuffle the data and use 10 fold cross validation.
You might want to also try to find other research regarding that data set. What results are other people getting?
Regarding the update
That makes a bit more sense, I think. Test error is generally higher now. However, those errors look huge to me. Probably the most important information this gives you is that linear regression is very bad at fitting this data.
Once more, I suggest you do 10 fold cross validation for learning curves. Think of it as averaging all of your current plots into one. Also shuffle the data before running the process.

how to handle large number of features machine learning

I developed a image processing program that identifies what a number is given an image of numbers. Each image was 27x27 pixels = 729 pixels. I take each R, G and B value which means I have 2187 variables from each image (+1 for the intercept = total of 2188).
I used the below gradient descent formula:
Repeat {
θj = θj−α/m∑(hθ(x)−y)xj
}
Where θj is the coefficient on variable j; α is the learning rate; hθ(x) is the hypothesis; y is real value and xj is the value of variable j. m is the number of training sets. hθ(x), y are for each training set (i.e. that's what the summation sign is for). Further the hypothesis is defined as:
hθ(x) = 1/(1+ e^-z)
z= θo + θ1X1+θ2X2 +θ3X3...θnXn
With this, and 3000 training images, I was able to train my program in just over an hour and when tested on a cross validation set, it was able to identify the correct image ~ 67% of the time.
I wanted to improve that so I decided to attempt a polynomial of degree 2.
However the number of variables jumps from 2188 to 2,394,766 per image! It takes me an hour just to do 1 step of gradient descent.
So my question is, how is this vast number of variables handled in machine learning? On the one hand, I don't have enough space to even hold that many variables for each training set. On the other hand, I am currently storing 2188 variables per training sample, but I have to perform O(n^2) just to get the values of each variable multiplied by another variable (i.e. the polynomial to degree 2 values).
So any suggestions / advice is greatly appreciated.
try to use some dimensionality reduction first (PCA, kernel PCA, or LDA if you are classifying the images)
vectorize your gradient descent - with most math libraries or in matlab etc. it will run much faster
parallelize the algorithm and then run in on multiple CPUs (but maybe your library for multiplying vectors already supports parallel computations)
Along with Jirka-x1's answer, I would first say that this is one of the key differences in working with image data than say text data for ML: high dimensionality.
Second... this is a duplicate, see How to approach machine learning problems with high dimensional input space?

Does it makes any sense that weights and threshold are growing proportionally when training my perceptron?

I am moving my first steps in neural networks and to do so I am experimenting with a very simple single layer, single output perceptron which uses a sigmoidal activation function. I am updating my weights on-line each time a training example is presented using:
weights += learningRate * (correct - result) * {input,1}
Here weights is a n-length vector which also contains the weight from the bias neuron (- threshold), result is the result as computed by the perceptron (and processed using the sigmoid) when given the input, correct is the correct result and {input,1} is the input augmented with 1 (the fixed input from the bias neuron). Now, when I try to train the perceptron to perform logic AND, the weights don't converge for a long time, instead they keep growing similarly and they maintain a ratio of circa -1.5 with the threshold, for instance the three weights are in sequence:
5.067160008240718 5.105631826680446 -7.945513136885797
...
8.40390853077094 8.43890306970281 -12.889540730182592
I would expect the perceptron to stop at 1, 1, -1.5.
Apart from this problem, which looks like connected to some missing stopping condition in the learning, if I try to use the identity function as activation function, I get weight values oscillating around:
0.43601272528257057 0.49092558197172703 -0.23106430854347537
and I obtain similar results with tanh. I can't give an explanation to this.
Thank you
Tunnuz
It is because the sigmoid activation function doesn't reach one (or zero) even with very highly positive (or negative) inputs. So (correct - result) will always be non-zero, and your weights will always get updated. Try it with the step function as the activation function (i.e. f(x) = 1 for x > 0, f(x) = 0 otherwise).
Your average weight values don't seem right for the identity activation function. It might be that your learning rate is a little high -- try reducing it and see if that reduces the size of the oscillations.
Also, when doing online learning (aka stochastic gradient descent), it is common practice to reduce the learning rate over time so that you converge to a solution. Otherwise your weights will continue to oscillate.
When trying to analyze the behavior of the perception, it helps to also look at correct and result.

Resources