Intuition behind Kernelized perceptron - machine-learning

I understand the derivation of the kernelized perceptron function, but I'm trying to figure out the intuition behind the final formula
f(X) = sum_i (alpha_i*y_i*K(X,x_i))
Where (x_i,y_i) are all the samples in the training data, alpha_i is the number of times we've made a mistake on that sample, and X is the sample we're trying to predict (during training or otherwise). Now, I understand why the kernel function is considered to be a measure of similarity (since it's a dot product in a higher dimensional space), but what I don't get is how this formula comes together.
My original attempt was that we're trying to predict a sample based on how similar it is to the other samples - and multiply it by y_i so that it contributes the correct sign (points that are closer are better indicators of the label than points that are farther). But why should a sample that we've made several mistakes on contribute more?
tl;dr: In a Kernelized perceptron, why should a sample that we've made several mistakes on contribute more to the prediction than ones we haven't made mistakes on?

My original attempt was that we're trying to predict a sample based on how similar it is to the other samples - and multiply it by y_i so that it contributes the correct sign (points that are closer are better indicators of the label than points that are farther).
This is pretty much what's going on. Although the idea is if alpha_i*y_i*K(X,x_i) already is well classified, then you don't need to update it further.
But if the point is misclassified we need to update it. The best way would be in the opposite direction right? that's if the result is negative we should be adding a possitive quantity (y_i). If the result is possitive (and it is missclassified) then we want to sum a negative value (y_i again).
As you can see, y_i already give us the right update direction and hence we use a misclassification counter to give a magnitude to that update.

Related

Gaussian Progress Regression usecase

while reading the paper :" Tactile-based active object discrimination and target object search in an unknown workspace", there is something that I just can not understand:
The paper is about finding object's position and other properties using only tactile information. In the section 4.1.2, the author says that he uses GPR to guide the exploratory process and in section 4.1.4 he describes how he trained his GPR:
Using the example from the section 4.1.2, the input is (x,z) and the ouput y.
Whenever there is a contact, the coresponding y-value is stored.
This procedure is repeated several times.
This trained GPR is used to estimate the next exploring point, which is the point where the variance is maximum at.
In the following link, you also can see the demonstration: https://www.youtube.com/watch?v=ZiLq3i-BJcA&t=177s . In the first part of video (0:24-0:29), the first initalization takes place where the robot samples 4 times. Then in the next 25 seconds, the robot explores explores from the corresponding direction. I do not understand how this tiny initialization of GPR can guide the exploratory process. Could someone please explain how the input points (x,z) from the first exploring part could be estimated?
Any regression algorithm simply maps the input (x,z) to an output y in some way unique to the specific algorithm. For a new input (x0,z0) the algorithm will likely predict something very close to the true output y0 if many data points similar to this was included in the training. If only training data was available in a vastly different region, the predictions will likely be very bad.
GPR includes a measure of confidence of the predictions, namely the variance. The variance will naturally be very high in regions where no training data has been seen before and low very close to already seen data points. If the 'experiment' takes much longer than evaluating the Gaussian Process, you can use the Gaussian Process fit to make sure you sample regions where you are very uncertain of your answer.
If the goal is to fully explore the entire input space, you could draw a lot of random values of (x,z) and evaluate the variance at these values. Then you could perform the costly experiment at the input point where you are most uncertain in y. Then you can retrain the GPR with all the explored data so far and repeat the process.
For optimization problems (Not the OP's question)
If you wish to find the lowest value of y across the input space, you are not interested in doing the experiment in regions that you know give high values of y, but you are just uncertain of how high these values will be. So instead of choosing the (x,z) points with the highest variance, you might choose the predicted value of y plus one standard deviation. Minimizing values this way is named Bayesian Optimization and this specific scheme is named Upper Confidence Bound (UCB). Expected Improvement (EI) - the probability of improving the previously best score - is also commonly used.

How to find the values for several variables so that a function returns highest value

I have 6 variables with values of something between 0 and 2 and a function, where these variables are given into. It predicts the result of a football match by looking at the past matches of both teams.
Depending on the variables, obviously, the output of the function changes. Each variable determines how much a certain part of the algorithm is weighed (e.g. how much less a game should be weighed that was 6 months ago compared to a game that was a week ago).
My goal is now to find out what the perfect ratios between the different variables and thus between the different parts of the algorithm are, so that my algorithm predicts most matches correctly. Is there any way of achieving that?
I thought of doing something like this with machine learning, something similar to linear/polynomal regression.
To determine how close a tip is I thought of giving:
2 points for when the tendency was right (predicted that Team A would win and Team A did win)
4 points for when the goal difference is right (Prediction: Team A
wins 2:1, actual result: 1:0)
5 points for when the result is
predicted correctly (predicted result: 2:1 and actual result: 2:1)
Which would make a loss function of
maximal points for game (which is 5) - points for predicted result
If I am able to minimize that, hopefully, after looking at some training sets (past seasons), it will theoretically score the most amount of points, when you give it a new season together with the variables computed in beforehand as input.
Now I'm trying to find out, by how much and in which direction I have to change each of my variables so that the loss is made smaller each time you look at a new training set.
You probably have to look at how big the loss is but I dont know how to find out which variable to change and in which direction. Is that possible and if so, how do I do that?
Currently I'm using javascript.
I am assuming that you are trying to use gradient descent to train your regression model.
Loss functions that you can use with gradient descent have to be differentiable, so simply giving/subtracting points to certain properties of the prediction is not possible.
A loss function that may be suitable for this task is the mean squared error, which is simply computed by averaging the squared differences between predicted and expected values. Your expected values would then just be the scores of both teams in the game.
You would then have to compute the gradient of the loss of your prediction with respect to the weights that the prediction function uses to compute its outputs. This can be done using backpropagation (details of which are way too broad for this answer, there are many tutorials available on the web).
The intuition behind the gradient of a function is that it points in the direction of steepest ascend of your function. If you update your parameters in that direction, the output of your function will grow. If this value is the loss of your prediction function, you want it to be smaller, so you go a small step in the opposite direction of your gradient.

Principle Component Analysis

I am studying principle component analysis, and I have just learnt that before applying PCA to the data samples, we have to apply two preprocessing steps which are mean normalization and feature scaling. However, I have no idea about what mean normalization is and how it can be implemented.
At first I searched it; however, I could not find a instructive explanation. Is there anyone who can explain what is mean normalization and how it can be implemented ?
Assume there is a dataset with 'd' features(Columns) and 'n' Observations(Rows). For simplicity sake lets consider d=2 and n=100. Which means now you dataset has 2 features and 100 observations.
In other words, now your dataset is a 2-dimensional array with 100 rows and 2 columns - (100x2).
Initially, when you visualize it, you can see that the points are scattered in a 2 dimension.
When you standardize the dataset, and when you visualize it you can actually see that all the points have shifted towards the origin. In other words, all the observation points have a mean of value 0 and standard deviation of value 1. This process is called Standardization.
How do you Standardize..?
Its pretty simple. The Formula is straight forward.
z = (X - u) / s
Where,
X - an observation in the feature column
u - mean of the feature column
s - standard deviation of the feature column
Note: You have to apply standardization with respect to all feature in the dataset
Reference:
https://machinelearningmastery.com/normalize-standardize-machine-learning-data-weka/
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

Is my method to detect overfitting in matrix factorization correct?

I am using matrix factorization as a recommender system algorithm based on the user click behavior records. I try two matrix factorization method:
The first one is the basic SVD whose prediction is just the product of user factor vector u and item factor i: r = u * i
The second one I used is the SVD with bias component.
r = u * i + b_u + b_i
where b_u and b_i represents the bias of preference of users and items.
One of the models I use has a very low performance, and the other one is reasonable. I really do not understand why the latter one performs worse, and I doubt that it is overfitting.
I googled methods to detect overfitting, and found the learning curve is a good way. However, the x-axis is the size of the training set and y-axis is the accuracy. This make me quite confused. How can I change the size of the training set? Pick out some of the records out of the data set?
Another problem is, I tried to plot the iteration-loss curve (The loss is the ). And it seems the curve is normal:
But I am not sure whether this method is correct because the metrics I use are precision and recall. Shall I plot the iteration-precision curve??? Or this one already tells that my model is correct?
Can anybody please tell me whether I am going in the right direction? Thank you so much. :)
I will answer in reverse:
So you are trying two different models, one that uses straight matrix factorization r = u * i and the other which enters the biases, r = u * i + b_u + b_i.
You mentioned you are doing Matrix Factorization for a recommender system which looks at user's clicks. So my question here is: Is this an Implicit ratings case? or Explicit one? I believe is an Implicit ratings problem if it is about clicks.
This is the first important thing you need to be very aware of, whether your problem is about Explicit or Implicit ratings. Because there are some differences about the way they are used and implemented.
If you check here:
http://yifanhu.net/PUB/cf.pdf
Implicit ratings are treated in a way that the number of times someone clicked or bought a given item for example is used to infer a confidence level. If you check the error function you can see that the confidence levels are used almost as a weight factor. So the whole idea is that in this scenario the biases have no meaning.
In the case of Explicit Ratings, where one has ratings as a score for example from 1-5, one can calculate those biases for users and products (averages of these bounded scores) and introduce them in the ratings formula. They make sense int his scenario.
The whole point is, depending whether you are in one scenario or the other you can use the biases or not.
On the other hand, your question is about over fitting, for that you can plot training errors with test errors, depending on the size of your data you can have a holdout test data, if the errors differ a lot then you are over fitting.
Another thing is that matrix factorization models usually include regularization terms, see the article posted here, to avoid over fitting.
So I think in your case you are having a different problem the one I mentioned before.

Distance measure for categorical attributes for k-Nearest Neighbor

For my class project, I am working on the Kaggle competition - Don't get kicked
The project is to classify test data as good/bad buy for cars. There are 34 features and the data is highly skewed. I made the following choices:
Since the data is highly skewed, out of 73,000 instances, 64,000 instances are bad buy and only 9,000 instances are good buy. Since building a decision tree would overfit the data, I chose to use kNN - K nearest neighbors.
After trying out kNN, I plan to try out Perceptron and SVM techniques, if kNN doesn't yield good results. Is my understanding about overfitting correct?
Since some features are numeric, I can directly use the Euclid distance as a measure, but there are other attributes which are categorical. To aptly use these features, I need to come up with my own distance measure. I read about Hamming distance, but I am still unclear on how to merge 2 distance measures so that each feature gets equal weight.
Is there a way to find a good approximate for value of k? I understand that this depends a lot on the use-case and varies per problem. But, if I am taking a simple vote from each neighbor, how much should I set the value of k? I'm currently trying out various values, such as 2,3,10 etc.
I researched around and found these links, but these are not specifically helpful -
a) Metric for nearest neighbor, which says that finding out your own distance measure is equivalent to 'kernelizing', but couldn't make much sense from it.
b) Distance independent approximation of kNN talks about R-trees, M-trees etc. which I believe don't apply to my case.
c) Finding nearest neighbors using Jaccard coeff
Please let me know if you need more information.
Since the data is unbalanced, you should either sample an equal number of good/bad (losing lots of "bad" records), or use an algorithm that can account for this. I think there's an SVM implementation in RapidMiner that does this.
You should use Cross-Validation to avoid overfitting. You might be using the term overfitting incorrectly here though.
You should normalize distances so that they have the same weight. By normalize I mean force to be between 0 and 1. To normalize something, subtract the minimum and divide by the range.
The way to find the optimal value of K is to try all possible values of K (while cross-validating) and chose the value of K with the highest accuracy. If a "good" value of K is fine, then you can use a genetic algorithm or similar to find it. Or you could try K in steps of say 5 or 10, see which K leads to good accuracy (say it's 55), then try steps of 1 near that "good value" (ie 50,51,52...) but this may not be optimal.
I'm looking at the exact same problem.
Regarding the choice of k, it's recommended be an odd value to avoid getting "tie votes".
I hope to expand this answer in the future.

Resources