Vector coefficients based on similarity - machine-learning

I've been looking for a solution to create a recommendation system based on vectors similarity.
Basically, i have a few vectors per user for example:
User1: [0,3,7,8,5] , [3,5,8,2,4] , [1,5,3,9,4]
User2: [3,1,6,7,9] , [2,4,1,3,8] , [7,8,3,3,1]
For every vector i need to calculate a coefficient and based on that coefficient differentiate a vector from another. I've found formulas that would calculate coefficients based on similarity of 2 vectors which i don't really want that.I need a formula that would calculate a coefficient per vector and then i do some other calculations with those coefficients.Are there any good formulas for this?
Thanks

So going based off your response to my comment: I don't think there's a similarity coefficient measure that will do what you want. Let me explain why...
Similarity coefficients are functions f(x, y) -> c where x and y are vectors and c is a scalar. Note that f takes two parameters. f(x,y) = f(y,x), but f(x) is meaningless - its asking for the similarity of x relative to... nothing.
So what? We could just use a function g(x) = f(x, V) where V is a fixed vector. E.g. let V = [1, 1, ..., 1]. Now we have a monadic function that gives us a similarity value for every individual vector. But...
Knowing f(x,y) = c and f(x,z) = c' doesn't tell you a whole lot about f(y,z). Take vectors in 2-space, x = [1, 1], y = [0, 1], z = [1,0]. A similarity function symmetric in the two dimensions would say f(x,y) = f(x,z) but hopefully not = f(y,z) So our g function above isn't very useful, because knowing how similar two vectors are to V doesn't tell us much about how similar they are to each other.
So what can you do? I think a simple solution to your problem would be a variation of the k nearest neighbors algorithm. It allows you to find vectors close to a given vector (or, if you prefer to find clusters of vectors without specifying a given vector, look up clustering)
EDIT: inspiration from Yahya's answer: if your vectors are super huge and knn or clustering is too difficult, consider principle component analysis or some other method of cutting them down to size (reducing the number of dimensions) - just keep in mind whatever you do will likely be lossy

Related

Math Behind Linear Regression

Am trying to understand math behind Linear Regression and i have verified in multiple sites that Linear Regression works under OLS method with y=mx+c to get best fit line
So in order to calculate intercept and slope we use below formula(if am not wrong)
m = sum of[ (x-mean(x)) (y-mean(y)) ] / sum of[(x-mean(x))]
c = mean(y) - b1(mean(x))
So with this we get x and c values to substitute in above equation to get y predicted values and can predict for newer x values.
But my doubt is when does "Gradient Descent" used. I understood it is also used for calculating co-efficients only in such a way it reduces the cost function by finding local minima value.
Please help me in this
Are this two having separate functions in python/R.
Or linear regression by default works on Gradient Descent(if so then when does above formula used for calculating m and c values)

Fourier Transform Scaling the magnitude

I have a set of signals S1, S2, ....,SN, for which I am numerically computing the Fourier transforms F1, F2, ,,,,FN. Where Si's and Fi's are C++ vectors (my computations are in C++).
My Computation objective is as follows:
compute product: F = F1*F2*...*FN
inverse Fourier transform F to get a S.
What I numerically observe is when I am trying to compute the product either I am running into situation where numbers become too small. Or numbers become too big.
I thought of scaling F1 by say a1 so that the overflow or underflow is avoided.
With the scaling my step 1 above will become
F' = (F1/a1)*(F2/a2)*...*(FN/aN) = F'1*F'2*...*F'N
And when I inverse transform my final S' will differ from S by a scale factor. The structural form of the S will not change. By this I mean only the normalization of S is different.
My question is :
Is my rationale correct.
If my rationale is correct then given a C++ vector "Fi" how can I choose a good scale "ai" to scale up or down the Fi's.
Many thanks in advance.
You can change the range of the Fi vector in a new domain, let's say [a, b]. So from your vector Fi, the minimum number will become a and the maximum number will become b. You can scale the vector using this equation:
Fnew[i] = (b-a)/(max-min) * (F[i] - min) + a,
where: min = the minimum value from F
max= the maximum value from F
The only problem now is choosing a and b. I would choose [1,2], because the values are small (so you won't get overflow easily), but not smaller than 0.

Are there any machine learning regression algorithms that can train on ordinal data?

I have a function f(x): R^n --> R (sorry, is there a way to do LaTeX here?), and I want to build a machine learning algorithm that estimates f(x) for any input point x, based on a bunch of sample xs in a training data set. If I know the value of f(x) for every x in the training data, this should be simple - just do a regression, or take the weighted average of nearby points, or whatever.
However, this isn't what my training data looks like. Rather, I have a bunch of pairs of points (x, y), and I know the value of f(x) - f(y) for each pair, but I don't know the absolute values of f(x) for any particular x. It seems like there ought to be a way to use this data to find an approximation to f(x), but I haven't found anything after some Googling; there are papers like this but they seem to assume that the training data comes in the form of a set of discrete labels for each entity, rather than having labels over pairs of entities.
This is just making something up, but could I try kernel density estimation over f'(x), and then do integration to get f(x)? Or is that crazy, or is there a known better technique?
You could assume that f is linear, which would simplify things - if f is linear we know that:
f(x-y) = f(x) - f(y)
For example, Suppose you assume f(x) = <w, x>, making w the parameter you want to learn. How would the squared loss per sample (x,y) and known difference d look like?
loss((x,y), d) = (f(x)-f(y) - d)^2
= (<w,x> - <w,y> - d)^2
= (<w, x-y> - d)^2
= (<w, z> - d)^2 // where z:=x-y
Which is simply the squared loss for z=x-y
Practically, you would need to construct z=x-y for each pair and then learn f using linear regression over inputs z and outputs d.
This model might be too weak for your needs, but its probably the first thing you should try. Otherwise, as soon as you step away from the linearity assumption, you'd likely arrive at a difficult non-convex optimization problem.
I don't see a way to get absolute results. Any constant in your function (f(x) = g(x) + c) will disappear, in the same way constants disappear in an integral.

Convolution Vs Correlation

Can anyone explain me the similarities and differences, of the Correlation and Convolution ? Please explain the intuition behind that, not the mathematical equation(i.e, flipping the kernel/impulse).. Application examples in the image processing domain for each category would be appreciated too
You will likely get a much better answer on dsp stack exchange but... for starters I have found a number of similar terms and they can be tricky to pin down definitions.
Correlation
Cross correlation
Convolution
Correlation coefficient
Sliding dot product
Pearson correlation
1, 2, 3, and 5 are very similar
4,6 are similar
Note that all of these terms have dot products rearing their heads
You asked about Correlation and Convolution - these are conceptually the same except that the output is flipped in convolution. I suspect that you may have been asking about the difference between correlation coefficient (such as Pearson) and convolution/correlation.
Prerequisites
I am assuming that you know how to compute the dot-product. Given two equal sized vectors v and w each with three elements, the algebraic dot product is v[0]*w[0]+v[1]*w[1]+v[2]*w[2]
There is a lot of theory behind the dot product in terms of what it represents etc....
Notice the dot product is a single number (scalar) representing the mapping between these two vectors/points v,w In geometry frequently one computes the cosine of the angle between two vectors which uses the dot product. The cosine of the angle between two vectors is between -1 and 1 and can be thought of as a measure of similarity.
Correlation coefficient (Pearson)
Correlation coefficient between equal length v,w is simply the dot product of two zero mean signals (subtract mean v from v to get zmv and mean w from w to get zmw - here zm is shorthand for zero mean) divided by the magnitudes of zmv and zmw.
to produce a number between -1 and 1. Close to zero means little correlation, close to +/- 1 is high correlation. it measures the similarity between these two vectors.
See http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient for a better definition.
Convolution and Correlation
When we want to correlate/convolve v1 and v2 we basically are computing a series of dot-products and putting them into an output vector. Let's say that v1 is three elements and v2 is 10 elements. The dot products we compute are as follows:
output[0] = v1[0]*v2[0]+v1[1]*v2[1]+v1[2]*v2[2]
output[1] = v1[0]*v2[1]+v1[1]*v2[2]+v1[2]*v2[3]
output[2] = v1[0]*v2[2]+v1[1]*v2[3]+v1[2]*v2[4]
output[3] = v1[0]*v2[3]+v1[1]*v2[4]+v1[2]*v2[5]
output[4] = v1[0]*v2[4]+v1[1]*v2[5]+v1[2]*v2[6]
output[5] = v1[0]*v2[7]+v1[1]*v2[8]+v1[2]*v2[9]
output[6] = v1[0]*v2[8]+v1[1]*v2[9]+v1[2]*v2[10] #note this is
#mathematically valid but might give you a run time error in a computer implementation
The output can be flipped if a true convolution is needed.
output[5] = v1[0]*v2[0]+v1[1]*v2[1]+v1[2]*v2[2]
output[4] = v1[0]*v2[1]+v1[1]*v2[2]+v1[2]*v2[3]
output[3] = v1[0]*v2[2]+v1[1]*v2[3]+v1[2]*v2[4]
output[2] = v1[0]*v2[3]+v1[1]*v2[4]+v1[2]*v2[5]
output[1] = v1[0]*v2[4]+v1[1]*v2[5]+v1[2]*v2[6]
output[0] = v1[0]*v2[7]+v1[1]*v2[8]+v1[2]*v2[9]
Notice that we have less than 10 elements in the output as for simplicity I am computing the convolution only where both v1 and v2 are defined
Notice also that the convolution is simply a number of dot products. There has been considerable work over the years to be able to speed up convolutions. The sweeping dot products are slow and can be sped up by first transforming the vectors into the fourier basis space and then computing a single vector multiplication then inverting the result, though I won't go into that here...
You might want to look at these resources as well as googling: Calculating Pearson correlation and significance in Python
The best answer I got were from this document:http://www.cs.umd.edu/~djacobs/CMSC426/Convolution.pdf
I'm just going to copy the excerpt from the doc:
"The key difference between the two is that convolution is associative. That is, if F and G are filters, then F*(GI) = (FG)*I. If you don’t believe this, try a simple example, using F=G=(-1 0 1), for example. It is very convenient to have convolution be associative. Suppose, for example, we want to smooth an image and then take its derivative. We could do this by convolving the image with a Gaussian filter, and then convolving it with a derivative filter. But we could alternatively convolve the derivative filter with the Gaussian to produce a filter called a Difference of Gaussian (DOG), and then convolve this with our image. The nice thing about this is that the DOG filter can be precomputed, and we only have to convolve one filter with our image.
In general, people use convolution for image processing operations such as smoothing, and they use correlation to match a template to an image. Then, we don’t mind that correlation isn’t associative, because it doesn’t really make sense to combine two templates into one with correlation, whereas we might often want to combine two filter together for convolution."
Convolution is just like correlation, except that we flip over the filter before correlating

PCA first or normalization first?

When doing regression or classification, what is the correct (or better) way to preprocess the data?
Normalize the data -> PCA -> training
PCA -> normalize PCA output -> training
Normalize the data -> PCA -> normalize PCA output -> training
Which of the above is more correct, or is the "standardized" way to preprocess the data? By "normalize" I mean either standardization, linear scaling or some other techniques.
You should normalize the data before doing PCA. For example, consider the following situation. I create a data set X with a known correlation matrix C:
>> C = [1 0.5; 0.5 1];
>> A = chol(rho);
>> X = randn(100,2) * A;
If I now perform PCA, I correctly find that the principal components (the rows of the weights vector) are oriented at an angle to the coordinate axes:
>> wts=pca(X)
wts =
0.6659 0.7461
-0.7461 0.6659
If I now scale the first feature of the data set by 100, intuitively we think that the principal components shouldn't change:
>> Y = X;
>> Y(:,1) = 100 * Y(:,1);
However, we now find that the principal components are aligned with the coordinate axes:
>> wts=pca(Y)
wts =
1.0000 0.0056
-0.0056 1.0000
To resolve this, there are two options. First, I could rescale the data:
>> Ynorm = bsxfun(#rdivide,Y,std(Y))
(The weird bsxfun notation is used to do vector-matrix arithmetic in Matlab - all I'm doing is subtracting the mean and dividing by the standard deviation of each feature).
We now get sensible results from PCA:
>> wts = pca(Ynorm)
wts =
-0.7125 -0.7016
0.7016 -0.7125
They're slightly different to the PCA on the original data because we've now guaranteed that our features have unit standard deviation, which wasn't the case originally.
The other option is to perform PCA using the correlation matrix of the data, instead of the outer product:
>> wts = pca(Y,'corr')
wts =
0.7071 0.7071
-0.7071 0.7071
In fact this is completely equivalent to standardizing the data by subtracting the mean and then dividing by the standard deviation. It's just more convenient. In my opinion you should always do this unless you have a good reason not to (e.g. if you want to pick up differences in the variation of each feature).
You need to normalize the data first always. Otherwise, PCA or other techniques that are used to reduce dimensions will give different results.
Normalize the data at first. Actually some R packages, useful to perform PCA analysis, normalize data automatically before performing PCA.
If the variables have different units or describe different characteristics, it is mandatory to normalize.
the answer is the 3rd option as after doing pca we have to normalize the pca output as the whole data will have completely different standard. we have to normalize the dataset before and after PCA as it will more accuarate.
I got another reason in PCA objective function.
May you see detail in this link
enter link description here
Assuming the X matrix has been normalized before PCA.

Resources