Problem with SVD in practice when using Surprise library

Problem with SVD in practice when using Surprise library - machine-learning

Since SVD decomposition consists of three matrices, U, Sigma, and V Transpose...
https://intoli.com/blog/pca-and-svd/img/svd-matrices.png
Why when I use SVD algorithm for the prediction can access only 2 matrices which are qi and pu:
https://surprise.readthedocs.io/en/stable/matrix_factorization.html
Where is the matrix Sigma which contains the singular values?

Related

Is there any reason to (not) L2-normalize vectors before using cosine similarity?

I was reading the paper "Improving Distributional Similarity
with Lessons Learned from Word Embeddings" by Levy et al., and while discussing their hyperparameters, they say:
Vector Normalization (nrm) As mentioned in Section 2, all vectors (i.e. W’s rows) are normalized to unit length (L2 normalization), rendering the dot product operation equivalent to cosine similarity.
I then recalled that the default for the sim2 vector similarity function in the R text2vec package is to L2-norm vectors first:
sim2(x, y = NULL, method = c("cosine", "jaccard"), norm = c("l2", "none"))
So I'm wondering, what might be the motivation for this, normalizing and cosine (both in terms of text2vec and in general). I tried to read up on the L2 norm, but mostly it comes up in the context of normalizing before using the Euclidean distance. I could not find (surprisingly) anything on whether L2-norm would be recommended for or against in the case of cosine similarity on word vector spaces/embeddings. And I don't quite have the math skills to work out the analytic differences.
So here is a question, meant in the context of word vector spaces learned from textual data (either just co-occurrence matrices possible weighted by tfidf, ppmi, etc; or embeddings like GloVe), and calculating word similarity (with the goal being of course to use a vector space+metric that best reflects the real-world word similarities). Is there, in simple words, any reason to (not) use L2 norm on a word-feature matrix/term-co-occurrence matrix before calculating cosine similarity between the vectors/words?

If you want to get cosine similarity you DON'T need to normalize to L2 norm and then calculate cosine similarity. Cosine similarity anyway normalizes the vector and then takes dot product of two.
If you are calculating Euclidean distance then u NEED to normalize if distance or vector length is not an important distinguishing factor. If vector length is a distinguishing factor then don't normalize and calculate Euclidean distance as it is.

text2vec handles everything automatically - it will make rows have unit L2 norm and then call dot product to calculate cosine similarity.
But if matrix already has rows with unit L2 norm then user can specify norm = "none" and sim2 will skip first normalization step (saves some computation).
I understand confusion - probably I need to remove norm option (it doesn't take much time to normalize matrix).

Relation between max-margin and vector support in SVM

I am reading the mathematical formulation of SVM and on many sources I found this idea that "max-margin hyperplane is completely determined by those \vec{x}_i which lie nearest to it. These \vec{x}_i are called support vectors."
Could an expert explain mathematically this consequence? please.

This is true only if you use representation theorem
Your separation plane W can be represented as Sum(i= 1 to m) Alpha_i*phi(x_i)
Where m - number of examples in your train sample and x_i is example i. And Phi is function that maps x_i to some feature space.
So, your SVM algorithm will find vector Alpha=(Alpha_1...Alpha_m), - alpha_i for x_i. Every x_i(example in train sample) that his alpha_i is NOT zero is support vector of W.
Hence the name - SVM.
What happens if your data is separable, is that you need only support vectors that are close to separation margin W, and all rest of the training set can be discarded(its alphas is 0). Amount of support vectors that algorithm will use is depends on complexity of data and kerenl you using.

Support Vector Machine: Feature Transformation

How do I do the transformation in the test data when I have the trained SVM model in hand? I am trying to simulate the SVM output from mathematical equations and the trained SVM model (using RBF kernel). How do I do that?
In SVM, some of the common kernels used are:
Here xi and xj represent two samples. Now if the data has, say 5 samples, does this transformation include all the combination of two samples to generate the transformed feature space, like, x1 and x1, x1 and x2, x1 and x3,..., x4 and x5, x5 and x5.
If data has two features, then a polynomial transformation of order 2 transforms the input to 3 dimensions, as explained her in slide 15 http://www.robots.ox.ac.uk/~az/lectures/ml/lect3.pdf
Now how can be find the similar explantion for the transformation using the RBF kernel? I am trying to write a code for transforming the test data so that i can apply the trained SVM model on it.

This is way more complex than that. In short - you do not map your data directly into feature space. You simply change the dot product to the one induced by the kernel. What happens "inside" SVM when you work with polynomial kernel, each point is actually (indirectly) transformed to O(d^p) dimensional space (where d-input data dimension, p-degree of polynomial kernel). From mathematical perspective you work with some (often unknown) projection phi_K(x) which has the property that K(x, y) = <phi_K(x), phi_K(y)>, and nothing more. In SVM implementations, you do not need actual data representation (as phi_K(x) is usually huge, sometimes even infinite, like in RBF case) but instead it needs vector of dot product of your point will each element of the training set.
Thus what you do (in implementations, not from math perspective) is you provide:
During training whole Gram matrix, G defined as G_ij = K(x_i, x_j) where x_i is i'th training sample
During testing, when you get new point y you provide it to SVM as a vector of dot products H such that H_i = K(y, x_i), where again x_i are your training points (in fact you just need values for support vectors, but many implementations, like libsvm, actually require vector of the size of the training set - you can simply put 0's for K(y, x_j) if x_j is not a training vector)
Just remember, that this is not the same as training linear SVM "on top" of the above representation. This is just a way implementations usually accept your data, as they need a definition of dot product (function) and it is often easier to pass numbers than functions (but some of them, like scikit-learn SVC module, actually accepts functions as kernel parameter).
So what is RBF kernel? It is actually a mapping from points to functions space of normal distributions with means in your training points. And then dot product is just an integral from -inf to +inf from the product of such two functions. Sounds complex? It is at first sight, but it is a really nice trick, worth understanding!

disciriminative subspace Fisher discriminative analysis

I have only 2 classes.
Fisher Discrimination Analysis project the data into a low dimensional discriminative subspace. According to the papers, I can find atmost C-1 non-zero eigen values. That means if my initial data has a dimension d the projected data will have a dimension of atmost c-1.
If the number of classes are only 2, I will get a feature vector with only one dimension.
My problem is I would like to project my data into a discriminative subspace and then I would like to get a feature vector of size m (m < d, m~=1).
Are there any way to do these kind of discriminative projection?

Why is there only one support vector in OpenCV's SVM

I am using libsvm to train a SVM with hog features. The model file has n support vectors. But when I try to use it in OpenCV's SVM I found that there is only one vector in OpenCV's model. How does OpenCV do it??

I guess libsvm stores support vectors, whereas opencv just uses a weight vector to store the hyperplane (one vector + one scalar suffices to describe a plane) - you can get there from the decision function using the support vectors by swapping sum and scalar product.

Here is the explanation from Learning OpenCV3:
In the case of linear SVM, all the support vectors for each decision plane can be compressed into a single vector that will basically describe the separating hyperplane.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Problem with SVD in practice when using Surprise library - machine-learning

Related

Is there any reason to (not) L2-normalize vectors before using cosine similarity?

Relation between max-margin and vector support in SVM

Support Vector Machine: Feature Transformation

disciriminative subspace Fisher discriminative analysis

Why is there only one support vector in OpenCV's SVM

Categories

Resources