Latent semantic analysis (LSA) single value decomposition (SVD) understanding - analysis

Bear with me through my modest understanding of LSI (Mechanical Engineering background):
After performing SVD in LSI, you have 3 matrices:
U, S, and V transpose.
U compares words with topics and S is a sort of measure of strength of each feature. Vt compares topics with documents.
U dot S dot Vt
returns the original matrix before SVD. Without doing too much (none) in-depth algebra it seems that:
U dot S dot **Ut**
returns a term by term matrix, which provides a comparison between the terms. i.e. how related one term is to other terms, a DSM (design structure matrix) of sorts that compares words instead of components. I could be completely wrong, but I tried it on a sample data set, and the results seemed to make sense. It could just be bias though (I wanted it to work, so I saw what I wanted). I can't post the results as the documents are protected.
My question though is: Does this make any sense? Logically? Mathematically?
Thanks for any time/responses.

If you want to know how related one term is to another you can just compute
(U dot S)
The terms are represented by the row vectors. You can then compute the distance matrix by applying a distance function such as euclidean distance. Once you make the distance matrix by computing the distance between all the vectors the resulted matrix should be hollow symmetric with all distances >0. if the distance A[i,j] is small then they are related otherwise they are not.

Related

Is there any reason to (not) L2-normalize vectors before using cosine similarity?

I was reading the paper "Improving Distributional Similarity
with Lessons Learned from Word Embeddings" by Levy et al., and while discussing their hyperparameters, they say:
Vector Normalization (nrm) As mentioned in Section 2, all vectors (i.e. W’s rows) are normalized to unit length (L2 normalization), rendering the dot product operation equivalent to cosine similarity.
I then recalled that the default for the sim2 vector similarity function in the R text2vec package is to L2-norm vectors first:
sim2(x, y = NULL, method = c("cosine", "jaccard"), norm = c("l2", "none"))
So I'm wondering, what might be the motivation for this, normalizing and cosine (both in terms of text2vec and in general). I tried to read up on the L2 norm, but mostly it comes up in the context of normalizing before using the Euclidean distance. I could not find (surprisingly) anything on whether L2-norm would be recommended for or against in the case of cosine similarity on word vector spaces/embeddings. And I don't quite have the math skills to work out the analytic differences.
So here is a question, meant in the context of word vector spaces learned from textual data (either just co-occurrence matrices possible weighted by tfidf, ppmi, etc; or embeddings like GloVe), and calculating word similarity (with the goal being of course to use a vector space+metric that best reflects the real-world word similarities). Is there, in simple words, any reason to (not) use L2 norm on a word-feature matrix/term-co-occurrence matrix before calculating cosine similarity between the vectors/words?
If you want to get cosine similarity you DON'T need to normalize to L2 norm and then calculate cosine similarity. Cosine similarity anyway normalizes the vector and then takes dot product of two.
If you are calculating Euclidean distance then u NEED to normalize if distance or vector length is not an important distinguishing factor. If vector length is a distinguishing factor then don't normalize and calculate Euclidean distance as it is.
text2vec handles everything automatically - it will make rows have unit L2 norm and then call dot product to calculate cosine similarity.
But if matrix already has rows with unit L2 norm then user can specify norm = "none" and sim2 will skip first normalization step (saves some computation).
I understand confusion - probably I need to remove norm option (it doesn't take much time to normalize matrix).

Full-Rank Assumption in Least Squares Estimation (Linear Regression)

In Ordinary Least Square Estimation, the assumption is for the Samples matrix X (of shape N_samples x N_features) to have "full column rank".
This is apparently needed so that the linear regression can be reduced to a simple algebraic equation using the Moore–Penrose inverse. See this section of the Wikipedia article for OLS:
https://en.wikipedia.org/wiki/Ordinary_least_squares#Estimation
In theory this means that if all columns of X (i.e. features) are linearly independent we can make an assumption that makes OLS simple to calculate, correct?
What does this mean in practice?
Does this mean that OLS is not calculable and will result in an error for such input data X? Or will the result just be bad?
Are there any classical datasets for which linear regression fails due to this assumption not being true?
The full rank assumption is only needed if you were to use the inverse (or cholesky decomposition, or QR or any other method that is (mathematically) equivalent to computing the inverse). If you use the Moore-Penrose inverse you will still compute an answer. When the full rank assumtion is violated there is no longer a unique answer, ie there are many x that minimise
||A*x-b||
The one you will compute with the Moore-Penrose will be the x of minimum norm. See here, for exampleA

Is it possible to use KDTree with cosine similarity?

Looks like I can't use this similarity metric for with sklearn KDTree, for example, but I need because I am using measuring words vectors similarity. What is fast robust customization algorithm for this case? I know about Local Sensitivity Hashing, but it should tunned & tested up a lot to find params.
The ranking your would get with cosine similarity is equivalent to the rank order of the euclidean distance when you normalize all the data points first. So you can use a KD tree to the the k nearest neighbors with KDTrees, but you will need to recompute what the cosine similarity is.
The cosine similarity is not a distance metric as normally presented, but it can be transformed into one. If done, you can then use other structures like Ball Trees to do accelerated nn with cosine similarity directly. I've implemented this in the JSAT library, if you were interested in a Java implementation.
According to the table at the end of this page, cosine support eoth k-d-tree should be possible: ELKI supports cosine with the R-tree, and you can derive bounding rectangles for the k-d-tree, too; and the k-d-tree supports at least five metrics in that table. So I do not see why it shouldn't work.
Indexing support in sklearn often is not very complete (albeit improving), unfortunately; so don't take that as a reference.
While the k-d-tree can theoretically support Cosine by
transforming the data such that Cosine becomes Euclidean distance
working with the bounding boxes and the minimum angle to the bounding box (that appears to be what ELKI is doing for the R-tree)
You should be aware that the k-d-tree does not work very well with high-dimensional data, and cosine is mostly popular for very high-dimensional data. A k-d-tree always only looks at one dimension. If you want all d dimension to be used once, you need O(2^d) data points. For high d, there is no way all attributes are used.
The R-tree is slightly better here because it uses bounding boxes; these shrink with every split in all dimensions, so the pruning does get better. But this also means it needs a lot of memory for such data, and the tree construction may suffer from the same problem.
So in essence, don't use either for high dimensional data.
But also don't assume that Cosine does magically improve your results, in particular for high-d data. It's very much overrated. As above transformation indicates, there cannot be a systematic benefit of Cosine over Euclidean: Cosine is a special case of Euclidean.
For sparse data, inverted lists (c.f. Lucene, Xapian, Solr, ...) are the way to index for cosine.

What is a good metric for feature vector comparison and how to normalize them before comparison?

Background:
I am working on a bottom up approach to image segmentation where in I first over-segment the image into small-regions/super-pixels/super-voxels and then I want to iteratively merge adjoining over-segmented regions based on some criterion. One criterion I have been playing with is to measure how similar in appearance are the two regions. To quantify appearance of a region, I use several measures -- intensity statistics, texture features etc. I lump all the features I compute for a region into a long feature vector.
Question:
Given two adjacent over-segmented regions R1 and R2, let F1 and F2 be the corresponding feature vectors. My questions are the following:
-- What are good metrics to quantify the similarity between F1 and F2?
-- How best to normalize F1 and F2 before quantifying their similarity with a metric? (using any supervised approach to normalization is not feasible because i dont want my algorithm to be tied to one set of images)
Solution in my mind:
Similarity(R1, R2) = dot_product(F1 / norm(F1), F2 / norm(F2))
In words, I first normalize F1 and F2 to be unit vectors and then use the dot product between the two vectors as a similarity measure.
I wonder if there are better ways to normalize them and compare them with a metric. I would be glad if the community can point me to some references and write out reasons why something else is better than the similarity measure I am using.
State of the art image segmentation algorithms use Conditional Random Fields over Superpixels (IMO SLIC algorithm is the best option). This type of algorithms capture the relationship between adjacent superpixels at the same time they classify each superpixel (normally using SSVM).
For superpixel classifying you will normally collect a bag of features for each of them, such as SIFT descriptors, histograms, or whatever feature you think it might help.
There are many papers that describe this process, here you have some of them which I find interesting:
Associative Hierarchical CRFs for Object Class Image Segmentation
Class Segmentation and Object Localization with Superpixel Neighborhoods
Figure-ground segmentation using a hierarchical conditional random fields
However, there are not many libraries or software for dealing with CRF. The best you can find out there is this blog entry.
I lump all the features I compute for a region into a long feature vector. [...]
What are good metrics to quantify the similarity between F1 and F2? [...]
How best to normalize F1 and F2?
tl;dr: use a TF-IDF kind of scoring as described here (see Discrete
 Approach, slides 18-35).
There is a (quite old) CBIR engine called GIFT (a.k.a The GNU Image-Finding Tool) that precisely follows such an approach to compute similarity between images.
What is precisely interesting with GIFT is that it applies techniques from text retrieval right to CBIR - which has become in some ways a classic approach (see A Text Retrieval Approach to
Object Matching in Videos).
In practice GIFT extracts a large amount of local and global color and texture low-level features where each individual feature (e.g the amount of the i-th color within an histogram) can be thought as a visual word:
global color (HSV color histogram): 166 bins = 166 visual words
local color (color histogram analysis by recursively subdivide the input image into sub-regions): 340 (sub-regions) x 166 (bins) = 56,440 visual words
global texture (Gabor histogram): 3 (scales) x 4 (orientations) x 10 (ranges) = 120 visual words
local texture (Gabor histogram in a grid of sub-regions): 256 (sub-regions) x 120 (bins) = 30,720 visual words
So for any input image GIFT is able to extract a 87,446-dimensional feature vector F, keeping in mind that a feature is considered as either present (with a certain frequency F[i]) or not present in the image (F[i] = 0).
Then the trick consists in first indexing every image (here every region) into an inverted file for efficient querying. In a second step (query time) you are then free to use each region as a query image.
At query time the engine uses a classical TF-IDF scoring:
/* Sum: sum over each visual word i of the query image
* TFquery(i): term frequency of visual word i in the query image
* TFcandidate(i): term frequency of visual word i in the candidate image
* CF(i): collection frequency of visual word i in the indexed database
*/
score(query, candidate) = Sum [ TFquery(i) * TFcandidate(i) * log**2(1/CF(i)) ]
Internally things are a bit more complex since GIFT:
performs sub-queries by focusing separately on each kind of low-level features (sub query 1 = color hist only, sub query 2 = color blocks, etc) and merges the scores,
includes features pruning to evaluate only a certain percentage of the features.
GIFT is pretty efficient so I'm pretty sure you may find interesting ideas there you could adapt. Of course you could avoid using an inverted index if you have no performance constraints.
Just want to point out that you don't really need create unit vectors off of F1 or F2 before computing the cosine similarity (which is the dot product). This is because F1/norm(F1) will explicitly make each a unit vector for direction comparison.
Other metrics for vector comparison would include the Euclidean distance, Manhattan distance, or the Mahalanobis distance. The last one may not be quite applicable in your scenario. Please read wikipedia for more.
I myself have argued a few times about which one is better to choose, the Euclidean or the Cosine. Note that the context of either metric's usage is subjective. If in Euclidean space, you just want to measure if two points are aligned together, cosine measure makes sense. If you want explicit distance metric, Euclidean is better.

importance of PCA or SVD in machine learning

All this time (specially in Netflix contest), I always come across this blog (or leaderboard forum) where they mention how by applying a simple SVD step on data helped them in reducing sparsity in data or in general improved the performance of their algorithm in hand.
I am trying to think (since long time) but I am not able to guess why is it so.
In general, the data in hand I get is very noisy (which is also the fun part of bigdata) and then I do know some basic feature scaling stuff like log-transformation stuff , mean normalization.
But how does something like SVD helps.
So lets say i have a huge matrix of user rating movies..and then in this matrix, I implement some version of recommendation system (say collaborative filtering):
1) Without SVD
2) With SVD
how does it helps
SVD is not used to normalize the data, but to get rid of redundant data, that is, for dimensionality reduction. For example, if you have two variables, one is humidity index and another one is probability of rain, then their correlation is so high, that the second one does not contribute with any additional information useful for a classification or regression task. The eigenvalues in SVD help you determine what variables are most informative, and which ones you can do without.
The way it works is simple. You perform SVD over your training data (call it matrix A), to obtain U, S and V*. Then set to zero all values of S less than a certain arbitrary threshold (e.g. 0.1), call this new matrix S'. Then obtain A' = US'V* and use A' as your new training data. Some of your features are now set to zero and can be removed, sometimes without any performance penalty (depending on your data and the threshold chosen). This is called k-truncated SVD.
SVD doesn't help you with sparsity though, only helps you when features are redundant. Two features can be both sparse and informative (relevant) for a prediction task, so you can't remove either one.
Using SVD, you go from n features to k features, where each one will be a linear combination of the original n. It's a dimensionality reduction step, just like feature selection is. When redundant features are present, though, a feature selection algorithm may lead to better classification performance than SVD depending on your data set (for example, maximum entropy feature selection). Weka comes with a bunch of them.
See: http://en.wikibooks.org/wiki/Data_Mining_Algorithms_In_R/Dimensionality_Reduction/Singular_Value_Decomposition
https://stats.stackexchange.com/questions/33142/what-happens-when-you-apply-svd-to-a-collaborative-filtering-problem-what-is-th
The Singular Value Decomposition is often used to approximate a matrix X by a low rank matrix X_lr:
Compute the SVD X = U D V^T.
Form the matrix D' by keeping the k largest singular values and setting the others to zero.
Form the matrix X_lr by X_lr = U D' V^T.
The matrix X_lr is then the best approximation of rank k of the matrix X, for the Frobenius norm (the equivalent of the l2-norm for matrices). It is computationally efficient to use this representation, because if your matrix X is n by n and k << n, you can store its low rank approximation with only (2n + 1)k coefficients (by storing U, D' and V).
This was often used in matrix completion problems (such as collaborative filtering) because the true matrix of user ratings is assumed to be low rank (or well approximated by a low rank matrix). So, you wish to recover the true matrix by computing the best low rank approximation of your data matrix. However, there are now better ways to recover low rank matrices from noisy and missing observations, namely nuclear norm minimization. See for example the paper The power of convex relaxation: Near-optimal matrix completion by E. Candes and T. Tao.
(Note: the algorithms derived from this technique also store the SVD of the estimated matrix, but it is computed differently).
PCA or SVD, when used for dimensionality reduction, reduce the number of inputs. This, besides saving computational cost of learning and/or predicting, can sometimes produce more robust models that are not optimal in statistical sense, but have better performance in noisy conditions.
Mathematically, simpler models have less variance, i.e. they are less prone to overfitting. Underfitting, of-course, can be a problem too. This is known as bias-variance dilemma. Or, as said in plain words by Einstein: Things should be made as simple as possible, but not simpler.

Resources