Storing TF-IDF values in an inverted index - search-engine

I'm creating a search engine to search a list of roughly 20k English phrases, each one being a few words long.
I've looked into ways to create the search engine, and currently I am using a TfidfVectorizer from sklearn and Cosine Similarity to compute the ranking scores.
From what I understand in information retrieval you have retrieval and ranking phases, however I'm confused how you could use a data structure like an inverted index to speed up the search before using TfidfVectorizer? It seems like TfidfVectorizer creates a term-document matrix which is different to an index. Could you just store TF and IDF values in an inverted index and use cosine similarity at run time? Ideally I want autocomplete of phrases so I need to store edge ngrams as well, and a boolean model isn't useful here.

Related

Hashtag-based Tweet similarity

I have a big dataset consisting of tweets including hashtags and I want to build a hashtag-based similarity engine to get the most similar tweets given a set of hashtags.
In the end I would like to have some kind of "hashtag to vector embedding" model (should work like a language embedding model) which outputs a comparable vector, based on a set of input hashtags.
One idea would be to fit a TF IDF Vectorizer and then take the cosine similarity between a query vector and the tweet vectors.
However, there come some problems with this solution (very big vectors, out of dictionary vectors of new tweets,...) and I feel like there are some better solutions for the problem, do you have any other solutions/pretrained model for the problem which you can recommend or I should try?

Can a list of websites be considered a corpus for a particular category?

I am trying to build my own corpus for particular categories such as Engineering, Business, Math, Science and etc... This will be for automatic web page categorization. Let's say I manually collect 100 websites that are related to Math. Can these 100 websites be considered a corpus for Math?
Another related question. How does this differentiate from a lexicon wherein instead of a list of websites it shows a list of words with weights such as 0 or 1 to particular categories? Example would be a sentiment lexicon with words that has weights for positive and negative. But instead of positive and negative, categories such as Math, Science are used.
You say you want to make some web page categorization, then the problem you're facing is a supervised learning problem. The data you get are web pages, so I guess you actually extract their content as text. You work with textual input data. Since you want to categorize them, each of your input data has one or more corresponding labels, which are the outputs you want to predict. You have multiple label so you want to do multi-label classification
To tackle this problem, since most machine learning algorithms work with numerical vector, you need to transform your corpus of texts into vectors (or into one matrix). To do so, you can use the bag of word technique which first build a dictionary or lexicon and then count the occurrences of each word of the dictionary in each text. Actually, you can transform your output label in the same way, attributing an index of you output vector for each category.
The final pipeline would be something like this:
[input_text] --bag_of_word--> [input_vector] --prediction--> [output_vector] --label_matchnig--> [labels]

Find similar items based on item attributes

Most of the recommendation algorithm in mahout requires user-item preference. But I want to find similar items for a given item. My system doesn't have user inputs. i.e. for any movie these can be attribute which can be use to find similarity coefficient
Genre
Director
Actor
The attribute list can be modified in future to build more efficient system. But to find item similarity in mahout datamodel user preference for each item is required. Where as these movies can be clustered together and get closest items in cluster on given item.
Later on after introducing user based recommendation above result can be used to boost the result.
If product attribute has some fix values like Genre. Do I have to convert those values to numerical value. If yes how system will calculate distance between two items where genre-1 and genre-2 doesn't have any numeric relation.
Edit:
I have found few example from command line, but I want to do it in java and save the pre-computed values for later use.
I think in the case of features vectors, the best similarity measure is the ones with exact matches like jaccard similarity for example.
In jaccard, the similarity between two items vectors is calculated as:
number of features in intersection/ number of features in union.
So, converting the genre to a numerical value will not make a difference since the exact match ( that is used to find intersection) will be the same in non numerical values.
Take a look at this question for how to do it in mahout:
Does Mahout provide a way to determine similarity between content (for content-based recommendations)?
It sounds like Mahout's spark-rowsimilarity algorithm, available since version 0.10.0, would be the perfect solution to your problem. It compares the rows of a given matrix (i.e: row vectors representing movies and their properties), looking for cooccurrences of values across those rows - or in your case: cooccurrences of Genres, Directors, and Actors. No user history or item interaction needed. The end result is another matrix mapping each of your movies to the top n most similar other movies in your collection, based on cooccurrence of genre, director, or actor.
The Apache Mahout site has a great write-up regarding how to do this from the command line, but if you want a deeper understanding of what's going on under the covers, read Pat Ferrel's machine learning blog Occam's Machete. He calls this type of similarity content or metadata similarity.

Fast k-NN search over bag-of-words models

I have a large amount of documents of equal size. For each of those documents I'm building a bag of words model (BOW). Number of possible words in all documents is limited and large (2^16 for example). Generally speaking, I have N histograms of size K, where N is a number of documents and K is histogram width. I can calculate distance between any two histograms.
First optimization opportunity. Documents usually uses only small subset of words (usually less then 5%, most of them less then 0.5%).
Second optimization opportunity Subset of used words is varying from document to document much so I can use bits instead of word counts.
Query by content
Query is a document as well. I need to find k most similar documents.
Naive approach
Calculate BOW model from query.
For each document in dataset:
Calculate it's BOW model.
Find distance between query and document.
Obviously, some data structure should be used to track top-ranked documents (priority queue for example).
I need some sort of index to get rid of full database scan. KD-tree comes to mind but dimensionality and size of the dataset is very high. One can suggest to use some subset of possible words as features but I don't have separate training phase and can't extract this features beforehand.
I've thought about using MinHash algorithm to prune search space but I can't design an appropriate hash functions for this task.
k-d-tree and similar indexes are for dense and continuous data.
Your data most likely is sparse.
A good index for finding the nearest neighbors on sparse data is inverted lists. Essentially the same way search engines like Google work.

What is the role of latent semantic analysis in developing search engines?

I am trying to develop a music-focused search engine for my final year project.I have been doing some research on Latent Semantic Analysis and how it works on the Internet. I am having trouble understanding where LSI sits exactly in the whole system of search engines.
Should it be used after a web crawler has finished looking up web pages?
I don't know much about music retrieval, but in text retrieval, LSA is only relevant if the search engine is making use of the vector space model of information retrieval. Most common search engines, such as Lucene, break each document up into words (tokens), remove stop words and put the rest of them into the index, each usually associated with a term weight indicating the importance of the term within the document.
Now the list of (token,weight) pairs can be viewed as a vector representing the document. If you combine all of these vectors into a huge matrix and apply the LSA algorithm to that (after crawling and tokenising, but before indexing), you can use the result of the LSA algorithm to transform the vectors of all documents before indexing them.
Note that in the original vectors, the tokens represented the dimensions of the vector space. LSA will give you a new set of dimensions, and you'll have to index those (e.g. in the form of auto-generated integers) instead of the tokens.
Furthermore, you will have to transform the query into a vector of (token,weight) pairs, too, and then apply the LSA-based transformation to that vector as well.
I am unsure if anybody actually does all of this in any real-life text retrieval engine. One problem is that performing the LSA algorithm on the matrix of all document vectors consumes a lot of time and memory. Another problem is handling updates, i.e. when a new document is added, or an existing one changes. Ideally, you'd recompute the matrix, re-run LSA, and then modify all existing document vectors and re-generate the entire index. Not exactly scalable.

Resources