Universal sentence encoder for big document similarity - machine-learning

I need to create a 'search engine' experience : from a short query (few words), I need to find the relevant documents in a corpus of thousands documents.
After analyzing few approaches, I got very good results with the Universal Sentence Encoder from Google.
The problem is that my documents can be very long. For these very long texts it looks like the performance are decreasing so my idea was to cut the text in sentences/paragraph.
So I ended up with getting a list of vectors for each document (representing each part of the document).
My question is : is there a state-of-the-art algorithm/methodology to compute a scoring from a list of vector ? I don't really want to merge them into one as it would create the same effect than before (the relevant part would be diluted in the document). Any scoring algorithms to sum up the multiple cosine similarities between the query and the different parts of the text ?
important information : I can have short and long text. So I can have 1 up to 10 vectors for a document.

One way of doing this is to embed all sentences of all documents (typically storing them in an index such as FAISS or elastic). Store the document identifier of each sentence. In Elastic this can be metadata but in FAISS this needs to be held in an external mapping.
Then:
embed query
Calculate cosine similarity between query and all sentence embeddings
For top-k results, group by document identifier and take the sum (this step is optional depending on whether youre looking for the most similar document or the most similar sentence, here I suppose that you are looking for the most similar document, thereby boosting documents with a higher similarity).
Then you should have an ordered list of relevant document identifiers.

Related

Text Content Relevancy Check

I need to check relevancy of content on particular web page. I have thousands of
webpages to check this on. What is the best way to check if the page title is relevant to the content on the page.
You question is a bit vague, when you say:
What is the best way to check if the page title is relevant to the
content on the page.
How is being relevant defined in the context of your problem?
I don't know if this is what you want, but couple of thing come to my mind, which essentially is comparing how similar two documents are, being one document the title and the other the description.
You can think about methods to generate vector representations for both and compare how similar they are.
Jaccard similarity using the tokens as elements of the both sets (i.e., documents)
TF-IDF weighted vectors and compare them with cosine similarity
Compute distribution topic model/LDA for each document and compare them using Kullback-Leibler divergence
Encode the documents into some sort of dense vector (doc2vec, or read them through an LSTM and keep the last state), and then compare both vectors.
The only consideration is that the size of the title is very small compared to the content of the webpage.

Fast k-NN search over bag-of-words models

I have a large amount of documents of equal size. For each of those documents I'm building a bag of words model (BOW). Number of possible words in all documents is limited and large (2^16 for example). Generally speaking, I have N histograms of size K, where N is a number of documents and K is histogram width. I can calculate distance between any two histograms.
First optimization opportunity. Documents usually uses only small subset of words (usually less then 5%, most of them less then 0.5%).
Second optimization opportunity Subset of used words is varying from document to document much so I can use bits instead of word counts.
Query by content
Query is a document as well. I need to find k most similar documents.
Naive approach
Calculate BOW model from query.
For each document in dataset:
Calculate it's BOW model.
Find distance between query and document.
Obviously, some data structure should be used to track top-ranked documents (priority queue for example).
I need some sort of index to get rid of full database scan. KD-tree comes to mind but dimensionality and size of the dataset is very high. One can suggest to use some subset of possible words as features but I don't have separate training phase and can't extract this features beforehand.
I've thought about using MinHash algorithm to prune search space but I can't design an appropriate hash functions for this task.
k-d-tree and similar indexes are for dense and continuous data.
Your data most likely is sparse.
A good index for finding the nearest neighbors on sparse data is inverted lists. Essentially the same way search engines like Google work.

modeling feature set with text documents

Example:
I have m sets of ~1000 text documents, ~10 are predictive of a binary result, roughly 990 aren't.
I want to train a classifier to take a set of documents and predict the binary result.
Assume for discussion that the documents each map the text to 100 features.
How is this modeled in terms of training examples and features? Do I merge all the text together and map it to a fixed set of features? Do I have 100 features per document * ~1000 documents (100,000 features) and one training example per set of documents? Do I classify each document separately and analyze the resulting set of confidences as they relate to the final binary prediction?
The most common way to handle text documents is with a bag of words model. The class proportions are irrelevant. Each word gets mapped to a unique index. Make the value at that index equal to the number of times that token occurs (there are smarter things to do). The number of features/dimension is then the number of unique tokens/words in your corpus. There are manny issues with this, and some of them are discussed here. But it works well enough for many things.
I would want to approach it as a two stage problem.
Stage 1: predict the relevancy of a document from the set of 1000. For best combination with stage 2, use something probabilistic (logistic regression is a good start).
Stage 2: Define features on the output of stage 1 to determine the answer to the ultimate question. These could be things like the counts of words for the n most relevant docs from stage 1, the probability of the most probable document, the 99th percentile of those probabilities, variances in probabilities, etc. Whatever you think will get you the correct answer (experiment!)
The reason for this is as follows: concatenating documents together will drown you in irrelevant information. You'll spend ages trying to figure out which words/features allow actual separation between the classes.
On the other hand, if you concatenate feature vectors together, you'll run into an exchangeability problem. By that I mean, word 1 in document 1 will be in position 1, word 1 in document 2 will be in position 1001, in document 3 it will be in position 2001, etc. and there will be no way to know that the features are all related. Furthermore, an alternate presentation of the order of the documents would lead to the positions in the feature vector changing its order, and your learning algorithm won't be smart to this. Equally valid presentations of the document orders will lead to completely different results in an entirely non-deterministic and unsatisfying way (unless you spend a long time designing a custom classifier that's not afficted with this problem, which might ultimately be necessary but it's not the thing I'd start with).

Document classification using naive bayse

I have question regarding the particular Naive Bayse algorithm that is used in document classification. Following is what I understand:
construct some probability of each word in the training set for each known classification
given a document we strip all the words that it contains
multiply together the probabilities of the words being present in a classification
perform (3) for each classification
compare the result of (4) and choose the classification with the highest posterior
What I am confused about is the part when we calculate the probability of each word given training set. For example for a word "banana", it appears in 100 documents in classification A, and there are totally 200 documents in A, and in total 1000 words appears in A. To get the probability of "banana" appearing under classification A do I use 100/200=0.5 or 100/1000=0.1?
I believe your model will more accurately classify if you count the number of documents the word appears in, not the number of times the word appears in total. In other words
Classify "Mentions Fruit":
"I like Bananas."
should be weighed no more or less than
"Bananas! Bananas! Bananas! I like them."
So the answer to your question would be 100/200 = 0.5.
The description of Document Classification on Wikipedia also supports my conclusion
Then the probability that a given document D contains all of the words W, given a class C, is
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
In other words, the document classification algorithm Wikipedia describes tests how many of the list of classifying words a given document contains.
By the way, more advanced classification algorithms will examine sequences of N-words, not just each word individually, where N can be set based on the amount of CPU resources you are willing to dedicate to the calculation.
UPDATE
My direct experience is based on short documents. I would like to highlight research that #BenAllison points out in the comments that suggests my answer is invalid for longer documents. Specifically
One weakness is that by considering only the presence or absence of terms, the BIM ignores information inherent in the frequency of terms. For instance, all things being equal, we would expect that if 1 occurrence of a word is a good clue that a document belongs in a class, then 5 occurrences should be even more predictive.
A related problem concerns document length. As a document gets longer, the number of distinct words used, and thus the number of values of x(j) that equal 1 in the BIM, will in general increase.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.46.1529

Caluculating IDF(Inverse Document Frequency) for document categorization

I have doubt in calculating IDF (Inverse Document Frequency) in document categorization. I have more than one category with multiple documents for training. I am calculating IDF for each term in a document using following formula:
IDF(t,D)=log(Total Number documents/Number of Document matching term);
My questions are:
What does "Total Number documents in Corpus" mean? Whether the document count from a current category or from all available categories?
What does "Number of Document matching term" mean? Whether the term matching document count from a current category or from all available categories?
Total Number documents in Corpus is simply the amount of documents you have in your corpus. So if you have 20 documents then this value is 20.
Number of Document matching term is the count of in how many documents the term t occurs. So if you have 20 documents in total and the term t occurs in 15 of the documents then the value for Number of Documents matching term is 15.
The value for this example would thus be IDF(t,D)=log(20/15) = 0.1249
Now if I'm correct, you have multiple categories per document and you want to able to categorize new documents with one or more of these categories. One method to do this would be to create one documents for each category. Each category-document should hold all texts which are labelled with this category. You can then perform tf*idf on these documents.
A simple way of categorizing a new document could then be achieved by summing the term values of the query using the different term values calculated for each category. The category whose term values, used to calculate the product, result in the highest outcome will then be ranked 1st.
Another possibility is to create a vector for the query using the idf of each term in the query. All terms which don't occur in the query are given the value of 0. The query-vector can then be compared for similarity to each category-vector using for example cosine similarity.
Smoothing is also a useful technique to deal with words in a query which don't occur in your corpus.
I'd suggest reading sections 6.2 and 6.3 of "Introduction to Information Retrieval" by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze.
I have written a small post describing term frequency-inverse document frequency here: http://bigdata.devcodenote.com/2015/04/tf-idf-term-frequency-inverse-document.html
Here is a snippet from the post:
TF-IDF is the most fundamental metric used extensively in classification of documents.
Let us try and define these terms:
Term frequency basically is significant of the frequency of occurrence of a certain word in a document compared to other words in the document.
Inverse Document frequency on the other hand is significant of the occurrence of the word in all the documents for a given collection (of documents which we want to classify into different categories).

Resources