How to find documents similar to a predefined set of documents

How to find documents similar to a predefined set of documents - machine-learning

From big population of documents I would like to find those similar to a predefined set of documents.
All documents inside the set are similar to each other, but very few documents from the population is similar to those in the set. Quite unbalanced situation.
As a first step I will calculate cosine similarity among all docs in population vs all docs from the set. Then for all docs I can extract features like maximum cosine similarity, top 10 average cosine similarity, number of docs from set with similarity greater than ...
But what approach to use then? What model?
It doesn't seem like classical classification problem as I don't have labels. Maybe I can mark all from set as class A and the rest would be class B.
I can also try rank all candidates but there are more features to rank by.
Clustering algorithms? But I don't have absolute coordinates in a space, I have just similarities - relative distances between each and every document. Is there clustering algorithm, that can handle this?
I have an idea how to validate the model. I can take part of the documents from the set, mix it with the population and check how many of them were found by model prediction.

Related

Is there a way to find the most representative set of samples of the entire dataset?

I'm working on text classification and I have a set of 200.000 tweets.
The idea is to manually label a short set of tweets and train classifiers to predict the labels of the rest. Supervised learning.
What I would like to know is if there is a method to choose what samples to include in the train set in a way that this train set is a good representation of the whole data set, and because the high diversity included in the train set, the trained classifiers have considerable trust to be applied on the rest of tweets.

This sounds like a stratification question - do you have pre-existing labels or do you plan to design the labels based on the sample you're constructing?
If it's the first scenario, I think the steps in order of importance would be:
Stratify by target class proportions (so if you have three classes, and they are 50-30-20%, train/dev/test should follow the same proportions)
Stratify by features you plan to use
Stratify by tweet length/vocabulary etc.
If it's the second scenario, and you don't have labels yet, you may want to look into using n-grams as a feature, coupled with a dimensionality reduction or clustering approach. For example:
Use something like PCA or t-SNE to maximize distance between tweets (or a large subset), then pick candidates from different regions of the projected space
Cluster them based on lexical items (unigrams or bigrams, possibly using log frequencies or TF-IDF and stop word filtering, if content words are what you're looking for) - then you can cut the tree at a height that gives you n bins, which you can then use as a source for samples (stratify by branch)
Use something like LDA to find n topics, then sample stratified by topic
Hope this helps!

It seems that before you know anything about the classes you are going to label, a simple uniform random sample will do almost as well as any stratified sample - because you don't know in advance what to stratify on.
After labelling this first sample and building the first classifier, you can start so-called active learning: make predictions for the unlabelled dataset, and sample some tweets in which your classifier is least condfident. Label them, retrain the classifier, and repeat.
Using this approach, I managed to create a good training set after several (~5) iterations, with ~100 texts in each iteration.

Document similarity: Vector embedding versus Tf-Idf performance?

I have a collection of documents, where each document is rapidly growing with time. The task is to find similar documents at any fixed time. I have two potential approaches:
A vector embedding (word2vec, GloVe or fasttext), averaging over word vectors in a document, and using cosine similarity.
Bag-of-Words: tf-idf or its variations such as BM25.
Will one of these yield a significantly better result? Has someone done a quantitative comparison of tf-idf versus averaging word2vec for document similarity?
Is there another approach, that allows to dynamically refine the document's vectors as more text is added?

doc2vec or word2vec ?
According to article, the performance of doc2vec or paragraph2vec is poor for short-length documents. [Learning Semantic Similarity for Very Short Texts, 2015, IEEE]
Short-length documents ...?
If you want to compare the similarity between short documents, you might want to vectorize the document via word2vec.
how construct ?
For example, you can construct a document vector with a weighted average vector using tf-idf.
similarity measure
In addition, I recommend using ts-ss rather than cosine or euclidean for similarity.
Please refer to the following article or the summary in github below.
"A Hybrid Geometric Approach for Measuring Similarity Level Among Documents and Document Clustering"
https://github.com/taki0112/Vector_Similarity
thank you

You have to try it: the answer may vary based on your corpus and application-specific perception of 'similarity'. Effectiveness may especially vary based on typical document lengths, so if "rapidly growing with time" also means "growing arbitrarily long", that could greatly affect what works over time (requiring adaptations for longer docs).
Also note that 'Paragraph Vectors' – where a vector is co-trained like a word vector to represent a range-of-text – may outperform a simple average-of-word-vectors, as an input to similarity/classification tasks. (Many references to 'Doc2Vec' specifically mean 'Paragraph Vectors', though the term 'Doc2Vec' is sometimes also used for any other way of turning a document into a single vector, like a simple average of word-vectors.)
You may also want to look at "Word Mover's Distance" (WMD), a measure of similarity between two texts that uses word-vectors, though not via any simple average. (However, it can be expensive to calculate, especially for longer documents.) For classification, there's a recent refinement called "Supervised Word Mover's Distance" which reweights/transforms word vectors to make them more sensitive to known categories. With enough evaluation/tuning data about which of your documents should be closer than others, an analogous technique could probably be applied to generic similarity tasks.

You also might consider trying Jaccard similarity, which uses basic set algebra to determine the verbal overlap in two documents (although it is somewhat similar to a BOW approach). A nice intro on it can be found here.

Clustering of news articles

My scenario is pretty straightforwrd: I have a bunch of news articles (~1k at the moment) for which I know that some cover the same story/topic. I now would like to group these articles based on shared story/topic, i.e., based on their similarity.
What I did so far is to apply basic NLP techniques including stopword removal and stemming. I also calculated the tf-idf vector for each article, and with this can also calculate the, e.g., cosine similarity based on these tf-idf-vectors. But now with the grouping of the articles I struggles a bit. I see two principle ways -- probably related -- to do it:
1) Machine Learning / Clustering: I already played a bit with existing clustering libraries, with more or less success; see here. On the one hand, algorithms such as k-means require the number of clusters as input, which I don't know. Other algorithms require parameters that are also not intuitive to specify (for me that is).
2) Graph algorithms: I can represent my data as a graph with the articles being the nodes and weighted adges representing the pairwise (cosine) similarity between the articles. With that, for example, I can first remove all edges that fall below a certain threshold and then might apply graph algorithms to look for strongly-connected subgraphs.
In short, I'm not sure where best to go from here -- I'm still pretty new in this area. I wonder if there some best practices for that, or some kind of guidelines which methods / algorithms can (not) be applied in certain scenarios.
(EDIT: forgot to link to related question of mine)

Try the class of Hierarchical Agglomerative Clustering HAC algorithms with Single and Complete linkage.
These algorithms do not need the number of clusters as input.
The basic principle is similar to growing a minimal spanning tree across a given set of data points and then stop based on a threshold criteria. A closely related class is the Divisive clustering algorithms which first builds up the minimal spanning tree and then prunes off a branch of the tree based on inter-cluster similarity ratios.

You can also try a canopy variation on k-means to create a relatively quick estimate for the number of clusters (k).
http://en.wikipedia.org/wiki/Canopy_clustering_algorithm
Will you be recomputing over time or do you only care about a static set of news? I ask because your k may change a bit over time.

Since you can model your dataset as a graph you could apply stochastic clustering based on markov models. Here are link for resources on MCL algorithm:
Official thesis description and code base
Gephi plugin for MCL (to experiment and evaluate the method)

How to calculate TF*IDF for a single new document to be classified?

I am using document-term vectors to represent a collection of document. I use TF*IDF to calculate the term weight for each document vector. Then I could use this matrix to train a model for document classification.
I am looking forward to classify new document in future. But in order to classify it, I need to turn the document into a document-term vector first, and the vector should be composed of TF*IDF values, too.
My question is, how could I calculate the TF*IDF with just a single document?
As far as I understand, TF can be calculated based on a single document itself, but the IDF can only be calculated with a collection of document. In my current experiment, I actually calculate the TF*IDF value for the whole collection of documents. And then I use some documents as training set and the others as test set.
I just suddenly realized that this seems not so applicable to real life.
ADD 1
So there are actually 2 subtly different scenarios for classification:
to classify some documents whose content are known but label are not
known.
to classify some totally unseen document.
For 1, we can combine all the documents, both with and without labels. And get the TF*IDF over all of them. This way, even we only use the documents with labels for training, the training result will still contain the influence of the documents without labels.
But my scenario is 2.
Suppose I have the following information for term T from the summary of the training set corpus:
document count for T in the training set is n
total number of training documents is N
Should I calculate the IDF of t for a unseen document D as below?
IDF(t, D)= log((N+1)/(n+1))
ADD 2
And what if I encounter a term in the new document which didn't show up in the training corpus before?
How should I calculate the weight for it in the doc-term vector?

TF-IDF doesn't make sense for a single document, independent of a corpus. It's fundamentally about emphasizing relatively rare and informative words.
You need to keep corpus summary information in order to compute TF-IDF weights. In particular, you need the document count for each term and the total number of documents.
Whether you want to use summary information from the whole training set and test set for TF-IDF, or for just the training set is a matter of your problem formulation. If it's the case that you only care to apply your classification system to documents whose contents you have, but whose labels you do not have (this is actually pretty common), then using TF-IDF for the entire corpus is okay. If you want to apply your classification system to entirely unseen documents after you train, then you only want to use the TF-IDF summary information from the training set.

TF obviously only depends on the new document.
IDF, you compute only on your training corpus.
You can add a slack term to the IDF computation, or adjust it as you suggested. But for a reasonable training set, the constant +1 term will not have a whole lot of effect. AFAICT, in classic document retrieval (think: search), you don't bother to do this. Often, they query document will not become part of your corpus, so why would it be part of IDF?

For unseen words, TF calculation is not a problem as TF is a document specific metric. While computing IDF, you can use smoothed inverse document frequency technique.
IDF = 1 + log(total documents / document frequency of a term)
Here the lower bound for IDF is 1. So if a word is not seen in the training corpus, its IDF is 1. Since, there is no universally agreed single formula for computing tf-idf or even idf, your formula for tf-idf calculation is also reasonable.
Note that, in many cases, unseen terms are ignored if they don't have much impact in the classification task. Sometimes, people replace unseen tokens with a special symbol like UNKNOWN_TOKEN and do their computation.
Alternative of TF-IDF: Another way of computing weight of each term of a document is using Maximum Likelihood Estimation. While computing MLE, you can smooth using additive smoothing technique which is also known as Laplace smoothing. MLE is used in case you are using Generative models like Naive Bayes algorithm for document classification.

Centroid algorithm for document classification, threshold detection

I have a collection of documents related to a particular domain and have trained the centroid classifier based on that collection. What I want to do is, I will be feeding the classifier with documents from different domains and want to determine how much they are relevant to the trained domain. I can use the cosine similarity for this to get a numerical value but my question is what is the best way to determine the threshold value?
For this, I can download several documents from different domains and inspect their similarity scores to determine the threshold value. But is this the way to go, does it sound statistically good? What are the other approaches for this?

Actually there is another issue with centroids in sparse vectors. The problem is that they usually are significantly less sparse than the original data. For examples, this increases computation costs. And it can yield vectors that are themselves actually atypical because they have a different sparsity pattern. This effect is similar to using arithmetic means of discrete data: say the mean number of doors in a car is 3.4; yet obviously no car exists that actually has 3.4 doors. So in particular, there will be no car with an euclidean distance of less than 0.4 to the centroid! - so how "central" is the centroid then really?
Sometimes it helps to use medoids instead of centroids, because they actually are proper objects of your data set.
Make sure you control such effects on your data!

A simple method to try would be to employ various machine-learning algorithms - and in particular, tree-based ones - on the distances from your centroids.
As mentioned in another answer(#Anony-Mousse), this won't necessarily provide you with good or usable answers, but it just might. Using a ML framework for this procedure, E.g. WEKA, will also help you with estimating your accuracy in a more rigorous manner.
Here are the steps to take, using WEKA:
Generate a train set by finding a decent amount of documents representing each of your classes (to get valid estimations, I'd recommend at least a few dozens per class)
Calculate the distance from each document to each of your centroids.
Generate a feature vector for each such document, composed of the distances from this document to the centroids. You can either use a single feature - the distance to the nearest centroid; or use all distances, if you'd like to try a more elaborate thresholding scheme. For example, if you chose the simpler method of using a single feature, the vector representing a document with a distance of 0.2 to the nearest centroid, belonging to class A would be: "0.2,A"
Save this set in ARFF or CSV format, load into WEKA, and try classifying, e.g. using a J48 tree.
The results would provide you with an overall accuracy estimation, with a detailed confusion matrix, and - of course - with a specific model, e.g. a tree, you can use for classifying additional documents.
These results can be used to iteratively improve the models and thresholds by collecting additional train documents for problematic classes, either by recreating the centroids or by retraining the thresholds classifier.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart