Graph based weighting for sentence extraction in automatic summarization? - machine-learning

I was reading a research paper Automatic Text Document Summarization Based on
Machine Learning and under the Table 1 corresponding to graph based weighting, they have used a feature F1 called Aggregate Similarity.
I have tried searching the web , though i have found mentions of things like "Flexible aggregate similarity" but 'm not sure how does it relate to the task of automatic summarization and weighing sentences.
What exactly is meant by aggregate similarity and How is it calculated?

Aggregate Similarity is the summation of similarities for each node (aggregate similarity).This similarity is simply a vocabulary overlap between the 2 nodes (2 sentences) under consideration divided by the longest length of the 2 sen-
tences (for normalization).
Aggregate similarity measures the importance of a sentence.
Instead of counting the number of links connecting a node (sentence) to other nodes (Bushy path), aggregate similarity sums the weights (similarities) on the links.

Related

How to find documents similar to a predefined set of documents

From big population of documents I would like to find those similar to a predefined set of documents.
All documents inside the set are similar to each other, but very few documents from the population is similar to those in the set. Quite unbalanced situation.
As a first step I will calculate cosine similarity among all docs in population vs all docs from the set. Then for all docs I can extract features like maximum cosine similarity, top 10 average cosine similarity, number of docs from set with similarity greater than ...
But what approach to use then? What model?
It doesn't seem like classical classification problem as I don't have labels. Maybe I can mark all from set as class A and the rest would be class B.
I can also try rank all candidates but there are more features to rank by.
Clustering algorithms? But I don't have absolute coordinates in a space, I have just similarities - relative distances between each and every document. Is there clustering algorithm, that can handle this?
I have an idea how to validate the model. I can take part of the documents from the set, mix it with the population and check how many of them were found by model prediction.

How to find probability of predicted weight of a link in weighted graph

I have an undirected weighted graph. Let's say node A and node B don't have a direct link between them but there are paths connects both nodes through other intermediate nodes. Now I want to predict the possible weight of the direct link between the node A and B as well as the probability of it.
I can predict the weight by finding the possible paths and their average weight but how can I find the probability of it
The problem you are describing is called link prediction. Here is a short tutorial explaining about the problem and some simple heuristics that can be used to solve it.
Since this is an open-ended problem, these simple solutions can be improved a lot by using more complicated techniques. Another approach for predicting the probability for an edge is to use Machine Learning rather than rule-based heuristics.
A recent article called node2vec, proposed an algorithm that maps each node in a graph to a dense vector (aka embedding). Then, by applying some binary operator on a pair of nodes, we get an edge representation (another vector). This vector is then used as input features to some classifier that predicts the edge-probability. The paper compared a few such binary operators over a few different datasets, and significantly outperformed the heuristic benchmark scores across all of these datasets.
The code to compute embeddings given your graph can be found here.

How to calculate TF*IDF for a single new document to be classified?

I am using document-term vectors to represent a collection of document. I use TF*IDF to calculate the term weight for each document vector. Then I could use this matrix to train a model for document classification.
I am looking forward to classify new document in future. But in order to classify it, I need to turn the document into a document-term vector first, and the vector should be composed of TF*IDF values, too.
My question is, how could I calculate the TF*IDF with just a single document?
As far as I understand, TF can be calculated based on a single document itself, but the IDF can only be calculated with a collection of document. In my current experiment, I actually calculate the TF*IDF value for the whole collection of documents. And then I use some documents as training set and the others as test set.
I just suddenly realized that this seems not so applicable to real life.
ADD 1
So there are actually 2 subtly different scenarios for classification:
to classify some documents whose content are known but label are not
known.
to classify some totally unseen document.
For 1, we can combine all the documents, both with and without labels. And get the TF*IDF over all of them. This way, even we only use the documents with labels for training, the training result will still contain the influence of the documents without labels.
But my scenario is 2.
Suppose I have the following information for term T from the summary of the training set corpus:
document count for T in the training set is n
total number of training documents is N
Should I calculate the IDF of t for a unseen document D as below?
IDF(t, D)= log((N+1)/(n+1))
ADD 2
And what if I encounter a term in the new document which didn't show up in the training corpus before?
How should I calculate the weight for it in the doc-term vector?
TF-IDF doesn't make sense for a single document, independent of a corpus. It's fundamentally about emphasizing relatively rare and informative words.
You need to keep corpus summary information in order to compute TF-IDF weights. In particular, you need the document count for each term and the total number of documents.
Whether you want to use summary information from the whole training set and test set for TF-IDF, or for just the training set is a matter of your problem formulation. If it's the case that you only care to apply your classification system to documents whose contents you have, but whose labels you do not have (this is actually pretty common), then using TF-IDF for the entire corpus is okay. If you want to apply your classification system to entirely unseen documents after you train, then you only want to use the TF-IDF summary information from the training set.
TF obviously only depends on the new document.
IDF, you compute only on your training corpus.
You can add a slack term to the IDF computation, or adjust it as you suggested. But for a reasonable training set, the constant +1 term will not have a whole lot of effect. AFAICT, in classic document retrieval (think: search), you don't bother to do this. Often, they query document will not become part of your corpus, so why would it be part of IDF?
For unseen words, TF calculation is not a problem as TF is a document specific metric. While computing IDF, you can use smoothed inverse document frequency technique.
IDF = 1 + log(total documents / document frequency of a term)
Here the lower bound for IDF is 1. So if a word is not seen in the training corpus, its IDF is 1. Since, there is no universally agreed single formula for computing tf-idf or even idf, your formula for tf-idf calculation is also reasonable.
Note that, in many cases, unseen terms are ignored if they don't have much impact in the classification task. Sometimes, people replace unseen tokens with a special symbol like UNKNOWN_TOKEN and do their computation.
Alternative of TF-IDF: Another way of computing weight of each term of a document is using Maximum Likelihood Estimation. While computing MLE, you can smooth using additive smoothing technique which is also known as Laplace smoothing. MLE is used in case you are using Generative models like Naive Bayes algorithm for document classification.

Calculating a score from multiple classifiers

I'm trying to determine the similarity between pairs of items taken among a large collection. The items have several attributes and I'm able to calculate a discrete similarity score for each attribute, between 0 and 1. I use various classifiers depending on the attribute: TF-IDF cosine similarity, Naive Bayes Classifier, etc.
I'm stuck when it comes to compiling all that information into a final similarity score for all the items. I can't just take an unweighted average because 1) what's a high score depends on the classifier and 2) some classifiers are more important than others. In addition, some classifiers should be considered only for their high scores, i.e. a high score points to a higher similarity but lower scores have no meaning.
So far I've calculated the final score with guesswork but the increasing number of classifiers makes this a very poor solution. What techniques are there to determine an optimal formula that will take my various scores and return just one? It's important to note that the system does receive human feedback, which is how some of the classifiers work to begin with.
Ultimately I'm only interested in ranking, for each item, the ones that are most similar. The absolute scores themselves are meaningless, only their ordering is important.
There is a great book on the topic of ensemble classifier. It is online on: Combining Pattern Classifiers
There are two chapters (ch4 & ch5) in this book on Fusion of Label Outputs and how to get a single decision value.
A set of methods are defined in the chapter including:
1- Weighted Majority Vote
2- Naive Bayes Combination
3- ...
I hope that this is what you were looking for.
Get a book on ensemble classification. There has been a lot of work on how to learn a good combination of classifiers. There are numerous choices. You can of course learn weights and do a weighted average. Or you can use error correcting codes. etc. pp.
Anyway, read up on "ensemble classification", that is the keyword you need.

Bad clustering results with mahout on Reuters 21578 dataset

I 've used a part of reuters 21578 dataset and mahout k-means for clustering.To be more specific I extracted only the texts that has a unique value for category 'topics'.So I ve been left with 9494 texts that belong to one among 66 categories. I ve used seqdirectory to create sequence files from texts and then seq2sparse to crate the vectors. Then I run k-means with cosine distance measure (I ve tried tanimoto and euclidean too, with no better luck), cd=0.1 and k=66 (same as the number of categories). So I tried to evaluate the results with silhouette measure using custom Java code and the matlab implementation of silhouette (just to be sure that there is no error in my code) and I get that the average silhouette of the clustering is 0.0405. Knowing that the best clustering could give an average silhouette value close to 1, I see that the clustering result I get is no good at all.
So is this due to Mahout or the quality of catgorization on reuters dataset is low?
PS: I m using Mahout 0.7
PS2: Sorry for my bad English..
I've never actually worked with Mahout, so I cannot say what it does by default, but you might consider checking what sort of distance metric it uses by default. For example, if the metric is Euclidean distance on unnormalized document word counts, you can expect very poor quality cluster quality, as document length will dominate any meaningful comparison between documents. On the other hand, something like cosine distance on normalized, or tf-idf weighted word counts can do much better.
One other thing to look at is the distribution of topics in the Reuters 21578. It is very skewed towards a few topics such as "acq" or "earn", while others are used only handfuls of times. This can it difficult to achieve good external clustering metrics.

Resources