Caluculating IDF(Inverse Document Frequency) for document categorization - machine-learning

I have doubt in calculating IDF (Inverse Document Frequency) in document categorization. I have more than one category with multiple documents for training. I am calculating IDF for each term in a document using following formula:
IDF(t,D)=log(Total Number documents/Number of Document matching term);
My questions are:
What does "Total Number documents in Corpus" mean? Whether the document count from a current category or from all available categories?
What does "Number of Document matching term" mean? Whether the term matching document count from a current category or from all available categories?

Total Number documents in Corpus is simply the amount of documents you have in your corpus. So if you have 20 documents then this value is 20.
Number of Document matching term is the count of in how many documents the term t occurs. So if you have 20 documents in total and the term t occurs in 15 of the documents then the value for Number of Documents matching term is 15.
The value for this example would thus be IDF(t,D)=log(20/15) = 0.1249
Now if I'm correct, you have multiple categories per document and you want to able to categorize new documents with one or more of these categories. One method to do this would be to create one documents for each category. Each category-document should hold all texts which are labelled with this category. You can then perform tf*idf on these documents.
A simple way of categorizing a new document could then be achieved by summing the term values of the query using the different term values calculated for each category. The category whose term values, used to calculate the product, result in the highest outcome will then be ranked 1st.
Another possibility is to create a vector for the query using the idf of each term in the query. All terms which don't occur in the query are given the value of 0. The query-vector can then be compared for similarity to each category-vector using for example cosine similarity.
Smoothing is also a useful technique to deal with words in a query which don't occur in your corpus.
I'd suggest reading sections 6.2 and 6.3 of "Introduction to Information Retrieval" by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze.

I have written a small post describing term frequency-inverse document frequency here: http://bigdata.devcodenote.com/2015/04/tf-idf-term-frequency-inverse-document.html
Here is a snippet from the post:
TF-IDF is the most fundamental metric used extensively in classification of documents.
Let us try and define these terms:
Term frequency basically is significant of the frequency of occurrence of a certain word in a document compared to other words in the document.
Inverse Document frequency on the other hand is significant of the occurrence of the word in all the documents for a given collection (of documents which we want to classify into different categories).

Related

Universal sentence encoder for big document similarity

I need to create a 'search engine' experience : from a short query (few words), I need to find the relevant documents in a corpus of thousands documents.
After analyzing few approaches, I got very good results with the Universal Sentence Encoder from Google.
The problem is that my documents can be very long. For these very long texts it looks like the performance are decreasing so my idea was to cut the text in sentences/paragraph.
So I ended up with getting a list of vectors for each document (representing each part of the document).
My question is : is there a state-of-the-art algorithm/methodology to compute a scoring from a list of vector ? I don't really want to merge them into one as it would create the same effect than before (the relevant part would be diluted in the document). Any scoring algorithms to sum up the multiple cosine similarities between the query and the different parts of the text ?
important information : I can have short and long text. So I can have 1 up to 10 vectors for a document.
One way of doing this is to embed all sentences of all documents (typically storing them in an index such as FAISS or elastic). Store the document identifier of each sentence. In Elastic this can be metadata but in FAISS this needs to be held in an external mapping.
Then:
embed query
Calculate cosine similarity between query and all sentence embeddings
For top-k results, group by document identifier and take the sum (this step is optional depending on whether youre looking for the most similar document or the most similar sentence, here I suppose that you are looking for the most similar document, thereby boosting documents with a higher similarity).
Then you should have an ordered list of relevant document identifiers.

Find similar items based on item attributes

Most of the recommendation algorithm in mahout requires user-item preference. But I want to find similar items for a given item. My system doesn't have user inputs. i.e. for any movie these can be attribute which can be use to find similarity coefficient
Genre
Director
Actor
The attribute list can be modified in future to build more efficient system. But to find item similarity in mahout datamodel user preference for each item is required. Where as these movies can be clustered together and get closest items in cluster on given item.
Later on after introducing user based recommendation above result can be used to boost the result.
If product attribute has some fix values like Genre. Do I have to convert those values to numerical value. If yes how system will calculate distance between two items where genre-1 and genre-2 doesn't have any numeric relation.
Edit:
I have found few example from command line, but I want to do it in java and save the pre-computed values for later use.
I think in the case of features vectors, the best similarity measure is the ones with exact matches like jaccard similarity for example.
In jaccard, the similarity between two items vectors is calculated as:
number of features in intersection/ number of features in union.
So, converting the genre to a numerical value will not make a difference since the exact match ( that is used to find intersection) will be the same in non numerical values.
Take a look at this question for how to do it in mahout:
Does Mahout provide a way to determine similarity between content (for content-based recommendations)?
It sounds like Mahout's spark-rowsimilarity algorithm, available since version 0.10.0, would be the perfect solution to your problem. It compares the rows of a given matrix (i.e: row vectors representing movies and their properties), looking for cooccurrences of values across those rows - or in your case: cooccurrences of Genres, Directors, and Actors. No user history or item interaction needed. The end result is another matrix mapping each of your movies to the top n most similar other movies in your collection, based on cooccurrence of genre, director, or actor.
The Apache Mahout site has a great write-up regarding how to do this from the command line, but if you want a deeper understanding of what's going on under the covers, read Pat Ferrel's machine learning blog Occam's Machete. He calls this type of similarity content or metadata similarity.

Learning to tag sentences with keywords based on examples

I have a set (~50k elements) of small text fragments (usually one or two sentences) each one tagged with a set of keywords chosen from a list of ~5k words.
How would I go to implement a system that, learning from this examples can then tag new sentences with the same set of keywords? I don't need code, I'm just looking for some pointers and methods/papers/possible ideas on how to implement this.
If I understood you well, what you need is a measure of similarity for a pair of documents. I have been recently using TF-IDF for clustering of documents and it worked quiet well. I think here you can use TF-IDF values and calculate a cosine similarity for the corresponding TF-IDF values for each of the documents.
TF-IDF computation
TF-IDF stands for Term Frequency - Inverse Document Frequency. Here is a definition how it can be calculated:
Compute TF-IDF values for all words in all documents
- TF-IDF score of a word W in document D is
TF-IDF(W, D) = TF(W, D) * IDF(W)
where TF(W, D) is frequency of word W in document D
IDF(W) = log(N/(2 + #W))
N - number of documents
#W - number of documents that contain word W
- words contained in the title will count twice (means more important)
- normalize TF-IDF values: sum of all TF-IDF(W, D)^2 in a document should be 1.
Depending of the technology You use, this may be achieved in different ways. I have had implemented it in Python using a nested dictionary. Firstly I use document name D as key and then for each document D I have a nested dictionary with word W as key and each word W has a corresponding numeric value which is the calculated TF-IDF.
Similarity computation
Let say You have calculated the TF-IDF values already and You want to compare 2 documents W1 and W2 how similar they are. For that we need to use some similarity metric. There are many choices, each one having pros and cons. In this case, IMO, Jaccard similarity and cosine similarity would work good. Both functions would have TF-IDF and names of the 2 documentsW1 and W2 as its arguments and it would return a numeric value which indicates how similar the 2 documents are.
After computing the similarity between 2 documents you will obtain a numeric value. The greater the value, the more similar 2 documents W1 and W2 are. Now, depending on what You want to achieve, we have 2 scenarios.
If You want for 1 document to assign only the tags of the most similar document, then You compare it with all other documents and assign to the new document the tags of the most similar one.
You can set some threshold and You can assign all tags of documents which have similarity with the document in question greater than the threshold value. If You set threshold = 0.7, than all document W will have the tags of all already tagged documents V for which similarity(W, V) > 0.7.
I hope it helps.
Best of luck :)
Given your description, you are looking for some form of supervised learning. There are many methods in that class, e.g. Naive Bayes classifiers, Support Vector Machines (SVM), k Nearest Neighbours (kNN) and many others.
For a numerical representation of your text, you can chose a bag-of-words or a frequency list (essentially, each text is represented by a vector on a high dimensional vectorspace spanned by all the words).
BTW, it is far easier to tag the texts with one keyword (a classification task) than to assign them up to five ones (the number of possible classes explodes combinatorically)

Document classification using naive bayse

I have question regarding the particular Naive Bayse algorithm that is used in document classification. Following is what I understand:
construct some probability of each word in the training set for each known classification
given a document we strip all the words that it contains
multiply together the probabilities of the words being present in a classification
perform (3) for each classification
compare the result of (4) and choose the classification with the highest posterior
What I am confused about is the part when we calculate the probability of each word given training set. For example for a word "banana", it appears in 100 documents in classification A, and there are totally 200 documents in A, and in total 1000 words appears in A. To get the probability of "banana" appearing under classification A do I use 100/200=0.5 or 100/1000=0.1?
I believe your model will more accurately classify if you count the number of documents the word appears in, not the number of times the word appears in total. In other words
Classify "Mentions Fruit":
"I like Bananas."
should be weighed no more or less than
"Bananas! Bananas! Bananas! I like them."
So the answer to your question would be 100/200 = 0.5.
The description of Document Classification on Wikipedia also supports my conclusion
Then the probability that a given document D contains all of the words W, given a class C, is
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
In other words, the document classification algorithm Wikipedia describes tests how many of the list of classifying words a given document contains.
By the way, more advanced classification algorithms will examine sequences of N-words, not just each word individually, where N can be set based on the amount of CPU resources you are willing to dedicate to the calculation.
UPDATE
My direct experience is based on short documents. I would like to highlight research that #BenAllison points out in the comments that suggests my answer is invalid for longer documents. Specifically
One weakness is that by considering only the presence or absence of terms, the BIM ignores information inherent in the frequency of terms. For instance, all things being equal, we would expect that if 1 occurrence of a word is a good clue that a document belongs in a class, then 5 occurrences should be even more predictive.
A related problem concerns document length. As a document gets longer, the number of distinct words used, and thus the number of values of x(j) that equal 1 in the BIM, will in general increase.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.46.1529

How do I combine two features of different dimension?

Let us consider the problem of text classification. So if the document is represented as Bag of words , then we will have an n dimensional feature , where n- number of words in the document. Now if the I decide that I also want to use the document length as feature , then the dimension of this feature alone( length ) will be one. So how do I combine to use both the features (length and Bag of words). Should consider the feature now as 2 dimensional( n-dimensional vector(BOW) and 1-dimensional feature(length). If this wont work , How do I combine the features. Any pointers on this will also be helpful ?
This statement is a little ambiguous: "So if the document is represented as Bag of words, then we will have an n dimensional feature, where n- number of words in the document."
My interpretation is that you have a column for each word that occurs in your corpus (probably restricted to some dictionary of interest), and for each document you have counted the number of occurrences of that word. Your number of columns is now equal to the number of words in your dictionary that appear in ANY of the documents. You also have a "length" feature, which could be a count of the number of words in the document, and you want to know how to incorporate it into your analysis.
A simple approach would be to divide the number of occurrences of a word by the total number of words in the document.
This has the effect of scaling the word occurrences based on the size of the document, and the new feature is called a 'term frequency'. The next natural step is to weight the term frequencies to compensate for terms that are more common in the corpus (and therefore less important). Since we give HIGHER weights to terms that are LESS common, this is called 'inverse document frequency', and the whole process is called “Term Frequency times Inverse Document Frequency”, or tf-idf. You can Google this for more information.
It's possible that you are doing word counts in a different way -- for example, counting the number of word occurrences in each paragraph (as opposed to each document). In that case, for each document, you have a word count for each paragraph, and the typical approach is to merge these paragraph-counts using a process such as Singular Value Decomposition.

Resources