How do I combine two features of different dimension? - machine-learning

Let us consider the problem of text classification. So if the document is represented as Bag of words , then we will have an n dimensional feature , where n- number of words in the document. Now if the I decide that I also want to use the document length as feature , then the dimension of this feature alone( length ) will be one. So how do I combine to use both the features (length and Bag of words). Should consider the feature now as 2 dimensional( n-dimensional vector(BOW) and 1-dimensional feature(length). If this wont work , How do I combine the features. Any pointers on this will also be helpful ?

This statement is a little ambiguous: "So if the document is represented as Bag of words, then we will have an n dimensional feature, where n- number of words in the document."
My interpretation is that you have a column for each word that occurs in your corpus (probably restricted to some dictionary of interest), and for each document you have counted the number of occurrences of that word. Your number of columns is now equal to the number of words in your dictionary that appear in ANY of the documents. You also have a "length" feature, which could be a count of the number of words in the document, and you want to know how to incorporate it into your analysis.
A simple approach would be to divide the number of occurrences of a word by the total number of words in the document.
This has the effect of scaling the word occurrences based on the size of the document, and the new feature is called a 'term frequency'. The next natural step is to weight the term frequencies to compensate for terms that are more common in the corpus (and therefore less important). Since we give HIGHER weights to terms that are LESS common, this is called 'inverse document frequency', and the whole process is called “Term Frequency times Inverse Document Frequency”, or tf-idf. You can Google this for more information.
It's possible that you are doing word counts in a different way -- for example, counting the number of word occurrences in each paragraph (as opposed to each document). In that case, for each document, you have a word count for each paragraph, and the typical approach is to merge these paragraph-counts using a process such as Singular Value Decomposition.

Related

Word Frequency Feature Normalization

I am extracting the features for a document. One of the features is the frequency of the word in the document. The problem is that the number of sentences in the training set and test set is not necessarily the same. So, I need to normalized it in some way. One possibility (that came to my mind) was to divide the frequency of the word by the number of sentences in the document. By my supervisor told me that it's better to normalize it in a logarithmic way. I have no idea what does that mean. Can anyone help me?
Thanks in advance,
PS: I also saw this topic, but it didn't help me.
The first question to ask is, what algorithm you are using subsequently? For many algorithms it is sufficient to normalize the bag of words vector, such that it sums up to one or that some other norm is one.
Instead of normalizing by the number of sentence you should, however, normalize by the total number of words in the document. Your test corpus might have longer sentences, for example.
I assume the recommendation of your supervisor means that you do not report the counts of the words but the logarithm of the counts. In addition I would suggest to look into the TF/IDF measure in general. this is imho more common in Textmining
'normalize it in a logarithmic way' probably simply means to replace the frequency feature by log(frequency).
One reason why taking the log might be useful is the Zipfian nature of word occurrences.
Yes, there is a logarithm way, It's called TF-IDF.
TF-IDF is the product of the term frequency and the inverse document frequency.
TF-IDF =
(total number of your word appers in the present document ÷ total number of words in the present document) *
log(total numbers of documents in your collection ÷ the number of documents where your word appears in your collection )
If you use python there is a nice library called GENSIM that contains the algorithm, but your data object must be a Dictionary from the gensim.corpora.
You can find a example here: https://radimrehurek.com/gensim/models/tfidfmodel.html
tf-idf help to normalized -> check the results with tf and tf-idf arguments,
dtm <- DocumentTermMatrix(corpus);dtm
<>
Non-/sparse entries: 27316/97548
Sparsity : 78%
Maximal term length: 22
Weighting : term frequency (tf)
dtm <- DocumentTermMatrix(corpus,control = list(weighting=weightTfIdf));dtm
<>
Non-/sparse entries: 24052/100812
Sparsity : 81%
Maximal term length: 22
Weighting : term frequency - inverse document frequency (normalized) (tf-idf)

Fast k-NN search over bag-of-words models

I have a large amount of documents of equal size. For each of those documents I'm building a bag of words model (BOW). Number of possible words in all documents is limited and large (2^16 for example). Generally speaking, I have N histograms of size K, where N is a number of documents and K is histogram width. I can calculate distance between any two histograms.
First optimization opportunity. Documents usually uses only small subset of words (usually less then 5%, most of them less then 0.5%).
Second optimization opportunity Subset of used words is varying from document to document much so I can use bits instead of word counts.
Query by content
Query is a document as well. I need to find k most similar documents.
Naive approach
Calculate BOW model from query.
For each document in dataset:
Calculate it's BOW model.
Find distance between query and document.
Obviously, some data structure should be used to track top-ranked documents (priority queue for example).
I need some sort of index to get rid of full database scan. KD-tree comes to mind but dimensionality and size of the dataset is very high. One can suggest to use some subset of possible words as features but I don't have separate training phase and can't extract this features beforehand.
I've thought about using MinHash algorithm to prune search space but I can't design an appropriate hash functions for this task.
k-d-tree and similar indexes are for dense and continuous data.
Your data most likely is sparse.
A good index for finding the nearest neighbors on sparse data is inverted lists. Essentially the same way search engines like Google work.

Document classification using naive bayse

I have question regarding the particular Naive Bayse algorithm that is used in document classification. Following is what I understand:
construct some probability of each word in the training set for each known classification
given a document we strip all the words that it contains
multiply together the probabilities of the words being present in a classification
perform (3) for each classification
compare the result of (4) and choose the classification with the highest posterior
What I am confused about is the part when we calculate the probability of each word given training set. For example for a word "banana", it appears in 100 documents in classification A, and there are totally 200 documents in A, and in total 1000 words appears in A. To get the probability of "banana" appearing under classification A do I use 100/200=0.5 or 100/1000=0.1?
I believe your model will more accurately classify if you count the number of documents the word appears in, not the number of times the word appears in total. In other words
Classify "Mentions Fruit":
"I like Bananas."
should be weighed no more or less than
"Bananas! Bananas! Bananas! I like them."
So the answer to your question would be 100/200 = 0.5.
The description of Document Classification on Wikipedia also supports my conclusion
Then the probability that a given document D contains all of the words W, given a class C, is
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
In other words, the document classification algorithm Wikipedia describes tests how many of the list of classifying words a given document contains.
By the way, more advanced classification algorithms will examine sequences of N-words, not just each word individually, where N can be set based on the amount of CPU resources you are willing to dedicate to the calculation.
UPDATE
My direct experience is based on short documents. I would like to highlight research that #BenAllison points out in the comments that suggests my answer is invalid for longer documents. Specifically
One weakness is that by considering only the presence or absence of terms, the BIM ignores information inherent in the frequency of terms. For instance, all things being equal, we would expect that if 1 occurrence of a word is a good clue that a document belongs in a class, then 5 occurrences should be even more predictive.
A related problem concerns document length. As a document gets longer, the number of distinct words used, and thus the number of values of x(j) that equal 1 in the BIM, will in general increase.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.46.1529

Sentence classification using Weka

I want to classify sentences with Weka. My features are sentence terms (words) and a Part of Speech tag of each terms. I don't know how figure attributes, because if each term is presented as one feature, number of feature for each instance (sentence) has become different. And, if all words in sentence is presented as one feature, how relate words and their POS tag.
Any ideas how I should proceed?
If I understand the question correctly, the answer is as follows: It is most common to treat words independently of their position in the sentence and represent a sentence in the feature space by the number of times each of the known words occurs in that sentence. I.e. there is usually a separate numerical feature for each word present in the training data. Or, if you're willing to use n-grams, a separate feature for every n-gram in the training data (possibly with some frequency threshold).
As for the POS tags, it might make sense to use them as separate features, but only if the classification you're interested in has to do with sentence structure (syntax). Otherwise you might want to just append the POS tag to the word, which would partly disambiguate those words that can represent different parts of speech.

Caluculating IDF(Inverse Document Frequency) for document categorization

I have doubt in calculating IDF (Inverse Document Frequency) in document categorization. I have more than one category with multiple documents for training. I am calculating IDF for each term in a document using following formula:
IDF(t,D)=log(Total Number documents/Number of Document matching term);
My questions are:
What does "Total Number documents in Corpus" mean? Whether the document count from a current category or from all available categories?
What does "Number of Document matching term" mean? Whether the term matching document count from a current category or from all available categories?
Total Number documents in Corpus is simply the amount of documents you have in your corpus. So if you have 20 documents then this value is 20.
Number of Document matching term is the count of in how many documents the term t occurs. So if you have 20 documents in total and the term t occurs in 15 of the documents then the value for Number of Documents matching term is 15.
The value for this example would thus be IDF(t,D)=log(20/15) = 0.1249
Now if I'm correct, you have multiple categories per document and you want to able to categorize new documents with one or more of these categories. One method to do this would be to create one documents for each category. Each category-document should hold all texts which are labelled with this category. You can then perform tf*idf on these documents.
A simple way of categorizing a new document could then be achieved by summing the term values of the query using the different term values calculated for each category. The category whose term values, used to calculate the product, result in the highest outcome will then be ranked 1st.
Another possibility is to create a vector for the query using the idf of each term in the query. All terms which don't occur in the query are given the value of 0. The query-vector can then be compared for similarity to each category-vector using for example cosine similarity.
Smoothing is also a useful technique to deal with words in a query which don't occur in your corpus.
I'd suggest reading sections 6.2 and 6.3 of "Introduction to Information Retrieval" by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze.
I have written a small post describing term frequency-inverse document frequency here: http://bigdata.devcodenote.com/2015/04/tf-idf-term-frequency-inverse-document.html
Here is a snippet from the post:
TF-IDF is the most fundamental metric used extensively in classification of documents.
Let us try and define these terms:
Term frequency basically is significant of the frequency of occurrence of a certain word in a document compared to other words in the document.
Inverse Document frequency on the other hand is significant of the occurrence of the word in all the documents for a given collection (of documents which we want to classify into different categories).

Resources