Loglikelihood similarity for document clustering - machine-learning

I’m using following loglikelihood formula to compare the similarity between a document and a cluster:
log p(d|c) = sum (c(w,d) * log p(w|c));
c(w,d) is the frequency of a word in a document and p(w|c) is the likelihood of word w being generated by a cluster c.
The problem is that based on this similarity the document is often assigned to the wrong cluster. If I assign the document to the cluster with the highest log p(d|c) (as it is usually negative value I take –log p(d|c)) then it will be the cluster that contains a lot of words from a document but the probability of these words in the cluster is low.
If I assign the document to the cluster with the lowest log p(d|c) then it will be the cluster that has intersection with a document only in one word.
Can someone explain me how to use the loglikelihood correctly? I try to implement this function in java. I already looked on google scholar, but didn’t found suitable explanation of loglikelihood in text mining.
Thanks in advance

Your log likelihood formulation is correct for describing a document with a multinomial model (words in each document are generated independently from a multinomial distribution).
To get the maximum likelihood cluster assignment, you should be taking the cluster assignment, c, that maximizes log p(d|c). log p(d|c) should be a negative number - the maximum is the number closest to zero.
If you are getting cluster assignments that don't make sense, it is likely that this is because the multinomial model does not describe your data well. So, the answer to your question is most likely that you should either choose a different statistical model or use a different clustering method.

Related

How to find the impact of one variable on the other when there seems to no correlation between them?

can we predict growth percentage in sales of an item given the change in discount(positive or negative number) from the previous year as a predictor variable. There seems to be no correlation between these. How to solve this problem using machine learning?
You are on the wrong track to ask this question.
Correlation is on the knowledge side of Statistics, Please check Pearson’s correlation of coefficient / Spearman’s correlation of coefficient in order to find the correlation between the discount changes and the sales groth correlation.
In Machine Learning, we seldom compare two percentage data, instead, we compare the actual sales/discount value. A simple ML can be applied by Linear regression (most ML is used in multi-dimension, as your case is one-x one-y data (single column to single output). Please refer to related information online and solved with excel or python code.

How to classify text with Knime

I'm trying to classify some data using knime with knime-labs deep learning plugin.
I have about 16.000 products in my DB, but I have about 700 of then that I know its category.
I'm trying to classify as much as possible using some DM (data mining) technique. I've downloaded some plugins to knime, now I have some deep learning tools as some text tools.
Here is my workflow, I'll use it to explain what I'm doing:
I'm transforming the product name into vector, than applying into it.
After I train a DL4J learner with DeepMLP. (I'm not really understand it all, it was the one that I thought I got the best results). Than I try to apply the model in the same data set.
I thought I would get the result with the predicted classes. But I'm getting a column with output_activations that looks that gets a pair of doubles. when sorting this column I get some related date close to each other. But I was expecting to get the classes.
Here is a print of the result table, here you can see the output with the input.
In columns selection it's getting just the converted_document and selected des_categoria as Label Column (learning node config). And in Predictor node I checked the "Append SoftMax Predicted Label?"
The nom_produto is the text column that I'm trying to use to predict the des_categoria column that it the product category.
I'm really newbie about DM and DL. If you could get me some help to solve what I'm trying to do would be awesome. Also be free to suggest some learning material about what attempting to achieve
PS: I also tried to apply it into the unclassified data (17,000 products), but I got the same result.
I won't answer with a workflow on this one because it is not going to be a simple one. However, be sure to find the text mining example on the KNIME server, i.e. the one that makes use of the bag of words approach.
The task
Product mapping to categories should be a straight-forward data mining task because the information that explains the target variable is available in a quasi-exhaustive manner. Depending on the number of categories to train though, there is a risk that you might need more than 700 instances to learn from.
Some resources
Here are some resources, only the first one being truly specialised in text mining:
Introduction on Information Retrieval, in particular chapter 13;
Data Science for Business is an excellent introduction to data mining, including text mining (chapter 10), also do not forget the chapter about similarity (chapter 6);
Machine Learning with R has the advantage of being accessible enough (chapter 4 provides an example of text classification with R code).
Preprocessing
First, you will have to preprocess your product labels a bit. Use KNIME's text analytics preprocessing nodes for that purpose, that is after you've transformed the product labels with Strings to Document:
Case Convert, Punctuation Erasure and Snowball Stemmer;
you probably won't need Stop Word Filter, however, there may be quasi-stop words such as "product", which you may need to remove manually with Dictionary Filter;
Be careful not to use any of the following without testing testing their impact first: N Chars Filter (g may be a useful word), Number Filter (numbers may indicate quantities, which may be useful for classification).
Should you encounter any trouble with the relevant nodes (e.g. Punctuation Erasure can be tricky amazingly thanks to the tokenizer), you can always apply String Manipulation with regex before converting the Strings to Document.
Keep it short and simple: the lookup table
You could build a lookup table based on the 700 training instances. The book Data mining techniques as well as resource (2) present this approach in some detail. If any model performs any worse than the lookup table, you should abandon the model.
Nearest neighbors
Neural networks are probably overkill for this task.
Start with a K Nearest Neighbor node (applying a string distance such as Cosine, Levensthein or Jaro-Winkler). This approach requires the least amount of data wrangling. At the very least, it will provide an excellent baseline model, so it is most definitely worth a shot.
You'll need to tune the parameter k and to experiment with the distance types. The Parameter Optimization Loop pair will help you with optimizing k, you can include a Cross-Validation meta node inside of the said loop to obtain an estimate of the expected performance given k instead of only one point estimate per value of k. Use Cohen's Kappa as an optimization criterion, as proposed by the resource number (3) and available via the Scorer node.
After the parameter tuning, you'll have to evaluate the relevance of your model using yet another Cross-Validation meta node, then follow up with a Loop pair including Scorer to calculate the descriptives on performance metric(s) per iteration, finally use Statistics. Kappa is a convenient metric for this task because the target variable consists of many product categories.
Don't forget to test its performance against the lookup table.
What next ?
Should lookup table or k-nn work well for you, then there's nothing else to add.
Should any of those approaches fail, you might want to analyse the precise cases on which it fails. In addition, training set size may be too low, so you could manually classify another few hundred or thousand instances.
If after increasing the training set size, you are still dealing with a bad model, you can try the bag of words approach together with a Naive Bayes classifier (see chapter 13 of the Information Retrieval reference). There is no room here to elaborate on the bag of words approach and Naive Bayes but you'll find the resources here above useful for that purpose.
One last note. Personally, I find KNIME's Naive Bayes node to perform poorly, probably because it does not implement Laplace smoothening. However, KNIME's R Learner and R Predictor nodes will allow you to use R's e1071 package, as demonstrated by resource (3).

Clustering or other mechanisms for implementing generic spam detection

In normal case I had tried out naive bayes and linear SVM earlier to classify data related to certain specific type of comments related to some page where I had access to training data manually labelled and classified as spam or ham.
Now I am being told to check if there are any ways to classify comments as spam where we don't have a training data. Something like getting two clusters for data which will be marked as spam or ham given any data.
I need to know certain ways to approach this problem and what would be a good way to implement this.
I am still learning and experimenting . Any help will be appreciated
Are the new comments very different from the old comments in terms of vocabulary? Because words is almost everything the classifiers for this task look at.
You always can try using your old training data and apply the classifier to the new domain. You would have to label a few examples from your new domain in order to measure performance (or better, let others do the labeling in order to get more reliable results).
If this doesn't work well, you could try domain adaptation or look for some datasets more similar to your new domain, using Google or looking at this spam/ham corpora.
Finally, there may be some regularity or pattern in your new setting, e.g. downvotes for a comment, which may indicate spam/ham. In such cases, you could compile training data yourself. This would them be called distant supervision (you can search for papers using this keyword).
The best I could get to was this research work which mentions about active learning. So what I came up with is that I first performed Kmeans clustering and got the central clusters (assuming 5 clusters I took 3 clusters descending ordered by length) and took 1000 msgs from each. Then I would assign it to be labelled by the user. The next process would be training using logistic regression on the labelled data and getting the probabilities of unlabelled data and then if I have probability close to 0.5 or in range of 0.4 to 0.6 which means it is uncertain I would assign it to be labelled and then the process would continue.

Naive Bays spam filtering

I am trying to implement my first spam filter using a naive bayes classifier. I am using the data provided by UCI’s machine learning data repository (http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection). The data is a table of features corresponding to a few thousand spam and non-spam(ham) messages. Therefore, my features are limited to those provided by the table.
My goal is to implement a classifier that can calculate P(S∣M), the probability of being spam given a message. So far I have been using the following equation to calculate P(S∣F), the probability of being spam given a feature.
P(S∣F)=P(F∣S)/(P(F∣S)+P(F∣H))
from http://en.wikipedia.org/wiki/Bayesian_spam_filtering
where P(F∣S) is the probability of feature given spam and P(F∣H) is the probability of feature given ham. I am having trouble bridging the gap from knowing a P(S∣F) to P(S∣M) where M is a message and a message is simply a bag of independent features.
At a glance I want to just multiply the features together. But that would make most numbers very small, I am not sure if that is normal.
In short these are the questions I have right now.
1.) How to take a set of P(S∣F) to a P(S∣M).
2.) Once P(S∣M) has been calculated, how do I define a a threshold for my classifier?
3.) Fortunately my feature set was selected for me, how would I go about selecting or finding my own feature set?
I would also appreciate resources that might help me out as well. Thanks for your time.
You want to use Naive Bayes:
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
It's probably beyond the scope of this answer to explain it, but essentially you multiply the probability of each feature give spam together, and multiply that by the prior probability of spam. Then repeat for ham (i.e. multiple each feature given ham together, and multiply that by the prior probability of ham). Now you have two numbers which can be normalized to probabilities by dividing each by the total of both. That will give you the probability of S|M and S|H. Again read the article above. If you want to avoid numerical underflow, take the log of each conditional and prior probability (any base) and add, instead of multiplying the original probabilities. Adding logs is equivalent to multiplying the original numbers. This won't give you a probability number at the end, but you can still take the one with the larger value as the predicted class.
You should not need to set a threshold, simply classify each instance by what is more likely, spam or ham (or whichever gives you the greater log likelihood).
There is no simple answer to this. Using a bag of words model is reasonable for this problem. Avoid very infrequent (occurring in < 5 documents) and also very frequent words, such as the, and a. A stop word list is often used to remove these. A feature selection algorithm can also help. Removing features that are highly correlated will help, particularly with Naive Bayes, which is highly sensitive to this.

Mahout: RowSimilarity vs Clustering

I was trying to cluster some documents using the KMeansClustering approach and successfully created the clusters. I saved the cluster id corresponding to a particular document for recommendations. So whenever I wanted to recommend documents similar to a particular document, I would query all the documents in a particular cluster and return n random documents from the cluster. However, returning any random document from the cluster did not seem appropriate and I read somewhere that we should be returning the documents nearest to the document in question.
So I started searching for calculating distance between documents and stumbled upon the RowSimilarity approach which returns 10 most similar documents to each document, ordered by distance. Now this approach relies on a similarity metric like LogLikelihood etc to calculate the distance between documents.
Now my question is this. How is clustering better/worse than RowSimilarity given that both the approaches use a similarity distance metric to calculate the distance between documents?
What I'm trying to achieve is that I'm trying to cluster products on the basis of their titles and other text properties to recommend similar products. Any help is appreciated.
Clustering is not just another variant of classification or recommendation. It is a different discipline.
When you are doing cluster analysis, you want to discover structure in the data. But then, you should actually be analyzing the structure you found.
Now k-means is not really meant for documents. It tries to find a near optimal partitioning of a data set into k Voronoi cells. Unless you have a good reason to believe that Voronoi cells are a good partitioning for your data, the algorithm may be pretty much useless. Just because it returns a result does not at all indicate that the result is useful.
For documents, Euclidean distance (and k-means is in fact optimizing Euclidean distances) are usually pretty much meaningless. The vectors are very sparse, and k-means cluster centers will then often resemble impossible (and thus insensible) "average documents".
And I havn't started on the need to find an appropriate value of k, on the Mahout implementation likely just being an approximation of Lloyds k-means approximation, and so on. Did you even check the cluster sizes? In situations like these, k-means will often produce degenerate results. For example, almost all clusters containing 1 or 0 elements, and a mega-cluster containing the rest. In this situation, you might in fact be returning just random documents from your database...
Just because you can use it does not mean it is helpful. Make sure to validate the individual steps of your approach, for example if the clusters are in any way useful and sensible!
Similarity is not the same thing as distance -- one is big when the other is small. Clustering is not the same as computing distances either. First you should decide whether you have a clustering problem -- it does not sound like you do based on what you say. So, don't use k-means.

Resources