Bad clustering results with mahout on Reuters 21578 dataset - mahout

I 've used a part of reuters 21578 dataset and mahout k-means for clustering.To be more specific I extracted only the texts that has a unique value for category 'topics'.So I ve been left with 9494 texts that belong to one among 66 categories. I ve used seqdirectory to create sequence files from texts and then seq2sparse to crate the vectors. Then I run k-means with cosine distance measure (I ve tried tanimoto and euclidean too, with no better luck), cd=0.1 and k=66 (same as the number of categories). So I tried to evaluate the results with silhouette measure using custom Java code and the matlab implementation of silhouette (just to be sure that there is no error in my code) and I get that the average silhouette of the clustering is 0.0405. Knowing that the best clustering could give an average silhouette value close to 1, I see that the clustering result I get is no good at all.
So is this due to Mahout or the quality of catgorization on reuters dataset is low?
PS: I m using Mahout 0.7
PS2: Sorry for my bad English..

I've never actually worked with Mahout, so I cannot say what it does by default, but you might consider checking what sort of distance metric it uses by default. For example, if the metric is Euclidean distance on unnormalized document word counts, you can expect very poor quality cluster quality, as document length will dominate any meaningful comparison between documents. On the other hand, something like cosine distance on normalized, or tf-idf weighted word counts can do much better.
One other thing to look at is the distribution of topics in the Reuters 21578. It is very skewed towards a few topics such as "acq" or "earn", while others are used only handfuls of times. This can it difficult to achieve good external clustering metrics.

Related

Feature engineering gaussian distributed input

I am designing a NN classifier where most of the input features are estimations of gaussian distributions. I.e. one feature has a mu and a sigma value.
The classifier has about 30 input features, 60 if you consider each mu and sigma their own feature.
The number of outputs are 15, i.e. there are 15 possible classifications.
I have about 50k examples to use for training/verification.
I can think of a few different scenarios of how to transform these features into something useful but I am not clever enough to come to any conclusions on how they would impact my results.
First scenario is to just scale and blindly pass each mu and sigma individually. I don't really see how sigma would help the classifier in this case, since it's just a measure of uncertainty. Optimally this would lead to slightly "fuzzier" classifications which possibly could be used for estimating some certainty metric of a classification result.
Second scenario is to generate more test cases by drawing a value from the gaussian of each each of the 30 input features, and then normalizing these random values. This would give me more training data, which could be useful.
As I side note I have the possibility to get more data (about 50k examples more) but I am not sure how accurate that data is so I would like to try with this smaller set first to see if it converges.
The question is: Is there any consensus or interesting paper in the community, describing how to deal with estimated uncertainty in input features?
Thanks!
P.S. Sorry for my bad wording, ML is not my professional domain nor is English my native language.

How can I normalize data to have same average sum of square?

In a lot of articles in my field, this sentence has been repeated: " The 2 matrices has been normalized to have the same average sum-of-squares (computed across all subjects and all voxels for each modality)". Suppose that we have two matrices that the rows define different subjects and the columns are features (voxels). In these articles, no much explanation can be found for normalization method. Does anybody knows how I should normalize data to have "same average sum-of-squares"? I don't understand it at all. Thanks
For a start normalization in this context is also known as features scaling, which pretty much sums it up. You scale your features, your data to get rid of variances and range of values which would disturb your algorithm and your results in the end.
https://en.wikipedia.org/wiki/Feature_scaling
In data processing, normalization is quite useful (depending on the application). E.g. in distance based machine learning algorithms you should normalize your features in order to get a proportional contribution to the outcome of your algorithm, independent of the range of value the features comprise.
To do so, you can use different statistical measurements, like the
Sum of squares:
SUM_i(Xi-Xbar)²
Other than that you could use the variance or the standard deviation of your data.
https://www.westgard.com/lesson35.htm#4
Those statistical terms can then be used to normalize your data, to improve e.g. the clustering quality of your algorithm. Which term to use and which method highly depends on the algorithms and data you're using and what you're aiming at.
Here is a paper which compares some of the approaches you could choose from for clustering:
http://maxwellsci.com/print/rjaset/v6-3299-3303.pdf
I hope this can help you a little.

How can we say that a clustering quality measure is good?

There are few well known measures like silhouette width (SW), the Davies- Bouldin index (DB), the Calinski-Harabasz index (CH), and the Dunn index .
How can we say that a clustering quality measure is good?
Is there some kind of metric for the clustering quality measure to be good?
Also ,
"algorithms that produce clusters with high Dunn index are more desirable" -Wikipedia
"Objects with a high silhouette value are considered well clustered" -Wikipedia
"clustering algorithm that produces a collection of clusters with the smallest Davies–Bouldin index is considered the best algorithm" -Wikipedia
How high or low these values should be ?Is there a metric number ?
Can any one provide me a small example using a clustering quality measure on a dataset or IRIS dataset to say that the particular clustering quality measure is good?
Maybe a simple starting point would be:
"Are the elements within a cluster alike and are they different from
elements in a different cluster".
There are obviously a variety of metrics to quantify similarity vs difference - as well as considerations like density vs distance.
The Stanford NLP project has a useful reference that is approachable: http://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html

Centroid algorithm for document classification, threshold detection

I have a collection of documents related to a particular domain and have trained the centroid classifier based on that collection. What I want to do is, I will be feeding the classifier with documents from different domains and want to determine how much they are relevant to the trained domain. I can use the cosine similarity for this to get a numerical value but my question is what is the best way to determine the threshold value?
For this, I can download several documents from different domains and inspect their similarity scores to determine the threshold value. But is this the way to go, does it sound statistically good? What are the other approaches for this?
Actually there is another issue with centroids in sparse vectors. The problem is that they usually are significantly less sparse than the original data. For examples, this increases computation costs. And it can yield vectors that are themselves actually atypical because they have a different sparsity pattern. This effect is similar to using arithmetic means of discrete data: say the mean number of doors in a car is 3.4; yet obviously no car exists that actually has 3.4 doors. So in particular, there will be no car with an euclidean distance of less than 0.4 to the centroid! - so how "central" is the centroid then really?
Sometimes it helps to use medoids instead of centroids, because they actually are proper objects of your data set.
Make sure you control such effects on your data!
A simple method to try would be to employ various machine-learning algorithms - and in particular, tree-based ones - on the distances from your centroids.
As mentioned in another answer(#Anony-Mousse), this won't necessarily provide you with good or usable answers, but it just might. Using a ML framework for this procedure, E.g. WEKA, will also help you with estimating your accuracy in a more rigorous manner.
Here are the steps to take, using WEKA:
Generate a train set by finding a decent amount of documents representing each of your classes (to get valid estimations, I'd recommend at least a few dozens per class)
Calculate the distance from each document to each of your centroids.
Generate a feature vector for each such document, composed of the distances from this document to the centroids. You can either use a single feature - the distance to the nearest centroid; or use all distances, if you'd like to try a more elaborate thresholding scheme. For example, if you chose the simpler method of using a single feature, the vector representing a document with a distance of 0.2 to the nearest centroid, belonging to class A would be: "0.2,A"
Save this set in ARFF or CSV format, load into WEKA, and try classifying, e.g. using a J48 tree.
The results would provide you with an overall accuracy estimation, with a detailed confusion matrix, and - of course - with a specific model, e.g. a tree, you can use for classifying additional documents.
These results can be used to iteratively improve the models and thresholds by collecting additional train documents for problematic classes, either by recreating the centroids or by retraining the thresholds classifier.

Ways to improve the accuracy of a Naive Bayes Classifier?

I am using a Naive Bayes Classifier to categorize several thousand documents into 30 different categories. I have implemented a Naive Bayes Classifier, and with some feature selection (mostly filtering useless words), I've gotten about a 30% test accuracy, with 45% training accuracy. This is significantly better than random, but I want it to be better.
I've tried implementing AdaBoost with NB, but it does not appear to give appreciably better results (the literature seems split on this, some papers say AdaBoost with NB doesn't give better results, others do). Do you know of any other extensions to NB that may possibly give better accuracy?
In my experience, properly trained Naive Bayes classifiers are usually astonishingly accurate (and very fast to train--noticeably faster than any classifier-builder i have everused).
so when you want to improve classifier prediction, you can look in several places:
tune your classifier (adjusting the classifier's tunable paramaters);
apply some sort of classifier combination technique (eg,
ensembling, boosting, bagging); or you can
look at the data fed to the classifier--either add more data,
improve your basic parsing, or refine the features you select from
the data.
w/r/t naive Bayesian classifiers, parameter tuning is limited; i recommend to focus on your data--ie, the quality of your pre-processing and the feature selection.
I. Data Parsing (pre-processing)
i assume your raw data is something like a string of raw text for each data point, which by a series of processing steps you transform each string into a structured vector (1D array) for each data point such that each offset corresponds to one feature (usually a word) and the value in that offset corresponds to frequency.
stemming: either manually or by using a stemming library? the popular open-source ones are Porter, Lancaster, and Snowball. So for
instance, if you have the terms programmer, program, progamming,
programmed in a given data point, a stemmer will reduce them to a
single stem (probably program) so your term vector for that data
point will have a value of 4 for the feature program, which is
probably what you want.
synonym finding: same idea as stemming--fold related words into a single word; so a synonym finder can identify developer, programmer,
coder, and software engineer and roll them into a single term
neutral words: words with similar frequencies across classes make poor features
II. Feature Selection
consider a prototypical use case for NBCs: filtering spam; you can quickly see how it fails and just as quickly you can see how to improve it. For instance, above-average spam filters have nuanced features like: frequency of words in all caps, frequency of words in title, and the occurrence of exclamation point in the title. In addition, the best features are often not single words but e.g., pairs of words, or larger word groups.
III. Specific Classifier Optimizations
Instead of 30 classes use a 'one-against-many' scheme--in other words, you begin with a two-class classifier (Class A and 'all else') then the results in the 'all else' class are returned to the algorithm for classification into Class B and 'all else', etc.
The Fisher Method (probably the most common way to optimize a Naive Bayes classifier.) To me,
i think of Fisher as normalizing (more correctly, standardizing) the input probabilities An NBC uses the feature probabilities to construct a 'whole-document' probability. The Fisher Method calculates the probability of a category for each feature of the document then combines these feature probabilities and compares that combined probability with the probability of a random set of features.
I would suggest using a SGDClassifier as in this and tune it in terms of regularization strength.
Also try to tune the formula in TFIDF you're using by tuning the parameters of TFIFVectorizer.
I usually see that for text classification problems SVM or Logistic Regressioin when trained one-versus-all outperforms NB. As you can see in this nice article by Stanford people for longer documents SVM outperforms NB. The code for the paper which uses a combination of SVM and NB (NBSVM) is here.
Second, tune your TFIDF formula (e.g. sublinear tf, smooth_idf).
Normalize your samples with l2 or l1 normalization (default in Tfidfvectorization) because it compensates for different document lengths.
Multilayer Perceptron, usually gets better results than NB or SVM because of the non-linearity introduced which is inherent to many text classification problems. I have implemented a highly parallel one using Theano/Lasagne which is easy to use and downloadable here.
Try to tune your l1/l2/elasticnet regularization. It makes a huge difference in SGDClassifier/SVM/Logistic Regression.
Try to use n-grams which is configurable in tfidfvectorizer.
If your documents have structure (e.g. have titles) consider using different features for different parts. For example add title_word1 to your document if word1 happens in the title of the document.
Consider using the length of the document as a feature (e.g. number of words or characters).
Consider using meta information about the document (e.g. time of creation, author name, url of the document, etc.).
Recently Facebook published their FastText classification code which performs very well across many tasks, be sure to try it.
Using Laplacian Correction along with AdaBoost.
In AdaBoost, first a weight is assigned to each data tuple in the training dataset. The intial weights are set using the init_weights method, which initializes each weight to be 1/d, where d is the size of the training data set.
Then, a generate_classifiers method is called, which runs k times, creating k instances of the Naïve Bayes classifier. These classifiers are then weighted, and the test data is run on each classifier. The sum of the weighted "votes" of the classifiers constitutes the final classification.
Improves Naive Bayes classifier for general cases
Take the logarithm of your probabilities as input features
We change the probability space to log probability space since we calculate the probability by multiplying probabilities and the result will be very small. when we change to log probability features, we can tackle the under-runs problem.
Remove correlated features.
Naive Byes works based on the assumption of independence when we have a correlation between features which means one feature depends on others then our assumption will fail.
More about correlation can be found here
Work with enough data not the huge data
naive Bayes require less data than logistic regression since it only needs data to understand the probabilistic relationship of each attribute in isolation with the output variable, not the interactions.
Check zero frequency error
If the test data set has zero frequency issue, apply smoothing techniques “Laplace Correction” to predict the class of test data set.
More than this is well described in the following posts
Please refer below posts.
machinelearningmastery site post
Analyticvidhya site post
keeping the n size small also make NB to give high accuracy result. and at the core, as the n size increase its accuracy degrade,
Select features which have less correlation between them. And try using different combination of features at a time.

Resources