In Mahout, is there any method for data classification with Naive Bayes? - mahout

I am still a newbie in using Mahout, and currently studying on the Naive Bayes for data classification.
As far as I know Mahout has 2 related programs, one is trainnb which is for training Bayes model, and testnb which is for evaluating the model. Under current implementation of Mahout, is there a way to apply the model on new data classification by just a simple command? Or do I need to code an implementation from scratch (e.g. use the model as a base to calculate the likelihood for each of the possibilities, compute and return the class with highest value) using java?

Related

How to code Naïve Bayes using Information Gain (IG)

I read from a paper that Naive Bayes using IG is the best model for text classification where the dataset is small and has few positives. However, I'm not too sure how to code this specific model using Python. Would this be user TF or Scikit learn and then adjusting a parameter?

Unsupervised Naive Bayes - how does it work?

So as I understand it, to implement an unsupervised Naive Bayes, we assign random probability to each class for each instance, then run it through the normal Naive Bayes algorithm. I understand that, through each iteration, the random estimates get better, but I can't for the life of me figure out exactly how that works.
Anyone care to shed some light on the matter?
The variant of Naive Bayes in unsupervised learning that I've seen is basically application of Gaussian Mixture Model (GMM, also known as Expectation Maximization or EM) to determine the clusters in the data.
In this setting, it is assumed that the data can be classified, but the classes are hidden. The problem is to determine the most probable classes by fitting a Gaussian distribution per class. Naive Bayes assumption defines the particular probabilistic model to use, in which the attributes are conditionally independent given the class.
From "Unsupervised naive Bayes for data clustering with mixtures of
truncated exponentials" paper by Jose A. Gamez:
From the previous setting, probabilistic model-based clustering is
modeled as a mixture of models (see e.g. (Duda et al., 2001)), where
the states of the hidden class variable correspond to the components
of the mixture (the number of clusters), and the multinomial
distribution is used to model discrete variables while the Gaussian
distribution is used to model numeric variables. In this way we move
to a problem of learning from unlabeled data and usually the EM
algorithm (Dempster et al., 1977) is used to carry out the learning
task when the graphical structure is fixed and structural EM
(Friedman, 1998) when the graphical structure also has to be
discovered (Pena et al., 2000). In this paper we focus on the
simplest model with fixed structure, the so-called Naive Bayes
structure (fig. 1) where the class is the only root variable and all
the attributes are conditionally independent given the class.
See also this discussion on CV.SE.

How to use different dataset for scikit and NLTK?

I am trying to implement inbuilt naive bayes classifier of Scikit and NLTK for raw data I have. The data I have is set tab-separated-rows each having some label, paragraph and some other attributes.
I am interested in classifying the paragraphs.
I need to convert this data into format suitable for inbuilt classifiers of Scikit/ NLTK.
I want to implement Gaussian,Bernoulli and Multinomial Naive Bayes for all paragraphs.
Question 1:
For scikit, the example given imports iris data. I checked the iris data, it has precalculated values from the data set. How can I convert my data into such format and directly call the gaussian function? Is there any standard way of doing so?
Question 2:
For NLTK,
What should be input for NaiveBayesClassifier.classify function? is it dict with boolean values? how can it be made multinomial or gaussian?
# question 2:
nltk.NaiveBayesClassifier.classify expects a so called 'featureset'. A featureset is a dictionary with feature names as keys and feature values as values, e.g. {'word1':True, 'word2':True, 'word3':False}. Nltks' naive bayes classifier cannot be used as multinomial approach. However, you can install scikit learn and use the nltk.classify.scikitlearn wrapper module to deploy scikit's multinomial classifier.

10 fold cross validation with weka api

How can I make a classification model by 10-fold cross-validation using Weka Api.
Should I cross validate model first : e.g.
evaluation.crossValidateModel(classifier, trainingSet, 10, Random(1))
and then build a new classifier based on this trainedSet. e.g
NaiveBayes nb2 = new NaiveBayes();
nb2.buildClassifier(train);
and then save and use this model (nb2)?
You are mixing concepts. Cross validation is used to test the performance of learning techniques over a dataset. The common procedure is to perform CV using the whole dataset with 10 folds usually. Then you can see which learning technique is obtaining better performance. You can use that technique to learn a model over the whole dataset for future predictions.
http://en.wikipedia.org/wiki/Cross-validation_(statistics)

How do I update a trained model (weka.classifiers.functions.MultilayerPerceptron) with new training data in Weka?

I would like to load a model I trained before and then update this model with new training data. But I found this task hard to accomplish.
I have learnt from Weka Wiki that
Classifiers implementing the weka.classifiers.UpdateableClassifier interface can be trained incrementally.
However, the regression model I trained is using weka.classifiers.functions.MultilayerPerceptron classifier which does not implement UpdateableClassifier.
Then I checked the Weka API and it turns out that no regression classifier implements UpdateableClassifier.
How can I train a regression model in Weka, and then update the model later with new training data after loading the model?
I have some data mining experience in Weka as well as in scikit-learn and r and updateble regression models do not exist in weka and scikit-learn as far as I know. Some R libraries however do support updating regression models (take a look at this linear regression model for example: http://stat.ethz.ch/R-manual/R-devel/library/stats/html/update.html), so if you are free to switching data mining tool this might help you out.
If you need to stick to Weka than I'm afraid that you would probably need to implement such a model yourself, but since I'm not a complete Weka expert please check with the guys at weka list (http://weka.wikispaces.com/Weka+Mailing+List).
The SGD classifier implementation in Weka supports multiple loss functions. Among them are two loss functions that are meant for linear regression, viz. Epsilon insensitive, and Huber loss functions.
Therefore one can use a linear regression trained with SGD as long as either of these two loss functions are used to minimize training error.

Resources