Classify a text with weka after classifier has been trained - machine-learning

I am a beginner with weka.
I have managed to import dataset from the disk (one folder by category, all text related to this category inside the folder), apply StringToWordVector with tokenizer, train a Naive Multniomial categorizer ... The code is below (it is c# but Java is ok of course)
However, I can hardly find information on how to use the categorizer on a project. Say I have a text with unknown category, input by a user, how can I just apply the categorizer to this text and infer the category it belongs to ? (code "// what to do here below").
Any help would be greatly appreciated ;-)
Thanks in advance
Julien
string filepath = #"C:\Users\Julien\Desktop\Meal\";
ClassificationDatasetHelper classHelper = new ClassificationDatasetHelper();
weka.core.converters.TextDirectoryLoader tdl = new
weka.core.converters.TextDirectoryLoader();
tdl.setDirectory(new java.io.File(filepath));
tdl.setCharSet("UTF-8");
weka.core.Instances insts = tdl.getDataSet();
weka.filters.unsupervised.attribute.StringToWordVector swv = new weka.filters.unsupervised.attribute.StringToWordVector();
swv.setInputFormat(insts);
swv.setDoNotOperateOnPerClassBasis(false);
swv.setOutputWordCounts(true);
swv.setWordsToKeep(1000);
swv.setIDFTransform(true);
swv.setMinTermFreq(1);
swv.setDoNotOperateOnPerClassBasis(false);
swv.setPeriodicPruning(-1);
weka.core.tokenizers.NGramTokenizer tokenizer = new weka.core.tokenizers.NGramTokenizer();
tokenizer.setNGramMinSize(2);
tokenizer.setNGramMaxSize(2);
swv.setTokenizer(tokenizer);
insts = weka.filters.Filter.useFilter(insts, swv);
insts.setClassIndex(0);
weka.classifiers.Classifier cl = new weka.classifiers.bayes.NaiveBayesMultinomial();
int trainSize = insts.numInstances() * percentSplit / 100;
int testSize = insts.numInstances() - trainSize;
weka.core.Instances train = new weka.core.Instances(insts, 0, trainSize);
cl.buildClassifier(train);
string s = "Try to classify this text";
weka.core.Instance instanceToClassify = new weka.core.Instance();
// what to do here
// ???
double predictedClass = cl.classifyInstance(instanceToClassify);
Thanks

The best place to learn how to use Weka in your Java app is in the official Weka wiki.
https://waikato.github.io/weka-wiki/use_weka_in_your_java_code/
Basically, you provide a new dataset (the classifier will ignore the category attribute) and you ask it to label each instance for you, like this
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileReader;
import java.io.FileWriter;
import weka.core.Instances;
...
// load unlabeled data
Instances unlabeled = new Instances(
new BufferedReader(
new FileReader("/some/where/unlabeled.arff")));
// set class attribute
unlabeled.setClassIndex(unlabeled.numAttributes() - 1);
// create copy
Instances labeled = new Instances(unlabeled);
// label instances
for (int i = 0; i < unlabeled.numInstances(); i++) {
double clsLabel = tree.classifyInstance(unlabeled.instance(i));
labeled.instance(i).setClassValue(clsLabel);
}
// save labeled data
BufferedWriter writer = new BufferedWriter(
new FileWriter("/some/where/labeled.arff"));
writer.write(labeled.toString());
writer.newLine();
writer.flush();
writer.close();

Related

Different predictions for the same data

I use Deeplearning4j to classify equipment names. I marked ~ 50,000 items with 495 classes, and I use this data to train the neural network.
That is, as input, I provide a set of vectors (50,000) consisting of 0 and 1, and the expected class for each vector (0 to 494).
I use the IrisClassifier example as a basis for the code.
I saved the trained model to a file, and now I can use it to predict the class of equipment.
As an example, I tried to use for prediction the same data (50,000 items) that I used for training, and compare the prediction with my markup of this data.
The result turned out to be very good, the error of the neural network is ~ 1%.
After that, I tried to use for prediction the first 100 vectors from these 50,000 records, and removed the rest 49900.
And for these 100 vectors, the prediction is different when compared with the prediction for the same 100 vectors in the composition of 50,000.
That is, the less data we provide to the trained model, the greater the prediction error.
Even for exactly the same vectors.
Why does this happen?
My code.
Training:
//First: get the dataset using the record reader. CSVRecordReader handles loading/parsing
int numLinesToSkip = 0;
char delimiter = ',';
RecordReader recordReader = new CSVRecordReader(numLinesToSkip,delimiter);
recordReader.initialize(new FileSplit(new File(args[0])));
//Second: the RecordReaderDataSetIterator handles conversion to DataSet objects, ready for use in neural network
int labelIndex = 3331;
int numClasses = 495;
int batchSize = 4000;
// DataSetIterator iterator = new RecordReaderDataSetIterator(recordReader,batchSize,labelIndex,numClasses);
DataSetIterator iterator = new RecordReaderDataSetIterator.Builder(recordReader, batchSize).classification(labelIndex, numClasses).build();
List<DataSet> trainingData = new ArrayList<>();
List<DataSet> testData = new ArrayList<>();
while (iterator.hasNext()) {
DataSet allData = iterator.next();
allData.shuffle();
SplitTestAndTrain testAndTrain = allData.splitTestAndTrain(0.8); //Use 80% of data for training
trainingData.add(testAndTrain.getTrain());
testData.add(testAndTrain.getTest());
}
DataSet allTrainingData = DataSet.merge(trainingData);
DataSet allTestData = DataSet.merge(testData);
//We need to normalize our data. We'll use NormalizeStandardize (which gives us mean 0, unit variance):
DataNormalization normalizer = new NormalizerStandardize();
normalizer.fit(allTrainingData); //Collect the statistics (mean/stdev) from the training data. This does not modify the input data
normalizer.transform(allTrainingData); //Apply normalization to the training data
normalizer.transform(allTestData); //Apply normalization to the test data. This is using statistics calculated from the *training* set
long seed = 6;
int firstHiddenLayerSize = labelIndex/6;
int secondHiddenLayerSize = firstHiddenLayerSize/4;
//log.info("Build model....");
MultiLayerConfiguration conf = new NeuralNetConfiguration.Builder()
.seed(seed)
.activation(Activation.TANH)
.weightInit(WeightInit.XAVIER)
.updater(new Sgd(0.1))
.l2(1e-4)
.list()
.layer(new DenseLayer.Builder().nIn(labelIndex).nOut(firstHiddenLayerSize)
.build())
.layer(new DenseLayer.Builder().nIn(firstHiddenLayerSize).nOut(secondHiddenLayerSize)
.build())
.layer( new OutputLayer.Builder(LossFunctions.LossFunction.NEGATIVELOGLIKELIHOOD)
.activation(Activation.SOFTMAX) //Override the global TANH activation with softmax for this layer
.nIn(secondHiddenLayerSize).nOut(numClasses).build())
.build();
//run the model
MultiLayerNetwork model = new MultiLayerNetwork(conf);
model.init();
//record score once every 100 iterations
model.setListeners(new ScoreIterationListener(100));
for(int i=0; i<5000; i++ ) {
model.fit(allTrainingData);
}
//evaluate the model on the test set
Evaluation eval = new Evaluation(numClasses);
INDArray output = model.output(allTestData.getFeatures());
eval.eval(allTestData.getLabels(), output);
log.info(eval.stats());
// Save the Model
File locationToSave = new File(args[1]);
model.save(locationToSave, false);
Prediction:
// Open the network file
File locationToLoad = new File(args[0]);
MultiLayerNetwork model = MultiLayerNetwork.load(locationToLoad, false);
model.init();
// First: get the dataset using the record reader. CSVRecordReader handles loading/parsing
int numLinesToSkip = 0;
char delimiter = ',';
// Data to predict
CSVRecordReader recordReader = new CSVRecordReader(numLinesToSkip, delimiter); //skip no lines at the top - i.e. no header
recordReader.initialize(new FileSplit(new File(args[1])));
//Second: the RecordReaderDataSetIterator handles conversion to DataSet objects, ready for use in neural network
int batchSize = 4000;
DataSetIterator iterator = new RecordReaderDataSetIterator.Builder(recordReader, batchSize).build();
List<DataSet> dataSetList = new ArrayList<>();
while (iterator.hasNext()) {
DataSet allData = iterator.next();
dataSetList.add(allData);
}
DataSet dataSet = DataSet.merge(dataSetList);
DataNormalization normalizer = new NormalizerStandardize();
normalizer.fit(dataSet);
normalizer.transform(dataSet);
// Now use it to classify some data
INDArray output = model.output(dataSet.getFeatures());
// Save result
BufferedWriter writer = new BufferedWriter(new FileWriter(args[2], true));
for (int i=0; i<output.rows(); i++) {
writer
.append(output.getRow(i).argMax().toString())
.append(" ")
.append(String.valueOf(i))
.append(" ")
.append(output.getRow(i).toString())
.append('\n');
}
writer.close();
Ensure you save the normalizer as follows alongside the model:
import org.nd4j.linalg.dataset.api.preprocessor.serializer.NormalizerSerializer;
NormalizerSerializer SUT = NormalizerSerializer.getDefault();
SUT.write(normalizer,new File("outputFile.bin"));
NormalizeStandardize restored = SUT.restore(new File("outputFile.bin");
You need to use the same normalizer data for both training and prediction. Otherwise it will use wrong statistics when transforming your data.
The way you are currently doing it, results in data that looks very different from the training data, that is why you get such a different result.

How to see failed machine learning records

I am using the following code to create my machine learning model. The accuracy of the model is 0.76. I am just curious to know which records from my test data failed? Is there a way I can see those data?
// 1. Load the dataset for training and testing
var trainData = ctx.Data.LoadFromTextFile<SentimentData>(trainDataPath, hasHeader: true);
var testData = ctx.Data.LoadFromTextFile<SentimentData>(testDataPath, hasHeader: true);
// 2. Build a tranformer/estimator to transform input data so that Machine Learning algorithm can understand
IEstimator<ITransformer> estimator = ctx.Transforms.Text.FeaturizeText("Features", nameof(SentimentData.Text));
// 3. - set the training algorithm and create the pipeline for model builder
var trainer = ctx.BinaryClassification.Trainers.SdcaLogisticRegression();
var trainingPipeline = estimator.Append(trainer);
// 4. - Train the model
var trainedModel = trainingPipeline.Fit(trainData);
// 5. - Perform the preditions on the test data
var predictions = trainedModel.Transform(testData);
// 6. - Evalute the model
var metrics = ctx.BinaryClassification.Evaluate(data: predictions);
By using the GetColumn and CreateEnumerable methods, you can find the data that the model didn't predict correctly.
After you the metrics, use the GetColumn method on the predictions that were from the test data set to get the original label values. Then, use the CreateEnuemrable method to get the predictions that will hold the predicted values. Optionally, you can get the sentiment text as well.
var originalLabels = predictions.GetColumn<bool>("Label").ToArray();
var sentimentText = predictions.GetColumn<string>(nameof(SentimentData.SentimentText)).ToArray();
var predictedLabels = context.Data.CreateEnumerable<SentimentPrediction>(predictions, reuseRowObject: false).ToArray();
After getting the data, just loop through one of them (I did a count of the original labels) and you can access the data at each iteration. From there you can check if the actual label doesn't equal the predicted value to only print out the values that the model didn't get correctly.
for (int i = 0; i < originalLabels.Count(); i++)
{
string outputText = String.Empty;
if (originalLabels[i] != predictedLabels[i].Prediction)
{
outputText = $"Text - {sentimentText[i]} | ";
outputText += $"Original - {originalLabels[i]} | ";
outputText += $"Predicted - {predictedLabels[i].Prediction}";
Console.WriteLine(outputText);
}
}
With that you have the data that you need. :)
Hope that helps!
From your comment, I believe the method you are looking for can be found in the keras library. The method should be keras.models.predict_classes as found on their documentation page.
This will provide you with an array of predicted outputs, which you can then compare to the ground truths. Visit the documentation to see the parameters.
Hope this helps!

Predicting the "no class" / unrecognised class in Weka Machine Learning

I am using Weka 3.7 to classify text documents based on their content. I have a set of text files in folders and they all belong to a certain category.
Category A: 100 txt files
Category B: 100 txt files
...
Category X: 100 txt files
I want to predict if a document falls into one of the categories A-X, OR if it falls in the category UNRECOGNISED (for all other documents).
I am getting the total set of Instances programatically like this:
private Instances getTotalSet(){
ArrayList<Attribute> listOfAttributes = new ArrayList<Attribute>(2);
Attribute classAttribute = getClassAttribute();
listOfAttributes.add(classAttribute);
listOfAttributes.add(new Attribute("text", (ArrayList) null));
Instances totalSet = new Instances("Rel", listOfAttributes,2);
totalSet.setClassIndex(1);
File[] classNamesFolders = new File(path).listFiles((FileFilter) FileFilterUtils.directoryFileFilter());
for(File folder: classNamesFolders){
if(folder.getName().equals("UNRECOGNISED")){
continue;
}
System.out.println("Adding "+folder.getName());
//all txt files in that subfolder
for(File file : FileUtils.listFiles(folder.getAbsoluteFile(), new SuffixFileFilter(".txt"), DirectoryFileFilter.DIRECTORY)){
try {
Instance instance = new DenseInstance(2);
instance.setValue(listOfAttributes.get(0), folder.getName());
instance.setValue(listOfAttributes.get(1), FileUtils.readFileToString(file.getAbsoluteFile()));
totalSet.add(instance);
}catch(IOException e){
System.out.println("Couldn't add "+e);
}
}
}
return totalSet;
}
I am using a RandomForest classifier in this case, (but that shouldn't make a difference for my question)
RandomForest rf = new RandomForest();
rf.setNumTrees(500);
rf.setMaxDepth(25);
rf.setSeed(1);
System.out.println("Building random forest with " + rf.getNumTrees() + " trees");
rf.buildClassifier(train);
When I make a prediction, I can see in which category the new document should fall, but how can I find out if the document should not belong in any category. While making the prediction I can access the
double pred = rf.classifyInstance(test.instance(i));
double dist[] = rf.distributionForInstance(test.instance(i));
distribution for the instance, but how can I disambiguate if a document should not be recognised at all and have the category UNRECOGNISED.

how to train a maxent classifier

[Project stack : Java, Opennlp, Elasticsearch (datastore) , twitter4j to read data from twitter]
I intend to use maxent classifier to classify tweets. I understand that the initial step is to train the model. From the documentation I found that we have a GISTrainer based train method to train the model. I have managed to put together a simple piece of code which makes use of opennlp's maxent classifier to train the model and predict the outcome.
I have used two files postive.txt and negative.txt to train the model
Contents of positive.txt
positive This is good
positive This is the best
positive This is fantastic
positive This is super
positive This is fine
positive This is nice
Contents of negative.txt
negative This is bad
negative This is ugly
negative This is the worst
negative This is worse
negative This sucks
And the java methods below generate the outcome.
#Override
public void trainDataset(String source, String destination) throws Exception {
File[] inputFiles = FileUtil.buildFileList(new File(source)); // trains both positive and negative.txt
File modelFile = new File(destination);
Tokenizer tokenizer = SimpleTokenizer.INSTANCE;
CategoryDataStream ds = new CategoryDataStream(inputFiles, tokenizer);
int cutoff = 5;
int iterations = 100;
BagOfWordsFeatureGenerator bowfg = new BagOfWordsFeatureGenerator();
DoccatModel model = DocumentCategorizerME.train("en", ds, cutoff,iterations, bowfg);
model.serialize(new FileOutputStream(modelFile));
}
#Override
public void predict(String text, String modelFile) {
InputStream modelStream = null;
try{
Tokenizer tokenizer = SimpleTokenizer.INSTANCE;
String[] tokens = tokenizer.tokenize(text);
modelStream = new FileInputStream(modelFile);
DoccatModel model = new DoccatModel(modelStream);
BagOfWordsFeatureGenerator bowfg = new BagOfWordsFeatureGenerator();
DocumentCategorizer categorizer = new DocumentCategorizerME(model, bowfg);
double[] probs = categorizer.categorize(tokens);
if(null!=probs && probs.length>0){
for(int i=0;i<probs.length;i++){
System.out.println("double[] probs index " + i + " value " + probs[i]);
}
}
String label = categorizer.getBestCategory(probs);
System.out.println("label " + label);
int bestIndex = categorizer.getIndex(label);
System.out.println("bestIndex " + bestIndex);
double score = probs[bestIndex];
System.out.println("score " + score);
}
catch(Exception e){
e.printStackTrace();
}
finally{
if(null!=modelStream){
try {
modelStream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
public static void main(String[] args) {
try {
String outputModelPath = "/home/**/sd-sentiment-analysis/models/trainPostive";
String source = "/home/**/sd-sentiment-analysis/sd-core/src/main/resources/datasets/";
MaximunEntropyClassifier me = new MaximunEntropyClassifier();
me.trainDataset(source, outputModelPath);
me.predict("This is bad", outputModelPath);
} catch (Exception e) {
e.printStackTrace();
}
}
I have the following questions.
1) How do I iteratively train a model? Also, how do I add new sentences/words to the model ? Is there a specific format for the data file? I found that the file needs to have a minimum of two words separated by a tab. Is my understanding valid?
2) Are there any publicly available data sets that I can use to train the model? I found some sources for movie reviews. The project I'm working on involves not just movie reviews but also other things such as product reviews, brand sentiments etc.
3) This helps to an extent. Is there a working example somewhere publicly available? I couldn't find the documentation for maxent.
Please help me out. I am kind'a blocked on this.
1) you can store the samples in a database. I used accumulo once for this. Then at some interval you rebuild the model and reprocess your data.
2) the format is: categoryname space sample newline. No tabs
3) sounds like you want to combine general sentiment with a topic or entity. You could use a name finder or just regex to find the entity or add the entity to your class labels for the doccat include a product name etc , then your samples would have to be very specific
AFAIK, you have to completely retrain a MaxEnt model if you want to add new training samples. It cannot be done incrementally on-line.
The default input format for opennlp maxent is a textual file where each line represents a single sample.
A sample is composed of tokens (features) delimited by whitespace. During training, the first token represents the outcome.
Take a look at my minimal working example here :
Training models using openNLP maxent

Getting classification result from mahout

Finally I am able to train mahout classifier , now my problem is how can i get target category for my input document.
What is the process of getting target category for my text documents ?
First, you have to vectorize the text document, RandomAccessSparseVector.
Some sample code for your reference:
Vector vector = new RandomAccessSparseVector(FEATURES);
FeatureExtractor fe = new FeatureExtractor();
HashSet<String> fs = fe.extract(text);
for (String s : fs) {
int index = dictionary.get(s);
vector.setQuick(index, frequency.get(index));
}
Then, use the Classifier.classify(Vector) to get the result.

Resources