Get Prediction Class as an integer in Weka classifier - machine-learning

I used following code to implement activity recognition using weka library. I used Random Forest Classifier and this is a multi class classification problem. I use the training dataset to train a random forest model. After that, I use the model to classify test dataset which does not contain the class labels. I get double vales as the class. How can I get correct integer values? Can someone help me for that?
And also if I cast the double values which I get to int value, I get only same integer value(0) because the predicted doubles are between 0 and 1.
BufferedReader br = null;
br = new BufferedReader(new FileReader("/home/thamali/Desktop/WekaProject/output.arff"));
Instances trainData = new Instances(br);
trainData.setClassIndex(trainData.numAttributes() - 1);
br.close();
RandomForest rf = new RandomForest();
rf.setNumTrees(23);
rf.setMaxDepth(18);
rf.buildClassifier(trainData);
BufferedReader br1 = new BufferedReader(new FileReader("/home/thamali/Desktop/WekaProject/testData.arff"));
Instances testData = new Instances(br1);
testData.setClassIndex(testData.numAttributes() - 1);
double value=rf.classifyInstance(testData.instance(0));
System.out.print(value);

Related

ValueError: Found input variables with inconsistent numbers of samples : [1, 14048]

I am trying to run MultinomiaL Naive bayes and receiving the below error. Sample training data is given. Test data is exactly similar.
def main():
text_train, targets_train = read_data('train')
text_test, targets_test = read_data('test')
classifier1 = MultinomialNB()
classifier1.fit(text_train, targets_train)
prediction1 = classifier1.predict(text_test)
Sample Data:
Train:
category, text
Family, I love you Mom
University, I hate this course
Sometimes I face this question and find most of reason from the error is the input data should be 2-D array, such as if you want to build a regression model. you write this code and then you will face this error!
for example:
a = np.array([1,2,3]).T
b = np.array([4,5,6]).T
regr = linear_model.LinearRegression()
regr.fit(a, b)
then you should add something!
a = np.array([[1,2,3]]).T
b = np.array([[4,5,6]]).T
lastly you will be run normally!
so it is just my empirical!
This is just a reference, not a standard answer!
i am from Chinese as a student in learning English and python!

Finding the probability with which an instance in classified in Weka

I am using Weka for classification using LibSVM classifier, and wanted some help related to the outputs that I get from the evaluation model.
In the below example, my test.arff file contains 1000 instances, and I want to know the probability with which each instance is classified as yes/ no (It's a simple two class problem).
For instance, for instance 1, if it is classified as 'yes', then with what probability is it classified so, is something which I am looking for.
Below is the code snippet that I have currently:
// Read and load the Training ARFF file
ArffLoader trainArffLoader = new ArffLoader();
trainArffLoader.setFile(new File("train_clusters.arff"));
Instances train = trainArffLoader.getDataSet();
train.setClassIndex(train.numAttributes() - 1);
System.out.println("Loaded Train File");
// Read and load the Test ARFF file
ArffLoader testArffLoader = new ArffLoader();
testArffLoader.setFile(new File("test_clusters.arff"));
Instances test = testArffLoader.getDataSet();
test.setClassIndex(test.numAttributes() - 1);
System.out.println("Loaded Test File");
LibSVM libsvm = new LibSVM();
libsvm.buildClassifier(train);
// Evaluation
Evaluation evaluation = new Evaluation(train);
evaluation.evaluateModel(libsvm, test);
System.out.println(evaluation.toSummaryString("\nPrinting the Results\n=====================\n", true));
System.out.println(evaluation.toClassDetailsString());
You should use libsvm.distributionForInstance method. It returns probability estimate for each class index (for 2 in your cases).
For example, to print all estimates for each instance from test set use something like this:
for (Instance instance : test) {
double[] distribution = libsvm.distributionForInstance(instance);
for (int classIndex : classIndices) {
System.out.print(distribution[classIndex] + " ");
}
System.out.println();
}
Note that it is not true probability, but estimations made by Platt's method (see the question).

No Recommendation made on small Dataset, despite best Pearson Corellation similarity

I am facing a small problem while running Recommender engine in Mahout:
The data set on which I am working is given below:
1,101,5.0
1,102,4.0
1,103,4.0
1,107,5.0
1,108,3.0
2,101,3.0
2,102,4.0
2,104,4.0
2,105,4.0
3,101,5.0
3,102,4.0
When I calculate the Pearson similarity between 1 and 3 I get a value 0.99999998 approx 1.0
Which is best similarity, So according to recommendation rule. The output for recommendation to User_ID 3 should be Item_ID 107
But my output gives no recommendation.
Below is my code:
public static void main( String[] args ) throws Exception
{
///////////////////////Data Model//////////////////////////////////////
DataModel model = new FileDataModel(new File("data/dataset_2.csv"));
System.out.println(model.getMaxPreference());
///////////////////Similarity between Users////////////////////////////
UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
System.out.println("Pearson distance "+similarity.userSimilarity(3, 1));
//////////////////The Neighbors who satisfy the threshold level//////////
UserNeighborhood neighborhood = new ThresholdUserNeighborhood(0.1, similarity, model);
///////////////////Recommender recomending the best/////////////////////////
UserBasedRecommender recommender = new GenericUserBasedRecommender(model, neighborhood, similarity);
List <RecommendedItem> recommendations = recommender.recommend(3, 1);
for (RecommendedItem recommendation : recommendations) {
System.out.println(recommendation);
}
}
}
I would appreciate If anybody could point out the mistake if any or If my understanding on Mahout pearson corellation is wrong.
PearsonCorrelationSimilarity does not work well with small and less similar dataset. You can change similarity method or neighbourhood size. When you increase dataset size, you will get good result.
In addition, you can increase recommendation size (for recommend function howMany).

Weka normalizing columns

I have an ARFF file containing 14 numerical columns. I want to perform a normalization on each column separately, that is modifying the values from each colum to (actual_value - min(this_column)) / (max(this_column) - min(this_column)). Hence, all values from a column will be in the range [0, 1]. The min and max values from a column might differ from those of another column.
How can I do this with Weka filters?
Thanks
This can be done using
weka.filters.unsupervised.attribute.Normalize
After applying this filter all values in each column will be in the range [0, 1]
That's right. Just wanted to remind about the difference of "normalization" and "standardization". What mentioned in the question is "standardization", while "normalization" assumes Gaussian distribution and normalizes by mean, and standard variation of each attribute. If you have an outlier in your data, the standardize filter might hurt your data distribution as the min, or max might be much farther than the other instances.
In this case, we can use weka.filters.unsupervised.attribute.Normalize filter to normalize but if we want to normalize only some columns the following will be the best approach.
To apply normalize on selected columns
The unsupervised.attribute.PartitionedMultiFilter can be used for this task.
Thereby you have to configure the filters and ranges sections as per your need.
For Ex: If I want to normalize only on humidity attribute
Step 01 :
After adding the ParririonedMultiFilter -> Tap on filter text box -> choose Normalize from weka.filters.unsupervised.attribute.Normalize -> And edit the Normalize filter as of your need(by giving the scale and translation values)
Step 02:
Tap on ranges text box -> Delete the default filter( which is first-last) -> Then add the column number you want to filter -> Click ok -> Click on Apply
Now the filter will be added only to the selected(humidity) column.
Here is the working normalization example with K-Means in JAVA.
final SimpleKMeans kmeans = new SimpleKMeans();
final String[] options = weka.core.Utils
.splitOptions("-init 0 -max-candidates 100 -periodic-pruning 10000 -min-density 2.0 -t1 -1.25 -t2 -1.0 -N 10 -A \"weka.core.EuclideanDistance -R first-last\" -I 500 -num-slots 1 -S 50");
kmeans.setOptions(options);
kmeans.setSeed(10);
kmeans.setPreserveInstancesOrder(true);
kmeans.setNumClusters(25);
kmeans.setMaxIterations(1000);
final BufferedReader datafile = new BufferedReader(new FileReader("/Users/data.arff");
Instances data = new Instances(datafile);
//normalize
final Normalize normalizeFilter = new Normalize();
normalizeFilter.setInputFormat(data);
data = Filter.useFilter(data, normalizeFilter);
//remove class column[0] from cluster
data.setClassIndex(0);
final Remove removeFilter = new Remove();
removeFilter.setAttributeIndices("" + (data.classIndex() + 1));
removeFilter.setInputFormat(data);
data = Filter.useFilter(data, removeFilter);
kmeans.buildClusterer(data);
System.out.println(kmeans.toString());
// evaluate clusterer
final ClusterEvaluation eval = new ClusterEvaluation();
eval.setClusterer(kmeans);
eval.evaluateClusterer(data);
System.out.println(eval.clusterResultsToString());
If you have CSV file then replace BufferedReader line above with below mentioned Datasource:
final DataSource source = new DataSource("/Users/data.csv");
final Instances data = source.getDataSet();

How to do multi class classification using Support Vector Machines (SVM)

In every book and example always they show only binary classification (two classes) and new vector can belong to any one class.
Here the problem is I have 4 classes(c1, c2, c3, c4). I've training data for 4 classes.
For new vector the output should be like
C1 80% (the winner)
c2 10%
c3 6%
c4 4%
How to do this? I'm planning to use libsvm (because it most popular). I don't know much about it. If any of you guys used it previously please tell me specific commands I'm supposed to use.
LibSVM uses the one-against-one approach for multi-class learning problems. From the FAQ:
Q: What method does libsvm use for multi-class SVM ? Why don't you use the "1-against-the rest" method ?
It is one-against-one. We chose it after doing the following comparison: C.-W. Hsu and C.-J. Lin. A comparison of methods for multi-class support vector machines, IEEE Transactions on Neural Networks, 13(2002), 415-425.
"1-against-the rest" is a good method whose performance is comparable to "1-against-1." We do the latter simply because its training time is shorter.
Commonly used methods are One vs. Rest and One vs. One.
In the first method you get n classifiers and the resulting class will have the highest score.
In the second method the resulting class is obtained by majority votes of all classifiers.
AFAIR, libsvm supports both strategies of multiclass classification.
You can always reduce a multi-class classification problem to a binary problem by choosing random partititions of the set of classes, recursively. This is not necessarily any less effective or efficient than learning all at once, since the sub-learning problems require less examples since the partitioning problem is smaller. (It may require at most a constant order time more, e.g. twice as long). It may also lead to more accurate learning.
I'm not necessarily recommending this, but it is one answer to your question, and is a general technique that can be applied to any binary learning algorithm.
Use the SVM Multiclass library. Find it at the SVM page by Thorsten Joachims
It does not have a specific switch (command) for multi-class prediction. it automatically handles multi-class prediction if your training dataset contains more than two classes.
Nothing special compared with binary prediction. see the following example for 3-class prediction based on SVM.
install.packages("e1071")
library("e1071")
data(iris)
attach(iris)
## classification mode
# default with factor response:
model <- svm(Species ~ ., data = iris)
# alternatively the traditional interface:
x <- subset(iris, select = -Species)
y <- Species
model <- svm(x, y)
print(model)
summary(model)
# test with train data
pred <- predict(model, x)
# (same as:)
pred <- fitted(model)
# Check accuracy:
table(pred, y)
# compute decision values and probabilities:
pred <- predict(model, x, decision.values = TRUE)
attr(pred, "decision.values")[1:4,]
# visualize (classes by color, SV by crosses):
plot(cmdscale(dist(iris[,-5])),
col = as.integer(iris[,5]),
pch = c("o","+")[1:150 %in% model$index + 1])
data=load('E:\dataset\scene_categories\all_dataset.mat');
meas = data.all_dataset;
species = data.dataset_label;
[g gn] = grp2idx(species); %# nominal class to numeric
%# split training/testing sets
[trainIdx testIdx] = crossvalind('HoldOut', species, 1/10);
%# 1-vs-1 pairwise models
num_labels = length(gn);
clear gn;
num_classifiers = num_labels*(num_labels-1)/2;
pairwise = zeros(num_classifiers ,2);
row_end = 0;
for i=1:num_labels - 1
row_start = row_end + 1;
row_end = row_start + num_labels - i -1;
pairwise(row_start : row_end, 1) = i;
count = 0;
for j = i+1 : num_labels
pairwise( row_start + count , 2) = j;
count = count + 1;
end
end
clear row_start row_end count i j num_labels num_classifiers;
svmModel = cell(size(pairwise,1),1); %# store binary-classifers
predTest = zeros(sum(testIdx),numel(svmModel)); %# store binary predictions
%# classify using one-against-one approach, SVM with 3rd degree poly kernel
for k=1:numel(svmModel)
%# get only training instances belonging to this pair
idx = trainIdx & any( bsxfun(#eq, g, pairwise(k,:)) , 2 );
%# train
svmModel{k} = svmtrain(meas(idx,:), g(idx), ...
'Autoscale',true, 'Showplot',false, 'Method','QP', ...
'BoxConstraint',2e-1, 'Kernel_Function','rbf', 'RBF_Sigma',1);
%# test
predTest(:,k) = svmclassify(svmModel{k}, meas(testIdx,:));
end
pred = mode(predTest,2); %# voting: clasify as the class receiving most votes
%# performance
cmat = confusionmat(g(testIdx),pred);
acc = 100*sum(diag(cmat))./sum(cmat(:));
fprintf('SVM (1-against-1):\naccuracy = %.2f%%\n', acc);
fprintf('Confusion Matrix:\n'), disp(cmat)
For multi class classification using SVM;
It is NOT (one vs one) and NOT (one vs REST).
Instead learn a two-class classifier where the feature vector is (x, y) where x is data and y is the correct label associated with the data.
The training gap is the Difference between the value for the correct class and the value of the nearest other class.
At Inference choose the "y" that has the maximum
value of (x,y).
y = arg_max(y') W.(x,y') [W is the weight vector and (x,y) is the feature Vector]
Please Visit link:
https://nlp.stanford.edu/IR-book/html/htmledition/multiclass-svms-1.html#:~:text=It%20is%20also%20a%20simple,the%20label%20of%20structural%20SVMs%20.

Resources