I'm using SMO in classify nominal values. After I build the classifier I need to predict the class for instance but classifyInstance (weka.classifiers.Classifier.classifyInstance) return only double number.
How can I use the double number to get the original nominal class?
Assuming you are calling the weka classes in your Java code, you'll need to know that internally, Weka handles all values as doubles.
When you create the Attribute, you pass it an array of strings that lists the possible nominal values. The double that classification returns is the index of the chosen attribute in the original array. So if you had code that looked like this:
String[] attributeValues = {"a", "b", "c"};
Attribute a = new Attribute("attributeName", attributeValues);
and classifyInstance() returned 2, then the class it chose would be attributeValues[2] or "c".
Related
How to use array features in RandomForest without flatten the input?
import numpy as np
from sklearn.ensemble import RandomForestClassifier
array_feature = np.array([0,0,1])
train_x = np.matrix([[1, 2, array_feature], [3, 4, array_feature] , [1,1, array_feature] ])
train_y = np.array([1,0,1])
clf_rf = RandomForestClassifier(n_estimators=2)
clf_rf.fit(train_x, train_y)
ValueError: setting an array element with a sequence.
You can't.
In sklearn, most models can only use numerical data, and preprocessing is done separately. Tree models (in sklearn) in particular can only make splits on whether a given feature is less or greater than a given value. You can either flatten the arrays, or provide some encoding for them, depending on what those arrays represent.
*(Tree models in other packages, and perhaps soon in sklearn, can treat categorical variables directly. Ordinal variables get treated just like continuous ones, and unordered categorical variables can be split into arbitrary bipartitions in CART or cause multiple-arity splits in Quinlan-family trees. But then still you would need to inform the model that your arrays should be treated as ordinal or unordered categorical or ...)
I used sklearn2pmml to serialize my decision tree classifier to a pmml file.
I used pmml4s in java to deserialize the model and use it to predict.
Iuse the code below to make a prediction over a single incoming value. This should return either 0/1/2/3/4/5/6.
Object[] result = model.predict(new String[]{"220"});
The result array looks like this after the prediction:
Does anyone know why this is happening? Is my way of inputting the prediction value wrong or is something wrong in the serialization/deserialization?
It is certainty of model for each class. In your case it means that it's 4 with probability 94.5% or 5 with probability 5.5%
In simple case, if you want to receive value, you should pick index for the maximal value.
However you might use this probabilities to additional control logic, like thresholding when decision is ambiguous (two values with probability ~0.4, etc.)
I’m a beginner in ML, I built a SVM model to classify some inputs.
I used panda to read my dataset. The classification results are printed as indexes that each one of them is correspond to the name of the labels (classes) in my dataset. How can I convert these indexes to their names (string) ?
for example I have three classes : [Question,General,Info], but when I try to classify an input, the result is one of these numbers: [0,1,2]
I want to convert these numbers to the names of the classes I have.
here is a part of my code:
data = pandas.read_csv("classes.csv",encoding='utf-16' )
Train_X, Test_X, Train_Y, Test_Y = sklearn.model_selection.train_test_split(data['input'],data['Class'],test_size=0.3,random_state=None)
Test_Y and Train_Y are lists of numbers (classes) , each number is referred to one class, how do I know what each number represents?
The first thing you need to know is: your model is working as expected. Most of the time, it'll output a probability for each label. So, if your model outputs something like [0.1, 0.1, 0.8], it means the sample you're classifiying has 80% to belong to the label in position 2. If you pass all labels in the order you indicated in your question, that is, [question, general, info], it means this sample belongs to the class info. Observe the order is important here and you need to ensure that when you're feeding the model in your code.
Therefore, to output a string instead of a number, you need to get the number outputted by the model and check the label in a list or dictionary containing this relationship. Using as an example a list:
labels_str = ['question', 'general', 'info']
# preds is a np.array containing the probabilities
preds = model(some_sample)
# this function returns the position of the max value in the array
pos_pred = preds.argmax()
print ("The label for this sample is {}".format(labels_str[pos_pred])
Did you get the idea?
I already build a tree to classify the instance. In my tree, there are 14 attributes. Each attribute is discretize by supervised discrete. When I created a new instance, I put the value in this instance and classify it in my tree, and I found the result is wrong. So I debug my program, and I found the value of the instance is not divided into the interval correctly. For example:
value of the instance:0.26879699248120303 is divided into '(-inf-0]'.
Why?
Problem solved.I didn't discretize the instance that was to be tested so that the weka didn't know the format of my instance.add the following code:
discretize.input(instance);//discretize is a filter
instance = discretize.output();
During the creating of my training set, I entered "true" and "false" in the same order as it was entered while creating the test set in WEKA. These nominal values are for the class attribute.
When I run a classifier, I somehow feel that the results look as if it is reversed in the test set.
My question is if the first line in the training set shows that the class value is "False", and if the trained model is used in the SVM classifier on a test set, does it mean if the returned classified class is 0, should I consider it as False?
Thanks
Abhishek S
If the nominal attribute was defined in the same order in both data sets (training and test).
The output will be in the same order.
Nominal values are coded as "double".
So if you wrote: {false, true} => "false" = 0.0 and "true" = 1.0.
Here is the excerpt from the weka documentation.
The returned double value from classifyInstance (or the index in the
array returned by distributionForInstance) is just the index for the
string values in the attribute. That is, if you want the string
representation for the class label returned above clsLabel, then you
can print it like this:
System.out.println(clsLabel + " -> " + unlabeled.classAttribute().value((int) clsLabel));