Ordering of nominal values for Class Attribute in weka - machine-learning

During the creating of my training set, I entered "true" and "false" in the same order as it was entered while creating the test set in WEKA. These nominal values are for the class attribute.
When I run a classifier, I somehow feel that the results look as if it is reversed in the test set.
My question is if the first line in the training set shows that the class value is "False", and if the trained model is used in the SVM classifier on a test set, does it mean if the returned classified class is 0, should I consider it as False?
Thanks
Abhishek S

If the nominal attribute was defined in the same order in both data sets (training and test).
The output will be in the same order.
Nominal values are coded as "double".
So if you wrote: {false, true} => "false" = 0.0 and "true" = 1.0.

Here is the excerpt from the weka documentation.
The returned double value from classifyInstance (or the index in the
array returned by distributionForInstance) is just the index for the
string values in the attribute. That is, if you want the string
representation for the class label returned above clsLabel, then you
can print it like this:
System.out.println(clsLabel + " -> " + unlabeled.classAttribute().value((int) clsLabel));

Related

pmml4s model.predict() returns array instead of single value

I used sklearn2pmml to serialize my decision tree classifier to a pmml file.
I used pmml4s in java to deserialize the model and use it to predict.
Iuse the code below to make a prediction over a single incoming value. This should return either 0/1/2/3/4/5/6.
Object[] result = model.predict(new String[]{"220"});
The result array looks like this after the prediction:
Does anyone know why this is happening? Is my way of inputting the prediction value wrong or is something wrong in the serialization/deserialization?
It is certainty of model for each class. In your case it means that it's 4 with probability 94.5% or 5 with probability 5.5%
In simple case, if you want to receive value, you should pick index for the maximal value.
However you might use this probabilities to additional control logic, like thresholding when decision is ambiguous (two values with probability ~0.4, etc.)

How to get class names in classification?

I’m a beginner in ML, I built a SVM model to classify some inputs.
I used panda to read my dataset. The classification results are printed as indexes that each one of them is correspond to the name of the labels (classes) in my dataset. How can I convert these indexes to their names (string) ?
for example I have three classes : [Question,General,Info], but when I try to classify an input, the result is one of these numbers: [0,1,2]
I want to convert these numbers to the names of the classes I have.
here is a part of my code:
data = pandas.read_csv("classes.csv",encoding='utf-16' )
Train_X, Test_X, Train_Y, Test_Y = sklearn.model_selection.train_test_split(data['input'],data['Class'],test_size=0.3,random_state=None)
Test_Y and Train_Y are lists of numbers (classes) , each number is referred to one class, how do I know what each number represents?
The first thing you need to know is: your model is working as expected. Most of the time, it'll output a probability for each label. So, if your model outputs something like [0.1, 0.1, 0.8], it means the sample you're classifiying has 80% to belong to the label in position 2. If you pass all labels in the order you indicated in your question, that is, [question, general, info], it means this sample belongs to the class info. Observe the order is important here and you need to ensure that when you're feeding the model in your code.
Therefore, to output a string instead of a number, you need to get the number outputted by the model and check the label in a list or dictionary containing this relationship. Using as an example a list:
labels_str = ['question', 'general', 'info']
# preds is a np.array containing the probabilities
preds = model(some_sample)
# this function returns the position of the max value in the array
pos_pred = preds.argmax()
print ("The label for this sample is {}".format(labels_str[pos_pred])
Did you get the idea?

How to understand sample_weight in sklearn.metrics?

Do we need to set sample_weight when we evaluate our model? Now I have trained a model about classification, but the dataset is unbalanced. When I set the sample_weight with compute_sample_weight('balanced'), the scores are very nice. Precision:0.88, Recall:0.86 for '1' class.
But the scores will be bad if I don't set the sample_weight. Precision:0.85, Recall:0.21.
Will the sample_weight destroy the original data distribution?
The sample-weight parameter is only used during training.
Suppose you have a dataset with 16 points belonging to class "0" and 4 points belonging to class "1".
Without this parameter, during optimization, they have a weight of 1 for loss calculation: they contribute equally to the loss that the model is minimizing. That means that 80% of the loss is due to points of class "0" and 20% is due to points of class "1".
By setting it to "balanced", scikit-learn will automatically calculate weights to assign to class "0" and class "1" such that 50% of the loss comes from class "0" and 50% from class "1".
This paramete affects the "optimal threshold" you need to use to separate class "0" predictions from class "1", and also influences the performance of your model.
Here is my understanding: The sample_weight have nothing to do with balanced or unbalanced on itself, it is just a way to reflect the distribution of the sample data. So basically the following two way of expression is equivalent, and expression 1 is definitely more efficient in terms of space complexity. This 'sample_weight' is just the same as any other statistical package in any language and have nothing about the random sampling
expression 1
X = [[1,1],[2,2]]
y = [0,1]
sample_weight = [1000,2000] # total 3000
versus
expression 2
X = [[1,1],[2,2],[2,2],...,[1,1],[2,2],[2,2]] # total 300 rows
y = [0,1,1,...,0,1,1]
sample_weight = [1,1,1,...,1,1,1] # or just set as None

the result weka j48 classifyinstance is not correct

I already build a tree to classify the instance. In my tree, there are 14 attributes. Each attribute is discretize by supervised discrete. When I created a new instance, I put the value in this instance and classify it in my tree, and I found the result is wrong. So I debug my program, and I found the value of the instance is not divided into the interval correctly. For example:
value of the instance:0.26879699248120303 is divided into '(-inf-0]'.
Why?
Problem solved.I didn't discretize the instance that was to be tested so that the weka didn't know the format of my instance.add the following code:
discretize.input(instance);//discretize is a filter
instance = discretize.output();

predict nominal values in SMO

I'm using SMO in classify nominal values. After I build the classifier I need to predict the class for instance but classifyInstance (weka.classifiers.Classifier.classifyInstance) return only double number.
How can I use the double number to get the original nominal class?
Assuming you are calling the weka classes in your Java code, you'll need to know that internally, Weka handles all values as doubles.
When you create the Attribute, you pass it an array of strings that lists the possible nominal values. The double that classification returns is the index of the chosen attribute in the original array. So if you had code that looked like this:
String[] attributeValues = {"a", "b", "c"};
Attribute a = new Attribute("attributeName", attributeValues);
and classifyInstance() returned 2, then the class it chose would be attributeValues[2] or "c".

Resources