Hello I am new to WEKA and am using weka 3.6.10.
Sorry if the answer to this question is something obvious.
I have a dataset containing 10 attributes and one decision class. Decision class is composed of values {1,2,3,4}, is there a way to change configuration so that the values would be considered as {1} and {2,3,4}(binary) rather than each of the values separately without modifying the other attributes?
I had a look at the WEKA filter but did not find anything useful.
Thanks guys
Use an Unsupervised Attribute filter, e.g. the NumericToBinary filter. In the topmost field of the configuration dialog, enter the posistion of the "Decision class" attribute. If it is in the 8th column, enter 8.
The filter will create "dummy variable" columns for each unique value of this attribute. If there are 4 unique values, after applying this filter your dataset will have 4 additional columns. Remove 3 of them.
Related
I want to get the details (unique id) of the incorrectly classified instances using Weka GUI. I am following the answers of this question. In that, they ask to use the filter StringToNominal in Preprocessing tab to convert the unique id, which is an string. However, by following that, I doubt if the classifier is considering the unique id column also as a feature during the classification?
Please suggest me the correct way of approaching this.
I happy to provide examples if needed.
Let's suppose you want to (1) add an instance ID, (2) not use that instance ID in the model, and (3) see the individual predictions, with the instance ID and maybe some other attributes.
We’re going to show this with a smaller data set. Open iris.arff, for example.
Use the AddID filter in the Preprocess tab, in the Unsupervised Attribute filters. ID will be the first attribute.
Now we need to ignore it during the modeling. Use the filtered classifier with the Remove filter.
And we need to output the predictions with the ID variable so we can see what happened. Here we are outputting all the attributes, although we don’t need to do all.
We get out this detail in the output window:
=== Predictions on test split ===
inst#,actual,predicted,error,prediction,ID,sepallength,sepalwidth,petallength,petalwidth
1,2:Iris-versicolor,2:Iris-versicolor,,0.968,53,6.9,3.1,4.9,1.5
2,3:Iris-virginica,3:Iris-virginica,,0.968,131,7.4,2.8,6.1,1.9
3,2:Iris-versicolor,2:Iris-versicolor,,0.968,59,6.6,2.9,4.6,1.3
4,1:Iris-setosa,1:Iris-setosa,,1,36,5,3.2,1.2,0.2
5,3:Iris-virginica,3:Iris-virginica,,0.968,101,6.3,3.3,6,2.5
6,2:Iris-versicolor,2:Iris-versicolor,,0.968,88,6.3,2.3,4.4,1.3
7,1:Iris-setosa,1:Iris-setosa,,1,42,4.5,2.3,1.3,0.3
8,1:Iris-setosa,1:Iris-setosa,,1,8,5,3.4,1.5,0.2
and so on.
I’m getting an evaluation error while building binary classification model in IBM Data Science Experience (DSX) using IBM Watson Machine Learning if one of the feature columns has unique categorical values.
The dataset i'm using looks like this -
Customer,Cust_No,Alerts,Churn
Ford,1000,8,0
GM,2000,50,1
Chrysler,3000,10,0
Tesla,4000,48,1
Toyota,5000,15,0
Honda,6000,55,1
Subaru,7000,12,0
BMW,8000,52,1
MBZ,9000,13,0
Porsche,10000,54,1
Ferrari,11000,9,0
Nissan,12000,49,1
Lexus,13000,10,0
Kia,14000,50,1
Saab,15000,12,0
Faraday,16000,47,1
Acura,17000,13,0
Infinity,18000,53,1
Eco,19000,16,0
Mazda,20000,52,1
In DSX, upload the above CSV data, then create a Model using automatic model builder. Select Churn as label column and Customer and Alerts as feature columns. Select Binary Classification model and use the default
settings for training/test split. Train the model. The model building fails with evaluation error. Instead if we select Cust_No and Alerts as feature columns, the model is created successfully. Why is that ?
When a model is built in DSX the data is split in training, test and holdout. These datasets are disjoint.
In case the Customer field is chosen, that is a string field this must be converted in numeric values to have a meaning for model ML algorithms (Linear regression / Logistic regression / Decision Trees etc. ).
How this is done:
The algorithm is iterating in each value from field Customer and creates a dictionary, mapping a string value to a numeric value (see spark StringIndexer - https://spark.apache.org/docs/2.2.0/ml-features.html#stringindexer).
When model is evaluated or scored the string fields from test subset are converted to numeric based on the dictionary made at training point. If a value is not found there are two options (skip entire record or throw an error - first option is choose by DSX).
Taking into consideration that all values from Customer field are unique , it means that none of the records from test dataset arrives in evaluation phase and from here the error that model can not be evaluated.
In case of Cust_No, the field is already a numeric and does not require a category encoding operation. Even if the values from evaluation step are not found in training the values will be use as is.
Taking a step back, it seems to me like your data doesn't really contain predictive information other than in Alerts.
The customer and Cust_no fields are basically ID columns, and seem to not contain predictive information.
Can you post a screenshot of your Evaluation error? I can try to help, I work on DSX.
So I've got a large dataset of 2304 numeric attributes and the class attribute), and I want to perform feature selection to remove misleading and redundant attributes. This is because I will be running discretization to make them nominal and then run Naïves Bayes on the dataset.
However, in the select attributes tab in Weka, it only lists them in ranking order. I know there is a remove filter in the preprocess tab but it only takes in a range or number of the attribute(s).
Is there was an automated way of removing these, because of such a large dataset?
In the Preprocess tab,
Choose the AttributeSelection filter (supervised attribute filter).
Configure evaluator and search as desired.
Apply.
This will only keep the ones that pass the filter (keeping the class attribute, of course).
If you like the result, save this as a new arff file.
I am classifying 755 classes with 1024 attributes in weka. I suppose I need to select best attributes for better accuracy. I tried to select attributes using InfoGainAttributeEval and Ranker method but all the attributes ranked as '0'. I am not sure what is wrong. Any help is appreciated.
The "0" here is not the ranking of each attribute, but the InfoGain of each attribute.
When you apply InfoGainAttributeEval on Weka, the 3 columns corresponds to :
Information Gain | "id" of the attribute | name of the attribute.
What happened here might be that the information gain was too small and has been rounded.
But again, as stated by #CAFEBABE in the comments, without having your data it is hardly possible to be sure.
Edit : A post here indicates that the precision of the output is set to four decimal places, which seems to confirm the above hypothesis.
Good Evening,
I am working on a supervised classification task. I have a big arff file full of data in the format, "text", class. There are only two classes, E and I.
I can load this data into Weka Explorer, apply the StringToWordVector with TF-IDF on it, then using LibSVM classify it and get results. But I need to use 5x2 Cross-Validation and get the Area under the ROC Curve. So I save that processed data, open up Weka Experimenter, load it in, set it to 2 folds, 5 iterations, and then set the algorithm to libSVM.
When I go to the RUN tab and press start I get the following error:
18:31:18: Started
18:31:18: Class attribute is not nominal!
18:31:18: Interrupted
18:31:18: There was 1 error
I don't know why this is happening, what exactly the error is, or how to fix it. I google this error and it is not leading me to any solutions. I am not sure where I should go from here to fix this.
I can go back to Explorer, reload in that processed file, and classify it without any issues but I need to do it in Experimenter.
In my case, there were nominal attributes in the file. However, Weka expects these to be last, since they indicate the class that the record is being assigned to. Here's how I rearranged the data so that the nominal value was last:
In Explorer, open the arff file.
Click 'Edit...' then find the column which should be the class of each record.
Right click on the column header and select 'Attribute as class'.
Click 'Save...' and use this new dataset in Experimenter.
Works like a charm.
If your class attribute is numeric (like 0,1) change it to a nominal form like true, false.
The StringToWordVector filter puts the class attribute as the first attribute in the data that it outputs. The Experimenter expects the last attribute in the data to be the class. You can reorder the attributes of the filtered data, but the best (and correct approach in general when combining filters with classifiers) is to use the FilteredClassifier to encapsulate your base classifier (LibSVM) with the StringToWordVector filter. This should work out just fine because the class attribute is the last attribute in your original "text", class data.