I am working on KDD99 dataset using WEKA. There are three types of attributes in the dataset, which are Nominal, Binary and Numeric. But in WEKA, it considers Binary data also as Numeric.
I tried to use Unsupervised-attribute-Normalize tool to normalize the data. However, it also normalize the binary data. I have two question here.
Do I need to normalize the Binary attributes? Because binary data is not continuous.
If I do not need to normalize the binary attributes, in WEKA, how I can select attributes in Normalize tool? Because the Normalize tool always applies to all the numeric attribute(including the binary attribute).
Thanks!
Weka has interpreted the binary attributes from your input file as numeric because their values are all numbers (i.e. 0 and 1), but if you're going to use classifiers that can handle nominal attributes you probably want to convert the binary attributes into nominal ones instead.
You can do this with the weka.filters.unsupervised.attribute.Discretize filter. Just specify the numeric indices of the attributes that are binary and specify the number of bins to be 2.
This will give you attributes with nominal value labels of (-inf-0.5] and (0.5-inf), but if you'd rather see them as 0 and 1 you can rename the values using weka.filters.unsupervised.attribute.RenameNominalValues.
Related
Dataset having columns-
target column-- fruit name(data- mango,orange,apple),
features column -- size(numeric),color(red,green,yellow),weight(numeric)
I have did one hot encoding on the color column and prepare the features, with every column is having the numeric values.
I want to use the classification model for the prediction.
If I am having a target column on which I have to do the prediction, is consist of categorical data(ex-apple,orange,mango), so If I want to use the logistic regression model which is a classification based model, do I need to one hot encoding or label encode the target column also, as we do for the features column(name-color).
Thankyou
No, it will work since logistic regression returns the probability of Y = y given your input X.
I am trying to encode gender feature containing two values Male and Female. I created two one-hot features from main feature, is_male and is_female, containing boolean values. But while applying model, I realized they are complement to each other. Does this impact model performance as they seem to be correlated?
One-hot encoding(creating separate columns for each value of column) should not be used with binary valued variables (MALE-FEMALE in your case).
Doing so causes DUMMY VARIABLE TRAP.
I'm using RandomForest from Weka 3.7.11 which in turn is bagging Weka's RandomTree. My input attributes are numerical and the output attribute(label) is also numerical.
When training the RandomTree, K attributes are chosen at random for each node of the tree. Several splits based on those attributes are attempted and the "best" one is chosen. How does Weka determine what split is best in this (numerical) case?
For nominal attributes I believe Weka is using the information gain criterion which is based on conditional entropy.
IG(T|a) = H(T) - H(T|a)
Is something similar used for numerical attributes? Maybe differential entropy?
When tree is split on numerical attribute, it is split on the condition like a>5. So, this condition effectively becomes binary variable and the criterion (information gain) is absolutely the same.
P.S. For regression commonly used is the sum of squared errors (for each leaf, then sum over leaves). But I do not know specifically about Weka
So I read a paper that said that processing your dataset correctly can increase LibSVM classification accuracy dramatically...I'm using the Weka implementation and would like some help making sure my dataset is optimal.
Here are my (example) attributes:
Power Numeric (real numbers, range is from 0 to 1.5132, 9000+ unique values)
Voltage Numeric (similar to Power)
Light Numeric (0 and 1 are the only 2 possible values)
Day Numeric (1 through 20 are the possible values, equal number of each value)
Range Nominal {1,2,3,4,5} <----these are the classes
My question is: which Weka pre-processing filters should I apply to make this dataset more effective for LibSVM?
Should I normalize and/or standardize the Power and Voltage data values?
Should I use a Discretization filter on anything?
Should I be binning the Power/Voltage values into a lot smaller number of bins?
Should I make the Light value Binary instead of numeric?
Should I normalize the Day values? Does it even make sense to do that?
Should I be using the Nominal to Binary or Nominal to some thing else filter for the classes "Range"?
Please advice on these questions and anything else you think I might have missed...
Thanks in advance!!
Normalization is very important, as it influences the concept of distance which is used by SVM. The two main approaches to normalization are:
Scale each input dimension to the same interval, for example [0, 1]. This is the most common approach by far. It is necessary to prevent some input dimensions to completely dominate others. Recommended by the LIBSVM authors in their beginner's guide (Appendix B for examples).
Scale each instance to a given length. This is common in text mining / computer vision.
As to handling types of inputs:
Continuous: no work needed, SVM works on these implicitly.
Ordinal: treat as continuous variables. For example cold, lukewarm, hot could be modeled as 1, 2, 3 without implicitly defining an unnatural structure.
Nominal: perform one-hot encoding, e.g. for an input with N levels, generate N new binary input dimensions. This is necessary because you must avoid implicitly defining a varying distance between nominal levels. For example, modelling cat, dog, bird as 1, 2 and 3 implies that a dog and bird are more similar than a cat and bird which is nonsense.
Normalization must be done after substituting inputs where necessary.
To answer your questions:
Should I normalize and/or standardize the Power and Voltage data
values?
Yes, standardize all (final) input dimensions to the same interval (including dummies!).
Should I use a Discretization filter on anything?
No.
Should I be binning the Power/Voltage values into a lot smaller number of
bins?
No. Treat them as continuous variables (e.g. one input each).
Should I make the Light value Binary instead of numeric?
No, SVM has no concept of binary variables and treats everything as numeric. So converting it will just lead to an extra type-cast internally.
Should I normalize the Day values? Does it even make sense to do
that?
If you want to use 1 input dimension, you must normalize it just like all others.
Should I be using the Nominal to Binary or Nominal to some thing else filter for the classes "Range"?
Nominal to binary, using one-hot encoding.
The LibSVM on WEKA isn't loading with my dataset.
I am using WEKA and LibSVM. Every time I open my dataset and then try to chose an algorithm, the LibSVM algorithm isn't enabled (the option is gray). But if for instance I load the weather.arff example dataset that comes with the WEKA then the LibSVM algorithm works...
I don't know if there is anything wrong with my dataset. Are there any limitations that I should be aware of when dealing with LibSVM? For instance, number of attributes, etc.
The strange thing is that when I run my dataset with the SMO algorith that comes with WEKA it works without any problem.
In my dataset I have 76 attributes and my class attribute has 100 possible values.
Am I doing anything wrong? Thanks, very appreciated.
Your dataset doesn't match input format required by LibSVM. The capabilities are as follows:
CAPABILITIES
Class -- Nominal class, Missing class values, Binary class
Attributes -- Empty nominal attributes, Nominal attributes, Unary attributes, Binary attributes, Date attributes, Numeric attributes
Additional
min # of instances: 1
So the class in your .arff file should be either nominal or binary (allowed to miss some values) and your attributes should be nominal, unary or binary (allowed to be empty).