I want to learn a decision tree having a reasonable discrete target attribute with 5 possible different values.
However, there are discrete high cardinality input attributes (1000s of different possible string values) that I wonder if it makes sense to include them. Is there any policy what the maximum cardinality should be when including an attribute to train a decision tree?
There is no maximum cardinality, no. Of course, you could omit values that do not actually appear in the data.
You will have to use an RDF implementation that handles multi-label categorical features directly rather than converts them to a series of binary indicator features.
For a categorical feature with N values there are 2^N - 2 possible decision rules on the feature, which is too many to consider by a long way. The heuristic I have used is to compute the entropy of the target when you divide up the data by the N categorical feature values. Then order the values by entropy and evaluate the N-2 rules you get by considering prefixes of that list.
Related
Below is a paramter for DecisionTreeClassifier: max_depth
http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
max_depth : int or None, optional (default=None)
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
I always thought that depth of the decision tree should be equal or less than number of the features (attributes) of a given dataset. IWhat if we find pure classes before the mentioned input for that parameter? Does it stop splitting or splits further till the mentioned input?
Is it possible to use the same attribute in two different level of a decision tree while splitting?
If the number of features are very high for a decision tree then it can grow very very large. To answer your question, yes, it will stop if it finds the pure class variable.
This is another reason DecisionTrees tend to do overfitting.
You would like to use max_depth parameter when you are using Random Forest , which does not select all features for any specific tree, therefore all trees are not expected to grow to the maximum possible depth, which in turn will require pruning. Decision Trees are weak learners and in RandomForest along with max_depth these participate in voting. More details about these RF and DT relations can be search easily on internet. There are a range of articles published.
So, Generally you would like to use max_depth when you are having large number of features. Also, in actual implementations you would like to use RandomForest rather than DecisionTree alone.
Say, a dataset has columns like length and width which can be float, and it can also have some binary elements (yes/no) or discrete numbers (categories transformed into numbers). What would it be wise to simply use all these as features without having to worry about the formats (or more like the nature of the features)? When doing normalization, can we just normalize the discrete numbers the same way as continuous numbers? I'm really confused on dealing with multiple formats.....
Yes, you can normalize discrete values. But it ought have no real
effect on learning - normalization would be required if you are
doing some form of a similarity measurement, which is not the case
for factor variables. There are some special cases like neural
networks, which are sensible to the scale of input\output and the
size of weights (see 'vanishing\exploding gradient' topic). Also it
would make some sense if you are doing a clustering on your data.
Clustering uses some kind of a distance measure so it would be
better to have all features on the same scale.
There is nothing special with categorical stuff, except that some of
the learning methods are especially good at using categorical
features, some - at using real-valued features, and some are good at
both.
My first choice for mix of categorical and real-valued features would be to use some tree-based methods (RandomForest or Gradient Boosting Machine) and the second one - ANNs.
Also, extremely good approach at handling factors (categorical variables) is to convert them into set of Boolean variables. For example if you have a factor with five levels (1,2,3,4 and 5) a good way to go would be to convert it into 5 features with 1 in a column representing one of the levels.
I have a data set of 1k records and my job is to do a decision algorithm based on those records.
Here is what I can share:
The target is a continuous value.
Some of the predictors (or attributes) are continuous values,
some of them are discrete and some are arrays of discrete values
(there can be more than one option)
My initial thoughts were to separate the arrays of discrete values and make them individual features (predictors). For the continuous values in the predictors I was thinking about just randomly picking a few decision boundaries and see which one reduces the entropy the most. Then make a decision tree (or a random forest) which use standard deviation reduction when creating the tree.
My question is: Am I on the right path? Is there a better way to do that?
I know this comes probably a bit late but what you are searching for are Model Trees. Model trees are decision trees with continuous rater than categorical values in the leafs. In general these values are predicted by linear regression models. One of the more prominent model trees and one that more or less suits your needs is the M5 model tree introduced by Quinlan. Wang and Witten re-implemented M5 and extended its functionality so that it can handle both, continuous and categorical attributes. Their version is called M5', you can find an implementation e.g. in Weka. The only thing left would be to handle the arrays. However, your description is a bit generic in that respect. From what I gather your choices are either flattening or, as you suggested, seperating them.
Note that, since Wang and Witten's work, more sophisticated model trees have been introduced. However, M5' is robust and does not need any parameterization in its original formulation, which makes it easy to use.
So I read a paper that said that processing your dataset correctly can increase LibSVM classification accuracy dramatically...I'm using the Weka implementation and would like some help making sure my dataset is optimal.
Here are my (example) attributes:
Power Numeric (real numbers, range is from 0 to 1.5132, 9000+ unique values)
Voltage Numeric (similar to Power)
Light Numeric (0 and 1 are the only 2 possible values)
Day Numeric (1 through 20 are the possible values, equal number of each value)
Range Nominal {1,2,3,4,5} <----these are the classes
My question is: which Weka pre-processing filters should I apply to make this dataset more effective for LibSVM?
Should I normalize and/or standardize the Power and Voltage data values?
Should I use a Discretization filter on anything?
Should I be binning the Power/Voltage values into a lot smaller number of bins?
Should I make the Light value Binary instead of numeric?
Should I normalize the Day values? Does it even make sense to do that?
Should I be using the Nominal to Binary or Nominal to some thing else filter for the classes "Range"?
Please advice on these questions and anything else you think I might have missed...
Thanks in advance!!
Normalization is very important, as it influences the concept of distance which is used by SVM. The two main approaches to normalization are:
Scale each input dimension to the same interval, for example [0, 1]. This is the most common approach by far. It is necessary to prevent some input dimensions to completely dominate others. Recommended by the LIBSVM authors in their beginner's guide (Appendix B for examples).
Scale each instance to a given length. This is common in text mining / computer vision.
As to handling types of inputs:
Continuous: no work needed, SVM works on these implicitly.
Ordinal: treat as continuous variables. For example cold, lukewarm, hot could be modeled as 1, 2, 3 without implicitly defining an unnatural structure.
Nominal: perform one-hot encoding, e.g. for an input with N levels, generate N new binary input dimensions. This is necessary because you must avoid implicitly defining a varying distance between nominal levels. For example, modelling cat, dog, bird as 1, 2 and 3 implies that a dog and bird are more similar than a cat and bird which is nonsense.
Normalization must be done after substituting inputs where necessary.
To answer your questions:
Should I normalize and/or standardize the Power and Voltage data
values?
Yes, standardize all (final) input dimensions to the same interval (including dummies!).
Should I use a Discretization filter on anything?
No.
Should I be binning the Power/Voltage values into a lot smaller number of
bins?
No. Treat them as continuous variables (e.g. one input each).
Should I make the Light value Binary instead of numeric?
No, SVM has no concept of binary variables and treats everything as numeric. So converting it will just lead to an extra type-cast internally.
Should I normalize the Day values? Does it even make sense to do
that?
If you want to use 1 input dimension, you must normalize it just like all others.
Should I be using the Nominal to Binary or Nominal to some thing else filter for the classes "Range"?
Nominal to binary, using one-hot encoding.
I'm building a binary classification tree using mutual information gain as the splitting function. But since the training data is skewed toward a few classes, it is advisable to weight each training example by the inverse class frequency.
How do I weight the training data? When calculating the probabilities to estimate the entropy, do I take weighted averages?
EDIT: I'd like an expression for entropy with the weights.
The Wikipedia article you cited goes into weighting. It says:
Weighted variants
In the traditional formulation of the mutual information,
each event or object specified by (x,y) is weighted by the corresponding probability p(x,y). This assumes that all objects or events are equivalent apart from their probability of occurrence. However, in some applications it may be the case that certain objects or events are more significant than others, or that certain patterns of association are more semantically important than others.
For example, the deterministic mapping {(1,1),(2,2),(3,3)} may be viewed as stronger (by some standard) than the deterministic mapping {(1,3),(2,1),(3,2)}, although these relationships would yield the same mutual information. This is because the mutual information is not sensitive at all to any inherent ordering in the variable values (Cronbach 1954, Coombs & Dawes 1970, Lockhead 1970), and is therefore not sensitive at all to the form of the relational mapping between the associated variables. If it is desired that the former relation — showing agreement on all variable values — be judged stronger than the later relation, then it is possible to use the following weighted mutual information (Guiasu 1977)
which places a weight w(x,y) on the probability of each variable value co-occurrence, p(x,y). This allows that certain probabilities may carry more or less significance than others, thereby allowing the quantification of relevant holistic or prägnanz factors. In the above example, using larger relative weights for w(1,1), w(2,2), and w(3,3) would have the effect of assessing greater informativeness for the relation {(1,1),(2,2),(3,3)} than for the relation {(1,3),(2,1),(3,2)}, which may be desirable in some cases of pattern recognition, and the like.
http://en.wikipedia.org/wiki/Mutual_information#Weighted_variants
State-value weighted entropy as a measure of investment risk.
http://www56.homepage.villanova.edu/david.nawrocki/State%20Weighted%20Entropy%20Nawrocki%20Harding.pdf