Weighted Decision Trees using Entropy - machine-learning

I'm building a binary classification tree using mutual information gain as the splitting function. But since the training data is skewed toward a few classes, it is advisable to weight each training example by the inverse class frequency.
How do I weight the training data? When calculating the probabilities to estimate the entropy, do I take weighted averages?
EDIT: I'd like an expression for entropy with the weights.

The Wikipedia article you cited goes into weighting. It says:
Weighted variants
In the traditional formulation of the mutual information,
each event or object specified by (x,y) is weighted by the corresponding probability p(x,y). This assumes that all objects or events are equivalent apart from their probability of occurrence. However, in some applications it may be the case that certain objects or events are more significant than others, or that certain patterns of association are more semantically important than others.
For example, the deterministic mapping {(1,1),(2,2),(3,3)} may be viewed as stronger (by some standard) than the deterministic mapping {(1,3),(2,1),(3,2)}, although these relationships would yield the same mutual information. This is because the mutual information is not sensitive at all to any inherent ordering in the variable values (Cronbach 1954, Coombs & Dawes 1970, Lockhead 1970), and is therefore not sensitive at all to the form of the relational mapping between the associated variables. If it is desired that the former relation — showing agreement on all variable values — be judged stronger than the later relation, then it is possible to use the following weighted mutual information (Guiasu 1977)
which places a weight w(x,y) on the probability of each variable value co-occurrence, p(x,y). This allows that certain probabilities may carry more or less significance than others, thereby allowing the quantification of relevant holistic or prägnanz factors. In the above example, using larger relative weights for w(1,1), w(2,2), and w(3,3) would have the effect of assessing greater informativeness for the relation {(1,1),(2,2),(3,3)} than for the relation {(1,3),(2,1),(3,2)}, which may be desirable in some cases of pattern recognition, and the like.
http://en.wikipedia.org/wiki/Mutual_information#Weighted_variants

State-value weighted entropy as a measure of investment risk.
http://www56.homepage.villanova.edu/david.nawrocki/State%20Weighted%20Entropy%20Nawrocki%20Harding.pdf

Related

Difference between Generative, Discriminating and Parametric, Nonparametric Algorithm/Model

Here in SO I found the following explanation of generative and discriminitive algorithms:
"A generative algorithm models how the data was generated in order to categorize a signal. It asks the question: based on my generation assumptions, which category is most likely to generate this signal?
A discriminative algorithm does not care about how the data was generated, it simply categorizes a given signal."
And here is the definition for parametric and nonparametric algorithms
"Parametric: data are drawn from a probability distribution of specific form up to unknown parameters.
Nonparametric: data are drawn from a certain unspecified probability distribution.
"
So essentially can we say that generative and parametric algorithms assume underlying model whereas discriminitve and nonparametric algorithms dont assume any model?
thanks.
Say you have inputs X (probably a vector) and output Y (probably univariate). Your goal is to predict Y given X.
A generative method uses a model of the joint probability p(X,Y) to determine P(Y|X). It is thus possible given a generative model with known parameters to sample jointly from the distribution p(X,Y) to produce new samples of both input X and output Y (note they are distributed according to the assumed, not true, distribution if you do this). Contrast this to discriminative approaches which only have a model of the form p(Y|X). Thus provided with input X they can sample Y; however, they cannot sample new X.
Both assume a model. However, discriminative approaches assume only a model of how Y depends on X, not on X. Generative approaches model both. Thus given a fixed number of parameters you might argue (and many have) that it's easier to use them to model the thing you care about, p(Y|X), than the distribution of X since you'll always be provided with the X for which you wish to know Y.
Useful references: this (very short) paper by Tom Minka. This seminal paper by Andrew Ng and Michael Jordan.
The distinction between parametric and non-parametric models is probably going to be harder to grasp until you have more stats experience. A parametric model has a fixed and finite number of parameters regardless of how many data points are observed. Most probability distributions are parametric: consider a variable z which is the height of people, assumed to be normally distributed. As you observe more people, your estimate for the parameters \mu and \sigma, the mean and standard deviation of z, become more accurate but you still only have two parameters.
In contrast, the number of parameters in a non-parametric model can grow with the amount of data. Consider an induced distribution over peoples' heights which places a normal distribution over each observed sample, with mean given by the measurement and fixed standard deviation. The marginal distribution over new heights is then a mixture of normal distributions, and the number of mixture components increases with each new data point. This is a non-parametric model of people's height. This specific example is called a kernel density estimator. Popular (but more complicated) non parametric models include Gaussian Processes for regression and Dirichlet Processes.
A pretty good tutorial on non-parametrics can be found here, which constructs the Chinese Restaurant Process as the limit of a finite mixture model.
I don't think you can say it. E.g. linear regression is a discriminative algorithm - you make an assumption about P(Y|X), and then estimate paramenters directly from the data, without making any assumption about P(X) or P(X|Y), as you would do in case of generative models. But at the same time, aby inference based on linear regression, including the properties of the paramenters, is a parametric estimation, as there is an assumption about behaviour of unobserved errors.
Here I'm only talking about parametric/non-parametric. Generative/ discriminative is a separate concept.
Non-parametric model means you don't make any assumptions on the distribution of your data. For example, in the real world, data will not 100% follow theoretical distributions like Gaussian, beta, Poisson, Weibull, etc. Those distributions are developed for our need's to model the data.
On the other hand, parametric models try to completely explain our data using parameters. In practice, this way is preferred because it makes easier to define how the model should behave in different circumstances (for example, we already know the derivative/gradients of the model, what happens when we set the rate too high/too low in Poisson, etc.)

Decision tree with high cardinality attribute

I want to learn a decision tree having a reasonable discrete target attribute with 5 possible different values.
However, there are discrete high cardinality input attributes (1000s of different possible string values) that I wonder if it makes sense to include them. Is there any policy what the maximum cardinality should be when including an attribute to train a decision tree?
There is no maximum cardinality, no. Of course, you could omit values that do not actually appear in the data.
You will have to use an RDF implementation that handles multi-label categorical features directly rather than converts them to a series of binary indicator features.
For a categorical feature with N values there are 2^N - 2 possible decision rules on the feature, which is too many to consider by a long way. The heuristic I have used is to compute the entropy of the target when you divide up the data by the N categorical feature values. Then order the values by entropy and evaluate the N-2 rules you get by considering prefixes of that list.

Information gain on non discrete dataset

Jiawei Han's book on Data Mining 2nd edition (Attribute Selection Measures - pp 297 thru 300) explains how to calculate information gain achieved by each attribute (age, income, credit_rating) and class (buys_computer yes or no).
In this example, each of the attribute values is discrete, for e.g. age can be youth/middle-aged/senior, income can be high/low/medium, credit_rating fair/excellent etc.
I would like to know how the same information gain can be applied to attributes which take non discrete data. For e.g. the income attribute takes any currency amount like 100.68, 120.90, etc etc.
If there are 1000 students, there could be 1000 different amount values.
How can we apply the same information gain over non discrete data? Any tutorial/sample example/video url would be of great help.
When your target variable is discrete (categorical), you just calculate entropy over the empirical distribution of categories in the left/right split you're considering, and compare their weighted average to the entropy without the split.
For a continuous target variable, like income, this is defined analogously as differential entropy. For your purpose you would assume that the values in your set have a normal distribution, and calculate the differential entropy accordingly. From Wikipedia:
That is it's just a function of the variance of the values. Note that this is in nats, not bits of entropy. To compare to Shannon entropy above, you'd have to convert, which is just a multiplication.
Most common way for to do splitting for continuous variable (1d) is picking a threshold (from discretized set of thresholds, or you can choose a prior). So you can compute information gain for continuous value by first sorting it (you have to have an order) and then scanning it for the best value. http://dilekylmzr.files.wordpress.com/2011/09/data-mining_lecture9.ppt
Example of using this technique in random forests
Often this technique is used in random forests (or decision trees), so I will post few references to resources on that.
More information on random forests and this technique can be found here : http://www.cs.ubc.ca/~nando/540-2013/lectures.html . See lectures on youtube because slides are not very much informative. In the lecture it is described how to match body parts using random forests in Kinect, so it is quite interesting.
Also you can look it up here : https://research.microsoft.com/pubs/145347/bodypartrecognition.pdf - the original paper being discussed in the lecture.
Note that for information gain you can use also gaussian entropy. It is basically fitting gaussian to data before and after split.

Adding attributes to a training set

If I had a feature calories and another feature number of people, why does adding the feature calorie per person or adding the feature calories/10 help in improving testing? I don't see how performing simple arithmetic on two features will gain you more information.
Thanks
Consider you're using a classifier/regression mechanism which is linear (or log-linear) in the feature space. If your instance x has features x_i, then being linear means the score is something like:
y_i = \sum_i x_i * w_i
Now consider you think there are some important interactions between the features---maybe you think that x_i is only important if x_j takes a similar value, or their sum is more important than the individual values, or whatever. One way of incorporating this information is to have the algorithm explicitly model cross products, e.g.:
y_i = [ \sum_i x_i * w_i ] + [\sum_i,j x_i * x_j * w_ij]
However, linear algorithms are ubiquitous and easy to use, so a way of getting interaction-like terms into your standard linear classifier/regression mechanism is to augment the feature space so for every pair x_i, x_j you create a feature of the form [x_i * x_j] or [x_i / x_j] or whatever. Now you can model interactions between features without needing to use a non-linear algorithm.
Performing that type of arithmetic allows you to use that information in models that don't explicitly consider nonlinear combinations of variables. Some classifiers attempt to find features that best explain/predict the training data and often the best feature may be nonlinear.
Using your data, suppose you wanted to predict whether a group of people will - on average - gain weight. And suppose the "correct" answer is that the group will gain weight if people in the group consume over an average of 3,000 calories per day. If your inputs are group_size and group_calories, you will need to use both of those variables to make an accurate prediction. But if you also provide group_avg_calories (which is just group_calories / group_size), you could just use that single feature to make the prediction. Even if the first two features added some additional information, if you were to feed those 3 features to a decision tree classifier, it would almost certainly pick group_avg_calories as the root node and you would end up with a much simpler tree structure. There is also a downside to adding lots of arbitrary nonlinear combinations of features to your model, which is that it can add significantly to the classifier's training time.
With regard to calories/10, it's not clear why you would do that specifically, but normalizing the input features can improve convergence rates for some classifiers (e.g., ANNs) and can also provide better performance for clustering algorithms because the input features will all be at the same scale (i.e., distances along different feature axes are comparable).

Centroid algorithm for document classification, threshold detection

I have a collection of documents related to a particular domain and have trained the centroid classifier based on that collection. What I want to do is, I will be feeding the classifier with documents from different domains and want to determine how much they are relevant to the trained domain. I can use the cosine similarity for this to get a numerical value but my question is what is the best way to determine the threshold value?
For this, I can download several documents from different domains and inspect their similarity scores to determine the threshold value. But is this the way to go, does it sound statistically good? What are the other approaches for this?
Actually there is another issue with centroids in sparse vectors. The problem is that they usually are significantly less sparse than the original data. For examples, this increases computation costs. And it can yield vectors that are themselves actually atypical because they have a different sparsity pattern. This effect is similar to using arithmetic means of discrete data: say the mean number of doors in a car is 3.4; yet obviously no car exists that actually has 3.4 doors. So in particular, there will be no car with an euclidean distance of less than 0.4 to the centroid! - so how "central" is the centroid then really?
Sometimes it helps to use medoids instead of centroids, because they actually are proper objects of your data set.
Make sure you control such effects on your data!
A simple method to try would be to employ various machine-learning algorithms - and in particular, tree-based ones - on the distances from your centroids.
As mentioned in another answer(#Anony-Mousse), this won't necessarily provide you with good or usable answers, but it just might. Using a ML framework for this procedure, E.g. WEKA, will also help you with estimating your accuracy in a more rigorous manner.
Here are the steps to take, using WEKA:
Generate a train set by finding a decent amount of documents representing each of your classes (to get valid estimations, I'd recommend at least a few dozens per class)
Calculate the distance from each document to each of your centroids.
Generate a feature vector for each such document, composed of the distances from this document to the centroids. You can either use a single feature - the distance to the nearest centroid; or use all distances, if you'd like to try a more elaborate thresholding scheme. For example, if you chose the simpler method of using a single feature, the vector representing a document with a distance of 0.2 to the nearest centroid, belonging to class A would be: "0.2,A"
Save this set in ARFF or CSV format, load into WEKA, and try classifying, e.g. using a J48 tree.
The results would provide you with an overall accuracy estimation, with a detailed confusion matrix, and - of course - with a specific model, e.g. a tree, you can use for classifying additional documents.
These results can be used to iteratively improve the models and thresholds by collecting additional train documents for problematic classes, either by recreating the centroids or by retraining the thresholds classifier.

Resources