Information gain on non discrete dataset - machine-learning

Jiawei Han's book on Data Mining 2nd edition (Attribute Selection Measures - pp 297 thru 300) explains how to calculate information gain achieved by each attribute (age, income, credit_rating) and class (buys_computer yes or no).
In this example, each of the attribute values is discrete, for e.g. age can be youth/middle-aged/senior, income can be high/low/medium, credit_rating fair/excellent etc.
I would like to know how the same information gain can be applied to attributes which take non discrete data. For e.g. the income attribute takes any currency amount like 100.68, 120.90, etc etc.
If there are 1000 students, there could be 1000 different amount values.
How can we apply the same information gain over non discrete data? Any tutorial/sample example/video url would be of great help.

When your target variable is discrete (categorical), you just calculate entropy over the empirical distribution of categories in the left/right split you're considering, and compare their weighted average to the entropy without the split.
For a continuous target variable, like income, this is defined analogously as differential entropy. For your purpose you would assume that the values in your set have a normal distribution, and calculate the differential entropy accordingly. From Wikipedia:
That is it's just a function of the variance of the values. Note that this is in nats, not bits of entropy. To compare to Shannon entropy above, you'd have to convert, which is just a multiplication.

Most common way for to do splitting for continuous variable (1d) is picking a threshold (from discretized set of thresholds, or you can choose a prior). So you can compute information gain for continuous value by first sorting it (you have to have an order) and then scanning it for the best value. http://dilekylmzr.files.wordpress.com/2011/09/data-mining_lecture9.ppt
Example of using this technique in random forests
Often this technique is used in random forests (or decision trees), so I will post few references to resources on that.
More information on random forests and this technique can be found here : http://www.cs.ubc.ca/~nando/540-2013/lectures.html . See lectures on youtube because slides are not very much informative. In the lecture it is described how to match body parts using random forests in Kinect, so it is quite interesting.
Also you can look it up here : https://research.microsoft.com/pubs/145347/bodypartrecognition.pdf - the original paper being discussed in the lecture.
Note that for information gain you can use also gaussian entropy. It is basically fitting gaussian to data before and after split.

Related

How to Intelligently Sample Parameter Space while Training a Statistical Classifier

I'm interested in a statistical classification problem. Given a feature vector X, I would like to classify X as either "yes" or "no". However, the training data will be fed in real-time based on human input. For instance, if the user sees feature vector X, the user will assign "yes" or "no" based on their expertise.
Rather than doing grid search on parameter space, I would like to more intelligently explore the parameter space based on the previously submitted data. For example, if there is a dense cluster of "no's" in part of the parameter space, it probably doesn't make sense to keep sampling there - it's probably just going to be more "no's".
How can I go about doing this? The C4.5 algorithm seems to be up this alley, but I'm unsure if this is the way to go.
An additional subtlety is that some of the features might be specifying random data. Suppose that the first two attributes in the feature vector specify the mean and variance of a gaussian distribution. The data the user classifies could be significantly different, even if all parameters are held equal.
For example, let's say the algorithm displays a sine wave with gaussian noise added, where the gaussian distribution is specified by the mean and variance in the feature vector. The user is asked "does this graph represent a sine wave?" Two very similar values in mean or variance could still have significantly different graphs.
Is there an algorithm designed to handle such cases?
The setting that you're talking about fits in the broad area of Active Learning. This topic addresses the iterative process of model building, and choosing which training examples to query next in order to optimize model performance. Here, the training cost of each data point is roughly the same, and there are no additional variable rewards in the learning phase.
However, in each iteration, if you have a variable reward which is a function of the data point chosen, you would want to look at Multi-Armed Bandits and Reinforcement Learning.
The other issue that you're talking about is one of finding the right features to represent your data points, and should be handled separately.

Feature extraction for multiple sub-features

I would like to conduct some feature extraction(or clustering) for dataset containing sub-features.
For example, dataset is like below. The goal is to classify the type of robot using the data.
Samples : 100 robot samples [Robot 1, Robot 2, ..., Robot 100]
Classes : 2 types [Type A, Type B]
Variables : 6 parts, and 3 sub-features for each parts (total 18 variables)
[Part1_weight, Part1_size, Part1_strength, ..., Part6_size, Part6_strength, Part6_weight]
I want to conduct feature extraction with [weight, size, strength], and use extracted feature as a representative value for the part.
In short, my aim is to reduce the feature to 6 - [Part1_total, Part2_total, ..., Part6_total] - and then, classify the type of robot with those 6 features. So, make combined feature with 'weight', 'size', and 'strength' is the problem to solve.
First I thought of applying PCA (Principal Component Analysis), because it is one of the most popular feature extraction algorithm. But it considers all 18 features separately, so 'Part1_weight' can be considered as more important than 'Part2_weight'. But what I have to know is the importance of 'weights', 'sizes', and 'strengths' among samples, so PCA seems to be not applicable.
Is there any supposed way to solve this problem?
If you want to have exactly one feature per part I see no other way than performing the feature reduction part-wise. However, there might be better choices than simple PCA. For example, if the parts are mostly solid, their weight is likely to correlate with the third power of the size, so you could take the cubic root of the weight or the cube of the size before performing the PCA. Alternatively, you can take a logarithm of both values, which again results in a linear dependency.
Of course, there are many more fancy transformations you could use. In statistics, the Box-Cox Transformation is used to achieve a normal-looking distribution of the data.
You should also consider normalising the transformed data before performing the PCA, i.e. subtracting the mean and dividing by the standard deviations of each variable. It will remove the influence of units of measurement. I.e. it won't matter whether you measure weight in kg, atomic units, or Sun masses.
If the Part's number makes them different from one another (e.g Part1 is different from Part2, doesn't matter if their size, weight, strength parameters are identical), you can do PCA once for each Part. Using only the current Part's size, weight and strength as parameters in the current PCA.
Alternatively, if the Parts array order doesn't matter, you can do only one PCA using all (size, weight, strength) parameter triples, not differing them by their part number.

Adding attributes to a training set

If I had a feature calories and another feature number of people, why does adding the feature calorie per person or adding the feature calories/10 help in improving testing? I don't see how performing simple arithmetic on two features will gain you more information.
Thanks
Consider you're using a classifier/regression mechanism which is linear (or log-linear) in the feature space. If your instance x has features x_i, then being linear means the score is something like:
y_i = \sum_i x_i * w_i
Now consider you think there are some important interactions between the features---maybe you think that x_i is only important if x_j takes a similar value, or their sum is more important than the individual values, or whatever. One way of incorporating this information is to have the algorithm explicitly model cross products, e.g.:
y_i = [ \sum_i x_i * w_i ] + [\sum_i,j x_i * x_j * w_ij]
However, linear algorithms are ubiquitous and easy to use, so a way of getting interaction-like terms into your standard linear classifier/regression mechanism is to augment the feature space so for every pair x_i, x_j you create a feature of the form [x_i * x_j] or [x_i / x_j] or whatever. Now you can model interactions between features without needing to use a non-linear algorithm.
Performing that type of arithmetic allows you to use that information in models that don't explicitly consider nonlinear combinations of variables. Some classifiers attempt to find features that best explain/predict the training data and often the best feature may be nonlinear.
Using your data, suppose you wanted to predict whether a group of people will - on average - gain weight. And suppose the "correct" answer is that the group will gain weight if people in the group consume over an average of 3,000 calories per day. If your inputs are group_size and group_calories, you will need to use both of those variables to make an accurate prediction. But if you also provide group_avg_calories (which is just group_calories / group_size), you could just use that single feature to make the prediction. Even if the first two features added some additional information, if you were to feed those 3 features to a decision tree classifier, it would almost certainly pick group_avg_calories as the root node and you would end up with a much simpler tree structure. There is also a downside to adding lots of arbitrary nonlinear combinations of features to your model, which is that it can add significantly to the classifier's training time.
With regard to calories/10, it's not clear why you would do that specifically, but normalizing the input features can improve convergence rates for some classifiers (e.g., ANNs) and can also provide better performance for clustering algorithms because the input features will all be at the same scale (i.e., distances along different feature axes are comparable).

Centroid algorithm for document classification, threshold detection

I have a collection of documents related to a particular domain and have trained the centroid classifier based on that collection. What I want to do is, I will be feeding the classifier with documents from different domains and want to determine how much they are relevant to the trained domain. I can use the cosine similarity for this to get a numerical value but my question is what is the best way to determine the threshold value?
For this, I can download several documents from different domains and inspect their similarity scores to determine the threshold value. But is this the way to go, does it sound statistically good? What are the other approaches for this?
Actually there is another issue with centroids in sparse vectors. The problem is that they usually are significantly less sparse than the original data. For examples, this increases computation costs. And it can yield vectors that are themselves actually atypical because they have a different sparsity pattern. This effect is similar to using arithmetic means of discrete data: say the mean number of doors in a car is 3.4; yet obviously no car exists that actually has 3.4 doors. So in particular, there will be no car with an euclidean distance of less than 0.4 to the centroid! - so how "central" is the centroid then really?
Sometimes it helps to use medoids instead of centroids, because they actually are proper objects of your data set.
Make sure you control such effects on your data!
A simple method to try would be to employ various machine-learning algorithms - and in particular, tree-based ones - on the distances from your centroids.
As mentioned in another answer(#Anony-Mousse), this won't necessarily provide you with good or usable answers, but it just might. Using a ML framework for this procedure, E.g. WEKA, will also help you with estimating your accuracy in a more rigorous manner.
Here are the steps to take, using WEKA:
Generate a train set by finding a decent amount of documents representing each of your classes (to get valid estimations, I'd recommend at least a few dozens per class)
Calculate the distance from each document to each of your centroids.
Generate a feature vector for each such document, composed of the distances from this document to the centroids. You can either use a single feature - the distance to the nearest centroid; or use all distances, if you'd like to try a more elaborate thresholding scheme. For example, if you chose the simpler method of using a single feature, the vector representing a document with a distance of 0.2 to the nearest centroid, belonging to class A would be: "0.2,A"
Save this set in ARFF or CSV format, load into WEKA, and try classifying, e.g. using a J48 tree.
The results would provide you with an overall accuracy estimation, with a detailed confusion matrix, and - of course - with a specific model, e.g. a tree, you can use for classifying additional documents.
These results can be used to iteratively improve the models and thresholds by collecting additional train documents for problematic classes, either by recreating the centroids or by retraining the thresholds classifier.

Weighted Decision Trees using Entropy

I'm building a binary classification tree using mutual information gain as the splitting function. But since the training data is skewed toward a few classes, it is advisable to weight each training example by the inverse class frequency.
How do I weight the training data? When calculating the probabilities to estimate the entropy, do I take weighted averages?
EDIT: I'd like an expression for entropy with the weights.
The Wikipedia article you cited goes into weighting. It says:
Weighted variants
In the traditional formulation of the mutual information,
each event or object specified by (x,y) is weighted by the corresponding probability p(x,y). This assumes that all objects or events are equivalent apart from their probability of occurrence. However, in some applications it may be the case that certain objects or events are more significant than others, or that certain patterns of association are more semantically important than others.
For example, the deterministic mapping {(1,1),(2,2),(3,3)} may be viewed as stronger (by some standard) than the deterministic mapping {(1,3),(2,1),(3,2)}, although these relationships would yield the same mutual information. This is because the mutual information is not sensitive at all to any inherent ordering in the variable values (Cronbach 1954, Coombs & Dawes 1970, Lockhead 1970), and is therefore not sensitive at all to the form of the relational mapping between the associated variables. If it is desired that the former relation — showing agreement on all variable values — be judged stronger than the later relation, then it is possible to use the following weighted mutual information (Guiasu 1977)
which places a weight w(x,y) on the probability of each variable value co-occurrence, p(x,y). This allows that certain probabilities may carry more or less significance than others, thereby allowing the quantification of relevant holistic or prägnanz factors. In the above example, using larger relative weights for w(1,1), w(2,2), and w(3,3) would have the effect of assessing greater informativeness for the relation {(1,1),(2,2),(3,3)} than for the relation {(1,3),(2,1),(3,2)}, which may be desirable in some cases of pattern recognition, and the like.
http://en.wikipedia.org/wiki/Mutual_information#Weighted_variants
State-value weighted entropy as a measure of investment risk.
http://www56.homepage.villanova.edu/david.nawrocki/State%20Weighted%20Entropy%20Nawrocki%20Harding.pdf

Resources