Why Information gain feature selection gives zero scores - machine-learning

I have a dataset in which I used the Information gain feature selection method in WEKA to get the important features. Below is the output I got.
Ranked attributes:
0.97095 1 Opponent
0.41997 11 Field_Goals_Made
0.38534 24 Opp_Free_Throws_Made
0.00485 4 Home
0 8 Field_Goals_Att
0 12 Opp_Total_Rebounds
0 10 Def_Rebounds
0 9 Total_Rebounds
0 6 Opp_Field_Goals_Made
0 7 Off_Rebounds
0 14 Opp_3Pt_Field_Goals_Made
0 2 Fouls
0 3 Opp_Blocks
0 5 Opp_Fouls
0 13 Opp_3Pt_Field_Goals_Att
0 29 3Pt_Field_Goal_Pct
0 28 3Pt_Field_Goals_Made
0 22 3Pt_Field_Goals_Att
0 25 Free_Throws_Made
Which tells me that all features with score 0 can be ignored, is it correct?
Now when I tried the Wrapper subset evaluation in WEKA, I got selected attribute which were ignored in info gain method (i.e whose score was 0). Below is the output
Selected attributes: 3,8,9,11,24,25 : 6
Opp_Blocks
Field_Goals_Att
Total_Rebounds
Field_Goals_Made
Opp_Free_Throws_Made
Free_Throws_Made
I want to understand, what is the reason that the attributes ignored by info gain are considered strongly by wrapper subset evaluation method?

To understand what's happening, it helps to understand first what the two feature selection methods are doing.
The information gain of an attribute tells you how much information with respect to the classification target the attribute gives you. That is, it measures the difference in information between the cases where you know the value of the attribute and where you don't know the value of the attribute. A common measure for the information is Shannon entropy, although any measure that allows to quantify the information content of a message will do.
So the information gain depends on two things: how much information was available before knowing the attribute value, and how much was available after. For example, if your data contains only one class, you already know what the class is without having seen any attribute values and the information gain will always be 0. If, on the other hand, you have no information to start with (because the classes you want to predict are represented in equal quantities in your data), and an attribute splits the data perfectly into the classes, its information gain will be 1.
The important thing to note in this context is that the information gain is a purely information-theoretic measure, it does not consider any actual classification algorithms.
This is what the wrapper method does differently. Instead of analyzing the attributes and targets from an information-theoretic point of view, it uses an actual classification algorithm to build a model with a subset of the attributes and then evaluates the performance of this model. It then tries a different subset of attributes and does the same thing again. The subset for which the trained model exhibits the best empirical performance wins.
There are a number of reasons why the two methods would give you different results (this list is not exhaustive):
A classification algorithm may not be able to leverage all the information that the attributes can provide.
A classification algorithm may implement its own attribute selection internally (for example decision tree/forest learners do this) that considers a smaller subset than attribute selection will yield.
Individual attributes may not be informative, but combinations of them may be (for example perhaps a and b has no information separately, but a*b on the other hand, might). Attribute selection will not discover this because it evaluates attributes in isolation, while a classification algorithm may be able to leverage this.
Attribute selection does not consider the attributes sequentially. Decision trees for example use a sequence of attributes and while b may provide information on its own, it may not provide any information in addition to a, which is used higher up in the tree. Therefore b would appear useful when evaluated according to information gain, but is not used by a tree that "knows" a first.
In practice it's usually a better idea to use a wrapper for attribute selection as it takes the performance of the actual classifier you want to use into account, and different classifier vary widely in usage of information. The advantage of classifier-agnostic measures like information gain is that they are much cheaper to compute.

In filter technique(Info gain here), Features are considered in isolation from one another hence when individually considered IG is 0
But in certain cases one feature needs another feature to
boost accuracy and hence when considered together with other feature it produces predictive value.
Hope this helps and on time :)

Related

How to evaluate mean reciprocal rank(mrr) is a good model

such as AUC have a metrics
a good model will be over 0.7
great one will be over 0.85.
I want to know mean reciprocal rank(mrr) metrics evaluation.
how to define this is a good model.
very thanks!!
The metric MRR take values from 0 (worst) to 1 (best), as described here. However, the definition of a good (or acceptable) MRR depends on your use case. For example, if you build a model to be used in a recommender system, and from thousands of possible items, recommend a set of five items to users, then an MRR of 0.2 could be defined as acceptable. This means that on average, the correct item the user bought was part of the top 5 items, predicted by your model.
All in all, it mostly depends on how many possible classes are possible to predict, as well as your use case.

How do I interpret this output from sklearn.tree.export_graphviz?

I am working on analyzing grade data. As a new way to look at the data I am using a decision tree, for the first time. I believe I have the code right and now I am trying to interpret it. The features are grades gotten for a series of quizzes, and the classification is the final grade the student received. I have a few questions:
If my understanding is correct, each node has a test and a left branch representing the test being true, and the other for false. And when the tree seems to have asked enough questions, it says what the "class" is. If that is the case, how come there's a class= on boxes well before the leaves? I would have thought that just leaves have a class=
How do I "tune" the overall tree? It seems to have too many boxes. Is this an example of "overfitting"? How can I tune that better?
For example, the use of FINAL_GRADE_PA01 seems arbitrary to be based on the ordering of the data. Is that true or did the analysis actually conclude that that feature was the best discriminator?
If I'm not mistaken, those class values indicate what the model would have predicted, had it stopped branching on that node. It still stores those values, but it doesn't use them if there's a branching from that node.
About the number of nodes, as you see in the docs:
The default values for the parameters controlling the size of the
trees (e.g. max_depth, min_samples_leaf, etc.) lead to fully grown and
unpruned trees which can potentially be very large on some data sets.
To reduce memory consumption, the complexity and size of the trees
should be controlled by setting those parameter values.
There are several parameters which you can use to reduce the complexity of your model. The following two parameters are just an example:
max_leaf_nodes : int or None, optional (default=None)
Grow a tree with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited
number of leaf nodes.
min_impurity_decrease : float, optional (default=0.)
A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

How to classify text with Knime

I'm trying to classify some data using knime with knime-labs deep learning plugin.
I have about 16.000 products in my DB, but I have about 700 of then that I know its category.
I'm trying to classify as much as possible using some DM (data mining) technique. I've downloaded some plugins to knime, now I have some deep learning tools as some text tools.
Here is my workflow, I'll use it to explain what I'm doing:
I'm transforming the product name into vector, than applying into it.
After I train a DL4J learner with DeepMLP. (I'm not really understand it all, it was the one that I thought I got the best results). Than I try to apply the model in the same data set.
I thought I would get the result with the predicted classes. But I'm getting a column with output_activations that looks that gets a pair of doubles. when sorting this column I get some related date close to each other. But I was expecting to get the classes.
Here is a print of the result table, here you can see the output with the input.
In columns selection it's getting just the converted_document and selected des_categoria as Label Column (learning node config). And in Predictor node I checked the "Append SoftMax Predicted Label?"
The nom_produto is the text column that I'm trying to use to predict the des_categoria column that it the product category.
I'm really newbie about DM and DL. If you could get me some help to solve what I'm trying to do would be awesome. Also be free to suggest some learning material about what attempting to achieve
PS: I also tried to apply it into the unclassified data (17,000 products), but I got the same result.
I won't answer with a workflow on this one because it is not going to be a simple one. However, be sure to find the text mining example on the KNIME server, i.e. the one that makes use of the bag of words approach.
The task
Product mapping to categories should be a straight-forward data mining task because the information that explains the target variable is available in a quasi-exhaustive manner. Depending on the number of categories to train though, there is a risk that you might need more than 700 instances to learn from.
Some resources
Here are some resources, only the first one being truly specialised in text mining:
Introduction on Information Retrieval, in particular chapter 13;
Data Science for Business is an excellent introduction to data mining, including text mining (chapter 10), also do not forget the chapter about similarity (chapter 6);
Machine Learning with R has the advantage of being accessible enough (chapter 4 provides an example of text classification with R code).
Preprocessing
First, you will have to preprocess your product labels a bit. Use KNIME's text analytics preprocessing nodes for that purpose, that is after you've transformed the product labels with Strings to Document:
Case Convert, Punctuation Erasure and Snowball Stemmer;
you probably won't need Stop Word Filter, however, there may be quasi-stop words such as "product", which you may need to remove manually with Dictionary Filter;
Be careful not to use any of the following without testing testing their impact first: N Chars Filter (g may be a useful word), Number Filter (numbers may indicate quantities, which may be useful for classification).
Should you encounter any trouble with the relevant nodes (e.g. Punctuation Erasure can be tricky amazingly thanks to the tokenizer), you can always apply String Manipulation with regex before converting the Strings to Document.
Keep it short and simple: the lookup table
You could build a lookup table based on the 700 training instances. The book Data mining techniques as well as resource (2) present this approach in some detail. If any model performs any worse than the lookup table, you should abandon the model.
Nearest neighbors
Neural networks are probably overkill for this task.
Start with a K Nearest Neighbor node (applying a string distance such as Cosine, Levensthein or Jaro-Winkler). This approach requires the least amount of data wrangling. At the very least, it will provide an excellent baseline model, so it is most definitely worth a shot.
You'll need to tune the parameter k and to experiment with the distance types. The Parameter Optimization Loop pair will help you with optimizing k, you can include a Cross-Validation meta node inside of the said loop to obtain an estimate of the expected performance given k instead of only one point estimate per value of k. Use Cohen's Kappa as an optimization criterion, as proposed by the resource number (3) and available via the Scorer node.
After the parameter tuning, you'll have to evaluate the relevance of your model using yet another Cross-Validation meta node, then follow up with a Loop pair including Scorer to calculate the descriptives on performance metric(s) per iteration, finally use Statistics. Kappa is a convenient metric for this task because the target variable consists of many product categories.
Don't forget to test its performance against the lookup table.
What next ?
Should lookup table or k-nn work well for you, then there's nothing else to add.
Should any of those approaches fail, you might want to analyse the precise cases on which it fails. In addition, training set size may be too low, so you could manually classify another few hundred or thousand instances.
If after increasing the training set size, you are still dealing with a bad model, you can try the bag of words approach together with a Naive Bayes classifier (see chapter 13 of the Information Retrieval reference). There is no room here to elaborate on the bag of words approach and Naive Bayes but you'll find the resources here above useful for that purpose.
One last note. Personally, I find KNIME's Naive Bayes node to perform poorly, probably because it does not implement Laplace smoothening. However, KNIME's R Learner and R Predictor nodes will allow you to use R's e1071 package, as demonstrated by resource (3).

What type of ML is this? Algorithm to repeatedly choose 1 correct candidate from a pool (or none)

I have a set of 3-5 black box scoring functions that assign positive real value scores to candidates.
Each is decent at ranking the best candidate highest, but they don't always agree--I'd like to find how to combine the scores together for an optimal meta-score such that, among a pool of candidates, the one with the highest meta-score is usually the actual correct candidate.
So they are plain R^n vectors, but each dimension individually tends to have higher value for correct candidates. Naively I could just multiply the components, but I hope there's something more subtle to benefit from.
If the highest score is too low (or perhaps the two highest are too close), I just give up and say 'none'.
So for each trial, my input is a set of these score-vectors, and the output is which vector corresponds to the actual right answer, or 'none'. This is kind of like tech interviewing where a pool of candidates are interviewed by a few people who might have differing opinions but in general each tend to prefer the best candidate. My own application has an objective best candidate.
I'd like to maximize correct answers and minimize false positives.
More concretely, my training data might look like many instances of
{[0.2, 0.45, 1.37], [5.9, 0.02, 2], ...} -> i
where i is the ith candidate vector in the input set.
So I'd like to learn a function that tends to maximize the actual best candidate's score vector from the input. There are no degrees of bestness. It's binary right or wrong. However, it doesn't seem like traditional binary classification because among an input set of vectors, there can be at most 1 "classified" as right, the rest are wrong.
Thanks
Your problem doesn't exactly belong in the machine learning category. The multiplication method might work better. You can also try different statistical models for your output function.
ML, and more specifically classification, problems need training data from which your network can learn any existing patterns in the data and use them to assign a particular class to an input vector.
If you really want to use classification then I think your problem can fit into the category of OnevsAll classification. You will need a network (or just a single output layer) with number of cells/sigmoid units equal to your number of candidates (each representing one). Note, here your number of candidates will be fixed.
You can use your entire candidate vector as input to all the cells of your network. The output can be specified using one-hot encoding i.e. 00100 if your candidate no. 3 was the actual correct candidate and in case of no correct candidate output will be 00000.
For this to work, you will need a big data set containing your candidate vectors and corresponding actual correct candidate. For this data you will either need a function (again like multiplication) or you can assign the outputs yourself, in which case the system will learn how you classify the output given different inputs and will classify new data in the same way as you did. This way, it will maximize the number of correct outputs but the definition of correct here will be how you classify the training data.
You can also use a different type of output where each cell of output layer corresponds to your scoring functions and 00001 means that the candidate your 5th scoring function selected was the right one. This way your candidates will not have to be fixed. But again, you will have to manually set the outputs of the training data for your network to learn it.
OnevsAll is a classification technique where there are multiple cells in the output layer and each perform binary classification in between one of the classes vs all others. At the end the sigmoid with the highest probability is assigned 1 and rest zero.
Once your system has learned how you classify data through your training data, you can feed your new data in and it will give you output in the same way i.e. 01000 etc.
I hope my answer was able to help you.:)

Is there no information gain using entropy on 2 classes?

I have a very random population I'm trying to split using binary decision tree.
Population probability
TRUE 51%
FALSE 49%
So the entropy is 1 (rounded to 3). So for any feature the entropy will also be 1 (the same), and thus no information gain.
Am I doing this right? In my process to learn it I haven't come across anything saying that entropy is useless for 2 classes
The entropy/information gain doesn't so much depend on the distribution of the classes, but on the information contained in the features that are used to characterise the instances in your data set. If, for example, you had a feature that was always 1 for the TRUE class and always 2 for the FALSE class, it would have the highest information gain because it allows you to separate these two classes perfectly.
If the information gain you're getting is very small, it indicates that the information contained in the features is not useful for separating your classes. In this case, you need to find more informative features.

Resources