Convert Decision Table To Decision Tree - machine-learning

How to convert or visualize decision table to decision tree graph,
is there an algorithm to solve it, or a software to visualize it?
For example, I want to visualize my decision table below:
http://i.stack.imgur.com/Qe2Pw.jpg

Gotta say that is an interesting question.
I don't know the definitive answer, but I'd propose such a method:
use Karnaugh map to turn your decision table to minimized boolean function
turn your function into a tree
Lets simplyify an example, and assume that using Karnaugh got you function (a and b) or c or d. You can turn that into a tree as:
Source: my own

It certainly is easier to generate a decision table from a decision tree, not the other way around.
But the way I see it you could convert your decision table to a data set. Let the 'Disease' be the class attribute and treat the evidence as simple binary instance attributes. From that you can easily generate a decision tree using one of available decision tree induction algorithms, for example C4.5. Just remember to disable pruning and lower the minimum number of objects parameter.
During that process you would lose a bit of information, but the accuracy would remain the same. Take a look at both rows describing disease D04 - the second row is in fact more general than the first. Decision tree generated from this data would recognize the mentioned disease only from E11, 12 and 13 attributes, since it's enough to correctly label the instance.

I've spent few hours looking for a good algorithm. But I'm happy with my results.
My code is too dirty now to paste here (I can share privately on request, on your discretion) but the general idea is as the following.
Assume you have a data set with some decision criteria and outcome.
Define a tree structure (e.g. data.tree in R) and create "Start" root node.
Calculate outcome entropy of your data set. If entropy is zero you are done.
Using each criterion, one by one, as tree node calculate entropy for all branches created with this criterion. Take the minimum one entropy of all branches.
Branches created with the criterion with the smallest (minimum) entropy are your next tree node. Add them as child nodes.
Split your data according to decision point/tree node found in step 4 and remove the criterion used.
Repeat step 2-4 for each branch until your all branches have entropy = 0.
Enjoy your ideal decision tree :)

Related

Regarding prediction of Decision Tree

How does Decision tree predict the out come on a new Data set. Lets say with the hyper parameters I allowed my decision tree to grow only to a certain extent to avoid over fitting. Now a new data point is passed to this trained model, so the new data point reaches to one of the leaf nodes. But how does that leaf node predict whether the data point is either 1 or 0? ( I am talking about Classification here).
Well, you pretty much answered your own question. But just to the extension, in the last the data is labelled to either 0 or 1 is hgihly dependent on the type of algorithm you used, for example, ID3 , uses the mode value to predict. similarly C4.5 and C5 or CART have there different criteria based on info gain, ginni index etc etc....
In simplified terms, the process of training a decision tree and predicting the target features of query instances is as follows:
Present a dataset containing of a number of training instances characterized by a number of descriptive features and a target feature
Train the decision tree model by continuously splitting the target feature along the values of the descriptive features using a measure of information gain during the training process
Grow the tree until we accomplish a stopping criteria --> create leaf nodes which represent the predictions we want to make for new query instances
Show query instances to the tree and run down the tree until we arrive at leaf nodes
DONE - Congratulations you have found the answers to your questions
here is a link I am suggesting which explain the decision tree in very detail with scratch. Give it a good read -
https://www.python-course.eu/Decision_Trees.php

How to restrict the sequence prediction in an LSTM model to match a specific pattern?

I have created a word-level text generator using an LSTM model. But in my case, not every word is suitable to be selected. I want them to match additional conditions:
Each word has a map: if a character is a vowel then it will write 1 if not, it will write 0 (for instance, overflow would be 10100010). Then, the sentence generated needs to meet a given structure, for instance, 01001100 (hi 01 and friend 001100).
The last vowel of the last word must be the one provided. Let's say is e. (friend will do the job, then).
Thus, to handle this scenario, I've created a pandas dataframe with the following structure:
word last_vowel word_map
----- --------- ----------
hello o 01001
stack a 00100
jhon o 0010
This is my current workflow:
Given the sentence structure, I choose a random word from the dataframe which matches the pattern. For instance, if the sentence structure is 0100100100100, we can choose the word hello, as its vowel structure is 01001.
I subtract the selected word from the remaining structure: 0100100100100 will become 00100100 as we've removed the initial 01001 (hello).
I retrieve all the words from the dataframe which matches part of the remaining structure, in this case, stack 00100 and jhon 0010.
I pass the current word sentence content (just hello by now) to the LSTM model, and it retrieves the weights of each word.
But I don't just want to select the best option, I want to select the best option contained in the selection of point 3. So I choose the word with the highest estimation within that list, in this case, stack.
Repeat from point 2 until the remaining sentence structure is empty.
That works like a charm, but there is one remaining condition to handle: the last vowel of the sentence.
My way to deal with this issue is the following:
Generating 1000 sentences forcing that the last vowel is the one specified.
Get the rmse of the weights returned by the LSTM model. The better the output, the higher the weights will be.
Selecting the sentence which retrieves the higher rank.
Do you think is there a better approach? Maybe a GAN or reinforcement learning?
EDIT: I think another approach would be adding WFST. I've heard about pynini library, but I don't know how to apply it to my specific context.
If you are happy with your approach, the easiest way might be if you'd be able to train your LSTM on the reversed sequences as to train it to give the weight of the previous word, rather than the next one. In such a case, you can use the method you already employ, except that the first subset of words would be satisfying the last vowel constraint. I don't believe that this is guaranteed to produce the best result.
Now, if that reversal is not possible or if, after reading my answer further, you find that this doesn't find the best solution, then I suggest using a pathfinding algorithm, similar to reinforcement learning, but not statistical as the weights computed by the trained LSTM are deterministic. What you currently use is essentially a depth first greedy search which, depending on the LSTM output, might be even optimal. Say if LSTM is giving you a guaranteed monotonous increase in the sum which doesn't vary much between the acceptable consequent words (as the difference between N-1 and N sequence is much larger than the difference between the different options of the Nth word). In the general case, when there is no clear heuristic to help you, you will have to perform an exhaustive search. If you can come up with an admissible heuristic, you can use A* instead of Dijkstra's algorithm in the first option below, and it will do the faster, the better you heuristic is.
I suppose it is clear, but just in case, your graph connectivity is defined by your constraint sequence. The initial node (0-length sequence with no words) is connected with any word in your data frame that matches the beginning of your constraint sequence. So you do not have the graph as a data structure, just it's the compressed description as this constraint.
EDIT
As per request in the comment here are additional details. Here are a couple of options though:
Apply Dijkstra's algorithm multiple times. Dijkstra's search finds the shortest path between 2 known nodes, while in your case we only have the initial node (0-length sequence with no words) and the final words are unknown.
Find all acceptable last words (those that satisfy both the pattern and vowel constraints).
Apply Dijkstra's search for each one of those, finding the largest word sequence weight sum for each of them.
Dijkstra's algorithm is tailored to the searching of the shortest path, so to apply it directly you will have to negate the weights on each step and pick the smallest one of those that haven't been visited yet.
After finding all solutions (sentences that end with one of those last words that you identified initially), select the smallest solution (this is going to be exactly the largest weight sum among all solutions).
Modify your existing depth-first search to do an exhaustive search.
Perform the search operation as you described in OP and find a solution if the last step gives one (if the last word with a correct vowel is available at all), record the weight
Rollback one step to the previous word and pick the second-best option among previous words. You might be able to discard all the words of the same length on the previous step if there was no solution at all. If there was a solution, it depends on whether your LSTM provides different weights depending on the previous word. Likely it does and in that case, you have to perform that operation for all the words in the previous step.
When you run out of the words on the previous step, move one step up and restart down from there.
You keep the current winner all the time as well as the list of unvisited nodes on every step and perform exhaustive search. Eventually, you will find the best solution.
I would reach for a Beam Search here.
This is much like your current approach of starting 1000 solutions randomly. But instead of expanding each of those paths independently, it expands all candidate solutions together in a step by step manner.
Sticking with the current candidate count of 1000, it would look like this:
Generate 1000 stub solutions, for example using random starting points or selected from some "sentence start" model.
For each candidate, compute the best extensions based on your LSTM language model, which fit the constraints. This works just as it does in your current approach, except you could also try more than one option. For example using the best 5 choices for the next word would product 5000 child candidates.
Compute a score for each of those partial solution candidates, then reduce back to 1000 candidates by keeping only the best scoring options.
Repeat steps 2 and 3 until all candidates cover the full vowel sequence, including the end constraint.
Take the best scoring of these 1000 solutions.
You can play with the candidate scoring to trade off completed or longer solutions vs very good but short fits.

How to derive the top contributing factors in a binary classification problem

I have a binary classification problem with about 30 features and an ultimate pass/fail label. I first trained a classifier to be able to predict if new instances will pass or fail but now I want to get a deeper understanding.
How can I derive some analysis about why these items pass or fail based on their features? I would ideally like to be able to show the top contributing factors with a weight associated with each one. Complicating this is that my features are not necessarily statistically independent of each other. What sorts of methods should I look into, what keywords will point me in the right direction?
Some initial thoughts: Use a decision tree classifier (ID3 or CART) and look at the top of the tree for top factors. I am not sure how robust this approach would be and it isn't immediately clear to me how one can assign the importance of each factor (one would just get an ordered list).
If I understand your objectives correctly, you might want to consider a Random Forest model. Random forests have the advantage of naturally providing an importance to the features by virtue of how the algorithm works.
In Python's scikit-learn, check out sklearn.ensemble.RandomForestClassifier(). feature_importances_ would return the "weights" I believe you're looking for. Check out the example in the documentation.
Alternatively, you can use R's randomForest package. After constructing the model, you can use importance() to extract the feature importance values.

How to classify text with Knime

I'm trying to classify some data using knime with knime-labs deep learning plugin.
I have about 16.000 products in my DB, but I have about 700 of then that I know its category.
I'm trying to classify as much as possible using some DM (data mining) technique. I've downloaded some plugins to knime, now I have some deep learning tools as some text tools.
Here is my workflow, I'll use it to explain what I'm doing:
I'm transforming the product name into vector, than applying into it.
After I train a DL4J learner with DeepMLP. (I'm not really understand it all, it was the one that I thought I got the best results). Than I try to apply the model in the same data set.
I thought I would get the result with the predicted classes. But I'm getting a column with output_activations that looks that gets a pair of doubles. when sorting this column I get some related date close to each other. But I was expecting to get the classes.
Here is a print of the result table, here you can see the output with the input.
In columns selection it's getting just the converted_document and selected des_categoria as Label Column (learning node config). And in Predictor node I checked the "Append SoftMax Predicted Label?"
The nom_produto is the text column that I'm trying to use to predict the des_categoria column that it the product category.
I'm really newbie about DM and DL. If you could get me some help to solve what I'm trying to do would be awesome. Also be free to suggest some learning material about what attempting to achieve
PS: I also tried to apply it into the unclassified data (17,000 products), but I got the same result.
I won't answer with a workflow on this one because it is not going to be a simple one. However, be sure to find the text mining example on the KNIME server, i.e. the one that makes use of the bag of words approach.
The task
Product mapping to categories should be a straight-forward data mining task because the information that explains the target variable is available in a quasi-exhaustive manner. Depending on the number of categories to train though, there is a risk that you might need more than 700 instances to learn from.
Some resources
Here are some resources, only the first one being truly specialised in text mining:
Introduction on Information Retrieval, in particular chapter 13;
Data Science for Business is an excellent introduction to data mining, including text mining (chapter 10), also do not forget the chapter about similarity (chapter 6);
Machine Learning with R has the advantage of being accessible enough (chapter 4 provides an example of text classification with R code).
Preprocessing
First, you will have to preprocess your product labels a bit. Use KNIME's text analytics preprocessing nodes for that purpose, that is after you've transformed the product labels with Strings to Document:
Case Convert, Punctuation Erasure and Snowball Stemmer;
you probably won't need Stop Word Filter, however, there may be quasi-stop words such as "product", which you may need to remove manually with Dictionary Filter;
Be careful not to use any of the following without testing testing their impact first: N Chars Filter (g may be a useful word), Number Filter (numbers may indicate quantities, which may be useful for classification).
Should you encounter any trouble with the relevant nodes (e.g. Punctuation Erasure can be tricky amazingly thanks to the tokenizer), you can always apply String Manipulation with regex before converting the Strings to Document.
Keep it short and simple: the lookup table
You could build a lookup table based on the 700 training instances. The book Data mining techniques as well as resource (2) present this approach in some detail. If any model performs any worse than the lookup table, you should abandon the model.
Nearest neighbors
Neural networks are probably overkill for this task.
Start with a K Nearest Neighbor node (applying a string distance such as Cosine, Levensthein or Jaro-Winkler). This approach requires the least amount of data wrangling. At the very least, it will provide an excellent baseline model, so it is most definitely worth a shot.
You'll need to tune the parameter k and to experiment with the distance types. The Parameter Optimization Loop pair will help you with optimizing k, you can include a Cross-Validation meta node inside of the said loop to obtain an estimate of the expected performance given k instead of only one point estimate per value of k. Use Cohen's Kappa as an optimization criterion, as proposed by the resource number (3) and available via the Scorer node.
After the parameter tuning, you'll have to evaluate the relevance of your model using yet another Cross-Validation meta node, then follow up with a Loop pair including Scorer to calculate the descriptives on performance metric(s) per iteration, finally use Statistics. Kappa is a convenient metric for this task because the target variable consists of many product categories.
Don't forget to test its performance against the lookup table.
What next ?
Should lookup table or k-nn work well for you, then there's nothing else to add.
Should any of those approaches fail, you might want to analyse the precise cases on which it fails. In addition, training set size may be too low, so you could manually classify another few hundred or thousand instances.
If after increasing the training set size, you are still dealing with a bad model, you can try the bag of words approach together with a Naive Bayes classifier (see chapter 13 of the Information Retrieval reference). There is no room here to elaborate on the bag of words approach and Naive Bayes but you'll find the resources here above useful for that purpose.
One last note. Personally, I find KNIME's Naive Bayes node to perform poorly, probably because it does not implement Laplace smoothening. However, KNIME's R Learner and R Predictor nodes will allow you to use R's e1071 package, as demonstrated by resource (3).

Binary Classification with Rule Based approach rather than proper algorithms

Problem Statement is somewhat like this:
Given a website, we have to classify it into one of the two predefined classes (say whether its an e-commerce website or not?)
We have already tried Naive Bayes Algorithms for this with multiple pre-processing techniques (stop word removal, stemming etc.) and proper features.
We want to increase the accuracy to 90 or somewhat closer, which we are not getting from this approach.
The issue here is, while evaluating the accuracy manually, we look for a few identifiers on web page (e.g. Checkout button, Shop/Shopping,paypal and many more) which are sometimes missed in our algorithms.
We were thinking, if we are too sure of these identifiers, why don't we create a rule based classifier where we will classify a page as per a set of rules(which will be written on the basis of some priority).
e.g. if it contains shop/shopping and has checkout button then it's an ecommerce page.
And many similar rules in some priority order.
Depending on a few rules we will visit other pages of the website as well (currently, we visit only home page which is also a reason of not getting very high accuracy).
What are the potential issues that we will be facing with rule based approach? Or it would be better for our use case?
Would be a good idea to create those rules with sophisticated algorithms(e.g. FOIL, AQ etc)?
The issue here is, while evaluating the accuracy manually, we look for a few identifiers on web page (e.g. Checkout button, Shop/Shopping,paypal and many more) which are sometimes missed in our algorithms.
So why don't you include this information in your classification scheme? It's not hard to find a payment/checkout button in the html, so the presence of these should definitely be features. A good classifier relies on two things- good data, and good features. Make sure you have both!
If you must do a rule based classifier, then think of it more or less like a decision tree. It's very easy to do if you are using a functional programming language- basically just recurse until you hit an endpoint, at which point you are given a classification.
A Decision Tree algorithm can take your data and return a rule set for prediction of unlabeled instances.
In fact, a decision tree is really just a recursive descent partitioner comprised of a set of rules in which each rule sits at a node in the tree and application of that rule on an unlabeled data instance, sends this instance down either the left fork or right fork.
Many decision tree implementations explicitly generate a rule set, but this isn't necesary, because the rules (both what the rule is and the position of that rule in the decision flow) are easy to see just by looking at the tree that represents the trained decision tree classifier.
In particular, each rule is just a Boolean test for a particular value in a particular feature (data column or field).
For instance, suppose one of the features in each data row describes the type of Application Cache; further suppose that this feature has three possible values, memcache, redis, and custom. Then a rule might be Applilcation Cache | memcache, or does this data instance have an Application Cache based on redis?
The rules extracted from a decision tree are Boolean--either true or false. By convention False is represented by the left edge (or link to the child node below and to the left-hand-side of this parent node); and True is represented by the right-hand-side edge.
Hence, a new (unlabeled) data row begins at the root node, then is sent down either the right or left side depending on whether the rule at the root node is answered True or False. The next rule is applied (at least level in the tree hierarchy) until the data instance reaches the lowest level (a node with no rule, or leaf node).
Once the data point is filtered to a leaf node, then it is in essence classified, becasue each leaf node has a distribution of training data instances associated with it (e.g., 25% Good | 75% Bad, if Good and Bad are class labels). This empirical distribution (which in the ideal case is comprised of a data instances having just one class label) determines the unknown data instances's estimated class label.
The Free & Open-Source library, Orange, has a decision tree module (implementations of specific ML techniques are referred to as "widgets" in Orange) which seems to be a solid implementation of C4.5, which is probably the most widely used and perhaps the best decision tree implementation.
An O'Reilly Site has a tutorial on decision tree construction and use, including source code for a working decision tree module in python.

Resources