Regarding prediction of Decision Tree - machine-learning

How does Decision tree predict the out come on a new Data set. Lets say with the hyper parameters I allowed my decision tree to grow only to a certain extent to avoid over fitting. Now a new data point is passed to this trained model, so the new data point reaches to one of the leaf nodes. But how does that leaf node predict whether the data point is either 1 or 0? ( I am talking about Classification here).

Well, you pretty much answered your own question. But just to the extension, in the last the data is labelled to either 0 or 1 is hgihly dependent on the type of algorithm you used, for example, ID3 , uses the mode value to predict. similarly C4.5 and C5 or CART have there different criteria based on info gain, ginni index etc etc....
In simplified terms, the process of training a decision tree and predicting the target features of query instances is as follows:
Present a dataset containing of a number of training instances characterized by a number of descriptive features and a target feature
Train the decision tree model by continuously splitting the target feature along the values of the descriptive features using a measure of information gain during the training process
Grow the tree until we accomplish a stopping criteria --> create leaf nodes which represent the predictions we want to make for new query instances
Show query instances to the tree and run down the tree until we arrive at leaf nodes
DONE - Congratulations you have found the answers to your questions
here is a link I am suggesting which explain the decision tree in very detail with scratch. Give it a good read -
https://www.python-course.eu/Decision_Trees.php

Related

Regression tree output

I'm confused about the intuition behind decision trees when used to describe continuous targets in machine learning.
I understand that decision trees uses splits based on feature values to decide which branches of a tree to go down to get to a leaf value.
It intuitively make sense to me when doing inference on classification based on nominal targets because each leaf would have as specific value (label), so after going down enough branches one eventually arrives at discrete value which is the label.
But if we're doing regression where a machine learning model predicts a value on a continuum, for example a real number between 0 and 100, how could there be enough leaves to allow the model to output any real number between 0 and 100?
Regression trees are only what you could call 'pseudo continuous' in contrast for example to linear regression models. For the 'leaves' the outputs will have a steady value for certain ranges of the independent variable(s) - dependent on the mentioned 'splits'.
However, there exists some academic work that fits (regression) models in the nodes (...). See the accepted answer here:
https://stats.stackexchange.com/questions/439756/decision-tree-that-fits-a-regression-at-leaf-nodes

Decision Tree Performance, ML

If we don't give any constraints such as max_depth, minimum number of samples for nodes, Can decision tree always give 0 training error? or it depends on Dataset? What about shown dataset?
edit- it is possible to have a split which results in lower accuracy than parent node, right? According to theory of decision tree it should stop splitting there even if the end results after several splitting can be good! Am I correct?
Decision tree will always find a split that imrpoves accuracy/score
For example, I've built a decision tree on data similiar to yours:
A decision tree can get to 100% accuracy on any data set where there are no 2 samples with the same feature values but different labels.
This is one reason why decision trees tend to overfit, especially on many features or on categorical data with many options.
Indeed, sometimes, we prevent a split in a node if the improvement created by the split is not high enough. This is problematic as some relationships, like y=x_1 xor x_2 cannot be expressed by trees with this limitation.
So commonly, a tree doesn't stop because he cannot improve the model on training data.
The reason you don't see trees with 100% accuracy is because we use techniques to reduce overfitting, such as:
Tree pruning like this relatively new example. This basically means that you build your entire tree, but then you go back and prune nodes that did not contribute enough to the model's performance.
Using a ratio instead of gain for the splits. Basically this is a way to express the fact that we expect less improvement from a 50%-50% split than a 10%-90% split.
Setting hyperparameters, such as max_depth and min_samples_leaf, to prevent the tree from splitting too much.

Decision Tree for Choosing Most Likely Option?

I'm trying to find the right ML algorithm. Let's say I have three data columns. I have a binary outcome for each column (either the data column belongs to (Group A) classification or it does not), BUT in each set of three data columns that I feed in, exactly ONE and only one column belongs to Group A.
Which algorithm can I choose to select the ONE BEST result of the three each time? Can I do this with a decision tree?
Decision tree aka ID3, can be suitable for this simple problem... best way is to check it on the data and see it's output prediction
ID3 have a problem of over fitting though
basically every classifier can do a good job on this task, if it linearly separable even SVM can be a good choice, also I'm suggesting trying basic neural network with 1/2 nodes at the output layer for classification of 2 groups
all of them are implemented via various packages and are fairly easy to use (almost any coding language)

How to classify text with Knime

I'm trying to classify some data using knime with knime-labs deep learning plugin.
I have about 16.000 products in my DB, but I have about 700 of then that I know its category.
I'm trying to classify as much as possible using some DM (data mining) technique. I've downloaded some plugins to knime, now I have some deep learning tools as some text tools.
Here is my workflow, I'll use it to explain what I'm doing:
I'm transforming the product name into vector, than applying into it.
After I train a DL4J learner with DeepMLP. (I'm not really understand it all, it was the one that I thought I got the best results). Than I try to apply the model in the same data set.
I thought I would get the result with the predicted classes. But I'm getting a column with output_activations that looks that gets a pair of doubles. when sorting this column I get some related date close to each other. But I was expecting to get the classes.
Here is a print of the result table, here you can see the output with the input.
In columns selection it's getting just the converted_document and selected des_categoria as Label Column (learning node config). And in Predictor node I checked the "Append SoftMax Predicted Label?"
The nom_produto is the text column that I'm trying to use to predict the des_categoria column that it the product category.
I'm really newbie about DM and DL. If you could get me some help to solve what I'm trying to do would be awesome. Also be free to suggest some learning material about what attempting to achieve
PS: I also tried to apply it into the unclassified data (17,000 products), but I got the same result.
I won't answer with a workflow on this one because it is not going to be a simple one. However, be sure to find the text mining example on the KNIME server, i.e. the one that makes use of the bag of words approach.
The task
Product mapping to categories should be a straight-forward data mining task because the information that explains the target variable is available in a quasi-exhaustive manner. Depending on the number of categories to train though, there is a risk that you might need more than 700 instances to learn from.
Some resources
Here are some resources, only the first one being truly specialised in text mining:
Introduction on Information Retrieval, in particular chapter 13;
Data Science for Business is an excellent introduction to data mining, including text mining (chapter 10), also do not forget the chapter about similarity (chapter 6);
Machine Learning with R has the advantage of being accessible enough (chapter 4 provides an example of text classification with R code).
Preprocessing
First, you will have to preprocess your product labels a bit. Use KNIME's text analytics preprocessing nodes for that purpose, that is after you've transformed the product labels with Strings to Document:
Case Convert, Punctuation Erasure and Snowball Stemmer;
you probably won't need Stop Word Filter, however, there may be quasi-stop words such as "product", which you may need to remove manually with Dictionary Filter;
Be careful not to use any of the following without testing testing their impact first: N Chars Filter (g may be a useful word), Number Filter (numbers may indicate quantities, which may be useful for classification).
Should you encounter any trouble with the relevant nodes (e.g. Punctuation Erasure can be tricky amazingly thanks to the tokenizer), you can always apply String Manipulation with regex before converting the Strings to Document.
Keep it short and simple: the lookup table
You could build a lookup table based on the 700 training instances. The book Data mining techniques as well as resource (2) present this approach in some detail. If any model performs any worse than the lookup table, you should abandon the model.
Nearest neighbors
Neural networks are probably overkill for this task.
Start with a K Nearest Neighbor node (applying a string distance such as Cosine, Levensthein or Jaro-Winkler). This approach requires the least amount of data wrangling. At the very least, it will provide an excellent baseline model, so it is most definitely worth a shot.
You'll need to tune the parameter k and to experiment with the distance types. The Parameter Optimization Loop pair will help you with optimizing k, you can include a Cross-Validation meta node inside of the said loop to obtain an estimate of the expected performance given k instead of only one point estimate per value of k. Use Cohen's Kappa as an optimization criterion, as proposed by the resource number (3) and available via the Scorer node.
After the parameter tuning, you'll have to evaluate the relevance of your model using yet another Cross-Validation meta node, then follow up with a Loop pair including Scorer to calculate the descriptives on performance metric(s) per iteration, finally use Statistics. Kappa is a convenient metric for this task because the target variable consists of many product categories.
Don't forget to test its performance against the lookup table.
What next ?
Should lookup table or k-nn work well for you, then there's nothing else to add.
Should any of those approaches fail, you might want to analyse the precise cases on which it fails. In addition, training set size may be too low, so you could manually classify another few hundred or thousand instances.
If after increasing the training set size, you are still dealing with a bad model, you can try the bag of words approach together with a Naive Bayes classifier (see chapter 13 of the Information Retrieval reference). There is no room here to elaborate on the bag of words approach and Naive Bayes but you'll find the resources here above useful for that purpose.
One last note. Personally, I find KNIME's Naive Bayes node to perform poorly, probably because it does not implement Laplace smoothening. However, KNIME's R Learner and R Predictor nodes will allow you to use R's e1071 package, as demonstrated by resource (3).

Convert Decision Table To Decision Tree

How to convert or visualize decision table to decision tree graph,
is there an algorithm to solve it, or a software to visualize it?
For example, I want to visualize my decision table below:
http://i.stack.imgur.com/Qe2Pw.jpg
Gotta say that is an interesting question.
I don't know the definitive answer, but I'd propose such a method:
use Karnaugh map to turn your decision table to minimized boolean function
turn your function into a tree
Lets simplyify an example, and assume that using Karnaugh got you function (a and b) or c or d. You can turn that into a tree as:
Source: my own
It certainly is easier to generate a decision table from a decision tree, not the other way around.
But the way I see it you could convert your decision table to a data set. Let the 'Disease' be the class attribute and treat the evidence as simple binary instance attributes. From that you can easily generate a decision tree using one of available decision tree induction algorithms, for example C4.5. Just remember to disable pruning and lower the minimum number of objects parameter.
During that process you would lose a bit of information, but the accuracy would remain the same. Take a look at both rows describing disease D04 - the second row is in fact more general than the first. Decision tree generated from this data would recognize the mentioned disease only from E11, 12 and 13 attributes, since it's enough to correctly label the instance.
I've spent few hours looking for a good algorithm. But I'm happy with my results.
My code is too dirty now to paste here (I can share privately on request, on your discretion) but the general idea is as the following.
Assume you have a data set with some decision criteria and outcome.
Define a tree structure (e.g. data.tree in R) and create "Start" root node.
Calculate outcome entropy of your data set. If entropy is zero you are done.
Using each criterion, one by one, as tree node calculate entropy for all branches created with this criterion. Take the minimum one entropy of all branches.
Branches created with the criterion with the smallest (minimum) entropy are your next tree node. Add them as child nodes.
Split your data according to decision point/tree node found in step 4 and remove the criterion used.
Repeat step 2-4 for each branch until your all branches have entropy = 0.
Enjoy your ideal decision tree :)

Resources