Decision tree completeness and unclassified data - machine-learning

I made a program that trains a decision tree built on the ID3 algorithm using an information gain function (Shanon entropy) for feature selection (split).
Once I trained a decision tree I tested it to classify unseen data and I realized that some data instances cannot be classified: there is no path on the tree that classifies the instance.
An example (this is an illustration example but I encounter the same problem with a larger and more complex data set):
Being f1 and f2 the predictor variables (features) and y the categorical variable, the values ranges are:
f1: [a1; a2; a3]
f2: [b1; b2; b3]
y : [y1; y2; y3]
Training data:
("a1", "b1", "y1");
("a1", "b2", "y2");
("a2", "b3", "y3");
("a3", "b3", "y1");
Trained tree:
[f2]
/ | \
b1 b2 b3
/ | \
y1 y2 [f1]
/ \
a2 a3
/ \
y3 y1
The instance ("a1", "b3") cannot be classified with the given tree.
Several questions came up to me:
Does this situation have a name? tree incompleteness or something like that?
Is there a way to know if a decision tree will cover all combinations of unknown instances (all features values combinations)?
Does the reason of this "incompleteness" lie on the topology of the data set or on the algorithm used to train the decision tree (ID3 in this case) (or other)?
Is there a method to classify these unclassifiable instances with the given decision tree? or one must use another tool (random forest, neural networks...)?

This situation cannot occur with the ID3 decision-tree learner---regardless of whether it uses information gain or some other heuristic for split selection. (See, for example, ID3 algorithm on Wikipedia.)
The "trained tree" in your example above could not have been returned by the ID3 decision-tree learning algorithm.
This is because when the algorithm selects a d-valued attribute (i.e. an attribute with d possible values) on which to split the given leaf, it will create d new children (one per attribute value). In particular, in your example above, the node [f1] would have three children, corresponding to attribute values a1,a2, and a3.
It follows from the previous paragraph (and, in general, from the way the ID3 algorithm works) that any well-formed vector---of the form (v1, v2, ..., vn, y), where vi is a value of i-th attribute and y is the class value---should be classifiable by the decision tree that the algorithm learns on a given train set.
Would you mind providing a link to the software you used to learn the "incomplete" trees?
To answer your questions:
Not that I know of. It doesn't make sense to learn such "incomplete trees." If we knew that some attribute values will never occur then we would not include them in the specification (the file where you list attributes and their values) in the first place.
With the ID3 algorithm, you can prove---as I sketched in the answer---that every tree returned by the algorithm will cover all possible combinations.
You're using the wrong algorithm. Data has nothing to do with it.
There is no such thing as an unclassifiable instance in decision-tree learning. One usually defines a decision-tree learning problem as follows. Given a train set S of examples x1,x2,...,xn of the form xi=(v1i,v2i,...,vni,yi) where vji is the value of the j-th attribute and yi is the class value in example xi, learn a function (represented by a decision tree) f: X -> Y, where X is the space of all possible well-formed vectors (i.e. all possible combinations of attribute values) and Y is the space of all possible class values, which minimizes an error function (e.g. the number of misclassified examples). From this definition, you can see that one requires that the function f is able to map any combination to a class value; thus, by definition, each possible instance is classifiable.

Related

How can I include survey weights in a poisson pint process model fitted to a logistic regression quadrature scheme?

Is it possible to include weights in a poisson point process model fitted to a logistic regression quadrature scheme? My data is a stratified sample and I would like to account for this sampling strategy in order to have valid population level predictions.
This is a question about the model-fitting function ppm in the R package spatstat.
Yes, you can include survey weights. The easiest way is to create a covariate surveyweight, which could be a function(x,y) or a pixel image or a column of data associated with your quadrature scheme. Then when fitting the model using ppm, add the model term +offset(log(surveyweight)).
The result of ppm will be a fitted model that describes the observed point pattern. You can do prediction, simulation etc from this model, but be aware that these will be predictions or simulations of the observed point process including the effect of non-constant survey effort.
To get a prediction or simulation of the original point process (i.e. after removing the effect of non-constant survey effort) you need to replace the original covariate surveyweight by another covariate that is constant and equal to 1, then pass this to predict.ppm in the argument newdata.
Here are a few lines to elaborate on the answer by #adrian-baddeley.
If you have the setup of your related question and we imagine you have the weights and two covariates in a data.frame in the same order as the points of your quadscheme:
library(spatstat)
X <- split(chorley)$larynx
D <- split(chorley)$lung
Q <- quadscheme.logi(X,D)
covar <- data.frame(weights = runif(npoints(chorley)),
covar1 = rnorm(npoints(chorley)),
covar2 = rnorm(npoints(chorley)))
fit <- ppm(Q ~ offset(log(weights)) + covar1 + covar2, data = covar)

Delta component doesnt show in weight learning rule of sigmoid activation MLP

As a basic proof of concept, in a network that classifies K classes with input x, bias b, output y,S samples, weights v and t teacher signal in which t(k) equals 1 if the matching sample is under k class.
Variables
Let x_(is) represent the i_(th) input feature in the s_(th) sample.
v_(ks) represents the vector that holds the weights of connection to k_(th) output from all inputs within the s_(th) sample.
t_(s) represents the teacher signal for s_(th) sample.
If we extend the above variables to consider multiple samples, the changes below has to be applied while declaring the variable z_(k), the activation function f(.) and using the corss entropy as a cost function:
Derivation
Typically in learning rule, delta ( t_(k) - y_(k) ) is always included, why Delta doesnt show up in this equation? have i missed something or the delta rule showing up isnt a must?
I managed to find the solution, it's clear when we consider the Kronecker delta in which Where (δck = 1 if a class matches the classifier and δck otherwise). which means the derivation takes this shape:
Derivation
which leads to the delta rule.

Solve Record Linkage as a Constraint Satisfaction with Machine Learning

I have pairs of sets such as
A = { L, M, N, P } = { <"Lll", 47, 0.004>, <"Mm", 60, 0.95>, <"Nnnn", 33, 0.2892>, <"P", 47, 0.0125> }
B = { l, m, n, o } = { <"l", 46, 0.004>, <"m", 0, 0.95>, <"nn", 33, 0.2892>, <"oOo", 33, 0.5773> }
... and I want to automatically train an algorithm based on known-good data to know how to link the set members as
{ <L, l>, <M, m>, <N, n>, <?, o>, <P, ?> }
... with, at most, one match for each element of either set. The sets do not have to have the same size and have no guarantees about their overlap - maybe no matches, maybe all matches, maybe a mix of matches and non-matches. But there is expected to be a human-identifiable matching in many cases and the computer should approximate it.
Tried so far
H(a, b, w1, w2, w3) scores a pair of tuples <a1, a2, a3> from A and <b1, b2, b3> from B as f1(a1, b1) * w1 + f2(a2, b2) * w2 + f3(a3, b3) * w3 where f1, f2, and f3 are hand-crafted and w1, w2, and w3 are parameterized weights. I sort all pairs A × B by their scores and take the pairs for which neither member is already represented by a higher-scored pair. I use a crude hill-climbing to train for the weights so that the resulting pairs map as the training data expects. A perfect weighting configuration has a threshold t which delineates correct pair scores S_ab from incorrect pair scores. This algorithm routinely finds perfect configurations after a few hundred or thousand iterations for my training data of about 800 (A, B) sets totaling 2500 pairs of 8-uples (instead of the 3-uples illustrated). I have yet to give it a validation dataset to find out how badly this method is overfitting.
I'm not happy about the hardcoded treatment of the set-ness aspect of the problem. I can only imagine machine learning techniques for scoring pairs but the subsequent mapping is hand-crafted and perhaps isn't as smart as an ideal solution that considers the set-mapping as a whole. Because the machine learning part doesn't consider the whole set, it seems to me to be missing out on some information it could be using to make better decisions.
I think my illustration above could be refactored to first score all pairs in A × B as S_ab = < f1(a1, b1), f2(a2, b2), ..., fn(an, bn) > (for n-tuples) and then use an [n, ?, 1] neural network training on matches and non-matches by each S_ab. This considers a pair and outputs match/no match and does nothing to consider the whole set.
It is my understanding that neural networks don't handle variable-sized input, though perhaps I could choose an upper-bound for ||A|| and ||B|| and find some neutral encoding for padding unused nodes. And the output could be a matrix of matches along the axes indexing the elements of A along the side and B along the bottom, say. But then still the net would be sensitive to the order of elements, no?
So ...
Is there a machine learning technique that could reliably map sets to sets in this way? It is related to record linkage in obvious ways. It is a constraint satisfaction problem in that each element can be matched at most once. It would be ideal if human corrections of results could be incorporated as feedback for improved future results. If you have a way, could please spell it out for me because I'm not well versed in machine learning concepts.

Log likelihood of a markov network

I am having trouble understanding the following figure from Coursera class:
From as far as I understand, the equation corresponds the factor table:
And therefore the likelihood of a sample data (a = 0, b=0, c=1) for example would be:
It doesn't look like the graph at any way. Can you please explain the graph for me?
I think you're confusing probability and likelihood.
You have a probability distribution p, parameterised by \theta, which has support on (A, B, C). The probability distribution is a function of A, B, C for fixed theta. The likelihood function, which is what's being graphed in the figure above, is a function of \theta for fixed A, B, C. It's a function which says how probable fixed observations are given different values for the parameters.
In popular usage likelihood and probability are synonymous. In technical use they are not.
With the likelihood/probability issue sorted, that likelihood function is telling you that the joint probability of (A, B, C) is the product of pairwise potentials between all connected pairs, in this case (A, B) and (B, C). I{a^1, b^1) is an indicator function which is 1 when a=1 and b=1 and zero otherwise. \theta_{a^1, b^1} is the parameter corresponding to this outcome.
If I had to guess (I can't see the whole class), I would say there are four \thetas for each pairwise relationship, representing the four possible states (both 1, both 0, or one of each), and we've just dropped the ones where the corresponding indicator function is zero and so the parameters are irrelevant.
Your derivation of the equation is not correct. The form of the MRF basically says add together the parameters corresponding to the correct state of each of the pairs, exponentiate, and normalise. The normalising constant is the sum of the joint probability over all possible configurations.

Machine Learning Model for Multi-Label Classification where we know relationship between the labels

I am having a problem at hand where,
I need to classify the input data to one or more of the labels S1, S2, S3, S4
There is a relationship between the labels S1, S2, S3 and S4 which is,
If input is labelled Sn it must be labelled S1..Sn.
S1, S2, S3 and S4 are like different stages for an entity X to pass through. Based on input data X might get through one or many of the stages, X must pass through S1 to go to S2, S2 to go to S3 and so on
We want to ensure that only those X are allowed to pass which reach S3, so based on input data we decide whether to allow X to go through S1 or not
What machine learning models can we choose to predict if X reaches S3 if we have information like, input data and what stages X has passed for that input data
I am thinking in direction of a multi label classification There might be some relationship between input data stage S1 and S2
Update: I have to train with examples like
1. Input data is s1
2. Input data is s2
3. ..
4 ..
Some doubts
Your question is far from being clear, for example:
We want to optimize that most X reaches S3, so based on input data we decide whether to allow X to go through S1 or not
Actually suggest, that the best model would be "always answer yes" ,as it maximized number of objects reaching S3 (as it simply lets any object reach this point)
General ideas
I assume two possible interpretations:
You have a labels "pipeline", which simply means, that object cannot be labelled S_n if it has not been already labelled with all S_i for i < n
This does not seem to be the problem for one single model, you can pipeline models in a natural way, ie. train a model 1 which regognizes, if object x should have label S_1. Next, you train a model 2 on all data that has label S_1 in the training set and predict label S_2, and so on. During execution you simply ask each model i if it accepts (labels) the incoming object x, and stop when the first one says "no"
You have some more complex constraints on the labels, which may be strict or not.For such cases, you should try one of many methods of multi label classification with constraints, in particular there is a tech report regarding this aspect of ML.
Solution 1 - approximating test functions
If your problem can be described as:
You have data points X, such that for each of them you know the maximum number of some pipelineable tests T_i which x passes
You want to train a classifier able to predict, what is the maximum number of consequtive tests that your point x passes
You do not have access to actual tests T_i or they are very inefficient
Then the simplest way would be to apply the following training procedure instead of one classifier:
Take all your data points, label those with y=0 as 0 and those with y>=1 as 1 and train some binary classifier (for example SVM). So you simply temporarly relabel your data so it shows points that pass the first test and those who don't. Lets call this classifier cl_1
Now take your data points, label those with y=1 as 0 and those with y>=2 as 1 and again train binary classifier, and call it cl_2
Repest until all tests have their classifier, in general in we call the classifier cl_i when it can distinguish between points labeled with y=i-1 and those with y>=i.
Now, to classify your new point, you simply check iteratively all your cl_i for i=1,..,tests and answer with the largest such i that cl_i(x)=1. So you "simulate" your tests with classifiers, and simply say how many this tests' approximations it passed.
To sum up: each test can be approximated with one binary classifier, and then the question of "What is the biggest consequtive test number that our point passes" is approximated with "what is the biggest consequtive classifier number that out point is classified as true".
Solution 2 - simple regression
You can also simply apply regression from your input space into the number of tests it reaches. Regression actually has an imprinted assumption, that the output values are correlated. So if you train your data with pairs (x,y) where y is the number of last test passed by x, then you are actually using the fact, that the output y=3 is highly related to first getting y=2 in the computations. Such regression (non-linear!) could be simply done using neural networks (possibly regularized)

Resources