Decision Tree, What is Wrong here?

Decision Tree, What is Wrong here? - machine-learning

I took a contest two days ago. one of our question is as follows:
decision tree with depth 2 is constructed for two binary feature.
how many features are in hypothesis space that can be shown with the following tree ?
The answer sheet say solution as 16 but the commitment say this
question is removed by reason of wrong answer. Who can add
explanantion why this is removed? which part of answer is wrong?

In this case, you have represented all possible features that can be represented by decision tree. So there are overall 4 possible points in the hypothesis space.

The number of partitions that this tree carves the hypothesis space into is equal to the number of leaves (4). That is also the maximum in this case, since with two binary features the total number of unique inputs is 2^2, or 4.

16? No way. The question is dumb, honestly.
The number of features you have in Decision Tree (DT) correspond to the maximum depth of the tree, which is the maximum amount of questions you can ask the DT in order to model the feature space.
That is a logical consequence of in each node the DT discriminating the feature space according to a feature.

Related

Loss function for Question Answering posed as Multiclass Classification?

I'm working on a question answering problem with limited data (~10,000s of data points) and very few features for both the context/question as well as the options/choices. Given:
a question Q and
options A, B, C, D, E (each characterized by some features, say, string similarity to Q or number of words in each option)
(while training) a single correct answer, say B.
I wish to predict exactly one of these as the correct answer. But I'm stuck because:
If I arrange ground truth as [0 1 0 0 0], and give the concatenation of QABCDE as input, then the model will behave as if classifying an image into dog, cat, rat, human, bird, i.e. each class will have a meaning, however that's not true here. If I switched the input to QBCDEA, the prediction should be [1 0 0 0 0].
If I split each data point into 5 data points, i.e. QA:0, QB:1, QC:0, QD:0, QE:0, then the model fails to learn that they're in fact interrelated, and only one of them must be predicted as 1.
One approach that seems viable is to make a custom loss function which penalizes multiple 1s for a single question, and which penalizes no 1s as well. But I think I might be missing something very obvious here :/
I'm also aware of how large models like BERT do this over SQuAD like datasets. They add positional embeddings to each option (eg. A gets 1, B gets 2), and then use a sort of concatenation over QA1 QB2 QC3 QD4 QE5 as input, and [0 1 0 0 0] as output. Unfortunately, I believe this will not work in my case given the very small dataset I have.

The problem you're having is that you removed all useful information from your "ground truth". The training target is not the ABCDE labels -- the target is the characteristics of the answers that those letters briefly represent.
Those five labels are merely array subscripts for classifications that are a 5Pn (5 objects chosen from n) shuffled subset of your training space. Bottom line: there is no information in those labels.
Rather, extract the salient characteristics from those answers. Your training needs to find the answer (characteristic set) that sufficiently matches the question. As such, what you're doing is close to multi-label training.
Multi-label models should handle this situation. This will include those that label photos, identifying multiple classes represented in the input.
Does that get you moving?
Response to OP comment
You understand correctly: predicting 0/1 for five arbitrary responses is meaningless to the model; the single-letter variables are of only transitory meaning, and have no relation to anything trainable.
A short thought experiment will demonstrate this. Imagine that we sort the answers such that A is always the correct answer; this doesn't change the information in the inputs and outputs; it's a valid arrangement of the multiple-choice test.
Train the model; we'll get to 100% accuracy in short order. Now, consider the model weights: what has the model learned from the input? Nothing -- the weights will train to ignore the input and select A, or will have absolutely arbitrary values that come to the A conclusion.
You need to ignore the ABCDE designations entirely; the target information is in the answers themselves, not in those letters. Since you haven't posted any sample cases, we have little to guide us for an alternate approach.
If your paradigm is a typical multiple-choice examination, with few restrictions on the questions and answers, then the problem you're tackling is far larger than your project is likely to solve -- you're in "Watson" territory, requiring a large knowledge base and a strong NLP system to parse the inputs and available responses.
If you have a restricted paradigm for the answers, perhaps you can parse them into phrases and relations, yielding a finite set of classes to consider in your training. In this case, a multi-label model might well be able to solve your problem.
If your application is open-ended, i.e. open topic, then I expect that you need a different model class (such as BERT), but you'll still need to consider the five answers as text sequences, not as letters. You need a holistic match to the subject at hand. If this is a typical multiple-choice exam, then your model will still have classification troubles, as all five answers are likely to be on topic; finding the correct answer should depend on some level of semantic insight into question and answer, something stronger than "bag of words" processing.

SPSS two way repeated measures ANOVA

i am fairly new with statitistic.
I made an experiment and used the two way ANOVA with repeated measures. The calculation was done in SPSS. In most papers I have seen, the f-value and the degree of freedom were reported as well. is it normal to report those values as well? if so, which values do i take from the spss output.
how do I interpret these values? what do they mean?
when does the f-value support a significant result and when not?
what are good values for the f-value and the degree of freedom.
in some article is also read about the critical f-values, how do I get this value?
most articles describe how to calculate those values but do not explain their meaning for the experiment.
some clarification in these issues is greatly appreciated.

My English is not very good, but I will try to answer your question.
The main purpose of ANOVA is that we want statistical proof that the measured groups have the same mean or not. So we make a null hypothesis and an alternative hypothesis, then we use a test statistics on the data. You can use ANOVA if the groups has the same variance (squared standard deviation).
You need to test this. This is a hyptest too, the nullhyp. is the groups have the same variance, the anternative hyp. is they dont.
You need to make decision from the Sig. value, if the value is higher than 0,05, we usually accept the nullhyp. If the variances are equal, we can use ANOVA. (I assume that the data is following the Normal distribution.) The nullhyp. is that the groups have equal means, the alternative hyp is that we have at least 1 group with a different mean. You can make your decision from the Sig. value, as I said before, if the value higher than 0.05 we accept the nullhyp. The F-critical value is not important if you are calculating on a computer. You can make an accepting interval from the lower and the upper F-critical, and if the F-value is in the interval you accept the nullhyp, but I only used this method in statistics class. You don't need the F-value and the df in the report, because they don't explain anything on their own.

modeling feature set with text documents

Example:
I have m sets of ~1000 text documents, ~10 are predictive of a binary result, roughly 990 aren't.
I want to train a classifier to take a set of documents and predict the binary result.
Assume for discussion that the documents each map the text to 100 features.
How is this modeled in terms of training examples and features? Do I merge all the text together and map it to a fixed set of features? Do I have 100 features per document * ~1000 documents (100,000 features) and one training example per set of documents? Do I classify each document separately and analyze the resulting set of confidences as they relate to the final binary prediction?

The most common way to handle text documents is with a bag of words model. The class proportions are irrelevant. Each word gets mapped to a unique index. Make the value at that index equal to the number of times that token occurs (there are smarter things to do). The number of features/dimension is then the number of unique tokens/words in your corpus. There are manny issues with this, and some of them are discussed here. But it works well enough for many things.

I would want to approach it as a two stage problem.
Stage 1: predict the relevancy of a document from the set of 1000. For best combination with stage 2, use something probabilistic (logistic regression is a good start).
Stage 2: Define features on the output of stage 1 to determine the answer to the ultimate question. These could be things like the counts of words for the n most relevant docs from stage 1, the probability of the most probable document, the 99th percentile of those probabilities, variances in probabilities, etc. Whatever you think will get you the correct answer (experiment!)
The reason for this is as follows: concatenating documents together will drown you in irrelevant information. You'll spend ages trying to figure out which words/features allow actual separation between the classes.
On the other hand, if you concatenate feature vectors together, you'll run into an exchangeability problem. By that I mean, word 1 in document 1 will be in position 1, word 1 in document 2 will be in position 1001, in document 3 it will be in position 2001, etc. and there will be no way to know that the features are all related. Furthermore, an alternate presentation of the order of the documents would lead to the positions in the feature vector changing its order, and your learning algorithm won't be smart to this. Equally valid presentations of the document orders will lead to completely different results in an entirely non-deterministic and unsatisfying way (unless you spend a long time designing a custom classifier that's not afficted with this problem, which might ultimately be necessary but it's not the thing I'd start with).

Minimum number of observation when performing Random Forest

Is it possible to apply RandomForests to very small datasets?
I have a dataset with many variables but only 25 observation each. Random forests produce reasonable results with low OOB errors (10-25%).
Is there any rule of thumb regarding the minimum number of observations to use?
In fact one of the response variable is unbalanced, and if I'm going to subsample it I will end up with an even smaller number of observations.
Thanks in advance

Absolutely RF can be used on these type of datasets (i.e. p>n). In fact they use RF in fields like genomics where the number of fields >= 20000 and there are only a very small number of rows - say 10-12. The entire problem is figuring out which of the 20k variables would make up a parsimonious marker (i.e. feature selection is the entire problem).
I don't have any ROTs about minimum size other than if your model doesn't work well on a held back sample (or Hold-One-Back cross validation might work well in your case) well then you should try something else.
Hope this helps

How to deal with feature vector of variable length?

Say you're trying to classify houses based on certain features:
Total area
Number of rooms
Garage area
But not all houses have garages. But when they do, their total area makes for a very discriminating feature. What's a good approach to leverage the information contained in this feature?

You could incorporate a zero/one dummy variable indicating whether there is a garage, as well as the cross-product of the garage area with the dummy (for houses with no garage, set the area to zero).

The best approach is to build your dataset with all the features and in most cases it is just fine to fill with zeroes those columns that are not available.
Using your example, it would be something like:
Total area Number of rooms Garage area
100 2 0
300 2 5
125 1 1.5
Often, the learning algorithm that you chose would be powerful enough to use those zeroes to classify properly that entry. After all, absence of value it's still information for the algorithm. This just could become a problem if your data is skewed, but in that case you need to address the skewness anyway.
EDIT:
I just realize there were another answer with a comment of you being afraid to use zeroes, given the fact that could be confused with small garages. While I still don't see a problem with that (there should be enough difference between a small garage and zero), you can still use the same structure marking the non-existence area garage with a negative number ( let's say -1).
The solution indicated in the other answer is perfectly plausible too, having an extra feature indicating whether the house has garage or not would work fine (specially in decision tree based algorithms). I just prefer to keep the dimensionality of the data as low as possible, but at the end this is more a preference rather a technical decision.

You'll want to incorporate a zero indicator feature. That is, a feature which is 1 when the garage size is 0, and 0 for any other value.
Your feature vector will then be:
area | num_rooms | garage_size | garage_exists
Your machine learning algorithm will then be able to see this (non-linear) feature of garage size.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart