Cut points from which to choose the best split in Decision Tree regressor with continious feature? - machine-learning

I understand, that in the Decision Tree algorithm, when the splitting is decided, we choose the best split based on some criterion. And when you're looking for the best split, you have to iterate through some list of values. But it seems very computationally expensive to consider every value of the feature as the possible threshold (or, so called, cut point). Thus, there is a necessity for some heuristic for choosing these thresholds. For example, if we have continuous feature and categorical target (i.e, we are dealing with classification problem), we can do the following: sort dataset by given feature and consider for splitting only values, where target variable is changing it's value.
But what do you do if you have regression task, i.e. both feature and target are continuous variables? I realize, that I have to calculate, for example, the mean variance or mean median deviation in both branches for each split. But how do you decide from which values you're choosing you best split? People surely have came up with some optimal solution in order to avoid iterating over every value of the feature in the training set.
I've done some research, but most sources only focuses on different criteria and questions of how you determine whether your split is suitable. Which is not really answering my question.
I've found this question, but Predictor only suggests, that it can be done using the percentiles. And I think, that there is no guarantee, that this is how it really done in real life.
I've also found this question, but for me geledek's answer is not very clear (obviously, dude just copy-pasted his answer from presentation, that he is referring to). I'm pretty much fine with the Method 1, but I would really appreciate if someone could explain Method 2 in more details. Or, perhaps, provide some different source or explanation of your own.
UPD: I've also looked up to the scikit-learn repo at GitHub, and found this line. I can't quite understand the overall code, but it seems that this particular line implies that thresholds are chosen as the averages of the neighboring feature values (which corresponds with the aforementioned Method 1 from the question above). Is that correct? I also don't understand this comment: # sum of halves is used to avoid infinite value. How exactly does dividing by two prevent from getting infinite values? Don't you get infinity only when you are dividing by zero? Is dividing by two necessary, because this way we are getting average value (and not because we don't want to get infinitely)?

Related

Selected features in random forest subsampling

I am trying to figure it out which features are being considered in each subsampling in my classification problem, for this, I am assuming that there is a random subset of features of length max_features that is considered when building every tree.
I am interested in this because I am using two different types of features for my problem so I want to make sure that in every tree both types of features are being used for every node split. So one way to at least make each tree to consider all features is by setting the max_features parameter to None. So one question here would be:
Does that mean that both types of features are being considered for every node split?
Another one derived from the previous question is:
Since Random Forest make a subsampling for every tree, is this subsampling among cases (rows) or among columns (features) as well? Besides, can this subsampling be done by group of rows instead of randomly?
Besides, it does not seem to be a good assumption to use all the features in the max_features parameter neither on Decision Trees nor on random forest since it is opposite to the whole point and definition of random forest in terms of correlation among trees (I am not completely sure about this statement).
Does anyone know if this is something that can be modified in the source code or if at least it can be approached differently?
Any suggestion or comment is very welcomed.
Feel free to correct any assumption.
In the source code I have been reading about this but could not find where this might be defined.
Source code inspected so far:
splitter.py code from decision tree
forest.py code from random forest
Does that mean that both types of features are being considered for every node split?
Given that you have correctly pointed out above that setting max_features to None will indeed force the algorithm to consider all features in every split, it is unclear what exactly you are asking here: all means all and, from the point of view of the algorithm, there are not different "types" of features.
Since Random Forest make a subsampling for every tree, is this subsampling among cases (rows) or among columns (features) as well?
Both. But, regarding the rows, it is not exactly subsampling, it is actually bootstrap sampling, i.e. sampling with replacement, which means that, in each sample, some rows will be missing while others will be present multiple times.
Random forest is in fact the combination of two independent ideas: bagging, and random selection of features. The latter corresponds essentially to "column subsampling", while the former includes the bootstrap sampling I have just described.
Besides, can this subsampling be done by group of rows instead of randomly?
AFAIK no, at least in the standard implementations (including scikit-learn).
Does anyone know if this is something that can be modified in the source code or if at least it can be approached differently?
Everything can be modified in the source code, literally; now, if it is really necessary (or even a good idea) to do so is a different story...
Besides, it does not seem to be a good assumption to use all the features in the max_features parameter
It does not indeed, as this is the very characteristic that discriminates RF from the simpler bagging approach (short for bootstrap aggregating). Experiments have indeed shown that adding this random selection of features at each step boosts performance related to simple bagging.
Although your question (and issue) sounds rather vague, my advice would be to "sit back and relax", letting the (powerful enough) RF algorithm do its job with your data...

Accurate general description of Regression versus Classification

So I have the following problem: I realized (while writing my master thesis) that I am still not sure/have vague descriptions of some of the machine learning principles.
For instance, I vaguely remember that at some point I heard the following description:
The output (label) of a classification task is discrete and finite while the output (label) of a regression task is continuous and can be infinite
The one word that I am unsure of is infinite for regression in this description.
For instance, if you assume that (for whatever reason) you have 2D data points that are almost distributed like a sine wave (with some noise) and you use polyfit to fit a polynomial of k-degree on it (see Figure here here k = 8). Now you have some data in a specific range, e.g., here the range of available points in the x-direction is [0,12], which is used to fit the polynomial.
However wouldn't you be able to quickly get the y-result for the value x = 1M (or an arbitrarily large number), as you have the general shape of the polynomial? Is that not what infinite labels mean?
Maybe I am just wrongly remembering stuff that I learned years ago ;).
best regards
First of all, this is a question more fitting for the more theoretically inclined sites of StackExchange, like Stats Stackexchange Math Stackexchange, or the Data Science Stackexchange, which conveniently also provide answers to your question.
But not quite. In any case, your problem seems to be on the distinction between input and output. The type of task (i.e. either classificaiton or regression) is solely based on the output of your model, but has nothing to do with the input.
You could have a ton of "continuous input variables" (or even a mixture with distinct ones), and still call it a classification task, if it has a distinct amount of output values.
Furthermore, the infinite simply refers to the fact that these values are not bounded, i.e. you cannot restrict your regression task to a specific range easily. If you suddenly input a value completely outside of your training value range (as with your example), you will likely get an "infinite" y value, since your network will only be trained on this specific range; a problem that also happens with polynomial fitting, as the following example shows:
The red line could be the learned function for your network, so if you suddenly go far beyond known values, you likely get some extreme value (unless you train very well).
Opposed to that, the classification network would still predict any of the given classes. I like to imagine it kind of a Voronoi diagram: Even if your point is arbitrarily far from any of the previous points, it will still belong to some category.

How to evaluate word2vec build on a specific context files

Using gensim word2vec, built a CBOW model with a bunch of litigation files for representation of word as vector in a Named-Entity-recognition problem, but I want to known how to evaluate my representation of words. If I use any other datasets like wordsim353(NLTK) or other online datasets of google, it doesn't work because I built the model specific to my domain dataset of files. How do I evaluate my word2vec's representation of word vectors .I want words belonging to similar context to be closer in vector space.How do I ensure that the build model is doing it ?
I started by using a techniques called odd one out. Eg:
model.wv.doesnt_match("breakfast cereal dinner lunch".split()) --> 'cereal'
I created my own dataset(for validating) using the words in the training of word2vec .Started evaluating with taking three words of similar context and an odd word out of context.But the accuracy of my model is only 30 % .
Will the above method really helps in evaluating my w2v model ? Or Is there a better way ?
I want to go with word_similarity measure but I need a reference score(Human assessed) to evaluate my model or is there any techniques to do it? Please ,do suggest any ideas or techniques .
Ultimately this depends on the purpose you intend for the word-vectors – your evaluation should mimic the final use as much as possible.
The "odd one out" approach may be reasonable. It's often done with just 2 words that are somehow, via external knowledge/categorization, known to be related (in the aspects that are important for your end use), then a 3rd word picked at random.
If you think your hand-crafted evaluation set is of high-quality for your purposes, but your word-vectors aren't doing well, it may just be that there are other problems with your training: too little data, errors in preprocessing, poorly-chosen metaparameters, etc.
You'd have to look at individual failure cases in more detail to pick what to improve next. For example, even when it fails at one of your odd-one-out tests, do the lists of most-similar words, for each of the words included, still make superficial sense in an eyeball-test? Does using more data or more training iterations significantly improve the evaluation scoring?
A common mistake during both training and evaluation/deployment is to retain too many rare words, on the (mistaken) intuition that "more info must be better". In fact, words with only a few occurrences can't get very high-quality vectors. (Compared to more-frequent words, their end vectors are more heavily influenced by the random original initialization, and by the idiosyncracies of their few occurrences available rather than their most-general meaning.) And further, their presence tends to interfere with the improvement of other nearby more-frequent words. Then, if you include the 'long tail' of weaker vectors in your evaluations, they tend to somewhat arbitrarily intrude in rankings ahead of common words with strong vectors, hiding the 'right' answers to your evaluation questions.
Also, note that the absolute value of an evaluation score may not be that important, because you're just looking for something that points your other optimizations in the right direction for your true end-goal. Word-vectors that are just slightly-better at precise evaluation questions might still work well-enough in other fuzzier information-retrieval contexts.

Decision tree entropy calculation target

I found several examples of two types.
Single feature
Given a data with only two items classes. For example only blue and yellow balls. I.e. we have only one feature in this case is color. This is clear example to show "divide and conquer" rule applicable to entropy. But this is senselessly for any prediction or classification problems because if we have an object with only one feature and the value is known we don't need a tree to decide that "this ball is yellow".
Multiple features
Given a data with multiple features and a feature to predict (known for training data). We can calculate a predicate based on minimum average entropy for each feature. Closer to life, isn't it? It was clear to me until I haven't tried to implement the algorithm.
And now I have a collision in my mind.
If we calculate entropy relatively to a known features (one per node) we will have meaningful results at classification with a tree only if unknown feature is strictly dependent from every known feature. Otherwise a single unbound known feature could break all prediction driving a decision in a wrong way. But if we calculate entropy relatively to a values of the feature which we want to predict at classification we are returned to the first senseless example. In this way there is no difference which of a known feature to use for a node...
And a question about a tree building process.
Should I calculate entropy only for known features and just believe that all the known features are bound to an unknown? Or maybe I should calculate entropy for unknown feature (known for training data) TOO to determine which feature more affects result?
I had the same problem (in maybe a similar programing task) some years ago: Do I calculate the entropy against the complete set of features, the relevant features for a branch or the relevant features for a level?
Turned out like this: In a decision tree it comes down to comparing entropies between different branches to determine the optimal branch. Comparison requires equal base sets, i.e. whenever you want to compare two entropie values, they must be based on the same feature set.
For your problem you can go with the features relevant to the set of branches you want to compare, as long as you are aware that with this solution you cannot compare entropies between different branch sets.
Otherwise go with the whole feature set.
(Disclaimer: Above solution is a mind protocol from a problem that lead to about an hour of thinking some years ago. Hopefully I got everything right.)
PS: Beware of the car dataset! ;)

SPSS two way repeated measures ANOVA

i am fairly new with statitistic.
I made an experiment and used the two way ANOVA with repeated measures. The calculation was done in SPSS. In most papers I have seen, the f-value and the degree of freedom were reported as well. is it normal to report those values as well? if so, which values do i take from the spss output.
how do I interpret these values? what do they mean?
when does the f-value support a significant result and when not?
what are good values for the f-value and the degree of freedom.
in some article is also read about the critical f-values, how do I get this value?
most articles describe how to calculate those values but do not explain their meaning for the experiment.
some clarification in these issues is greatly appreciated.
My English is not very good, but I will try to answer your question.
The main purpose of ANOVA is that we want statistical proof that the measured groups have the same mean or not. So we make a null hypothesis and an alternative hypothesis, then we use a test statistics on the data. You can use ANOVA if the groups has the same variance (squared standard deviation).
You need to test this. This is a hyptest too, the nullhyp. is the groups have the same variance, the anternative hyp. is they dont.
You need to make decision from the Sig. value, if the value is higher than 0,05, we usually accept the nullhyp. If the variances are equal, we can use ANOVA. (I assume that the data is following the Normal distribution.) The nullhyp. is that the groups have equal means, the alternative hyp is that we have at least 1 group with a different mean. You can make your decision from the Sig. value, as I said before, if the value higher than 0.05 we accept the nullhyp. The F-critical value is not important if you are calculating on a computer. You can make an accepting interval from the lower and the upper F-critical, and if the F-value is in the interval you accept the nullhyp, but I only used this method in statistics class. You don't need the F-value and the df in the report, because they don't explain anything on their own.

Resources