can "splitting attribute" appear many times in decision tree?

can "splitting attribute" appear many times in decision tree? - machine-learning

Just want to clarify one thing: the same attribute can appear in decision tree for many times as long as they are in different "branches" right?

For obvious reasons, it does not make sense to use the same decision within the same branch.
On different branches, this reasoning obviously does not hold.
Consider the classic XOR(x,y) problem. You can solve it with a two layer decision tree, but you will need to split on the same attribute in both branches.
If x is true:
If y is true: return false
If y is false: return true
If x is false:
If y is true: return true
If y is false: return false
Another example is the following: assume your data is positive in x=[0;1], and negative outside. A good tree would be the following:
If x > 1: return negative
If x <= 1:
If x >= 0: return positive
If x < 0: return negative
It's not the same decision, so it can make sense to use x twice.

In general , you can do whatever you want, as long as you keep a structure of a "tree". They can be customized in many ways and while there can be redundancy it doesn't undermine its validity.
Binary attributes shouldn't appear twice in the same brunch, that would be redundant. However, continuous attributes can appear in same branch several times.

If the attribute is categorical, it cannot be used as the split attribute for more than one time. If the attribute is numerical, in principle, it can be used for many times, but the standard decision tree algorithm (C4.5 algorithm) does not implemented that way.
The following description is based on the assumption that the attributes are all categorical.
From the explanation perspective, decision tree is explainable, how an instance labeled can be explained by the attributes (as well as the value of the attributes) used from the root to the leaf. Therefore, it does not make sense to have duplicate attributes in one branch of the tree.
From the algorithm perspective, once an attribute is selected as the split attribute, the attributes would have no chance to be selected again based on the attribute selection criteria, e.g. information gain would be 0. This is because all the instances would have the same attribute value once they have been filtered by the attribute. Using the attribute again cannot bring more information for classification.

Related

How to scale % change based features so that they are viewed "similarly" by the model

I have some features that are zero-centered values and supposed to represent change between a current value and previous value. Generally speaking i believe there should be some symmetry between these values. Ie. there should be roughly the same amount of positive values as negative values and roughly these values should operate on the same scale.
When i try to scale my samples using MaxAbsScaler, i notice that my negative values for this feature get almost completely drowned out by the positive values. And i don't really have any reason to believe my positive values should be that much larger than my negative values.
So what i've noticed is that fundamentally, the magnitude of percentage change values are not symmetrical in scale. For example if i have a value that goes from 50 to 200, that would result in a 300.0% change. If i have a value that goes from 200 to 50 that would result in a -75.0% change. I get there is a reason for this, but in terms of my feature, i don't see a reason why a change of 50 to 100 should be 3x+ more "important" than the same change in value but the opposite direction.
Given this information, i do not believe there would be any reason to want my model to treat a change of 200-50 as a "lesser" change than a change of 50-200. Since i am trying to represent the change of a value over time, i want to abstract this pattern so that my model can "visualize" the change of a value over time that same way a person would.
Right now i am solving this by using this formula
if curr > prev:
return curr / prev - 1
else:
return (prev / curr - 1) * -1
And this does seem to treat changes in value, similarly regardless of the direction. Ie from the example of above 50>200 = 300, 200>50 = -300. Is there a reason why i shouldn't be doing this? Does this accomplish my goal? Has anyone ran into similar dilemmas?

This is a discussion question and it's difficult to know the right answer to it without knowing the physical relevance of your feature. You are calculating a percentage change, and a percent change is dependent on the original value. I am not a big fan of a custom formula only to make percent change symmetric since it adds a layer of complexity when it is unnecessary in my opinion.
If you want change to be symmetric, you can try direct difference or factor change. There's nothing to suggest that difference or factor change are less correct than percent change. So, depending on the physical relevance of your feature, each of the following symmetric measures would be correct ways to measure change -
Difference change -> 50 to 200 yields 150, 200 to 50 yields -150
Factor change with logarithm -> 50 to 200 yields log(4), 200 to 50 yields log(1/4) = -log(4)

You're having trouble because you haven't brought the abstract questions into your paradigm.
"... my model can "visualize" ... same way a person would."
In this paradigm, you need a metric for "same way". There is no such empirical standard. You've dropped both of the simple standards -- relative error and absolute error -- and you posit some inherently "normal" standard that doesn't exist.
Yes, we run into these dilemmas: choosing a success metric. You've chosen a classic example from "How To Lie With Statistics"; depending on the choice of starting and finishing proportions and the error metric, you can "prove" all sorts of things.
This brings us to your central question:
Does this accomplish my goal?
We don't know. First of all, you haven't given us your actual goal. Rather, you've given us an indefinite description and a single example of two data points. Second, you're asking the wrong entity. Make your changes, run the model on your data set, and examine the properties of the resulting predictions. Do those properties satisfy your desired end result?
For instance, given your posted data points, (200, 50) and (50, 200), how would other examples fit in, such as (1, 4), (1000, 10), etc.? If you're simply training on the proportion of change over the full range of values involved in that transaction, your proposal is just what you need: use the higher value as the basis. Since you didn't post any representative data, we have no idea what sort of distribution you have.

Is it possible to split a dataset in Google Dataprep? If so, how?

I've been looking into Google Dataprep as an ETL solution to perform some basic data transformation before feeding it to a machine learning platform. I'm wondering if it's possible to use the Dataprep/Dataflow tools to split a dataset into train, test, and validation sets. Ideally I'm looking to do a stratified split on a target column, but for starters I'd settle for a simple uniform random split by percent of whole (e.g. 50% train, 30% validation, 20% test).
So far I haven't been able to find anything about whether this is even possible with Dataprep, so I'm wondering if anyone knows definitively if this is possible and, if so, how to accomplish it.
EDIT 1
Thanks #jakub-janoštík for getting me going in the right direction! I modified your answer slightly and came up with the following (in wrangle form):
case condition: customConditions cases: [false,0] default: rand() as: 'split_condition'
case condition: customConditions cases: [split_condition < 0.6,'train'],[split_condition >= 0.8,'test'] default: 'validation' as: 'dataset_type'
drop col: split_condition action: Drop
By assigning random values in a separate step, I got the guaranteed percentage split I was looking for. The flow ended up looking like this:
Image: final flow diagram with dataset splitting
EDIT 2
I just figured out how to do the stratified split too, so I thought I'd add it in case anyone else is trying to do this. Here's the rough steps:
Split your dataset based on whatever subpopulations you're targeting (e.g. target0, target1)
For each subpopulation, do the uniform random split described above (e.g. now you have target0-train, target0-test, target0-validation, target1-train, etc.)
For each set type (i.e. train, test, validation):
Create a new recipe from one of the sets
Edit the recipe, and use the Union transform to merge it with other datasets of the same type (e.g. target0-train union with target1-train). The union button is in the middle of the toolbar on the Edit Recipe page.
I hope that's helpful to someone!

I'm looking at the same problem and I was able to partially solve this using "case on custom condition" and "Random" functions. What I do is that I create new column named target and apply following logic:
After applying this you'll have new column with these 3 new labels and you can generate 3 new datasets by applying row filtering rules based on those values. Thing to keep in mind is that each time you'll run the job you'll get different validation set. So if you want to keep it fixed you need to use the dataset created in first run as input for future runs (and randomise only train and test sets).
If you need more control on the distribution of labels in your datasets there is ROWNUMBER window function that could potentially be used. But I haven't been able to make it work yet.

How do i compute the Similarity between logical rules

I have a similarity problem here. I want to predict the traffic of a new rule using historical data (The traffic of rules implemented in the past). Traffic here means the number of times a rule matched a Person. Here is an example of a Rule :
Person.Age<20 and
(Person.number_of_children==3 or Person.married==True) and
Person.Work==student and
Person.Car.isSportCar==False and
Person.Car.Color in [blue,pink,red]
As you can see, in a rule there are a lot of attributes linked with Boolean expressions. The rule matches a person if he and his car satisfy some criteria. To predict the traffic of a rule I have to find a distance or a similarity metric between my rules but I find it difficult to flatten the rules in a column expression. If I do it I’ll lose information and here is why:
An example of column presentation of my rule :
Person.Age : 20
Person.number_of_children:3
Person.married:True
Person.work:student
Person.Car.isSportCar:False
Person.Car.Color:[blue,pink,red]
With this I lose the ‘OR’ and ‘<’ and ‘in’
Is flattening my rules expression a good idea or is there another one? Should I convert my rules to another data structure (A tree data structure for example) to better catch the similarity value between them? Do you have some suggestions?

What I'd do in a case like this would be to try to convert the specifications of the rules to sets so flattening them makes sense and then compute a Jaccard distance. Jaccard distance is defined by intersection over union of the sets. Finally, weight the different attributes (or not and use a single set for everything).
For example, given:
Person.Age<20 and (Person.number_of_children==3 or
Person.married==True) and Person.Work==student and
Person.Car.isSportCar==False and Person.Car.Color in [blue,pink,red]
and:
Person.Age<15 and (Person.number_of_children==2 or
Person.married==False) and Person.Work==student and
Person.Car.isSportCar==False and Person.Car.Color in [pink,red,white]
Convert them to something like this:
Person.Age (5,5,5,5)
Person.Relatives (Child,Child,Child,Wife)
Person.CarColor (blue,pink,red)
Person.Age (5,5,5)
Person.Relatives (Child,Child)
Person.CarColor (pink,red,white)
And then your Jaccard distance will be something like:
Person.Age = 3/4
Person.Relatives = 2/4
Person.CarColor = 2/4
And aggregate them (weighted if necessary).

Let me suggest another approach:
Base the similarity score on the percentage of people for whom the two rules give the same result. Of course you'll need a large and heterogeneous population.
If both rules have a similar result for most of the population (e.g. "false") - you may base the score only on test cases where at least one of the rules' result is "true".

How to find a function that fits a given data set?

The search algorithm is a Breadth first search. I'm not sure how to store terms from and equation into a open list. The function f(x) has the form of ax^e1 + bx^e2 + cx^e3 + k, where a, b, c, are coefficients; k is constant. All exponents, coefficients, and constants are integers between 0 and 5.
Initial state: of the problem solving process should be any term from the ax^e1, bx^e2, cX^e3, k.
The algorithm gradually expands the number of terms in each level of the list.
Not sure how to add the terms to an equation from an open Queue. That is the question.

The general problem that you are dealing belongs to the regression analysis area, and several techniques are available to find a function that fits a given data set, including the popular least squares methods for finding the line of best fit given a dataset (a brief starting point is the related page on wikipedia, but if you want to deepen this topic, you should look at the research paper out there).
If you want to stick with the breadth first search algorithm, although this kind of approach is not common for such a problem, first of all, you need to define all the elements for a search problem, namely (see for more information Chapter 3 of the book of Stuart and Russell, Artificial Intelligence: A Modern Approach):
Initial state: Some initial values for the different terms.
Actions: in your case it should be a change in the different terms. Note that you should discretize the changes in the values.
Transition function: function that determines the new states given a state and an action.
Goal test: a check to recognize whether a state is a goal state or not, and so to terminate the search. There are different ways to define this test in a regression problem. One way is to set a threshold for the sum of the square errors.
Step cost: The cost for an action. In such an abstract problem, probably you can consider the unweighted distance from the initial state on the search graph.
Note that you should carefully think about these elements, as, for example, they determine how efficient your search would be or whether you will have cycles in the search graph.
After you defined all of the elements for the search problem, you basically have to implement:
Node, that contains information about the parent, the state, and the current cost;
Function to expand a given node that returns the successor nodes (according to the transition function, the actions, and the step cost);
Goal test;
The actual search algorithm. In the queue at the beginning you will have the node containing the initial state. After, it is updated with the successor nodes.

Trying to understand the VITERBI algorithm a bit better

I'm currently trying to implement the viterbi algorithm in python, more specifically the version presented in an online course.
As it stands, the algorithm is presented that way:
given a sentence with K tokens, we have to generate K tags .
We assume that tag K-1 = tag K-2 = '*', then for k going from 0 to K,
we set the tag for the token as follows :
tag(WORD_k) = argmax(p(k-1, tag_k-2, tag_k-1) * e( word_k, tag_k) * q(tag_k, tag_k-1, tag_k-1))
From my understanding this is straightforward because the p parameters are already calculated on each step (we go from 1 forward, and we already know p0), and max for the e and q params can be calculated by one iteration through the tags (since we can't come up with 2 different tags, we basically have to find the tag T for which the q * e product is maximal, and return that). This saves a lot of time, since we are almost at linear time in terms in big O notation, instead of exponential complexity, which we would get if we iterated through all possible word/tag combinations.
Am I getting the core of the algorithm correctly or am I missing something out?
Thanks in advance

since we can't come up with 2 different tags, we basically have to
find the tag T for which the q * e product is maximal, and return that
Yeah, sounds about right. q is the trigram (transition) probability and e is named the emission probability. As you said is unchanged between different paths in each stage, so the max is only dependent on the other two.
Each tag sequence should start with two asterisks at positions -2 and -1. So the first assumption is correct:
If we assume to be the maximum probability that the last two tags at position k are u and v, based on what we just said about the beginning asterisks, the base case would be
.
You had two errors in the general case though. The emission probability is a conditional. Also in the trigram, is repeated two times and the formula given is incorrect:

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart