Decision tree with categorical features - machine-learning

I'm implementing a decision tree.
Suppose "race" feature has the following possible values:
['Asian-Pac-Islander', 'Amer-Indian-Eskimo', 'White', 'Other', 'Black']
Suppose the samples in a node has the following values for the "race" feature, and "race" is selected to be the best splitting feature right now.
['Asian-Pac-Islander', 'Asian-Pac-Islander', 'Amer-Indian-Eskimo', 'White', 'White', 'White', 'Other', 'Black']
Note that the values are grouped together - "sorted".
Suppose entropy diff tells me that the following is the best splitting position: (The vertical bar "|")
['Asian-Pac-Islander', 'Asian-Pac-Islander', 'Amer-Indian-Eskimo', | 'White', 'White', 'White', 'Other', 'Black']
What exactly is the split rule, then? It doesn't exactly make sense to make "asian-pac-islander" and "amer_indian_eskimo" go left and "white", "other" and "black" to go right because they are not numbers.
Thanks.

Remember that "left" and "right" children for decision tree nodes are arbitrary labels that humans use for visualization, not inherent mathematical properties of the trees. Flipping the left and right children of any node results in an identical (a mathematician would probably say "isomorphic") tree.
When splitting on a categorical attribute, you usually try every grouping of values and compare their Gini or information gain to determine the best split. Once you've established the best split, which group is the "left" group and which is the "right" group is randomly selected, because it doesn't matter.
It also looks like you're thinking of the split in the literal sense of drawing a dividing line in a specifically ordered list. For categorical attributes, you don't create splits this way. Instead, you define the split condition as, for example, "White, Other, and Black go left; all other Race labels go right". The order of data going into the split node should not affect the resulting split.

Related

Finding similar users based on String preperties

Im a software engineering student, and new to Data Mining, I want to implement a solution to find similar users based on their interests and skills (Strings sets).
I think I cannot use K nearest Neighbors using an edit distance(Levenshtein or ..)
If someone could help with that please
The first thing you should do is convert your data into some reasonable representation, so that you will have a well-defined notion of distance between suitably represented users.
I would recommend converting all strings into some canonical form, then sorting all n distinct skills and interest strings into a dictionary D. Now for each user u, construct a vector v(u) with n components, which has i-th component set to 1 if the property in dictionary entry i is present, and 0 otherwise. Essentially we represented each user with a characteristic vector of her interests/skills.
Now you can compare users with Jaccard index (it's just an example, you'll have to figure out what works best for you). With the notion of a distance in hand, you can start trying out various approaches. Here are some that spring to mind:
apply hierarchical clustering if the number of users is sufficiently small;
apply association rule learning (I'll leave you to think out the details);
etc.

Fuzzy join on multiple columns with Spark

I have two Spark RDDs without common key that I need to join.
The first RDD is from cassandra table a contains reference set of items (id, item_name, item_type, item_size) for example: (1, 'item 1', 'type_a', 20).
The second RDD is imported each night from another system and it contains roughly the same data without id and is in raw form (raw_item_name, raw_type, raw_item_size) for example ('item 1.', 'type a', 20).
Now I need to join those two RDDs based on similarity of the data. Right know the size of the RDDs is about 10000 but in the future it will grow up.
My actual solutions is: cartesian join of both RDDs, then calculating the distance between ref and raw attributes for each row, then grouping by id and selecting best match.
At this size of RDDs this solution is working but i'm afraid that in the future the cartesian join might be just to big.
What would be better solution?
I tried to look at Spark MLlib but didn't know where to start, which algorith to use etc. Any advice will be greatly appreciated.

Modeling arrows/relationships as nodes in Neo4j

Relationship/Arrows in Neo4j can not get more than one type/label (see here, and here). I have a data model that edges need to get labels and (probably) properties. If I decide to use Neo4j (instead of OriendDB which supports labeled arrow), I think I would have then two options to model an arrow, say f, between two nodes A and B:
1) encode an arrow f as a span, say A<--f-->B, such that f is also a node and --> and <-- are arrows.
or
2) encode an arrow f as A --> f -->B, such that f is a node again and two --> are arrows.
Though this seems to be adding unnecessary complexity on my data model, it does not seem to be any other option at the moment if I want to use Neo4j. Then, I am trying to see which of the above encoding might fit better in my queries (queries are the core of my system). For doing so, I need to resort to examples. So I have two question:
First Question:
part1) I have nodes labeled as Person and father, and there are arrows between them like Person<-[:sr]-father-[:tr]->Person in order to model who is father of who (tr is father of sr). For a given person p1 how can I get all of his ancestors.
part2) If I had Person-[:sr]->father-[:tr]->Person structure instead, for modeling father relationship, how the above same query would look like.
This is answered here when father is considered as a simple relationship (instead of being encoded as a node)
Second Question:
part1) I have nodes labeled as A nodes with the property p1 for each. I want to query A nodes, get those elements that p1<5, then create the following structure: for each a1 in the query result I create qa1<-[:sr]-isA-[:tr]->a1 such that isA and qa1 are nodes.
part2) What if I wanted to create qa1-[:sr]->isA-[:tr]->qa1 instead?
This question is answered here when isA is considered as a simple arrow (instead of being modeled as a node).
First, some terminology; relationships don't have labels, they only have types. And yes, one type per relationship.
Second, relative to modeling, I think the direction of the relationship isn't always super important, since with neo4j you can traverse it both ways easily. So the difference between A-->f-->B and A<--f-->B I think should be entirely driven what what makes sense semantically for your domain, nothing else. So your options (1) and (2) at the top seem the same to me in terms of overall complexity, which brings me to point #3:
Your main choice is between making a complex relationship into a node (which I think we're calling f here) or keeping it as a relationship. Making "a relationship into a node" is called reification and I think it's considered a fairly standard practice to accommodate a number of modeling issues. It does add complexity (over a simple relationship) but adds flexibility. That's a pretty standard engineering tradeoff everywhere.
So with all of that said, for your first question I wouldn't recommend an intermediate node at all. :father is a very simple relationship, and I don't see why you'd ever need more than one label on it. So for question one, I would pick "neither of the options you list" and would instead model it as (personA)-[:father]->(personB). More simple. You'd query that by saying
MATCH (personA { name: "Bob"})-[:father]->(bobsDad) RETURN bobsDad
Yes, you could model this as (personA)-[:sr]->(fatherhood)-[:tr]->(personB) but I don't see how this gains you much. As for the relationship direction, again it doesn't matter for performance or query, only for semantics of whatever :tr and :sr are supposed to mean.
I have nodes labeled as A nodes with the property p1 for each. I want
to query A nodes, get those elements that p1<5, then create the
following structure: for each a1 in the query result I create
qa1<-[:sr]-isA-[:tr]->a1 such that isA and qa1 are nodes.
That's this:
MATCH (aNode:A)
WHERE aNode.p1 < 5
WITH aNode
MATCH (qa1 { label: "some qa1 node" })
CREATE (qa1)<-[:sr]-(isA)-[:tr]->aNode;
Note that you'll need to adjust the criteria for qa1 and also specify something meaningful for isA.
What if I wanted to create qa1-[:sr]->isA-[:tr]->qa1 instead?
It should be trivial to modify that query above, just change the direction of the arrows, same query.

Is the order ascending?

I have a question about the traversal of a tree.
When we print the values of a binary search tree using in order traversal are the values printed in an ascending order??
Yes, the normal implementation of a binary search tree is in ascending order, i.e. nodes to the left are smaller than nodes to the right.
As the concepts of "left" and "right" are what we specify, and "lower" and "higher" depend on what the keys really represent, it's of course possible to implement the tree as a descending tree (or just a reverse traversal). In that case you might want to add "reverse" or "descending" to the name of the tree to signify the uncommon implementation.

what is prototype vector of a phrase in the training set

I am trying to implement an approaches following a paper to disambiguate an entity. The process consists of 2 steps, a training phase and disambiguation phase. I would like to ask about training phase, I do not quite understand how the way to get prototype vectors as this paragraph explained:
In the training phase, we compute, for each word or phrase that is linked at least 10 times to a particular entity, what we called a prototype vector: this is a tf.idf-weighted, normalized list of all terms which occur in one of the neighbourhoods (we consider 10 words to the left and right) of the respective links. Note that one and the same word or phrase can have several such prototype vectors, one for each entity linked from some occurrence of that word or phrase in the collection.
They have used the approach for wikipedia and use the links from wikipedia as training set.
Could someone help me to give an example of the prototype vector as explained there,please? I am beginner in this field.
Here is a sketch of what the prototype vector is about:
The first thing to note is that a word in wikipedia can be hyperlink to a wikipedia page (which we will call an entity). This entity is related in some way to the word yet the same word could link to different entities.
"for each word or phrase that is linked at least 10 times to a particular entity"
Across wikipedia, we count the number of times that word_A links to entity_B, if it's over 10, we continue (writing down where the entities they link from):
[(wordA, entityA1), (wordA, entityA2),...]
Here wordA occurs in entityA1 where it links to entityB, etc.
"list of all terms which occur in one of the neighbourhoods of the respective links"
In entityA1, wordA has ten words to it's left and right (we show only 4 either side):
are developed and the entity relationships between these data
wordA
link # (to entityB)
['are', 'developed, 'and', 'the', 'relationships', 'between', 'these', 'data']
Each pair (wordA, entityAi) gives us such a list, concatenate them.
"tf.idf-weighted, normalized list"
Basically, tf.idf means you should give common words less "weight" than less-common words. For example, 'and' and 'the' are very common words so we give them less meaning (to their being next to 'entity') than 'relationships' or 'between'.
Normalise, means we should (essentially) count the number of times a word occurs (the more it occurs the more associated we think it is to wordA. Then multiply this count by the weight to get some score with which to sort the list... Putting the most-frequent least-common words at the top.
"Note that one and the same word or phrase can have several such prototype vectors"
This has been not only dependant on wordA but also entityB, you could think of it as a mapping.
(wordA, entityB) -> tf.idf-weighted, normalized list (as described above)
(wordA, entityB2) -> a different tf.idf-weighted, normalized list
This is making the point that links to cats from the word 'cat' are less likely to have the neighbour 'batman', than links to cat woman.

Resources