Finding clusters of similar nodes with Cypher - neo4j

Given a set of hundred of thousands of nodes with like relationship, (Foodie) -likes-> (Food), I would like to find out logical cluster of Foodie nodes.
For instance suppose I want to divide the cluster into two sets. As an output I would like two sets which have the most common eating habits.
The same logic can be extended to 3,4,5 sets etc. In case of three sets, each set would have most like eating habits. Please note that sets may NOT have same number of nodes.
An application for instance could be coloring of nodes. If the foodies were of different countries, the color of the nodes could point to various countries assuming the people of different countries ate similar food.
I would like to write a Cypher query to extract the nodes. I am stumped as where to start. Any solution or pointers would be appreciated.

what about trying the current milestone of Neo4J 2.0 (http://www.neo4j.org/download, milestone section) and assigning your nodes different labels according to their characteristics (http://www.neo4j.org/develop/labels)?
Then, you'll only have to Cypher execute queries like:
MATCH (nodes:MY_LABEL)
WHERE /.../
RETURN nodes
so that you can retrieve nodes by clusters.

You might want to look into Cliques. This is a general Graph Theory idea, but it sounds like what you want is to define certain 'cliques' of foodies, say there are BBQ foodies, Food Truck foodies, etc.

Related

Is there a benefit to implementing singletons in Neo?

My business requirement says I need to add an arbitrary number of well-defined (AKA not dynamic, not unknown) attributes to certain types of nodes. I am pretty sure that while there could be 30 or 40 different attributes, a node will probably have no more than 4 or 5 of them. Of course there will be corner cases...
In this context, I am generically using 'attribute' as a tag wanted by the business, and not in the Neo4J sense.
I'll be expected to report on which nodes have which attributes. For example, I might have to report on which nodes have the "detention", "suspension", or "double secret probation" attributes.
One way is to simply have an array of appropriate attributes on each entity. But each query would require a search of all nodes. Or, I could create explicit attributes on each node. Now they could be indexed. I'm not seriously considering either of these approaches.
Another way is to implement each attribute as a singleton Neo node, and allow many (tens of thousands?) of other nodes to relate to these nodes. This implementation would have 10,000 nodes but 40,000 relationships.
Finally, the attribute nodes could be created and used by specific entity nodes on an as-needed basis. In this case, if 10,000 entities had an average of 4 attributes, I'd have a total of 50,000 nodes.
As I type this, I realize that in the 2nd case, I still have 40,000 relationships; the 'truth' of the situation did not change.
Is there a reason to avoid the 'singleton' implementation? I could put timestamps on the relationships. But those wouldn't be indexed...
For your simple use case, I'd suggest an approach you didn't list -- which is to use a node label for each "attribute".
Nodes can have multiple labels, and neo4j can quickly iterate through all the nodes with the same label -- making it very quick and easy to find all the nodes with a specific label.
For example:
MATCH (n:Detention)
RETURN n;

Neo4j graph modelling performance and querability, property to a node or as separate node plus relationship

I am teaching myself graph modelling and use Neo4j 2.2.3 database with NodeJs and Express framework.
I have skimmed through the free neo4j graph database book and learned how to model a scenario, when to use relationship and when to create nodes, etc.
I have modelled a vehicle selling scenario, with following structure
NODES
(:VEHICLE{mileage:xxx, manufacture_year: xxxx, price: xxxx})
(:VFUEL_TYPE{type:xxxx}) x 2 (one for diesel and one for petrol)
(:VCOLOR{color:xxxx}) x 8 (red, green, blue, .... yellow)
(:VGEARBOX{type:xxx}) x 2 (AUTO, MANUAL)
RELATIONSHIPS
(vehicleNode)-[:VHAVE_COLOR]->(colorNode - either of the colors)
(vehicleNode)-[:VGEARBOX_IS]->(gearboxNode - either manual or auto)
(vehicleNode)-[:VCONSUMES_FUEL_TYPE]->(fuelNode - either diesel or petrol)
Assuming we have the above structure and so on for the rest of the features.
As shown in the above screenshot (136 & 137 are VEHICLE nodes), majority of the features of a vehicle is created as separate nodes and shared among vehicles with common feature with relationships.
Could you please advise whether roles (labels) like color, body type, driving side (left drive or right drive), gearbox and others should be seperate nodes or properties of vehicle node? Which option is more performance friendly, and easy to query?
I want to write a JS code that allows querying the graph with above structure with one or many search criteria. If majority of those features are properties of VEHICLE node then querying would not be difficult:
MATCH (v:VEHICLE) WHERE v.gearbox = "MANUAL" AND v.fuel_type = "PETROL" AND v.price > x AND v.price < y AND .... RETURN v;
However with existing graph model that I have it is tricky to search, specially when there are multiple criteria that are not necessarily a properties of VEHICLE node but separate nodes and linked via relationship.
Any ideas and advise in regards to existing structure of the graph to make it more query-able as well as performance friendly would be much appreciated. If we imagine a scenario with 1000 VEHICLE nodes that would generate 15000 relationship, sounds a bit scary and if it hits a million VEHICLE then at most 15 million relationships. Please comment if I am heading in the wrong direction.
Thank you for your time.
Modeling is full of tradeoffs, it looks like you have a decent start.
Don't be concerned at all with the number of relationships. That's what graph databases are good at, so I wouldn't be too concerned about over-using them.
Should something be a property, or a node? I can't answer for your scenario, but here are some things to consider:
If you look something up by a value all the time, and you have many objects, it's usually going to be faster to find one node and then everything connected to it, because graph DBs are good at exploiting relationships. It's less fast to scan all nodes of a label and find the items where a property=a value.
Relationships work well when you want to express a connection to something that isn't a simple primitive data type. For example, take "gearbox". There's manuals, and other types...if it's a property value, you won't later easily be able to decide to store 4 other sub-types/sub-aspects of "gearbox". If it were a node, that would later be easy because you could add more properties to the node, or relate other things.
If a piece of data really is a primitive (String, integer, etc) and you don't need extra detail about it, that usually makes a good property. Querying primitive values by connecting to other nodes will seem clunky later on. For example, I wouldn't model a person with a "date of birth" as a separate node, that would be irritating to query, and would give you flexibility you'd be very unlikely to need in the future.
Semantically, how is your data related? If two items are similar because they share an X, then that X probably should be a node. If two items happen to have the same Y value but that doesn't really mean much, then Y is probably better off as a node property.

An Example Showing the Necessity of Relationship Type Index and Related Execution Plan Optimization

Suppose I have a large knowledge base with many relationship types, e.g., hasChild, livesIn, locatedIn, capitalOf, largestCityOf...
The number of capicalOf relationships is relatively small (say, one hundred) compared to that of all nodes and other types of relationships.
I want to fetch any capital which is also the largest city in their country by the following query:
MATCH city-[:capitalOf]->country, city-[:largestCityOf]->country RETURN city
Apparently it would be wise to take the capitalOf type as clue, scan all 100 relationship with this type and refine by [:largestCityOf]. However the current execution plan engine of neo4j would do an AllNodesScan and Expand. Why not consider add an "RelationshipByTypeScan" operator into the current query optimization engine, like what NodeByLabelScan does?
I know that I can transform relationship types to relationship properties, index it using the legacy index and manually indicate
START r=relationship:rels(rtype = "capitalOf")
to tell neo4j how to make it efficient. But for a more complicated pattern query with many relationship types but no node id/label/property to start from, it is clearly a duty of the optimization engine to decide which relationship type to start with.
I saw many questions asking the same problem but getting answers like "negative... a query TYPICALLY starts from nodes... ". I just want to use the above typical scenario to ask why once more.
Thanks!
A relationship is local to its start and end node - there is no global relationship dictionary. An operation like "give me globally all relationships of type x" is therefore an expensive operation - you need to go through all nodes and collect matching relationships.
There are 2 ways to deal with this:
1) use a manual index on relationships as you've sketched
2) assign labels to your nodes. Assume all the country nodes have a Country label. Your can rewrite your query:
MATCH (city)-[:capitalOf]->(country:Country), (city)-[:largestCityOf]->(country) RETURN city
The AllNodesScan is now a NodeByLabelScan. The query grabs all countries and matches to the cities. Since every country does have one capital and one largest city this is efficient and scales independently of the rest of your graph.
If you put all relationships into one index and try to grab to ~100 capitalOf relationships that operation scales logarithmically with the total number of relationships in your graph.

how to cluster users based on tags

I'd like to cluster users based on the categories or tags of shows they watch. What's the easiest/best algorithm to do this?
Assuming I have around 20,000 tags and several million watch events I can use as signals, is there an algorithm I can implement using say pig/hadoop/mortar or perhaps on neo4j?
In terms of data I have users, programs they've watched, and the tags that a program has (usually around 10 tags per program).
I would like to expect at the end k number of clusters (maybe a dozen?) or broad buckets which I can use to classify and group my users into buckets and also gain some insight about how they would be divided - with a set of tags representing each cluster.
I've seen some posts out there suggesting a hierarchical algorithm, but not sure how one would calculate "distance" in that case. Would that be a distance between two users, or between a user and a set of tags, etc..
You basically want to cluster the users according to their tags.
To keep it simple, assume that you only have 10 tags (instead of 20,000 ones). Assume that a user, say user_34, has the 2nd and 7th tag. For this clustering task, user_34 can be represented as a point in the 10-dimensional space, and his corresponding coordinates are: [0,1,0,0,0,0,1,0,0,0].
In your own case, each user can be similarly represented as a point in a 20,000-dimensional space.
You can use Apache Mahout which contains many effective clustering algorithms, such as K-means.
Since everything is well defined in a mathematical coordinate system, computing the distance between any two users is easy! It can be computed using any distance function, but the Euclidean distance is the de-facto standard.
Note: Mahout and many other data-mining programs support many formats suitable for SPARSE features, i.e. You do not need to insert ...,0,0,0,0,... in the file, but only need to specify which tags are selected. (See RandomAccessSparseVector in Mahout.)
Note: I assumed you only want to cluster your users. Extracting representative info from clusters is somewhat tricky. For example, for each cluster you may select the tags that are more common between the users of the cluster. Alternatively, you may use concepts from information theory, such as information gain to find out which tags contain more information about the cluster.
You should consider using neo4j. You can model your data using the following node labels and relationship types.
If you are not familiar with neo4j's Cypher language notation, (:Foo) represents a node with the label Foo, and [:BAR] represents a relationship with the type BAR. The arrows around a relationship indicate its directionality. neo4j efficiently traverses relationships in both directions.
(:Cluster) -[:INCLUDES_TAG]-> (:Tag) <-[:HAS_TAG]- (:Program) <-[:WATCHED]- (:User)
You'd have k Cluster nodes, 20K Tag nodes, and several million WATCHED relationships.
With this model, starting with any given Cluster node, you can efficiently find all its related tags, programs, and users.

Node categories as label, as node or as label and node?

I'm trying to decide if i should implement Categories as nodes or labels.
Especially the query to get a count of nodes belonging to a category is not so easy.
Nodes have to be able to belong to more categories !
Categories as labels, variant 1
Keep a list of categories somewhere, then:
MATCH a:cat1, b:cat2, c:cat3, ...
With a lot of categories i will get a lot of columns .. so that's not really good. Also lot's of preprocessing on the query.
Not even sure if i could get a count easily from that.
Categories as labels, variant 2
MATCH n:category <-- the category label is used to limit the amount of nodes
RETURN DISTINCT labels(n), count(*) as count
Will return something like:
["category","the actual category label"], 2
Looks perfect, but this won't work when a node has multiple categories
["category","cat1","cat2"], 2 <-- two nodes found with category "cat1" and "cat2"
["category","cat1"], 4 <-- four nodes found with category "cat1"
Now i don't know how to get the count per category ...
Maybe something with extract(..labels()..) or filter(..labels()..) could be able to do it, but i don't know how.
Categories as nodes
Yes this works ... this is pretty straight forwarded. But aren't labels suppose to be THE thing for categorizing nodes? Plus all the extra relationships i would be creating ..
Maybe i should implement it as both labels and nodes?
Then with labels i can get every node with a category fast. And with a node i could get the category count.
I'm still searching for a good perspective on this problem, so i can not give a concrete implementation question yet.
My two cents.
For your kind of categories, I would go with a node per category and create a BELONGS_TO relationship from nodes belonging to that category. There are a number of reasons for this preference of mine.
One of the reasons labels were added is that many people were putting a "type" property on nodes. Another way to talk about labels is that they add a little bit of a "schema" to your graph - in the sense that you can categorise nodes.
With the introduction of labels, there's always the risk that they will be abused. It is just an extra tool in a database that is primarily designed for storing graphs. In an extreme case, you could use labels for almost everything, ending up with a store of "tagged" nodes.
Finally, traversing relationships is the fastest thing Neo4j does. We're talking units of microseconds. Don't be afraid adding thousands of relationships to a node. I'd leave labels for developer-defined "schema-like" information.
So in your case of user-added categories, I'd definitely create category nodes and BELONGS_TO relationships, in favour of labelling.
One last thing with a disclaimer that this is a bit of self-marketing. If you get to a point where you have tens of thousands or millions of relationships per node, and all you're after is counting the relationships, it might be a good idea to cache those counts on the nodes as properties. I've developed a module called "Relationship Count Module" for the GraphAware Framework, which does exactly that. I've demonstrated in my MSc. thesis, which is gonna be public in a couple of weeks, that the module speeds up count queries for high-degree vertices by several orders of magnitude, for as little as 10-25% write throughput penalty. Let me know if you need more detail about that.

Resources