Node categories as label, as node or as label and node? - neo4j

I'm trying to decide if i should implement Categories as nodes or labels.
Especially the query to get a count of nodes belonging to a category is not so easy.
Nodes have to be able to belong to more categories !
Categories as labels, variant 1
Keep a list of categories somewhere, then:
MATCH a:cat1, b:cat2, c:cat3, ...
With a lot of categories i will get a lot of columns .. so that's not really good. Also lot's of preprocessing on the query.
Not even sure if i could get a count easily from that.
Categories as labels, variant 2
MATCH n:category <-- the category label is used to limit the amount of nodes
RETURN DISTINCT labels(n), count(*) as count
Will return something like:
["category","the actual category label"], 2
Looks perfect, but this won't work when a node has multiple categories
["category","cat1","cat2"], 2 <-- two nodes found with category "cat1" and "cat2"
["category","cat1"], 4 <-- four nodes found with category "cat1"
Now i don't know how to get the count per category ...
Maybe something with extract(..labels()..) or filter(..labels()..) could be able to do it, but i don't know how.
Categories as nodes
Yes this works ... this is pretty straight forwarded. But aren't labels suppose to be THE thing for categorizing nodes? Plus all the extra relationships i would be creating ..
Maybe i should implement it as both labels and nodes?
Then with labels i can get every node with a category fast. And with a node i could get the category count.
I'm still searching for a good perspective on this problem, so i can not give a concrete implementation question yet.

My two cents.
For your kind of categories, I would go with a node per category and create a BELONGS_TO relationship from nodes belonging to that category. There are a number of reasons for this preference of mine.
One of the reasons labels were added is that many people were putting a "type" property on nodes. Another way to talk about labels is that they add a little bit of a "schema" to your graph - in the sense that you can categorise nodes.
With the introduction of labels, there's always the risk that they will be abused. It is just an extra tool in a database that is primarily designed for storing graphs. In an extreme case, you could use labels for almost everything, ending up with a store of "tagged" nodes.
Finally, traversing relationships is the fastest thing Neo4j does. We're talking units of microseconds. Don't be afraid adding thousands of relationships to a node. I'd leave labels for developer-defined "schema-like" information.
So in your case of user-added categories, I'd definitely create category nodes and BELONGS_TO relationships, in favour of labelling.
One last thing with a disclaimer that this is a bit of self-marketing. If you get to a point where you have tens of thousands or millions of relationships per node, and all you're after is counting the relationships, it might be a good idea to cache those counts on the nodes as properties. I've developed a module called "Relationship Count Module" for the GraphAware Framework, which does exactly that. I've demonstrated in my MSc. thesis, which is gonna be public in a couple of weeks, that the module speeds up count queries for high-degree vertices by several orders of magnitude, for as little as 10-25% write throughput penalty. Let me know if you need more detail about that.

Related

Neo4j design choice: relationships vs nodes

I'm dealing with the following situation: many trips exist between many cities. Both have various properties. E.g the cities have a name and an amount of trips that passed them, whereas trips have a distance and time.
What is 'best practise' in Neo4j?
a) Add all cities and trips as nodes, and connect the trips to the start and end nodes by means of 'STARTED_AT' and 'ENDS_IN' relations.
or
b) Add only cities as a node, and represent each of the trips as a relation between 2 nodes. This means there are many of the same relations between nodes, where the only difference is that they have other properties.
Information that might be useful: we only need to do all kinds of queries. No insertion needed.
Thanks!
I would argue it's better to store trips as nodes because relationship properties cannot be indexed, and it will be slow to do more complex queries (like find shortest route by time) So if you are searching for trips by ID or something, you will need to store them as nodes.
On the other hand, an argument can be made for using relationships, because then you can take full advantage of APOC's weighted graph search functions.
A good way to decide if something should be a node or relation, is to ask yourself "are there any other relations that would make sense here?" If you are talking about if two cities are connected, a relationship makes more since because they either are or are not. If you are talking about road trips though, the trip can pass through multiple cities, can have participants in the trip (or groups there of) and can have an owner. In that case, for future flexibility, nodes will be much easier to maintain.
I would say it really depends on how you model these trips, lets assume we can generalize this as (city)-[trip]->(city). Notice that neo4j's relations always has a direction so we can go on adding an unlimited number of trips between cities without having to redefine each city for each trip -- this actually answers (a) by the way, we don't need to define where it starts and ends the relation does all that work for you.
'This means there are many of the same relations between nodes' <<- on this note, if you need to differ each trip based on the time the trip was taken you can add the date/timestamp in the relationship property or you can go with a time tree (see Mark Needham's Article on that here and Graphgrid's take)
Hope this helps.

Is there a benefit to implementing singletons in Neo?

My business requirement says I need to add an arbitrary number of well-defined (AKA not dynamic, not unknown) attributes to certain types of nodes. I am pretty sure that while there could be 30 or 40 different attributes, a node will probably have no more than 4 or 5 of them. Of course there will be corner cases...
In this context, I am generically using 'attribute' as a tag wanted by the business, and not in the Neo4J sense.
I'll be expected to report on which nodes have which attributes. For example, I might have to report on which nodes have the "detention", "suspension", or "double secret probation" attributes.
One way is to simply have an array of appropriate attributes on each entity. But each query would require a search of all nodes. Or, I could create explicit attributes on each node. Now they could be indexed. I'm not seriously considering either of these approaches.
Another way is to implement each attribute as a singleton Neo node, and allow many (tens of thousands?) of other nodes to relate to these nodes. This implementation would have 10,000 nodes but 40,000 relationships.
Finally, the attribute nodes could be created and used by specific entity nodes on an as-needed basis. In this case, if 10,000 entities had an average of 4 attributes, I'd have a total of 50,000 nodes.
As I type this, I realize that in the 2nd case, I still have 40,000 relationships; the 'truth' of the situation did not change.
Is there a reason to avoid the 'singleton' implementation? I could put timestamps on the relationships. But those wouldn't be indexed...
For your simple use case, I'd suggest an approach you didn't list -- which is to use a node label for each "attribute".
Nodes can have multiple labels, and neo4j can quickly iterate through all the nodes with the same label -- making it very quick and easy to find all the nodes with a specific label.
For example:
MATCH (n:Detention)
RETURN n;

Neo4j graph modelling performance and querability, property to a node or as separate node plus relationship

I am teaching myself graph modelling and use Neo4j 2.2.3 database with NodeJs and Express framework.
I have skimmed through the free neo4j graph database book and learned how to model a scenario, when to use relationship and when to create nodes, etc.
I have modelled a vehicle selling scenario, with following structure
NODES
(:VEHICLE{mileage:xxx, manufacture_year: xxxx, price: xxxx})
(:VFUEL_TYPE{type:xxxx}) x 2 (one for diesel and one for petrol)
(:VCOLOR{color:xxxx}) x 8 (red, green, blue, .... yellow)
(:VGEARBOX{type:xxx}) x 2 (AUTO, MANUAL)
RELATIONSHIPS
(vehicleNode)-[:VHAVE_COLOR]->(colorNode - either of the colors)
(vehicleNode)-[:VGEARBOX_IS]->(gearboxNode - either manual or auto)
(vehicleNode)-[:VCONSUMES_FUEL_TYPE]->(fuelNode - either diesel or petrol)
Assuming we have the above structure and so on for the rest of the features.
As shown in the above screenshot (136 & 137 are VEHICLE nodes), majority of the features of a vehicle is created as separate nodes and shared among vehicles with common feature with relationships.
Could you please advise whether roles (labels) like color, body type, driving side (left drive or right drive), gearbox and others should be seperate nodes or properties of vehicle node? Which option is more performance friendly, and easy to query?
I want to write a JS code that allows querying the graph with above structure with one or many search criteria. If majority of those features are properties of VEHICLE node then querying would not be difficult:
MATCH (v:VEHICLE) WHERE v.gearbox = "MANUAL" AND v.fuel_type = "PETROL" AND v.price > x AND v.price < y AND .... RETURN v;
However with existing graph model that I have it is tricky to search, specially when there are multiple criteria that are not necessarily a properties of VEHICLE node but separate nodes and linked via relationship.
Any ideas and advise in regards to existing structure of the graph to make it more query-able as well as performance friendly would be much appreciated. If we imagine a scenario with 1000 VEHICLE nodes that would generate 15000 relationship, sounds a bit scary and if it hits a million VEHICLE then at most 15 million relationships. Please comment if I am heading in the wrong direction.
Thank you for your time.
Modeling is full of tradeoffs, it looks like you have a decent start.
Don't be concerned at all with the number of relationships. That's what graph databases are good at, so I wouldn't be too concerned about over-using them.
Should something be a property, or a node? I can't answer for your scenario, but here are some things to consider:
If you look something up by a value all the time, and you have many objects, it's usually going to be faster to find one node and then everything connected to it, because graph DBs are good at exploiting relationships. It's less fast to scan all nodes of a label and find the items where a property=a value.
Relationships work well when you want to express a connection to something that isn't a simple primitive data type. For example, take "gearbox". There's manuals, and other types...if it's a property value, you won't later easily be able to decide to store 4 other sub-types/sub-aspects of "gearbox". If it were a node, that would later be easy because you could add more properties to the node, or relate other things.
If a piece of data really is a primitive (String, integer, etc) and you don't need extra detail about it, that usually makes a good property. Querying primitive values by connecting to other nodes will seem clunky later on. For example, I wouldn't model a person with a "date of birth" as a separate node, that would be irritating to query, and would give you flexibility you'd be very unlikely to need in the future.
Semantically, how is your data related? If two items are similar because they share an X, then that X probably should be a node. If two items happen to have the same Y value but that doesn't really mean much, then Y is probably better off as a node property.

What is the most performant way to create the following MATCH statement and why?

The question:
What is the most performant way to create the following MATCH statement and why?
The detailed problem:
Let's say we have a Place node with a variable amount of properties and need to look up nodes from potentially billions of nodes by it's category. I'm trying to wrap my head around the performance of each query and it's proving to be quite difficult.
The possible queries:
Match Place node using a property lookup:
MATCH (entity:Place { category: "Food" })
Match Place node with isCategory relationship to Food node:
MATCH (entity:Place)-[:isCategory]->(category:Food)
Match Place node with Food relationship to Category node:
MATCH (entity)-[category:Food]->(:Category)
Match Food node with isCategoryFor relationship to Place node:
MATCH (category:Food)-[:isCategoryFor]->(entity:place)
And obviously all the variations in between. With relationship directions going the other way as well.
More complexity:
Let's throw in a little more complexity and say we now need to find all Place nodes using multiple categories. For example: Find all Place nodes with category Food or Bar
Would we just tack on another MATCH statement? If not, what is the most performant route to take here?
Extra:
Is there a tool to help me describe the traversal process and tell me the best method to choose?
If I understand your domain correctly, I would recommend making your Categorys into nodes themselves.
MERGE (:Category {name:"Food"})
MERGE (:Category {name:"Bar"})
MERGE (:Category {name:"Park"})
And connecting each Place node to the Categorys it belongs to.
MERGE (:Place {name:"Central Park"})-[:IS_A]->(:Category {name:"Park"})
MERGE (:Place {name:"Joe's Diner"})-[:IS_A]->(:Category {name:"Food"})
MERGE (:Place {name:"Joe's Diner"})-[:IS_A]->(:Category {name:"Bar"})
Then, if you want to find Places that belong to a Category, it can be pretty quick. Start by matching the category, then branch out to the places related to the category.
MATCH (c:Category {name:"Bar"}), (c)<-[:IS_A]-(p:Place)
RETURN p
You'll have a relatively limited number of categories, so matching the category will be quick. Then, because of the way Neo4j actually stores data, it will be fast to find all the places related to that category.
More Complexity
Finding places within multiple categories will be easy as well.
MATCH (c:Category) WHERE c.name = "Bar" OR c.name = "Food", (c)<-[:IS_A]-(p:Place)
RETURN p
Again, you just match the categories first (fast because there aren't many of them), then branch out to the connected places.
Use an Index
If you want fast, you need to use indexes where it makes sense. In this example, I would use an index on the category's name property.
CREATE INDEX ON :Category(name)
Or better yet, use a uniqueness constraint on the category names, which will index them and prevent duplicates.
CREATE CONSTRAINT ON (c:Category) ASSERT c.name IS UNIQUE
Indexes (and uniqueness) make a big difference on the speed of your queries.
Why this is fastest
Neo4j stores nodes and relationships in a very compact, quick-to-access format. Once you have a node or relationship, getting the adjacent relationships or nodes is very fast. However, it stores each node's (and relationship's) properties separately, meaning that looking through properties is relatively slow.
The goal is to get to a starting node as quickly as possible. Once there, traversing related entities is quick. If you only have 1,000 categories, but you have a billion places, it will be faster to pick out an individual Category than an individual Place. Once you have that starting node, getting to related nodes will be very efficient.
The Other Options
Just to reinforce, this is what makes your other options slower or otherwise worse.
In your first example, you are looking through properties on each node to look for the match. Property lookup is slow and you are doing it a billion times. An index can help with this, but it's still a lot of work. Additionally, you are effectively duplicating the category data over each of you billion places, and not taking advantage of Neo4j's strengths.
In all your other examples, your data models seem odd. "Food", "Bar", "Park", etc. are all instances of categories, not separate types. They should each be their own node, but they should all have the Category label, because that's what they are. In addition, categories are things, and thus they should be nodes. A relationship describes the connection between things. It does not make sense to use categories in this way.
I hope this helps!

Finding clusters of similar nodes with Cypher

Given a set of hundred of thousands of nodes with like relationship, (Foodie) -likes-> (Food), I would like to find out logical cluster of Foodie nodes.
For instance suppose I want to divide the cluster into two sets. As an output I would like two sets which have the most common eating habits.
The same logic can be extended to 3,4,5 sets etc. In case of three sets, each set would have most like eating habits. Please note that sets may NOT have same number of nodes.
An application for instance could be coloring of nodes. If the foodies were of different countries, the color of the nodes could point to various countries assuming the people of different countries ate similar food.
I would like to write a Cypher query to extract the nodes. I am stumped as where to start. Any solution or pointers would be appreciated.
what about trying the current milestone of Neo4J 2.0 (http://www.neo4j.org/download, milestone section) and assigning your nodes different labels according to their characteristics (http://www.neo4j.org/develop/labels)?
Then, you'll only have to Cypher execute queries like:
MATCH (nodes:MY_LABEL)
WHERE /.../
RETURN nodes
so that you can retrieve nodes by clusters.
You might want to look into Cliques. This is a general Graph Theory idea, but it sounds like what you want is to define certain 'cliques' of foodies, say there are BBQ foodies, Food Truck foodies, etc.

Resources