How to partially isolate a subgraph without using labels in neo4j - neo4j

I'm creating a graph that contains a large number of subgraphs of roughly treelike structure in that the 'root' of each subgraph only has outwardly directed relationships. The many leaves and branches of this subgraph all contain data related to the root. This is so that a single query like the following will return all data associated with a given root, and only the data associated with that root:
MATCH (root:ROOT {id: 'foo'})-[*]->(leaves) RETURN leaves
There are very strong reasons to optimize for this query. However, the subgraphs are not truly isolated, because some of the leaves are actually categories that can receive relationships from many roots, so structures like this exist:
(root)-[]->(category)<-[]-(root)
This seems like a great way to preserve the integrity of the subgraphs while also allowing for complex relationships between them, however, there's one catch. I can't have simple, one-to-one relationships directly between roots, or one root will contaminate the other's response to the first query. As I see it, there are only two real options.
Build a new dummy node for each 1-to-1 relationship between roots. Like so:
(root)-[]->(dummy)<-[]-(root)
I hate this option. It proliferates useless nodes and it dilutes the concept of relationships.
Give every child of each subgraph a label identifying it as a member of the subgraph. This is an even worse option as I see it. Since the subgraphs number in the many thousands it would dramatically pollute the label space.
I've also considered filtering on the label of a direct relationship, but that only excludes the foreign root, and not its children. See below:
Filter on the label of direct 1-to-1 relationships with a structure like this:
(root)-[:bar]->(foreign_root)-[]->(foreign_leaves)
And a primary query like this:
MATCH (root {id: 'foo'})-[*]->(leaves) WHERE NOT (root)-[:bar]->(leaves) RETURN leaves
Produces a result of (foreign_leaves) This is undesirable for multiple reasons, since it makes the most important query larger, and doesn't actually isolate the graph.
So, in one sense I am asking, is there a way to create a direct, 1-to-1 relationship between two of these roots without massive graph pollution or cross-contamination between subgraphs? In a larger sense, am I viewing the problem wrongly?

I think you are almost there. In your last Cypher query, you can tweak your WHERE clause so that it does not instantiate the :bar relationship's destination node. Like this:
MATCH (root {id: 'foo'})-[*]->(leaves)
WHERE NOT (root)-[:bar]->()
RETURN leaves
This way, you filter out all paths that start with a :bar relationship.

Related

NEO4J - Matching a path where middle node might exist or not

I have the following graph:
I would look to get all contractors and subcontractors and clients, starting from David.
So I thought of a query likes this:
MATCH (a:contractor)-[*0..1]->(b)-[w:works_for]->(c:client) return a,b,c
This would return:
(0:contractor {name:"David"}) (0:contractor {name:"David"}) (56:client {name:"Sarah"})
(0:contractor {name:"David"}) (1:subcontractor {name:"John"}) (56:client {name:"Sarah"})
Which returns the desired result. The issue here is performance.
If the DB contains millions of records and I leave (b) without a label, the query will take forever. If I add a label to (b) such as (b:subcontractor) I won't hit millions of rows but I will only get results with subcontractors:
(0:contractor {name:"David"}) (1:subcontractor {name:"John"}) (56:client {name:"Sarah"})
Is there a more efficient way to do this?
link to graph example: https://console.neo4j.org/r/pry01l
There are some things to consider with your query.
The relationship type is not specified- is it the case that the only relationships from contractor nodes are works_for and hired? If not, you should constrain the relationship types being matched in your query. For example
MATCH (a:contractor)-[:works_for|:hired*0..1]->(b)-[w:works_for]->(c:client)
RETURN a,b,c
The fact that (b) is unlabelled does not mean that every node in the graph will be matched. It will be reached either as a result of traversing the works_for or hired relationships if specified, or any relationship from :contractor, or via the works_for relationship.
If you do want to label it, and you have a hierarchy of types, you can assign multiple labels to nodes and just use the most general one in your query. For example, you could have a label such as ExternalStaff as the generic label, and then further add Contractor or SubContractor to distinguish individual nodes. Then you can do something like
MATCH (a:contractor)-[:works_for|:hired*0..1]->(b:ExternalStaff)-[w:works_for]->(c:client)
RETURN a,b,c
Depends really on your use cases.

Neo4j labels and properties, and their differences

Say we have a Neo4j database with several 50,000 node subgraphs. Each subgraph has a root. I want to find all nodes in one subgraph.
One way would be to recursively walk the tree. It works but can be thousands of trips to the database.
One way is to add a subgraph identifier to each node:
MATCH(n {subgraph_id:{my_graph_id}}) return n
Another way would be to relate each node in a subgraph to the subgraph's root:
MATCH(n)-[]->(root:ROOT {id: {my_graph_id}}) return n
This feels more "graphy" if that matters. Seems expensive.
Or, I could add a label to each node. If {my_graph_id} was "BOBS_QA_COPY" then
MATCH(n:BOBS_QA_COPY) return n
would scoop up all the nodes in the subgraph.
My question is when is it appropriate to use a garden-variety property, add relationships, or set a label?
Setting a label to identify a particular subgraph makes me feel weird, like I am abusing the tool. I expect labels to say what something is, not which instance of something it is.
For example, if we were graphing car information, I could see having parts labeled "FORD EXPLORER". But I am less sure that it would make sense to have parts labeled "TONYS FORD EXPLORER". Now, I could see (USER id:"Tony") having a relationship to a FORD EXPLORER graph...
I may be having a bout of "SQL brain"...
Let's work this through, step by step.
If there are N non-root nodes, adding an extra N ROOT relationships makes the least sense. It is very expensive in storage, it will pollute the data model with relationships that don't need to be there and that can unnecessarily complicate queries that want to traverse paths, and it is not the fastest way to find all the nodes in a subgraph.
Adding a subgraph ID property to every node is also expensive in storage (but less so), and would require either: (a) scanning every node to find all the nodes with a specific ID (slow), or (b) using an index, say, :Node(subgraph_id) (faster). Approach (b), which is preferable, would also require that all the nodes have the same Node label.
But wait, if approach 2(b) already requires all nodes to be labelled, why don't we just use a different label for each subgroup? By doing that, we don't need the subgraph_id property at all, and we don't need an index either! And finding all the nodes with the same label is fast.
Thus, using a per-subgroup label would be the best option.

What is the most performant way to create the following MATCH statement and why?

The question:
What is the most performant way to create the following MATCH statement and why?
The detailed problem:
Let's say we have a Place node with a variable amount of properties and need to look up nodes from potentially billions of nodes by it's category. I'm trying to wrap my head around the performance of each query and it's proving to be quite difficult.
The possible queries:
Match Place node using a property lookup:
MATCH (entity:Place { category: "Food" })
Match Place node with isCategory relationship to Food node:
MATCH (entity:Place)-[:isCategory]->(category:Food)
Match Place node with Food relationship to Category node:
MATCH (entity)-[category:Food]->(:Category)
Match Food node with isCategoryFor relationship to Place node:
MATCH (category:Food)-[:isCategoryFor]->(entity:place)
And obviously all the variations in between. With relationship directions going the other way as well.
More complexity:
Let's throw in a little more complexity and say we now need to find all Place nodes using multiple categories. For example: Find all Place nodes with category Food or Bar
Would we just tack on another MATCH statement? If not, what is the most performant route to take here?
Extra:
Is there a tool to help me describe the traversal process and tell me the best method to choose?
If I understand your domain correctly, I would recommend making your Categorys into nodes themselves.
MERGE (:Category {name:"Food"})
MERGE (:Category {name:"Bar"})
MERGE (:Category {name:"Park"})
And connecting each Place node to the Categorys it belongs to.
MERGE (:Place {name:"Central Park"})-[:IS_A]->(:Category {name:"Park"})
MERGE (:Place {name:"Joe's Diner"})-[:IS_A]->(:Category {name:"Food"})
MERGE (:Place {name:"Joe's Diner"})-[:IS_A]->(:Category {name:"Bar"})
Then, if you want to find Places that belong to a Category, it can be pretty quick. Start by matching the category, then branch out to the places related to the category.
MATCH (c:Category {name:"Bar"}), (c)<-[:IS_A]-(p:Place)
RETURN p
You'll have a relatively limited number of categories, so matching the category will be quick. Then, because of the way Neo4j actually stores data, it will be fast to find all the places related to that category.
More Complexity
Finding places within multiple categories will be easy as well.
MATCH (c:Category) WHERE c.name = "Bar" OR c.name = "Food", (c)<-[:IS_A]-(p:Place)
RETURN p
Again, you just match the categories first (fast because there aren't many of them), then branch out to the connected places.
Use an Index
If you want fast, you need to use indexes where it makes sense. In this example, I would use an index on the category's name property.
CREATE INDEX ON :Category(name)
Or better yet, use a uniqueness constraint on the category names, which will index them and prevent duplicates.
CREATE CONSTRAINT ON (c:Category) ASSERT c.name IS UNIQUE
Indexes (and uniqueness) make a big difference on the speed of your queries.
Why this is fastest
Neo4j stores nodes and relationships in a very compact, quick-to-access format. Once you have a node or relationship, getting the adjacent relationships or nodes is very fast. However, it stores each node's (and relationship's) properties separately, meaning that looking through properties is relatively slow.
The goal is to get to a starting node as quickly as possible. Once there, traversing related entities is quick. If you only have 1,000 categories, but you have a billion places, it will be faster to pick out an individual Category than an individual Place. Once you have that starting node, getting to related nodes will be very efficient.
The Other Options
Just to reinforce, this is what makes your other options slower or otherwise worse.
In your first example, you are looking through properties on each node to look for the match. Property lookup is slow and you are doing it a billion times. An index can help with this, but it's still a lot of work. Additionally, you are effectively duplicating the category data over each of you billion places, and not taking advantage of Neo4j's strengths.
In all your other examples, your data models seem odd. "Food", "Bar", "Park", etc. are all instances of categories, not separate types. They should each be their own node, but they should all have the Category label, because that's what they are. In addition, categories are things, and thus they should be nodes. A relationship describes the connection between things. It does not make sense to use categories in this way.
I hope this helps!

Neo4j labels, relationship types, and cypher matching performance

Say I have a massive graph of users and other types of nodes. Each type has a label, some may have multiple labels. Since I am defining users and their access to nodes, there is one relationship type between users and nodes: CAN_ACCESS. Between other objects, there are different relationship types, but for the purpose of access control, everything involves a CAN_ACCESS relationship when we start from a user.
I never perform a match without using labels, so my intention and hope is that any performance downsides to having one heavily-used relationship type from my User nodes should be negated by matching a label. Obviously, this match could get messy:
MATCH (n:`User`)-[r1:`CAN_ACCESS`]->(n2)
But I'd never do that. I'd do this:
MATCH (n:`User`)-[r1:`CAN_ACCESS`]->(n2:`LabelX`)
My question, then is whether the use of labels on the destination side of the match is effectively equivalent to having a dedicated relationship type between a User and any given label. In other words, does this:
MATCH (n:`User`)-[r1:`CAN_ACCESS`]->(n2:`LabelX`)
Give me the same performance as this:
MATCH (n:`User`)-[r1:`CAN_ACCESS_LABEL_X`]->(n2)
If CAN_ACCESS_LABEL_X ALWAYS goes (n:`User`)-->(n:`LabelX`)?
As pointed out by Michael Hunger's comment, Mark Needham's blog post here demonstrates that performance is best when you use a dedicated relationship type instead of relying on labels.

Why do relationships as a concept exist in neo4j or graph databases in general?

I can't seem to find any discussion on this. I had been imagining a database that was schemaless and node based and heirarchical, and one day I decided it was too common sense to not exist, so I started searching around and neo4j is about 95% of what I imagined.
What I didn't imagine was the concept of relationships. I don't understand why they are necessary. They seem to add a ton of complexity to all topics centered around graph databases, but I don't quite understand what the benefit is. Relationships seem to be almost exactly like nodes, except more limited.
To explain what I'm thinking, I was imagining starting a company, so I create myself as my first nodes:
create (u:User { u.name:"mindreader"});
create (c:Company { c.name:"mindreader Corp"});
One day I get a customer, so I put his company into my db.
create (c:Company { c.name:"Customer Company"});
create (u:User { u.name:"Customer Employee1" });
create (u:User { u.name:"Customer Employee2"});
I decide to link users to their customers
match (u:User) where u.name =~ "Customer.*"
match (c:Company) where c.name =~ "Customer.*
create (u)-[:Employee]->(c);
match (u:User where name = "mindreader"
match (c:Company) where name =~ "mindreader.*"
create (u)-[:Employee]->(c);
Then I hire some people:
match (c:Company) where c.name =~ "mindreader.*"
create (u:User { name:"Employee1"})-[:Employee]->(c)
create (u:User { name:"Employee2"})-[:Employee]->(c);
One day hr says they need to know when I hired employees. Okay:
match (c:Company)<-[r:Employee]-(u:User)
where name =~ "mindreader.*" and u.name =~ "Employee.*"
set r.hiredate = '2013-01-01';
Then hr comes back and says hey, we need to know which person in the company recruited a new employee so that they can get a cash reward for it.
Well now what I need is for a relationship to point to a user but that isn't allowed (:Hired_By relationship between :Employee relationship and a User). We could have an extra relationship :Hired_By, but if the :Employee relationship is ever deleted, the hired_by will remain unless someone remembers to delete it.
What I could have done in neo4j was just have a
(u:User)-[:hiring_info]->(hire_info:HiringInfo)-[:hired_by]->(u:User)
In which case the relationships only confer minimal information, the name.
What I originally envisioned was that there would be nodes, and then each property of a node could be a datatype or it could be a pointer to another node. In my case, a user record would end up looking like:
User {
name: "Employee1"
hiring_info: {
hire_date: "2013-01-01"
hired_by: u:User # -> would point to a user
}
}
Essentially it is still a graph. Nodes point to each other. The name of the relationship is just a field in the origin node. To query it you would just go
match (u:User) where ... return u.name, u.hiring_info.hiring_date, u.hiring_info.hired_by.name
If you needed a one to many relationship of the same type, you would just have a collection of pointers to nodes. If you referenced a collection in return, you'd get essentially a join. If you delete hiring_info, it would delete the pointer. References to other nodes would not have to be a disorganized list at the toplevel of a node. Furthermore when I query each user I will know all of the info about a user without both querying for the user itself and also all of its relationships. I would know his name and the fact that he hired someone in the same query. From the database backend, I'm not sure much would change.
I see quite a few questions from people asking whether they should use nodes or relationships to model this or that, and occasionally people asking for a relationship between relationships. It feels like the XML problem where you are wondering if a pieces of information should be its own tag or just a property its parent tag.
The query engine goes to great pains to handle relationships, so there must be some huge advantage to having them, but I can't quite see it.
Different databases are for different things. You seem to be looking for a noSQL database.
This is an extremely wide topic area that you've reached into, so I'll give you the short of it. There's a spectrum of database schemas, each of which have different use cases.
NoSQL aka Non-relational Databases:
Every object is a single document. You can have references to other documents, but any additional traversal means you're making another query. Times when you don't have relationships between your data very often, and are usually just going to want to query once and have a large amount of flexibly-stored data as the document that is returnedNote: These are not "nodes". Node have a very specific definition and implies that there are edges.)
SQL aka Relational Databases:
This is table land, this is where foreign keys and one-to-many relationships come into play. Here you have strict schemas and very fast queries. This is honestly what you should use for your user example. Small amounts of data where the relationships between things are shallow (You don't have to follow a relationship more than 1-2 times to get to the relevant entry) are where these excel.
Graph Database:
Use this when relationships are key to what you're trying to do. The most common example of a graph is something like a social graph where you're connecting different users together and need to follow relationships for many steps. (Figure out if two people are connected within a depth for 4 for instance)
Relationships exist in graph databases because that is the entire concept of a graph database. It doesn't really fit your application, but to be fair you could just keep more in the node part of your database. In general the whole idea of a database is something that lets you query a LOT of data very quickly. Depending on the intrinsic structure of your data there are different ways that that makes sense. Hence the different kinds of databases.
In strongly connected graphs, Neo4j is 1000x faster on 1000x the data than a SQL database. NoSQL would probably never be able to perform in a strongly connected graph scenario.
Take a look at what we're building right now: http://vimeo.com/81206025
Update: In reaction to mindreader's comment, we added the related properties to the picture:
RDBM systems are tabular and put more information in the tables than the relationships. Graph databases put more information in relationships. In the end, you can accomplish much the same goals.
However, putting more information in relationships can make queries smaller and faster.
Here's an example:
Graph databases are also good at storing human-readable knowledge representations, being edge (relationship) centric. RDF takes it one step further were all information is stored as edges rather than nodes. This is ideal for working with predicate logic, propositional calculus, and triples.
Maybe the right answer is an object database.
Objectivity/DB, which now supports a full suite of graph database capabilities, allows you to design complex schema with one-to-one, one-to-many, many-to-one, and many-to-many reference attributes. It has the semantics to view objects as graph nodes and edges. An edge can be just the reference attribute from one node to another or an edge can exist as an edge object that sits between two nodes.
An edge object can have any number of attribute and can have references off to other objects, as shown in the diagram below.
Being able to "hang" complex objects off of an edge allows Objectivity/DB to support weighted queries where the edge-weight can be calculated using a user-defined weight calculator operator. The weight calculator operator can build the weight from a static attribute on the edge or build the weight by digging down through the objects connected to the edge. In the picture, above, we could create a edge-weight calculator that computes the sum of the CallDetail lengths connected to the Call edge.

Resources