I have the following graph:
I would look to get all contractors and subcontractors and clients, starting from David.
So I thought of a query likes this:
MATCH (a:contractor)-[*0..1]->(b)-[w:works_for]->(c:client) return a,b,c
This would return:
(0:contractor {name:"David"}) (0:contractor {name:"David"}) (56:client {name:"Sarah"})
(0:contractor {name:"David"}) (1:subcontractor {name:"John"}) (56:client {name:"Sarah"})
Which returns the desired result. The issue here is performance.
If the DB contains millions of records and I leave (b) without a label, the query will take forever. If I add a label to (b) such as (b:subcontractor) I won't hit millions of rows but I will only get results with subcontractors:
(0:contractor {name:"David"}) (1:subcontractor {name:"John"}) (56:client {name:"Sarah"})
Is there a more efficient way to do this?
link to graph example: https://console.neo4j.org/r/pry01l
There are some things to consider with your query.
The relationship type is not specified- is it the case that the only relationships from contractor nodes are works_for and hired? If not, you should constrain the relationship types being matched in your query. For example
MATCH (a:contractor)-[:works_for|:hired*0..1]->(b)-[w:works_for]->(c:client)
RETURN a,b,c
The fact that (b) is unlabelled does not mean that every node in the graph will be matched. It will be reached either as a result of traversing the works_for or hired relationships if specified, or any relationship from :contractor, or via the works_for relationship.
If you do want to label it, and you have a hierarchy of types, you can assign multiple labels to nodes and just use the most general one in your query. For example, you could have a label such as ExternalStaff as the generic label, and then further add Contractor or SubContractor to distinguish individual nodes. Then you can do something like
MATCH (a:contractor)-[:works_for|:hired*0..1]->(b:ExternalStaff)-[w:works_for]->(c:client)
RETURN a,b,c
Depends really on your use cases.
Related
Suppose you've got two nodes that represent the same thing, and you want to merge those two nodes. Both nodes can have any number of relations with other nodes.
The basics are fairly easy, and would look something like this:
MATCH (a), (b) WHERE a.id == b.id
MATCH (b)-[r]->()
CREATE (a)-[s]->()
SET s = PROPERTIES(r)
DELETE DETACH b
Only I can't create a relation without a type. And Cypher doesn't support variable labels either. I'd love to be able to do something like
CREATE (a)-[s:{LABELS(r)}]->(o)
but that doesn't work. To create the relation, you need to know the type of the relation, and in this case I really don't.
Is there a way to dynamically assign types to relationships, or am I going to have to query the types of the old relation, and then string concat new queries with the proper types? That's not impossible, but a lot slower and more complex. And this could potentially match a lot of elements and even more relationships, so having to generate a separate query for every instance is going to slow things down quite a lot.
Or is there a way to change the target of the old relationship? That would probably be the fastest, but I'm not aware of any way to do that.
I think you need to take a look at APOC, especially apoc.create.relationship which enable creating relationships with dynamic type.
Adapting your example, you should end up with something along the line of (not tested):
MATCH (a), (b) WHERE a.id == b.id
MATCH (b)-[r]->(n)
CALL apoc.create.relationship(a, type(r), properties(r), n)
DETACH DELETE b
NB
relationships have TYPE and not label
the proper cypher statement to delete relationships attached to a node and the node itself is DETACH DELETE (and not DELETE DETACH)
Related resource: https://markhneedham.com/blog/2016/10/30/neo4j-create-dynamic-relationship-type/
The APOC procedure apoc.refactor.mergeNodes should be very helpful. That procedure is very powerful, and you need to read the documentation to understand how to configure it to do what you want in your specific situation.
Here is a simple example that shows how to use the procedure's default configuration to merge nodes with the same id:
MATCH (node:Foo)
WITH node.id AS id, COLLECT(node) AS nodes
WHERE SIZE(nodes) > 1
CALL apoc.refactor.mergeNodes(nodes, {}) YIELD node
RETURN node
In this example, I specified an arbitrary Foo label to avoid accidentally merging unwanted nodes. Doing so also helps to speed up the query if you have a lot of nodes with other labels (since they will not need to be scanned for the id property).
The aggregating function COLLECT is used to collect a list of all the nodes with the same id. After checking the size of the list, it is passed to the procedure.
My graph is 1M nodes. The data model is intentionally simple. There are Entities and IDType nodes. A single Entity may have 1:many IDType nodes. And an IDType node may be connected to 1:many Entities. This forms the graph.
The goal is to find all clusters of IDType's and Entities that are connected together into what I call a cluster of nodes (subgraph I guess some call it). Imagine if we had 1M nodes. I would like to find "clusters" like this in the graph data, I'm trying to figure out how to do that. I've written the cypher query that I believe does it, but it's not clear to me if it's doing what is intended.
The question: how do I efficiently traverse my graph and cluster together nodes so that there is a single row or group of rows that I can return as a row-based result set to my python driver program to then operate over that cluster. While this doesn't need to be the exact structure of my result, this is a sense of what I'm looking for.
cluster|nodes
1|2,3,4,5,6,7
2|10,11,12,13
3|15,17,19,20,21,25,27,28,33
Where the "cluster" is some arbitrary clustering of the list of nodes (frankly if I have a single line that's just a collection of clusters or some other way of telling they are all related, then I'm golden). The "nodes" number represents a unique integer-based property that we tag to every Entity node.
The query is below. The concept is that an "Entity" node can have 1 or many "ID" nodes and I'm trying to get all "Entity" and "ID" that are related to each other through the relationship "HAS_ID".
Conceptually, if there is a relationship that exists in the data like this Entity1-->ID1<--Entity2-->ID2<--Entity3-->ID3<--Entity4-->ID4<--Entity5 then I want to "cluster" them together so that I can create a unique number that represents this group of nodes. With my example, there are 5 entities, but there could just as easily be 2 entities, or 50 entities, which are all related to one another, that's why I'm thinking the variable length path is what I need.
The below is my attempt to do this in the graph. But 1) is it correct? 2) is it efficient because it seems to runs indefinitely 3) how do i best "group" these together?
match
(n:Entity)-[e1:HAS_ID*]-(o)
where n.key <> o.key
return *
limit 10
;
I've also tried
match (n:Entity)-[e1:HAS_ID*]-(o)
where n.key <> o.key
with distinct n.key as key_1, o.key as key_2
return key_1, collect(key_2)
limit 100
;
This seems to do close to what I want, but I'm still not getting a single group for a given key, in other words, I can have 5 rows returned but they are all still related, which I'd rather have 1 row in that case... He's an example, you can see that key "49518" is on the first and second row, I'd rather have one row that grouped them all together.
49518 [49004, 49871, 49940, 50525, 49101, 49625, 50165, 50017, 49098, 50383]
49940 [49088, 49706, 50292, 50470, 49140, 49258, 49216, 49559, 50004, 50346, 49237, 49518, 49894, 49101, 49625, 50165, 50017, 49098, 50383]
Well, for one, your query doesn't match the relationship pattern you described.
Each of your arrows in your pattern is a [:HAS_ID] relationship, so if entities and IDs are always alternating between each relationship, then your current query would only match patterns like this:
(:Entity)-[:HAS_ID]->(:ID)<-[:HAS_ID]-(:Entity)-[:HAS_ID]->(:ID)<-[:HAS_ID]-(:Entity)
3 entities, 2 IDs, 4 relationships. That doesn't match your example pattern of 5 entities, 4 IDs, and 8 relationships. So at the very least, you'll want to alter your pattern to use *8.
As for efficiency...the thing you're trying to do seems rather inefficient, as it must attempt to find this pattern on every single :Entity node in your graph, trying every single :HAS_ID relationship it finds. If your entire graph is made of this same pattern of :Entity and :ID and :HAS_ID, then your query is going to be traversing your entire graph, not once but multiple times.
You are going to get duplicate results. Even if we assume that your entire graph is made up of isolated 5 entity / 4 ID / 8 relationship chains like a snake, as in your example (an entity either being at the end of the chain with one link to an ID, or somewhere in the middle with links to 2 IDs), then you'll be getting 2 matches for that same group of nodes, one matching from one end of the chain, the other matching the other end. And that's the simple case...I'm guessing your graph could be much more complex than this, allowing even more possibilities for many different patterns to match on the exact same group of nodes. A unique path using your pattern does not equate to a unique grouping of nodes.
At the very least, you'll probably want to match on a pattern and use RETURN DISTINCT NODES(p) to enforce unique sets of nodes, but I still think the matching may take quite a bit of time.
When writing a query to add relationships to existing nodes, it keeps me warning with this message:
"This query builds a cartesian product between disconnected patterns.
If a part of a query contains multiple disconnected patterns, this will build a cartesian product between all those parts. This may produce a large amount of data and slow down query processing. While occasionally intended, it may often be possible to reformulate the query that avoids the use of this cross product, perhaps by adding a relationship between the different parts or by using OPTIONAL MATCH (identifier is: (e))"
If I run the query, it creates no relationships.
The query is:
match
(a{name:"Angela"}),
(b{name:"Carlo"}),
(c{name:"Andrea"}),
(d{name:"Patrizia"}),
(e{name:"Paolo"}),
(f{name:"Roberta"}),
(g{name:"Marco"}),
(h{name:"Susanna"}),
(i{name:"Laura"}),
(l{name:"Giuseppe"})
create
(a)-[:mother]->(b),
(a)-[:grandmother]->(c), (e)-[:grandfather]->(c), (i)-[:grandfather]->(c), (l)-[:grandmother]->(c),
(b)-[:father]->(c),
(e)-[:father]->(b),
(l)-[:father]->(d),
(i)-[:mother]->(d),
(d)-[:mother]->(c),
(c)-[:boyfriend]->(f),
(g)-[:brother]->(f),
(g)-[:brother]->(h),
(f)-[:sister]->(g), (f)-[:sister]->(h)
Can anyone help me?
PS: if I run the same query, but with just one or two relationships (and less nodes in the match clause), it creates the relationships correctly.
What is wrong here?
First of all, as I mentionned in my comments, you don't have any Labels, it's a really bad practice because Labels are useful to match properties in a certains dataset (if you match "name" property, you don't want to match it on a node who doesn't have a name, Labels are here for that.
The second problem is that your query doesn't know how many nodes it will get before it does. It means that if you have 500 000 nodes having name : "Angela" and 500 000 nodes having name : "Carlo", you will create one relation from each Angela node, going on each Carlo, that's quite a big query (500 000 * 500 000 relations to create if my maths aren't bad). Cypher is giving you a warning for that.
Cypher will still tell you this warning because you aren't using Unique properties to match your nodes, even with Labels, you will still have the warning.
Solution?
Use unique properties to create and match your nodes, so you avoid cartesian product.
Always use labels, Neo4j without labels is like using one giant table in SQL to store all of your data.
If you want to know how your query will run, use PROFILE before your query, here is the profile plan for your query:
Does every single one of those name strings exist? If not then you're not going to get any results because it's all one big match. You could try changing it to a MERGE.
But Supamiu is right, you really should have a label (say Person) and an index on :Person(name).
We have a scenario to display relationships spreading pictures(or messages) to user.
For example: Relationship 1 of Node A has a message "Foo", Relationship 2 of Node 2 also has same message "Foo" ... Relationship n of Node n also has same message "Foo".
Now we are going to display a relationship graph by query Neo4j.
This is my query:
MATCH (a)-[r1]-()-[r2]-()-[r3]-()-[r4]
WHERE a.id = '59072662'
and r2.message_id = r1.target_message_id
and r3.message_id = r2.target_message_id
and r4.message_id = r3.target_message_id
RETURN r1,r2,r3,r4
The problem is, this query does not work if there are only 2 levels of linking. If there is only a r1 and r2, this query returns nothing.
Please tell me how to write a Cypher query returns a set of relationships of my case?
Adding to Stefan's answer.
If you want to keep track of how pictures spread then you would also include a relationship to the image like:
(message)-[:INCLUDES]->(image)
If you want how a specific picture got spread in the message network:
MATCH (i:Image {url: "X"}), p=(recipient:User)<-[*]-(m:Message)<-[*]-(sender:User)
WHERE (m)-[:INCLUDES]->(i) WITH length(p) as length, sender ORDER BY length
RETURN DISTINCT sender
This will return all senders, ordered by path length, so the top one should be the original sender.
If you're just interested in the original sender you could use LIMIT 1.
Alternatively, if you find yourself traversing huge networks and hitting performance issue because of the massive paths that have to be traversed, you could also add a relationship between the message and the original uploader.
The answer to the question you psoted at the bottom, about the way to get a set of relationships in a variable length path:
You define a path, like in the example above
p=(recipient:User)<-[*]-(m:Message)<-[*]-(sender:User)
Then, to access the relationships in that path, you use the rels function
RETURN rels(p)
You didn't provide much details on your use case. From my experience I suggest that you rethink your way of graph data modelling.
A message seems to be a central concept in your domain. Therefore the message should be probably modeled as a node. To connect (a) and (b) via message (m), you might use something like (a)-[:SENT]->(m {message_id: ....})-[TO:]->(b).
Using this (m) could easily have a REFERS_TO relationship to another message making the query above way more graphy.
My database contains about 300k nodes and 350k relationships.
My current query is:
start n=node(3) match p=(n)-[r:move*1..2]->(m) where all(r2 in relationships(p) where r2.GameID = STR(id(n))) return m;
The nodes touched in this query are all of the same kind, they are different positions in a game. Each of the relationships contains a property "GameID", which is used to identify the right relationship if you want to pass the graph via a path. So if you start traversing the graph at a node and follow the relationship with the right GameID, there won't be another path starting at the first node with a relationship that fits the GameID.
There are nodes that have hundreds of in and outgoing relationships, some others only have a few.
The problem is, that I don't know how to tell Cypher how to do this. The above query works for a depth of 1 or 2, but it should look like [r:move*] to return the whole path, which is about 20-200 hops.
But if i raise the values, the querys won't finish. I think that Cypher looks at each outgoing relationship at every single path depth relating to the start node, but as I already explained, there is only one right path. So it should do some kind of a DFS search instead of a BFS search. Is there a way to do so?
I would consider configuring a relationship index for the GameID property. See http://docs.neo4j.org/chunked/milestone/auto-indexing.html#auto-indexing-config.
Once you have done that, you can try a query like the following (I have not tested this):
START n=node(3), r=relationship:rels(GameID = 3)
MATCH (n)-[r*1..]->(m)
RETURN m;
Such a query would limit the relationships considered by the MATCH cause to just the ones with the GameID you care about. And getting that initial collection of relationships would be fast, because of the indexing.
As an aside: since neo4j reuses its internally-generated IDs (for nodes that are deleted), storing those IDs as GameIDs will make your data unreliable (unless you never delete any such nodes). You may want to generate and use you own unique IDs, and store them in your nodes and use them for your GameIDs; and, if you do this, then you should also create a uniqueness constraint for your own IDs -- this will, as a nice side effect, automatically create an index for your IDs.