I'm interested to create relationships between two nodes having certain properties. The neo4j query for this could be written as:
MATCH (x:User {username: "user2064000"}), (y:User {username: "user2064001"}) MERGE (x)-[:KNOWS]->(y)
While the query does have the intended effect, the Neo4j web console also warns about the query creating a cartesian product (and about them being slow).
How should I rewrite the above query in order to prevent a cartesian product?
This is just a warning, and in your case you don't have to take care about it, because your are doing the following cartesian product : 1 x 1 (I assume that you have a unique constraint on username).
This warning appears when into a MATCH clause you describe two disjoincts patterns.
Cheers.
Related
Let's start from simple query that finds all coworkers recursively.
match (user:User {username: 'John'})
match (user)-[:WORKED_WITH *1..]-(coworkers:User)
return user, coworkers
Now, I have to modify it in order to recieve only those users, that are connected with first N relationships.
Every User label have value of N in the properties, and every relationship have date of creation in its properties.
I suppose, that it can be reasonable to create and maintain separate set of relationships that will satisfy this condition.
UPD: Limitations have to be applied only for those, who know each other directly.
Limitation have to be applied to each node in the path, e.g. first user have 3 relationships :WORKED_WITH (on the first level) and limitation 5, than everything OK we can continue to check connected users, if user have 6 relationships and limitation 5, only 5 of relationships have to be used to move on.
I understand that it can be slow query, but how to do that without hand written tools? One of improvements is to move all that limitations out of query execution into some preprocessing step and create additional type of relationships that will hold all of those limitations, it will require validations because they are not part of the state but projection of the state.
The following query should work (as long as you do not have a lot of data). It uses DISTINCT to remove duplicates.
MATCH (user:User {username: 'John'})-[:WORKED_WITH*]-(coworker:User)
WITH DISTINCT user, coworker
ORDER BY coworker.createDate
RETURN COLLECT(coworker)[0..user.N] AS coworkers;
Note: since variable-length paths have exponential complexity, you would usually want to specify a reasonable upper bound (e.g., [:WORKED_WITH*..5]) to avoid the query running too long or causing an out-of-memory error. Also, since the LIMIT operator does not accept a variable as its argument, this query uses COLLECT(coworker)[0..user.N] to get the N coworkers with the earliest createDate -- which is also a bit expensive.
Now, if (as you suggested) you had created a specific type of relationship (e.g., FIRST_WORKED_WITH) between each User and its N earliest "coworkers", that would allow you to use the following trivial and fast query:
MATCH (user:User {username: 'John'})-[:FIRST_WORKED_WITH]->(coworker:User)
RETURN coworker;
I am trying to build a cypher statement for neo4j where I know 2-n starting nodes by name and need to find a node (if any) that can be reached by all of the starting nodes.
At first I tought it was similar to the "Mutual Friend" situation that could be handled like
(start1)-[*..2]->(main)<-[*..2]-(start2)
but in my case I often have more then 2 starting points up around 6 that I know by name.
So basically I am puzzled by how i can include the third, fourth and so on node into the cypher to be able to find a commmon root amongst them.
In the above Example from the neo4j Website I would need a path starting with 'Dilshad', 'Becky' and 'Cesar' to check if they have a common friend (Anders) excluding 'Filipa' and 'Emil' as they are not friends of all three.
So far I would create a statement programmatically that looks like
MATCH (start1 {name:'Person1'}), (start2 {name:'Person2'}),
(start3 {name: 'Person3'}), (main)
WHERE (start1)-[*..2]->(main) AND
(start2)-[*..2]->(main) AND
(start3)-[*..2]->(main) RETURN distinct main
But I was wondering if there is a more elegant / efficient way in cypher possibly where I could use the list of names as parameter
The query shown in your question is building a cartesian product because you are matching multiple disconnected patterns.
Instead of MATCH all nodes separately and use WHERE to restrict the relations between these nodes you can do something like:
MATCH (start1 {name:'Person1'})-[*..2]->(main),
(start2 {name:'Person2'})-[*..2]->(main),
(start3 {name: 'Person3'})-[*..2]->(main)
RETURN main
The above query will be more efficient because it will match only the required pattern. Note that when you are doing MATCH (start1 {name:'Person1'}), (start2 {name:'Person2'}), (start3 {name: 'Person3'}), (main), the part (main) is matching all nodes of your graph because no restrictions to this are specified. You can use PROFILE with your query to see it more clearly.
I'm defining the relationship between two entities, Gene and Chromosome, in what I think is the simple and normal way, after importing the data from CSV:
MATCH (g:Gene),(c:Chromosome)
WHERE g.chromosomeID = c.chromosomeID
CREATE (g)-[:PART_OF]->(c);
Yet, when I do so, neo4j (browser UI) complains:
This query builds a cartesian product between disconnected patterns.
If a part of a query contains multiple disconnected patterns, this will build a cartesian product between all those parts. This may produce a large amount of data and slow down query processing. While occasionally intended, it may often be possible to reformulate the query that avoids the use of this cross product, perhaps by adding a relationship between the different parts or by using OPTIONAL MATCH (identifier is: (c)).
I don't see what the issue is. chromosomeID is a very straightforward foreign key.
The browser is telling you that:
It is handling your query by doing a comparison between every Gene instance and every Chromosome instance. If your DB has G genes and C chromosomes, then the complexity of the query is O(GC). For instance, if we are working with the human genome, there are 46 chromosomes and maybe 25000 genes, so the DB would have to do 1150000 comparisons.
You might be able to improve the complexity (and performance) by altering your query. For example, if we created an index on :Gene(chromosomeID), and altered the query so that we initially matched just on the node with the smallest cardinality (the 46 chromosomes), we would only do O(G) (or 25000) "comparisons" -- and those comparisons would actually be quick index lookups! This is approach should be much faster.
Once we have created the index, we can use this query:
MATCH (c:Chromosome)
WITH c
MATCH (g:Gene)
WHERE g.chromosomeID = c.chromosomeID
CREATE (g)-[:PART_OF]->(c);
It uses a WITH clause to force the first MATCH clause to execute first, avoiding the cartesian product. The second MATCH (and WHERE) clause uses the results of the first MATCH clause and the index to quickly get the exact genes that belong to each chromosome.
[UPDATE]
The WITH clause was helpful when this answer was originally written. The Cypher planner in newer versions of neo4j (like 4.0.3) now generate the same plan even if the WITH is omitted, and without creating a cartesian product. You can always PROFILE both versions of your query to see the effect with/without the WITH.
As logisima mentions in the comments, this is just a warning. Matching a cartesian product is slow. In your case it should be OK since you want to connect previously unconnected Gene and Chromosome nodes and you know the size of the cartesian product. There are not too many chromosomes and a smallish number of genes. If you would MATCH e.g. genes on proteins the query might blow.
I think the warning is intended for other problematic queries:
if you MATCH a cartesian product but you don't know if there is a relationship you could use OPTIONAL MATCH
if you want to MATCH both a Gene and a Chromosome without any relationships, you should split up the query
In case your query takes too long or does not finish, here is another question giving some hints how to optimize cartesian products: How to optimize Neo4j Cypher queries with multiple node matches (Cartesian Product)
When writing a query to add relationships to existing nodes, it keeps me warning with this message:
"This query builds a cartesian product between disconnected patterns.
If a part of a query contains multiple disconnected patterns, this will build a cartesian product between all those parts. This may produce a large amount of data and slow down query processing. While occasionally intended, it may often be possible to reformulate the query that avoids the use of this cross product, perhaps by adding a relationship between the different parts or by using OPTIONAL MATCH (identifier is: (e))"
If I run the query, it creates no relationships.
The query is:
match
(a{name:"Angela"}),
(b{name:"Carlo"}),
(c{name:"Andrea"}),
(d{name:"Patrizia"}),
(e{name:"Paolo"}),
(f{name:"Roberta"}),
(g{name:"Marco"}),
(h{name:"Susanna"}),
(i{name:"Laura"}),
(l{name:"Giuseppe"})
create
(a)-[:mother]->(b),
(a)-[:grandmother]->(c), (e)-[:grandfather]->(c), (i)-[:grandfather]->(c), (l)-[:grandmother]->(c),
(b)-[:father]->(c),
(e)-[:father]->(b),
(l)-[:father]->(d),
(i)-[:mother]->(d),
(d)-[:mother]->(c),
(c)-[:boyfriend]->(f),
(g)-[:brother]->(f),
(g)-[:brother]->(h),
(f)-[:sister]->(g), (f)-[:sister]->(h)
Can anyone help me?
PS: if I run the same query, but with just one or two relationships (and less nodes in the match clause), it creates the relationships correctly.
What is wrong here?
First of all, as I mentionned in my comments, you don't have any Labels, it's a really bad practice because Labels are useful to match properties in a certains dataset (if you match "name" property, you don't want to match it on a node who doesn't have a name, Labels are here for that.
The second problem is that your query doesn't know how many nodes it will get before it does. It means that if you have 500 000 nodes having name : "Angela" and 500 000 nodes having name : "Carlo", you will create one relation from each Angela node, going on each Carlo, that's quite a big query (500 000 * 500 000 relations to create if my maths aren't bad). Cypher is giving you a warning for that.
Cypher will still tell you this warning because you aren't using Unique properties to match your nodes, even with Labels, you will still have the warning.
Solution?
Use unique properties to create and match your nodes, so you avoid cartesian product.
Always use labels, Neo4j without labels is like using one giant table in SQL to store all of your data.
If you want to know how your query will run, use PROFILE before your query, here is the profile plan for your query:
Does every single one of those name strings exist? If not then you're not going to get any results because it's all one big match. You could try changing it to a MERGE.
But Supamiu is right, you really should have a label (say Person) and an index on :Person(name).
I'm defining the relationship between two entities, Gene and Chromosome, in what I think is the simple and normal way, after importing the data from CSV:
MATCH (g:Gene),(c:Chromosome)
WHERE g.chromosomeID = c.chromosomeID
CREATE (g)-[:PART_OF]->(c);
Yet, when I do so, neo4j (browser UI) complains:
This query builds a cartesian product between disconnected patterns.
If a part of a query contains multiple disconnected patterns, this will build a cartesian product between all those parts. This may produce a large amount of data and slow down query processing. While occasionally intended, it may often be possible to reformulate the query that avoids the use of this cross product, perhaps by adding a relationship between the different parts or by using OPTIONAL MATCH (identifier is: (c)).
I don't see what the issue is. chromosomeID is a very straightforward foreign key.
The browser is telling you that:
It is handling your query by doing a comparison between every Gene instance and every Chromosome instance. If your DB has G genes and C chromosomes, then the complexity of the query is O(GC). For instance, if we are working with the human genome, there are 46 chromosomes and maybe 25000 genes, so the DB would have to do 1150000 comparisons.
You might be able to improve the complexity (and performance) by altering your query. For example, if we created an index on :Gene(chromosomeID), and altered the query so that we initially matched just on the node with the smallest cardinality (the 46 chromosomes), we would only do O(G) (or 25000) "comparisons" -- and those comparisons would actually be quick index lookups! This is approach should be much faster.
Once we have created the index, we can use this query:
MATCH (c:Chromosome)
WITH c
MATCH (g:Gene)
WHERE g.chromosomeID = c.chromosomeID
CREATE (g)-[:PART_OF]->(c);
It uses a WITH clause to force the first MATCH clause to execute first, avoiding the cartesian product. The second MATCH (and WHERE) clause uses the results of the first MATCH clause and the index to quickly get the exact genes that belong to each chromosome.
[UPDATE]
The WITH clause was helpful when this answer was originally written. The Cypher planner in newer versions of neo4j (like 4.0.3) now generate the same plan even if the WITH is omitted, and without creating a cartesian product. You can always PROFILE both versions of your query to see the effect with/without the WITH.
As logisima mentions in the comments, this is just a warning. Matching a cartesian product is slow. In your case it should be OK since you want to connect previously unconnected Gene and Chromosome nodes and you know the size of the cartesian product. There are not too many chromosomes and a smallish number of genes. If you would MATCH e.g. genes on proteins the query might blow.
I think the warning is intended for other problematic queries:
if you MATCH a cartesian product but you don't know if there is a relationship you could use OPTIONAL MATCH
if you want to MATCH both a Gene and a Chromosome without any relationships, you should split up the query
In case your query takes too long or does not finish, here is another question giving some hints how to optimize cartesian products: How to optimize Neo4j Cypher queries with multiple node matches (Cartesian Product)