Relationship between two nodes not being formed in Neo4j [duplicate] - neo4j

I'm defining the relationship between two entities, Gene and Chromosome, in what I think is the simple and normal way, after importing the data from CSV:
MATCH (g:Gene),(c:Chromosome)
WHERE g.chromosomeID = c.chromosomeID
CREATE (g)-[:PART_OF]->(c);
Yet, when I do so, neo4j (browser UI) complains:
This query builds a cartesian product between disconnected patterns.
If a part of a query contains multiple disconnected patterns, this will build a cartesian product between all those parts. This may produce a large amount of data and slow down query processing. While occasionally intended, it may often be possible to reformulate the query that avoids the use of this cross product, perhaps by adding a relationship between the different parts or by using OPTIONAL MATCH (identifier is: (c)).
I don't see what the issue is. chromosomeID is a very straightforward foreign key.

The browser is telling you that:
It is handling your query by doing a comparison between every Gene instance and every Chromosome instance. If your DB has G genes and C chromosomes, then the complexity of the query is O(GC). For instance, if we are working with the human genome, there are 46 chromosomes and maybe 25000 genes, so the DB would have to do 1150000 comparisons.
You might be able to improve the complexity (and performance) by altering your query. For example, if we created an index on :Gene(chromosomeID), and altered the query so that we initially matched just on the node with the smallest cardinality (the 46 chromosomes), we would only do O(G) (or 25000) "comparisons" -- and those comparisons would actually be quick index lookups! This is approach should be much faster.
Once we have created the index, we can use this query:
MATCH (c:Chromosome)
WITH c
MATCH (g:Gene)
WHERE g.chromosomeID = c.chromosomeID
CREATE (g)-[:PART_OF]->(c);
It uses a WITH clause to force the first MATCH clause to execute first, avoiding the cartesian product. The second MATCH (and WHERE) clause uses the results of the first MATCH clause and the index to quickly get the exact genes that belong to each chromosome.
[UPDATE]
The WITH clause was helpful when this answer was originally written. The Cypher planner in newer versions of neo4j (like 4.0.3) now generate the same plan even if the WITH is omitted, and without creating a cartesian product. You can always PROFILE both versions of your query to see the effect with/without the WITH.

As logisima mentions in the comments, this is just a warning. Matching a cartesian product is slow. In your case it should be OK since you want to connect previously unconnected Gene and Chromosome nodes and you know the size of the cartesian product. There are not too many chromosomes and a smallish number of genes. If you would MATCH e.g. genes on proteins the query might blow.
I think the warning is intended for other problematic queries:
if you MATCH a cartesian product but you don't know if there is a relationship you could use OPTIONAL MATCH
if you want to MATCH both a Gene and a Chromosome without any relationships, you should split up the query
In case your query takes too long or does not finish, here is another question giving some hints how to optimize cartesian products: How to optimize Neo4j Cypher queries with multiple node matches (Cartesian Product)

Related

neo4j for fraud detection - efficient data structure

I'm trying to improve a fraud detection system for a commerce website. We deal with direct bank transactions, so fraud is a risk we need to manage. I recently learned of graphing databases and can see how it applies to these problems. So, over the past couple of days I set up neo4j and parsed our data into it: example
My intuition was to create a node for each order, and a node for each piece of data associated with it, and then connect them all together. Like this:
MATCH (w:Wallet),(i:Ip),(e:Email),(o:Order)
WHERE w.wallet="ex" AND i.ip="ex" AND e.email="ex" AND o.refcode="ex"
CREATE (w)-[:USED]->(o),(i)-[:USED]->(o),(e)-[:USED]->(o)
But this query runs very slowly as the database size increases (I assume because it needs to search the whole data set for the nodes I'm asking for). It also takes a long time to run a query like this:
START a=node(179)
MATCH (a)-[:USED*]-(d)
WHERE EXISTS(d.refcode)
RETURN distinct d
This is intended to extract all orders that are connected to a starting point. I'm very new to Cypher (<24 hours), and I'm finding it particularly difficult to search for solutions.
Are there any specific issues with the data structure or queries that I can address to improve performance? It ideally needs to complete this kind of thing within a few seconds, as I'd expect from a SQL database. At this time we have about 17,000 nodes.
Always a good idea to completely read through the developers manual.
For speeding up lookups of nodes by a property, you definitely need to create indexes or unique constraints (depending on if the property should be unique to a label/value).
Once you've created the indexes and constraints you need, they'll be used under the hood by your query to speed up your matches.
START is only used for legacy indexes, and for the latest Neo4j versions you should use MATCH instead. If you're matching based upon an internal id, you can use MATCH (n) WHERE id(n) = xxx.
Keep in mind that you should not persist node ids outside of Neo4j for lookup in future queries, as internal node ids can be reused as nodes are deleted and created, so an id that once referred to a node that was deleted may later end up pointing to a completely different node.
Using labels in your queries should help your performance. In the query you gave to find orders, Neo4j must inspect every end node in your path to see if the property exists. Property access tends to be expensive, especially when you're using a variable-length match, so it's better to restrict the nodes you want by label.
MATCH (a)-[:USED*]-(d:Order)
WHERE id(a) = 179
RETURN distinct d
On larger graphs, the variable-length match might start slowing down, so you may get more performance by installing APOC Procedures and using the Path Expander procedure to gather all subgraph nodes and filter down to just Order nodes.
MATCH (a)
WHERE id(a) = 179
CALL apoc.path.expandConfig(a, {bfs:true, uniqueness:"NODE_GLOBAL"}) YIELD path
RETURN LAST(NODES(path)) as d
WHERE d:Order

Cypher query slow when intermediate node labels are specified

I have the following Cypher query
MATCH (p1:`Article` {article_id:'1234'})--(a1:`Author` {name:'Jones, P'})
MATCH (p2:`Article` {article_id:'5678'})--(a2:`Author` {name:'Jones, P'})
MATCH (p1)-[:WRITTEN_BY]->(c1:`Author`)-[h1:HAS_NAME]->(l1)
MATCH (p2)-[:WRITTEN_BY]->(c2:`Author`)-[h2:HAS_NAME]->(l2)
WHERE l1=l2 AND c1<>a1 AND c2<>a2
RETURN c1.FullName, c2.FullName, h1.distance + h2.distance
On my local Neo4j server, running this query takes ~4 seconds and PROFILE shows >3 million db hits. If I don't specify the Author label on c1 and c2 (it's redundant thanks to the relationship labels), the same query returns the same output in 33ms, and PROFILE shows <200 db hits.
When I run the same two queries on a larger version of the same database that's hosted on a remote server, this difference in performance vanishes.
Both dbs have the same constraints and indexes. Any ideas what else might be going wrong?
Your query has a lot of unnecessary stuff in it, so first off, here's a cleaner version of it that is less likely to get misinterpreted by the planner:
MATCH (name:Name) WHERE NOT name.name = 'Jones, P'
WITH name
MATCH (:`Article` {article_id:'1234'})-[:WRITTEN_BY]->()-[h1:HAS_NAME]->(name)<-[h2:HAS_NAME]-()<-[:WRITTEN_BY]-(:`Article` {article_id:'5678'})
RETURN name.name, h1.distance + h2.distance
There's really only one path you want to find, and you want to find it for any author whose name is not Jones, P. Take advantage of your shared :Name nodes to start your query with the smallest set of definite points and expand paths from there. You are generating a massive cartesian product by stacking all those MATCH statements and then filtering them out.
As for the difference in query performance, it appears that the query planner is trying to use the Author label to build your 3rd and 4th paths, whereas if you leave it out, the planner will only touch the much narrower set of :Articles (fixed by indexed property), then expand relationships through the (incidentally very small) set of nodes that have -[:WRITTEN_BY]-> relationships, and then the (also incidentally very small) set of those nodes that have a -[:HAS_NAME]-> relationship. That decision is based partly on the predictable size of the various sets, so if you have a different number of :Author nodes on the server, the planner will make a smarter choice and not use them.

Neo4j: error in adding relationships among existing nodes

When writing a query to add relationships to existing nodes, it keeps me warning with this message:
"This query builds a cartesian product between disconnected patterns.
If a part of a query contains multiple disconnected patterns, this will build a cartesian product between all those parts. This may produce a large amount of data and slow down query processing. While occasionally intended, it may often be possible to reformulate the query that avoids the use of this cross product, perhaps by adding a relationship between the different parts or by using OPTIONAL MATCH (identifier is: (e))"
If I run the query, it creates no relationships.
The query is:
match
(a{name:"Angela"}),
(b{name:"Carlo"}),
(c{name:"Andrea"}),
(d{name:"Patrizia"}),
(e{name:"Paolo"}),
(f{name:"Roberta"}),
(g{name:"Marco"}),
(h{name:"Susanna"}),
(i{name:"Laura"}),
(l{name:"Giuseppe"})
create
(a)-[:mother]->(b),
(a)-[:grandmother]->(c), (e)-[:grandfather]->(c), (i)-[:grandfather]->(c), (l)-[:grandmother]->(c),
(b)-[:father]->(c),
(e)-[:father]->(b),
(l)-[:father]->(d),
(i)-[:mother]->(d),
(d)-[:mother]->(c),
(c)-[:boyfriend]->(f),
(g)-[:brother]->(f),
(g)-[:brother]->(h),
(f)-[:sister]->(g), (f)-[:sister]->(h)
Can anyone help me?
PS: if I run the same query, but with just one or two relationships (and less nodes in the match clause), it creates the relationships correctly.
What is wrong here?
First of all, as I mentionned in my comments, you don't have any Labels, it's a really bad practice because Labels are useful to match properties in a certains dataset (if you match "name" property, you don't want to match it on a node who doesn't have a name, Labels are here for that.
The second problem is that your query doesn't know how many nodes it will get before it does. It means that if you have 500 000 nodes having name : "Angela" and 500 000 nodes having name : "Carlo", you will create one relation from each Angela node, going on each Carlo, that's quite a big query (500 000 * 500 000 relations to create if my maths aren't bad). Cypher is giving you a warning for that.
Cypher will still tell you this warning because you aren't using Unique properties to match your nodes, even with Labels, you will still have the warning.
Solution?
Use unique properties to create and match your nodes, so you avoid cartesian product.
Always use labels, Neo4j without labels is like using one giant table in SQL to store all of your data.
If you want to know how your query will run, use PROFILE before your query, here is the profile plan for your query:
Does every single one of those name strings exist? If not then you're not going to get any results because it's all one big match. You could try changing it to a MERGE.
But Supamiu is right, you really should have a label (say Person) and an index on :Person(name).

Why does neo4j warn: "This query builds a cartesian product between disconnected patterns"?

I'm defining the relationship between two entities, Gene and Chromosome, in what I think is the simple and normal way, after importing the data from CSV:
MATCH (g:Gene),(c:Chromosome)
WHERE g.chromosomeID = c.chromosomeID
CREATE (g)-[:PART_OF]->(c);
Yet, when I do so, neo4j (browser UI) complains:
This query builds a cartesian product between disconnected patterns.
If a part of a query contains multiple disconnected patterns, this will build a cartesian product between all those parts. This may produce a large amount of data and slow down query processing. While occasionally intended, it may often be possible to reformulate the query that avoids the use of this cross product, perhaps by adding a relationship between the different parts or by using OPTIONAL MATCH (identifier is: (c)).
I don't see what the issue is. chromosomeID is a very straightforward foreign key.
The browser is telling you that:
It is handling your query by doing a comparison between every Gene instance and every Chromosome instance. If your DB has G genes and C chromosomes, then the complexity of the query is O(GC). For instance, if we are working with the human genome, there are 46 chromosomes and maybe 25000 genes, so the DB would have to do 1150000 comparisons.
You might be able to improve the complexity (and performance) by altering your query. For example, if we created an index on :Gene(chromosomeID), and altered the query so that we initially matched just on the node with the smallest cardinality (the 46 chromosomes), we would only do O(G) (or 25000) "comparisons" -- and those comparisons would actually be quick index lookups! This is approach should be much faster.
Once we have created the index, we can use this query:
MATCH (c:Chromosome)
WITH c
MATCH (g:Gene)
WHERE g.chromosomeID = c.chromosomeID
CREATE (g)-[:PART_OF]->(c);
It uses a WITH clause to force the first MATCH clause to execute first, avoiding the cartesian product. The second MATCH (and WHERE) clause uses the results of the first MATCH clause and the index to quickly get the exact genes that belong to each chromosome.
[UPDATE]
The WITH clause was helpful when this answer was originally written. The Cypher planner in newer versions of neo4j (like 4.0.3) now generate the same plan even if the WITH is omitted, and without creating a cartesian product. You can always PROFILE both versions of your query to see the effect with/without the WITH.
As logisima mentions in the comments, this is just a warning. Matching a cartesian product is slow. In your case it should be OK since you want to connect previously unconnected Gene and Chromosome nodes and you know the size of the cartesian product. There are not too many chromosomes and a smallish number of genes. If you would MATCH e.g. genes on proteins the query might blow.
I think the warning is intended for other problematic queries:
if you MATCH a cartesian product but you don't know if there is a relationship you could use OPTIONAL MATCH
if you want to MATCH both a Gene and a Chromosome without any relationships, you should split up the query
In case your query takes too long or does not finish, here is another question giving some hints how to optimize cartesian products: How to optimize Neo4j Cypher queries with multiple node matches (Cartesian Product)

Cypher query optimisation - Utilising known properties of nodes

Setup:
Neo4j and Cypher version 2.2.0.
I'm querying Neo4j as an in-memory instance in Eclipse created TestGraphDatabaseFactory().newImpermanentDatabase();.
I'm using this approach as it seems faster than the embedded version and I assume it has the same functionality.
My graph database is randomly generated programmatically with varying numbers of nodes.
Background:
I generate cypher queries automatically. These queries are used to try and identify a single 'target' node. I can limit the possible matches of the queries by using known 'node' properties. I only use a 'name' property in this case. If there is a known name for a node, I can use it to find the node id and use this in the start clause. As well as known names, I also know (for some nodes) if there are names known not to belong to a node. I specify this in the where clause.
The sorts of queries that I am running look like this...
START
nvari = node(5)
MATCH
(target:C5)-[:IN_LOCATION]->(nvara:LOCATION),
(nvara:LOCATION)-[:CONNECTED]->(nvarb:LOCATION),
(nvara:LOCATION)-[:CONNECTED]->(nvarc:LOCATION),
(nvard:LOCATION)-[:CONNECTED]->(nvarc:LOCATION),
(nvard:LOCATION)-[:CONNECTED]->(nvare:LOCATION),
(nvare:LOCATION)-[:CONNECTED]->(nvarf:LOCATION),
(nvarg:LOCATION)-[:CONNECTED]->(nvarf:LOCATION),
(nvarg:LOCATION)-[:CONNECTED]->(nvarh:LOCATION),
(nvari:C4)-[:IN_LOCATION]->(nvarg:LOCATION),
(nvarj:C2)-[:IN_LOCATION]->(nvarg:LOCATION),
(nvare:LOCATION)-[:CONNECTED]->(nvark:LOCATION),
(nvarm:C3)-[:IN_LOCATION]->(nvarg:LOCATION),
WHERE
NOT(nvarj.Name IN ['nf']) AND NOT(nvarm.Name IN ['nb','nj'])
RETURN DISTINCT target
Another way to think about this (if it helps), is that this is an isomorphism testing problem where we have some information about how nodes in a query and target graph correspond to each other based on restrictions on labels.
Question:
With regards to optimisation:
Would it help to include relation variables in the match clause? I took them out because the node variables are sufficient to distinguish between relationships but this might slow it down?
Should I restructure the match clause to have match/where couples including the where clauses from my previous example first? My expectation is that they can limit possible bindings early on. For example...
START
nvari = node(5)
MATCH
(nvarj:C2)-[:IN_LOCATION]->(nvarg:LOCATION)
WHERE NOT(nvarj.Name IN ['nf'])
MATCH
(nvarm:C3)-[:IN_LOCATION]->(nvarg:LOCATION)
WHERE NOT(nvarm.Name IN ['nb','nj'])
MATCH
(target:C5)-[:IN_LOCATION]->(nvara:LOCATION),
(nvara:LOCATION)-[:CONNECTED]->(nvarb:LOCATION),
(nvara:LOCATION)-[:CONNECTED]->(nvarc:LOCATION),
(nvard:LOCATION)-[:CONNECTED]->(nvarc:LOCATION),
(nvard:LOCATION)-[:CONNECTED]->(nvare:LOCATION),
(nvare:LOCATION)-[:CONNECTED]->(nvarf:LOCATION),
(nvarg:LOCATION)-[:CONNECTED]->(nvarf:LOCATION),
(nvarg:LOCATION)-[:CONNECTED]->(nvarh:LOCATION),
(nvare:LOCATION)-[:CONNECTED]->(nvark:LOCATION)
RETURN DISTINCT target
On the side:
(Less important but still an interest) If I make each relationship in a match clause an optional match except for relationships containing the target node, would cypher essentially be finding a maximum common sub-graph between the query and the graph data base with the constraint that the MCS contains the target node?
Thanks a lot in advance! I hope I have made my requirements clear but I appreciate that this is not a typical use-case for Neo4j.
I think querying with node properties is almost always preferable to using relationship properties (if you had a choice), as that opens up the possibility that indexing can help speed up the query.
As an aside, I would avoid using the IN operator if the collection of possible values only has a single element. For example, this snippet: NOT(nvarj.Name IN ['nf']), should be (nvarj.Name <> 'nf'). The current versions of Cypher might not use an index for the IN operator.
Restructuring a query to eliminate undesirable bindings earlier is exactly what you should be doing.
First of all, you would need to keep using MATCH for at least the first relationship in your query (which binds target), or else your result would contain a lot of null rows -- not very useful.
But, thinking clearly about this, if all the other relationships were placed in separate OPTIONAl MATCH clauses, you'd be essentially saying that you want a match even if none of the optional matches succeeded. Therefore, the logical equivalent would be:
MATCH (target:C5)-[:IN_LOCATION]->(nvara:LOCATION)
RETURN DISTINCT target
I don't think this is a useful result.

Resources