Confused about MERGE sometimes creating duplicate relationship [duplicate] - neo4j

My database model has users and MAC addresses. A user can have multiple MAC addresses, but a MAC can only belong to one user. If some user sets his MAC and that MAC is already linked to another user, the existing relationship is removed and a new relationship is created between the new owner and that MAC. In other words, a MAC moves between users.
This is a particular instance of the Cypher query I'm using to assign MAC addresses:
MATCH (new:User { Id: 2 })
MERGE (mac:MacAddress { Value: "D857EFEF1CF6" })
WITH new, mac
OPTIONAL MATCH ()-[oldr:MAC_ADDRESS]->(mac)
DELETE oldr
MERGE (new)-[:MAC_ADDRESS]->(mac)
The query runs fine in my tests, but in production, for some strange reason it sometimes creates duplicate MacAddress nodes (and a new relationship between the user and each of those nodes). That is, a particular user can have multiple MacAddress nodes with the same Value.
I can tell they are different nodes because they have different node ID's. I'm also sure the Values are exactly the same because I can do a collect(distinct mac.Value) on them and the result is a collection with one element. The query above is the only one in the code that creates MacAddress nodes.
I'm using Neo4j 2.1.2. What's going on here?
Thanks,
Jan

Are you sure this is the entirety of the queries you're running? MERGE has this really common pitfall where it merges everything that you give it. So here's what people expect:
neo4j-sh (?)$ MERGE (mac:MacAddress { Value: "D857EFEF1CF6" });
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 1
Properties set: 1
Labels added: 1
1650 ms
neo4j-sh (?)$ MERGE (mac:MacAddress { Value: "D857EFEF1CF6" });
+--------------------------------------------+
| No data returned, and nothing was changed. |
+--------------------------------------------+
17 ms
neo4j-sh (?)$ match (mac:MacAddress { Value: "D857EFEF1CF6" }) return count(mac);
+------------+
| count(mac) |
+------------+
| 1 |
+------------+
1 row
200 ms
So far, so good. That's what we expect. Now watch this:
neo4j-sh (?)$ MERGE (mac:MacAddress { Value: "D857EFEF1CF6" })-[r:foo]->(b:SomeNode {label: "Foo!"});
+-------------------+
| No data returned. |
+-------------------+
Nodes created: 2
Relationships created: 1
Properties set: 2
Labels added: 2
178 ms
neo4j-sh (?)$ match (mac:MacAddress { Value: "D857EFEF1CF6" }) return count(mac);
+------------+
| count(mac) |
+------------+
| 2 |
+------------+
1 row
2 ms
Wait, WTF happened here? We specified only the same MAC address again, why is a duplicate created?
The documentation on MERGE specifies that "MERGE will not partially use existing patterns — it’s all or nothing. If partial matches are needed, this can be accomplished by splitting a pattern up into multiple MERGE clauses". So because when we run this path MERGE the whole path doesn't already exist, it creates everything in it, including a duplicate mac address node.
There are frequently questions about duplicated nodes created by MERGE, and 99 times out of 100, this is what's going on.

This is the response I got back from Neo4j's support (emphasis mine):
I got some feedback from our team already, and it's currently known that this can happen in the absence of a constraint. MERGE is effectively MATCH or CREATE - and those two steps are run independently within the transaction. Given concurrent execution, and the "read committed" isolation level, there's a race condition between the two.
The team have done some discussion on how to provided a higher guarantee in the face of concurrency, and do have it noted as a feature request for consideration.
Meanwhile, they've assured me that using a constraint will provide the uniqueness you're looking for.

Related

Performance of the transitive closure query in Neo4j

I am trying to compute the transitive closure of an undirected graph in Neo4j using the following Cypher Query ("E" is the label that every edge of the graph has):
MATCH (a) -[:E*]- (b) WHERE ID(a) < ID(b) RETURN DISTINCT a, b
I tried to execute this query on a graph with 10k nodes and around 150k edges, but even after 8 hours it did not finish. I find this surprising, because even the most naive SQL solutions are much faster and I expected that Neo4j would be much more efficient for these kind of standard graph queries. So is there something that I am missing, maybe some tuning of the Neo4j server or a better way to write the query?
Edit
Here is the result of EXPLAINing the above query:
+--------------------------------------------+
| No data returned, and nothing was changed. |
+--------------------------------------------+
908 ms
Compiler CYPHER 3.3
Planner COST
Runtime INTERPRETED
+-----------------------+----------------+------------------+--------------------------------+
| Operator | Estimated Rows | Variables | Other |
+-----------------------+----------------+------------------+--------------------------------+
| +ProduceResults | 14069 | a, b | |
| | +----------------+------------------+--------------------------------+
| +Distinct | 14069 | a, b | a, b |
| | +----------------+------------------+--------------------------------+
| +Filter | 14809 | anon[11], a, b | ID(a) < ID(b) |
| | +----------------+------------------+--------------------------------+
| +VarLengthExpand(All) | 49364 | anon[11], b -- a | (a)-[:E*]-(b) |
| | +----------------+------------------+--------------------------------+
| +AllNodesScan | 40012 | a | |
+-----------------------+----------------+------------------+--------------------------------+
Total database accesses: ?
You can limit the direction, but it requires the graph to be directed.
After doing some testing and profiling of my own, I found that for even very small sets of data (Randomly-generated sets of 10 nodes with 2 random edges on each), making the query be only for a single direction cut down on database hits by a factor of 10000 (from 2266909 to 149 database hits).
Adding a direction to your query (and thus forcing the graph to be directed) cuts down the search space by a great deal, but it requires the graph to be directed.
I also tried simply adding a reverse relationship for each directed one, to see if that would have similar performance. It did not; it did not complete before 5 minutes had passed, at which point I killed it.
Unfortunately, you are not doing anything wrong, but your query is massive.
Neo4J being a graph database does not mean that all mathematical operations involving graphs will be extremely fast; they are still subject to performance constraints, up to and including the transitive closure operation.
The query you have written is an unbounded path search for every single pair of nodes. The node pairs are bounded, but not in a very meaningful way (the bound of ID(a) < ID(b) just means that the search only needs to be done one way; there are still 10k! (as in factorial) possible sets of nodes in the result set.
And then, that's only after every single path is checked. Searching for the entire transitive closure of a graph the size that you specified will be extremely expensive performance-wise.
The SQL that you posted is not performing the same operation.
You mentioned in the comments that you tried this query in a relational table in a recursive form:
WITH RECURSIVE temp_tc AS (
SELECT v AS a, v AS b FROM nodes
UNION SELECT a,b FROM edges g
UNION SELECT t.a,g.b FROM temp_tc t, edges g WHERE t.b = g.a
)
SELECT a, b FROM temp_tc;
I should note that this query is not performing the same thing that Neo4J does when it tries to find all paths. Before Neo4J can start to pare down your results, it must generate a result set that consists of every single path in the entire graph.
The SQL and relational query does not do that; it starts from the list of links, but that recursive query has the effect of removing any potential duplicate links; it discovers other links as its searching for the links of others; e.g. if the graph looks like (A)-(B)-(C), that query will find that B connects to C in the process of finding that A connects to C.
With the Neo4J, every path must be discovered separately.
If this is your general use-case, it is possible that Neo4J is not a good choice if speed is a concern.

what is the limitation in neo4j community edition in terms of data storage(i.e. number of records)?

I am trying to load 500000 nodes ,but the query is not executed successfully.Can any one tell me the limitation of number of nodes in neo4j community edition database?
I am running these queries
result = session.run("""
USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM "file:///relationships.csv" AS row
merge (s:Start {ac:row.START})
on create set s.START=row.START
merge (e:End {en:row.END})
on create set s.END=row.END
FOREACH (_ in CASE row.TYPE WHEN "PAID" then [1] else [] end |
MERGE (s)-[:PAID {cr:row.CREDIT}]->(e))
FOREACH (_ in CASE row.TYPE WHEN "UNPAID" then [1] else [] end |
MERGE (s)-[:UNPAID {db:row.DEBIT}]->(e))
RETURN s.START as index, count(e) as connections
order by connections desc
""")
I don't think the community edition is more limited than the enterprise edition in that regard, and most of the limits have been removed in 3.0.
Anyway, I can easily create a million nodes (in one transaction):
neo4j-sh (?)$ unwind range(1, 1000000) as i create (n:Node) return count(n);
+----------+
| count(n) |
+----------+
| 1000000 |
+----------+
1 row
Nodes created: 1000000
Labels added: 1000000
3495 ms
Running that 10 times, I've definitely created 10 million nodes:
neo4j-sh (?)$ match (n) return count(n);
+----------+
| count(n) |
+----------+
| 10000000 |
+----------+
1 row
3 ms
Your problem is most likely related to the size of the transaction: if it's too large, it can result in an OutOfMemory error, and before that it can slow the instance to a crawl because of all the garbage collection. Split the node creation in smaller batches, e.g. with USING PERIODIC COMMIT if you use LOAD CSV.
Update:
Your query already includes USING PERIODIC COMMIT and only creates 2 nodes and 1 relationship per line from the CSV file, so it most likely has to do with the performance of the query itself than the size of the transaction.
You have Start nodes with 2 properties set with the same value from the CSV (ac and START), and End nodes also with 2 properties set with the same value (en and END). Is there a unicity constraint on the property used for the MERGE? Without it, as nodes are created, processing each line will take longer and longer as it needs to scan all the existing nodes with the wanted label (an O(n^2) algorithm, which is pretty bad for 500K nodes).
CREATE CONSTRAINT ON (n:Start) ASSERT n.ac IS UNIQUE;
CREATE CONSTRAINT ON (n:End) ASSERT n.en IS UNIQUE;
That's probably the main improvement to apply.
However, do you actually need to MERGE the relationships (instead of CREATE)? Either the CSV contains a snapshot of the current credit relationships between all Start and End nodes (in which case there's a single relationship per pair), or it contains all transactions and there's no real reason to merge those for the same amount.
Finally, do you actually need to report the sorted, aggregated result from that loading query? It requires more memory and could be split into a separate query, after the loading has succeeded.

Neo4j Cypher script freezes for too many nodes

I have a database with about 300,000 nodes. For comparison purposes with a previous database version, I need to get all the nodes and the number of nodes connected to it or it is connected to.
My cypher query looks like this:
match (node)-[r]-(n) return node.Name, count(n)
And my expected result looks like this:
Name | Count
Node1 | 8
Node2 | 3
Node3 | 5
I'm testing this on Neo4j's web interface (version 3.0.3). For some reason the web interface freezes maybe because of the number of results that I'm getting back so definitely this is a performance issue of the query.
Can this query still be optimized?
This is faster, has less dbhits and will also give you node names of unconnected nodes (if you want).
MATCH (node) RETURN node.Name, size((node)-[]-())

why my cypher query return null?

I have two structure with different label:
tree node's label:Person
graph nodes'label:Friend
1)tree:
a-->b-->d
| |
| -->e
|
-->c-->f
|
-->g
|
-->h
2)graph:
b-->a-->f-->g-->b
|
-->b
I have this cypher query and what i expect return is : "b"
but it return null. how I must write this query????
MATCH (a),(b)
WHERE a.name='ali'
AND (a:Friend)-[:FRIEND_OF]-(b:Friend)
AND (a:Person)-[:PARENT_OF]->(b:Person)
RETURN b.name
It would help a lot if you could explain what you're trying to do. I don't really understand.
Since the relationship between people already has tags (FRIEND_OF, PARENT_OF), why would you use a Friend node label? Unless you have a good reason, I would just have nodes with a Person tag.
The query you posted will return person nodes who are friends with ali and where Ali is his/her parent. It can be rewritten to:
MATCH (a:Person {name:"ali"})-[:FRIEND_OF]-(b:Person)
WHERE (a)-[:PARENT_OF]->(b)
RETURN b.name
I don't have your data set so it's hard to test for the results. You are stating that you are not getting any results.
Can you verify that:
There is only one node with name ali and it has the 2 relationships you are looking for? If there are 2 nodes with name ali and each of them has one of the afore mentioned relationships Neo4j won't find any matches. You are looking for exactly one node which has both relationships.
Node b exists and has a name property?

Retrieving relationships for a single node is slow

I'm getting started with Neo4J 2.0.1 and I'm already running into performance problems that make me think that my approach is wrong. I have a single node type so far (all with the label NeoPerson) and one type of relationship (all with the label NeoWeight). In my test setup, there are about 100,000 nodes and each node has between 0 and 300 relationships to other nodes. There is a Neo4j2.0-style index on NeoPerson's only field, called profile_id (eg CREATE INDEX ON :NeoPerson(profile_id)). Looking up a NeoPerson by profile_id is reasonably fast:
neo4j-sh (?)$ match (n:NeoPerson {profile_id:38}) return n;
+----------------------------+
| n |
+----------------------------+
| Node[23840]{profile_id:38} |
+----------------------------+
1 row
45 ms
However, once I throw relationships into the mix, it gets quite slow.
neo4j-sh (?)$ match (n:NeoPerson {profile_id:38})-[e:NeoWeight]->() return n, e;
+----------------------------------------------------------------------------+
| n | e |
+----------------------------------------------------------------------------+
| Node[23840]{profile_id:38} | :NeoWeight[8178324]{value:384} |
| Node[23840]{profile_id:38} | :NeoWeight[8022460]{value:502} |
...
| Node[23840]{profile_id:38} | :NeoWeight[54914]{} |
+----------------------------------------------------------------------------+
244 rows
2409 ms
My understanding was that traversing relationships from a single node should be quite efficient (isn't that the point of using a graph database?), so why is it taking over 2 seconds for such a simple query on a small data set? I didn't see a way to add an index on a relationship whose keys are the source and/or destination nodes.
People use Neo4j in production without issues. If they have the requirement that the first user query has to return in a few ms, they warm up the caches after server start. E.g. by running their most important use-case queries upfront.
It takes some time to load the nodes and rels from disk. Esp. if the relationships (and their properties) of the single node are distributed across the relationship store file and are loaded from a spinning disk.
For the first query it also takes a bit longer as its query plan has to be built and compiled.
That's why in production you usually use parameters to allow query caching.
What is the use-case you're trying to address?

Resources