Is there a race condition when creating unique paths? - neo4j

I recently discovered that a race condition exists when executing concurrent MERGE statements. Specifically, duplicate nodes can be created in the scenario where a node is created after the MATCH step but before the CREATE step of a given MERGE.
This can be worked around in some instances using unique constraints on the merged nodes; however, this falls short in scenarios where:
There is no single unique property to enforce (e.g. pairs of properties need to be unique but individual ones don't).
Trying to merge relationships and paths.
Does using CREATE UNIQUE solve this problem (or do the same pitfalls exist)? If so, is it the only option? It feels like the usefulness of MERGE is fairly heavily diminished when it effectively can't guarantee the uniqueness of the path or node being merged...

When MERGE statements are executed concurrently, these situations may occur. Basically, each transaction gets a view of the graph at the first point of reading, and won't see updates made after that point (with some variations). The main exception to this are uniquely constrained nodes, where Neo4j will initialise a fresh reader from the index when reading, regardless of what was previously read in the transaction.
A workaround could be to create a 'dummy' property and a unique constraint on it and one of the node labels. In Neo4j 2.2.5, this should work to get around your problem.

Related

apoc.periodic.iterate fails the batch if there is an duplicate data in parameter

I am using an apoc.periodic.iterate query to store millions of data . Since the data may contain duplicates I am using MERGE action to create nodes but unfortunately whenever the data is duplicated the whole batch is getting with error like this
"LockClient[200] can't wait on resource RWLock[NODE(14), hash=1645803399] since => LockClient[200] <-[:HELD_BY]- RWLock[NODE(101)"
Changing parallel as false works fine
Also by removing duplicates the query is passed successfully
But both of the above solution takes more time since dealing with millions of data . Is there any alternate solution like making a it to wait for the lock
You cannot use parallel:true, because you are creating relationships in your query. Every time you want to add a relationship to a node, the cypher engine adds a write lock to a node, and other processes can't add to that particular node. That is why you have the write lock exception. Not much you can do except to run it with parallel:false setting.
To avoid deadlocks, concurrent requests that update the DB should avoid touching the same nodes or relationships (including the nodes on both ends of those relationships). One way to achieve this is to figure out a way to have the concurrent requests work on disjoint subgraphs.
Or, you can retry queries that throw a DeadlockDetectedException. The docs show an example of how to do that.

Why does MERGE sometimes create a duplicate relationship?

My last question was closed for being a duplicate Confused about MERGE sometimes creating duplicate relationship, however I was unable to find a solution, and this deals with duplicate relationships, not duplicate nodes.
I have a query when a user VISITED another user's profile
MATCH (you:User {user_id: { myId }}), (youVisited:User {user_id: { id }})
MERGE (you)-[yvr:VISITED]->(youVisited)
SET yvr.seen = false, yvr.created_at = timestamp()
RETURN yvr.created_at as visited_at
I noticed that in rare cases, a duplicate [:VISITED] relationship happens. For (1057)-[:VISITED]->(630), both have the same properties, and there's really only supposed to be one [:VISITED] no matter what (the next time the user visits, it should simply MERGE the [:VISITED] and update the [:VISITED {created_at: ..., seen: false}] between the same User nodes:
{
created_at: 1485800172734,
seen: false
}
I thought the point of MERGE to prevent this? Clearly not, so why does this happen and how can I ensure this doesn't happen?
I have looked up some other things, but I am not sure if the information is reliable or up to date. For example: http://neo4j.com/docs/developer-manual/current/cypher/clauses/create-unique/, am I supposed to be using CREATE UNIQUE instead? I thought MERGE was pretty much a better replacement for it.
I agree that in some cases, MERGE and CREATE UNIQUE can be used for the same purpose. MERGE does not replace CREATE UNIQUE, however.
For example, MERGE allows multiple matches, and its pattern has to fully match the graph to be considered a match - it will simply duplicate partial matches; CREATE UNIQUE, on the other hand, will error on multiple matches, and allows partial matches - it will attempt to re-use existing parts of your graph and add the missing parts.
As mentioned in the docs, there also seems to be a difference regarding uniqueness of relationships, i.e. what you are experiencing:
MERGE might be what you want to use instead of CREATE UNIQUE. Note however, that MERGE doesn’t give as strong guarantees for relationships being unique.
I'll leave it up to the developers of Neo4j to explain exactly what those guarantees are. I can only say that in your particular case, CREATE UNIQUE seems a better fit than MERGE anyway: if your intent is to only ever allow a single VISITED relationship from one user to another - his last visit - and multiple VISITED relationships are a violation of your data model, then by all means use CREATE UNIQUE to document this intent, and enforce it at the database level at the same time.
In this case, one could argue that the VISITED relationship is also not particularly well-named, since it implies that there could be more: one for each time that a user visited another user's profile.
As mentioned in my comments, there was a locking bug with MERGE upon Neo4j switching to the COST planner.
As far as I can tell it works like this:
Due to the bug, double-checked locking wasn't occurring, so after MERGE determines the relationship doesn't exist, it locks on the nodes in preparation to CREATE the relationship, but there's a race condition between the time of the existence check of the relationship, and the locking, so a concurrent MERGE or CREATE could have created the relationship just before the locks were acquired, resulting in duplicate relationships being created.
The fix will ensure MERGE checks for the existence of the relationship again after the locks are acquired. This should restore concurrency guarantees for MERGE.
This fix is not yet in current Neo4j releases as of 2/10/2017.
In the meantime, you can explicitly lock on the nodes in question before you MERGE to prevent the race condition.
You can do this by setting/removing nonexistent values on the nodes in question, or use APOC locking procedures.

neo4j concurrency issue - deleting and matching

I have one program that builds graphs. I have a reaper which deletes old graphs.
Sometimes, the set of nodes returned by the queries used when building the graphs overlaps with the set of nodes being reaped. This is giving me a spurious "EntityNotFound: Node with id xxxxxx" error.
I say it is spurious because the reality is that we're not deleting the nodes we're adding - they are on separate graphs.
However, the loader's MATCH has two parts:
MATCH(n: MYNODE {indexed-var:"ddd", version:"xxx"} ...
It is true that some n indexed by "ddd" can be in the graph being deleted, but the specific version of node n I am adding will always have a 'safe' version number. However, EXPLAIN clearly shows I am sucking in multiple MYNODE nodes, and then filtering to the specific node. I am guessing that the delete program is deleting a MYNODE node after the loader has fetched it, but before it is filtered.
The loader and deleter are both running with transactions, so it isn't an immediate thing - the failure happens on commit.
Can I use _LOCK_ to prevent the read and the delete from acting on the same nodes at the same time? Other ideas?
One solution is to:
Add to MYNODE another property (let's call it id) whose value is a string that concatenates index and version, separated by a delimiter character (let's say it is "|").
Create an index on :MYNODE(id).
Change your MATCH clause to:
MATCH(n: MYNODE {id:"ddd|xxx"} ...
The use of an index allows Cypher to immediately get the desired node(s), avoiding the need to iterate through all the MYNODE nodes and filtering out the undesired ones (some of which may no longer exist). This approach has the added benefit of being much faster.

Add a relationship without data already in the database

It appears that I can't add a relationship unless there is already some data in some Entities that obey that relationship. Is this correct? I want to be able to set up my relationships and Labels first and then populate with data and have the data just use the relationships.
I am using:
MATCH (from:this_label),(to:that_label)
WHERE from.id = to.uuid
CREATE (from)-[:hasARelationship]->(to);
Basically, I want to be able to define a bunch of relationships on nodes of a certain label, even if those node-type do not yet exist. And then when some data of those nodes comes into the database it will hook up the relationships automatically.
It may be helpful to distinguish between the responsibilities of enforcing a constraint and fulfilling a constraint.
Neo4j allows for indices and constraints associated with labels. Indices and constraints created for a label are used to index and constrain the nodes that have that label. As of version 2.2.5, there is only one type of constraint: a uniqueness constraint for a single property. There have been talk about adding constraints for combinations of properties, and for relationships, but I don't know the status of these conversations.
The Neo4j schema constraints enforce something, but they will not fulfill, in the sense of changing your operations on the database to satisfy the constraint. If there were constraints enforcing that a node with label A may only be created if it has a relationship of type R to a node with label B, they would block your operation if it did not satisfy the constraint, but they would not satisfy it for you.
The best way to achieve this is a) to satisfy this requirement in your client application, or b) to create an extension for Neo4j. For an extension example, consider neo4j-uuid by Stefan Armbruster. It listens to transactions (using what's called a TransactionEventListener) and makes sure that any node that is created in the database has a UUID. This extension satisfies what could only be enforced by a corresponding Neo4j schema constraint (there are other differences, e.g., the constraint would be limited to the scope of a label).
A way to achieve your intention could be to either create an extension which listens to what you write to the database and satisfies your constraint, altering your operations if necessary; or, one which provides an invocation target in the server (a RESTful endpoint) that you can invoke whenever you want to create a node with a particular label. The extension would then create the node and other elements necessary to fulfill your schema. A downside to the former could be the overhead of listening to all your operations, a downside to the latter could be that it breaks your flow of interaction with the database to introduce a separate type of invocation (e.g., if you normally execute cypher statements and have to pause to issue a separate POST request and interpret the response before continuing).
If I understand you correctly, you want to use MERGE instead of MATCH.
MERGE (from:this_label) -[:hasARelationship]-> (to:that_label) WHERE from.id = to.uuid
If you are trying to create relationships without nodes, I guess that is not even possible in NEO4J. Infact, it wouldn't be possible in any graph in general.
It does not make sense to pre-populate your DB with relationships that connect to dummy nodes. Among the many reasons are these:
You would not be able to make any meaningful queries involving such relationships
Trying to fill in the dummy nodes later on with actual data may be a complex endeavor
It is very easy to created relationships right when they are needed. neo4j is a "schemaless" DB (except when you define uniqueness constraints, as #jjaderberg mentions). You can create a relationship of any type connecting nodes with any labels (or no labels) at any time. To keep things organized, you may choose to write your DB client code and Cypher queries to conform to your own conceptual "schema", but neo4j has no such a requirement.

Neo4j unique IDs by tree with root node counter?

Is using a tree with a counter on the root node, to be referenced and incremented when creating new nodes, a viable way of managing unique IDs in Neo4j? In a previous question on performance on this forum (Neo4j merge performance VS create/set), the approach was described, and it occurred to me it may suggest a methodology for unique ID management without having to extend the Neo4j database (and support that extension). However, I noticed this approach has not been mentioned in other discussions on best practice for unique ID management (Best practice for unique IDs in Neo4J and other databases?).
Can anyone help validate or reject this approach?
Thanks!
You can just create a singleton node (I'll give it the label IdCounter in my example) to hold the "next-valid ID counter" value. There is no need for it be part of any "tree" or for it to have any relationships at all.
When you create the singleton, initialize it with the first id value that you want to use. For example:
CREATE (:IdCounter {nextId: 1});
Here is a simple example of how to use it when creating a new node.
MATCH (c:IdCounter)
CREATE (x {id: c.nextId})
SET c.nextId = c.nextId + 1
RETURN x;
Since all Cypher queries are transactional, if the node creation did not happen for any reason, then the nextId increment would also not be done, so you should never end up with any gaps in assigned id numbers.
However, to avoid re-using the same id number, you would have to write your queries carefully to ensure that the increment always happens whenever you create a new node (using CREATE, CREATE UNIQUE, or MERGE).

Resources