I know that you're not supposed to rely on IDs as identifier for nodes over the long term because when you delete nodes, the IDs may be re-assigned to new nodes (ref).
Neo4j reuses its internal ids when nodes and relationships are deleted. This means that applications using, and relying on internal Neo4j ids, are brittle or at risk of making mistakes. It is therefore recommended to rather use application-generated ids.
If I'm understanding this correctly, then only looking up a node/relationship by its id when you can't guarantee if it may have been deleted puts you at risk.
If through my application design I can guarantee that the node with a certain ID hasn't been deleted since the time ID was queried, am I alright to use the IDs? Or is there still some problem that I might run into?
My use case is that I wish to perform a complex operation which spans multiple transactions. And I need to know if the ID I obtained for a node during the first transaction of that operation is a valid way of identifying the node during the last transaction of the operation.
As long as you are certain that a node/relationship with a given ID won't be deleted, you can use its native ID indefinitely.
However, over time you may get want to add support for other use cases that will need to delete that entity. Once that happens, your existing query could start producing intermittent errors (that may not be obvious).
So, it is still generally advisable to use your own identification properties.
Related
In neo4j I have an application where an API endpoint does CRUD operations on the graph, then I materialize reachable parts of the graph starting at known nodes, and finally I send out the materialized subgraphs to a bunch of other machines that don’t know how to query neo4j directly. However, the materialized views are moderately large, and within a given minute only small parts of each one will change, so I’d like to be able to query “what has changed since last time I checked” so that I only have to send the deltas. What’s the best way to do that? I’m not sure if it helps, but my data doesn’t contain arbitrary-length paths — if needed I can explicitly write each node and edge type into my query.
One possibility I imagined was adding a “last updated” timestamp as a property on every node and edge, and instead of deleting things directly, just add a “deleted” boolean property and update the timestamp, and then use some background process to actually delete a few minutes later (after the deltas have been sent out). Then in my query, select all reachable nodes and edges and filter them based on the timestamp property. However:
If there’s clock drift between two different neo4j write servers and the Raft leader changes from one to the other, can the timestamps go back in time? Or even worse, will two concurrent writes always give me a transaction time that is in commit order, or can they be reordered within a single box? I would rather use a graph-wide monotonically-increasing integer like
the write commit ID, but I can’t find a function that gives me that.
Or theoretically I could use the cookie used for causal consistency,
but since you only get that after the transaction is complete, it’d
be messy to have to do every write as two separate transactions.
Also, it just sucks to use deletion markers because then you have to explicitly remove deleted edges / nodes in every other query you do.
Are there other better patterns here?
I'm making a simple Swift meditation app and want to have a feature to allow users to see how many others have installed the app as well ("You're part of a community of 354 other meditators")
My current plan - save a "blank" record on first load to public DB in CloudKit.
Then - each client on login retrieves all the records and counts how many there are?
Is there a better solution. I could imagine this getting slow if there are lots of users...
Thanks!
In terms of your CloudKit example, as far as I'm aware there is no option to return the number of records, instead CloudKit just returns the actual records in batches (it decides how many to return). However, you may specify a limit of records for it to return.
If you did specify a limit, you would need to continually update it since once the number of records grows larger than the limit it will no longer retrieve them all and your count will be wrong.
This would be a bad idea probably since you will have to continually release app updates to increase the limit (unless you stored this value in some kind of other external DB which would then probably be preferable to CloudKit itself). Basically, CloudKit is probably not the best idea for this.
It would probably be much easier to use a different public DB setup. Either set up your own or use a service like 'Parse.com' which makes setting up and connecting to a public DB very simple. An additional benefit of doing it this way is you can run the count query on the server and just return the count value itself rather than returning all records and counting them locally - very inefficient.
Having issues with deadlocks on a real-time, transactional, multi-user neo4J embedded system. Can you point me to documentation which will spell out what locks are acquired for each graph action - I'm especially concerned with adding and deleting relationships as that seems to cause most of the deadlocks.
e.g.
Add relationship: write locks placed on both end nodes (is it true that write locks are also placed on all relationships that exist for both end nodes?)
Delete relationship: write locks placed on relationship and both end nodes (is it true that write locks are also placed on all relationships for both end nodes?).
Why do the end nodes need to be locked during a relationship deletion?
Thanks
When you add a relationship, the graph locks the nodes involved. You can get a deadlock if you lock items in an unpredictable fashion. For me, I had a one-to-many relationship create, so we could order out many nodes by Node ID and this prevented deadlocks for us.
When you delete, it's more complicated. It locks the nodes involved, but under the covers stores the all the relationships as a doubly linked list, so when you remove a relationship you have to lock the previous and the next links so you can link them together without issue. This is something you cannot predict, as you don't really have any ability to get these ID's under the covers.
Your best bet is to do a deadlock retry policy. Do a try{}catch(DeadlockDetectedException){} and if you catch the deadlock exception, retry(I did this by putting the entire operation in a while loop that wouldn't break until the operation I wanted was free of deadlocks).
Adding and removing relationships also requires updating both nodes references to what relationships they have. In other words, adding and removing relationships implies writes to the nodes of both ends. Therefore Neo4j needs to take write locks for all three entities.
The documentation is unfortunately out of date, it seems. There is more to locking in Neo4j than what that page reveals, especially now that it has support for things like unique constraints.
Nicholas advice about trying to order the entities by their id, that you want to write to, is worth trying. You can also try to split things out in the graph, such that transactions that would otherwise conflict are less likely to work on the same data.
Found this http://docs.neo4j.org/chunked/stable/transactions-locking.html which covers some basic info but does not mention linked lists of relationships
Is there a way to ensure an ordered atomic change set from Simperium?
I have a data model that has complex relationships associated. It seems looking over things that it is possible for the object graph to enter in an invalid state if the communication pipe is severed. Is there a way to indicate to Simperium that a group of changes belong together? This would be helpful as the client or server would prevent applying those changes unless all the data from a "transaction" is present thus keeping the object graph in a valid state.
Presently it's expected that your relationships are marked as optional, which allows objects to be synced and stored in any order without technically violating your model structure. Relationships are lazily re-established by Simperium at first opportunity, even if the connection is severed and later restored.
But this approach does pass some burden to your application logic. The code is open source, and suggestions for changes in this regard are welcome.
I would like to synchronize access to a particular insert. Hence, if multiple applications execute this "one" insert, the inserts should happen one at a time. The reason behind synchronization is that there should only be ONE instance of this entity. If multiple applications try to insert the same entity,only one should succeed and others should fail.
One option considered was to create a composite unique key, that would uniquely identify the entity and rely on unique constraint. For some reasons, the dba department rejected this idea. Other option that came to my mind was to create a stored proc for the insert and if the stored proc can obtain a global lock, then multiple applications invoking the same stored proc, though in their seperate database sessions, it is expected that the stored proc can obtain a global lock and hence serialize the inserts.
My question is it possible to for a stored proc in oracle version 10/11, to obtain such a lock and any pointers to documentation would be helpful.
If you want the inserted entities to be unique, then in Oracle you don't need to serialise anything - a unique constraint is perfectly designed and suited for exactly this purpose. Oracle handles all the locking required to ensure that only one entity gets inserted.
I can't think of a reason why the dba department rejected the idea of a unique constraint, this is pretty basic - perhaps they rejected some other aspect of your proposed solution.
If you want to serialise access for some reason (and I can't think of a reason why you would), you could (a) get a lock on the whole table, which would serialise all DML on the table; or (b) get a user-named lock using DBMS_LOCK - which would only serialise the particular process(es) in which you get the lock. Both options have advantages and disadvantages.