In neo4j I have an application where an API endpoint does CRUD operations on the graph, then I materialize reachable parts of the graph starting at known nodes, and finally I send out the materialized subgraphs to a bunch of other machines that don’t know how to query neo4j directly. However, the materialized views are moderately large, and within a given minute only small parts of each one will change, so I’d like to be able to query “what has changed since last time I checked” so that I only have to send the deltas. What’s the best way to do that? I’m not sure if it helps, but my data doesn’t contain arbitrary-length paths — if needed I can explicitly write each node and edge type into my query.
One possibility I imagined was adding a “last updated” timestamp as a property on every node and edge, and instead of deleting things directly, just add a “deleted” boolean property and update the timestamp, and then use some background process to actually delete a few minutes later (after the deltas have been sent out). Then in my query, select all reachable nodes and edges and filter them based on the timestamp property. However:
If there’s clock drift between two different neo4j write servers and the Raft leader changes from one to the other, can the timestamps go back in time? Or even worse, will two concurrent writes always give me a transaction time that is in commit order, or can they be reordered within a single box? I would rather use a graph-wide monotonically-increasing integer like
the write commit ID, but I can’t find a function that gives me that.
Or theoretically I could use the cookie used for causal consistency,
but since you only get that after the transaction is complete, it’d
be messy to have to do every write as two separate transactions.
Also, it just sucks to use deletion markers because then you have to explicitly remove deleted edges / nodes in every other query you do.
Are there other better patterns here?
Related
I know that you're not supposed to rely on IDs as identifier for nodes over the long term because when you delete nodes, the IDs may be re-assigned to new nodes (ref).
Neo4j reuses its internal ids when nodes and relationships are deleted. This means that applications using, and relying on internal Neo4j ids, are brittle or at risk of making mistakes. It is therefore recommended to rather use application-generated ids.
If I'm understanding this correctly, then only looking up a node/relationship by its id when you can't guarantee if it may have been deleted puts you at risk.
If through my application design I can guarantee that the node with a certain ID hasn't been deleted since the time ID was queried, am I alright to use the IDs? Or is there still some problem that I might run into?
My use case is that I wish to perform a complex operation which spans multiple transactions. And I need to know if the ID I obtained for a node during the first transaction of that operation is a valid way of identifying the node during the last transaction of the operation.
As long as you are certain that a node/relationship with a given ID won't be deleted, you can use its native ID indefinitely.
However, over time you may get want to add support for other use cases that will need to delete that entity. Once that happens, your existing query could start producing intermittent errors (that may not be obvious).
So, it is still generally advisable to use your own identification properties.
Say I have 3 unrelated time-series. Each written row key starts with the current timestamp: timestamp#.....
Having each time-series in a separate table will cause hotspotting because new rows are always added at one extremity (latest timestamp).
If we join all 3 time-series in one BigTable table with prefixes:
series1#timestamp#....
series2#timestamp#....
series3#timestamp#....
Does that avoid hotspotting? Will each cluster node handle one time-series?
I'm assuming that there are 3 nodes per cluster and that each of the 3 time-series will receive similar load and will grow in size evenly.
If yes, is there any disadvantage to having multiple unrelated time-series in one BigTable table?
Because you have a timestamp as the first part of your rowkey, I think you're going to get hotspots either way.
In a Bigtable instance, your data is split into groups of contiguous rowkeys (called tablets) and those are distributed evenly across the nodes. To maximize efficiency with Bigtable, you need that data to be distributed across the nodes and within the nodes as tablets. You get hotspotting when you're writing to the same row or contiguous set of rows since that is all happening within one tablet. If you are constantly writing with the timestamp as a prominent part of the key, you will keep writing to the same tablet until it fills up and you have to go to the next one rather than writing to multiple tablets within a node.
The Bigtable documentation has a guide for time-series schema design which recommends a few solutions for a use case like yours:
Field promotion: add an additional field to the rowkey before your timestamp to separate out a group of data (USER_ID#timestamp#...)
Salting: take a hash of the timestamp and divide it by the number of nodes, then add that to the rowkey (SALT_RESULT#timestamp#...)
Reverse timestamps: or if either of those don't work, reverse the timestamp. This works best if your most common query is for the latest values, but can make other queries more difficult
Edit:
Your approach is definitely similar to salting, but since your data is already in separate tables you're actually not getting any increased benefit since the hotspotting is going to be caused at the tablet level.
To draw it out more, let's say you have this data in separate tables and start writing data. Each table is going to be composed of tablets, which capture timestamps 0-10, 11-20, etc... Those tablets will automatically be distributed amongst nodes for the best performance. If the loads are all similar, tablets 0-10 should all be on separate nodes, 11-20 will all be on separate nodes etc.
With the way your schema is set up, you are constantly writing to the latest tablet (let's say the time is now 91,) you're only writing to the 91-100 while ignoring all the other tablets within that node. Since that 91-100 tablet is the only one getting work instead of the other tablets, your node isn't going to give you optimized performance and this is what we refer to as hotspotting. A certain tablet is getting a spike, but there wont be enough time for the load balancer to correct it.
If you have it in the same table, we can just focus on one node now. series1#0-10 will first get slammed, then series1#11-20, then series1#21-30. There is always one tablet that is getting too much load and not making use of the full node.
There is some more information about load balancing in the documentation.
What is the method for removing inactive, unwanted node labels in a Neo4j database (community edition version 2.2.2)?
I've seen this question in the past but for whatever reason it gets many interpretations, such as clearing browser cache, etc.
I am referring here to labels actually contained in the database, such that the REST command
GET /db/data/labels
will produce them in its output. The labels have been removed from any nodes and there are not active constraints attached to them.
I am aware this question has been asked in the past and there is a cumbersome way of solving it, which is basically, dump and reload the database. The dump command doesn't even contain scattered commit statements and thus needs to be edited before executing it back. Of course this takes forever with big databases. There has to be a better way, or at least there is a feature in the queue of requirements waiting to be implemented. Can someone clarify?
If you delete the last node with a certain label - as you've observed - the label itself does not get deleted. As of today there is no way to delete a label.
However you can copy over the datastore in offline mode using e.g. Michael's store copy tool to achieve this.
The new store is then aware of only those labels which actually are used.
We started designing a process for detecting changes in our ERP database for creating datawarehouse databases. Since they don't like placing triggers on the ERP databases or even enabling CDC (sql server), we are thinking in reading changes from the databases that get replicated to a central repository through transaction replication, then have an extra copy that will merge the changes (we will have CDC on the extra copy)...
I wonder if there is a possibility where data that changes within, let's say 15 minutes, is important enough to consider a change in our design, the way we plan in designing this would not be able to track every single change, it will only get the latest after a period of time, so for example if a value on a row changes from A to B, then 1 minute later, changes from B to C, the replication system will bring that last value to the central repository, then we will merge the table with our extra copy (that extra copy might have had the value of A, then it will be updated with C, and we lost the value of B).
Is there a good scenario in a data warehouse database where you need to track ALL changes a table has gone through ?
Taking care of historical data in a DW is important in some cases such as:
When the dimension value changes. Say, a supplier merged with another and changed their commercial name
When the fact table uses calculations derived based on other information outside the fact table that changes. Say conversion rate changes for example.
When you need to run queries that reflect fact information in previous periods (versions of the fact table).
An example where every change maters may be a bank account's balance or a storage warehouse item count or a stock price, etc.
For your particular case, you should check with your customer how the system will be used and what is its benefits exactly, and design accordingly. How granular the change should be captured (every hour, day, etc.) is primarily your customer's call.
Some techniques in handling dimension data change is in Kimball-Slowly Changing Dimension.
In direct answer to your question: depends on the application.
Examples:
The value is the description field of an item in some inventory, where the items themselves do not change (i.e. item ID X is always a sparkly-thingy). In this case saving short lived descriptions is probably not required.
The value is the last reading of temperature sensor. If it goes over a certain value action is taken to bring the temperature back. In this case you certainly need to save each an every change.
This raises three points:
The second case where every single change is required shows very bad design. Such a system would surely insert new values with a time stamp into a table and not update a single value.
Bad designs do exist. Hence:
The amount data being warehoused depends on the nature of data.
a. Will you be able to derive any intelligence from your warehoused data?
b. Will you be able to know based on changes at the database level what happened at the business level?
c. What happens to your data when the database schema changes because you upgraded the ERP product?
I'm wondering whether saving a log of changes on the table level is usable. You might be able to reverse engineer what a set of changes means and then save that to the warehouse, or actually get the ERP to "tell" you what it has done and save those changes.
Is it possible to find out the updates / modifications / changes done to neo4j db over a time interval?
NEO4J DB will be polled at periodic intervals for finding the changes happened to it over that time period.
And then these changes have to be sync'd with other DBs.This is the real task.
Here changes include addition ,updation ,deletion of Nodes, Relationships,properties.
How do we track the changes that have been done in a particular timeframe. Not all nodes and relationships have timestamps set on it.
Add a timestamp field to each of your nodes and relationships that inserts the timestamp() while they are created. Then write a cypher query to bring back all nodes and relationships within the given time range.
EDIT
There are two ways of implementing this synchronization.
Option 1
If you can use Spring Data Neo4j then you can use the lifecycle events as explained here to intercept the CUD operations and do the necessary synchronization either synchronously or asynchronously.
Option 2
If you can't use Spring, then you need to implement the interception code yourself. The best way I can think of is to publish all the CUD operations to a Topic and then write subscribers that can each synchronize to to each of the stores. In your case you have Neo4jSubscriber, DbOneSubscriber, Db2Subscriber etc.
There is something called time tree where you use year month and day nodes to track changes and you can use this as well to get the history.
You also need to make sure you set the changing attributes / properties on the the relating object nodes when relating it to the day / month or year nodes
i hope this helps someone