I have an incoming event stream of player interactions from an MMO. I want to construct a graph of the player's moment-to-moment interactions, continuously run queries on the activities of the past ~30-240 seconds, and update a graphical view, all in real-time.
Some more details about my particular case:
I have ~100-500 incoming events every second. These look like:
(PlayerA)->[:TAKES_ACTION]->(Event)->[:RECIPIENT]->(PlayerB)
where every event is time-stamped. Timestamps are accurate to the second. I plan on attaching every event to a node representing a timestamp, so I can restrict queries to the events attached to a set of X most recent timestamps.
I expect at any given time-frame for there to be ~1000-2000 active players.
My queries will be to group players together based on mutual interactions, to figure out which groups are currently engaged in combat with which other groups.
My main questions are:
Does Neo4j have any sort of "incremental update" functionality to efficiently update query results without re-running the entire query for every set of changes?
Does Neo4j have any sort of ability to "push" any changes to the results of a query to a client? Or would a client have to continuously poll the database?
Are there any optimisations or tricks to making a continuously repeated query as efficient as possible?
Answers
1) No. You can only execute query and get results.
2) No. Currently you can only make client -> server requests.
3) Yes.
Details
Let's get to the bottom of this one. Neo4j by default can offer you:
REST API
Transactional Cypher ednpoint
Traversal endpoint
Custom plugins
Custom unmanaged extensions
In your case you should implement unmanaged extension. This is best option to get desired functionality - develop it by yourself.
More information on extensions:
How to create unmanaged Neo4j extension?
Unmanaged extension template
Graphaware framework for extension development
In extension you can do everything you want:
Use Core API to make efficient queries for latest data
Create WebSocket endpoint for full-duplex communication channel between client and server
Implement any additional logic to format/represent your data correctly
Queries and performance
Cypher queries are compiled and cached on first execution. After that - cached query version is used. And query execution by itself is quite fast.
Recommendations:
Always use query parameters where it is possible. This allow Neo4j to efficiently reuse queries
Be smart on writing queries. Try to lower cardinality where possible.
Think about data model. Probably you can model your data in such way, when query always fetches only latest data. In you case probably relationship :LAST_EVENT, :PREVIOUS_EVENT and etc. can help.
Related
In neo4j I have an application where an API endpoint does CRUD operations on the graph, then I materialize reachable parts of the graph starting at known nodes, and finally I send out the materialized subgraphs to a bunch of other machines that don’t know how to query neo4j directly. However, the materialized views are moderately large, and within a given minute only small parts of each one will change, so I’d like to be able to query “what has changed since last time I checked” so that I only have to send the deltas. What’s the best way to do that? I’m not sure if it helps, but my data doesn’t contain arbitrary-length paths — if needed I can explicitly write each node and edge type into my query.
One possibility I imagined was adding a “last updated” timestamp as a property on every node and edge, and instead of deleting things directly, just add a “deleted” boolean property and update the timestamp, and then use some background process to actually delete a few minutes later (after the deltas have been sent out). Then in my query, select all reachable nodes and edges and filter them based on the timestamp property. However:
If there’s clock drift between two different neo4j write servers and the Raft leader changes from one to the other, can the timestamps go back in time? Or even worse, will two concurrent writes always give me a transaction time that is in commit order, or can they be reordered within a single box? I would rather use a graph-wide monotonically-increasing integer like
the write commit ID, but I can’t find a function that gives me that.
Or theoretically I could use the cookie used for causal consistency,
but since you only get that after the transaction is complete, it’d
be messy to have to do every write as two separate transactions.
Also, it just sucks to use deletion markers because then you have to explicitly remove deleted edges / nodes in every other query you do.
Are there other better patterns here?
I'm new to working with CloudKit and database fetching and I've looked at the CKDataBaseOperation calls, so I'm trying to understand the real differences between adding an operation to a database and using "normal" function calls on that database if they both produce, more or less, the same results.
Why would adding an operation be more desirable over a function call and in what situations?
Thanks for helping me understand this. I'm trying to learn as much as I can about Swift.
Overview:
In CloudKit most of the tasks have 2 ways of doing things:
Convenience APIs (functions with completion handlers)
Operations
1. Convenience APIs
Advantages:
As the name implies, they are convenient to use
Disadvantage:
Usually requires more server requests.
Can't build dependencies
2. Operations:
Advantages:
More configurable and more options.
Requires lesser server requests (Better for your server request quota)
It is built using Operation, so you get all the capabilities of Operation like dependencies (you will need them in a real app)
Disadvantages:
It is not so convenient to use, you need to create the operation. It takes a little more time to code but well worth it.
Example 1 (Fetch):
If you use CKDatabase.fetch, you would need to specify the record IDs that you want to fetch.
If you use CKQueryOperation, you can query based on field values.
Example 2 (Save / Update):
If you use CKDatabase.save, you can save 1 record with every function call. Each function call would result in a separate server request. If you want to save 200 records, you would have to run it in a loop and would make 200 server requests which is not very efficient. CloudKit also has a limit on the number of server requests you can make per second. This way you would exhaust your quota very quickly.
If you use CKModifyRecordsOperation, you can save 200 records all at once*, by passing it as an array. So you would be making far lesser server requests.
*Note: The server imposes a limit on the number of records it can save in 1 request but it is definitely better than creating a separate request to save each record.
Reference:
https://developer.apple.com/library/content/documentation/DataManagement/Conceptual/CloudKitQuickStart/Introduction/Introduction.html#//apple_ref/doc/uid/TP40014987-CH1-SW1
Watch WWDC CloudKit videos
Might help to learn and watch WWDC videos about Operation (earlier used to be referred as NSOperation)
I have been testing neo4j for graph projects for 1 or 2 month now and it has been really efficient, but I'm having a hard time finding how to solve one of my problem and I'm seeking for advice.
I'm using neo4j to store graph databases and check that they follow some structural requirements, for example, I have a db modeling dependency between items : the nodes are the items and the links are labeled "need" or "incompatible" to model the dependency and I want neo4j to check the coherence of the data.
I coded the checker in a server plugin and it works very well. But now I would like to allow users to connect to the database, modify the data (without saving the modification yet), check that the modifications are not breaking the coherence and then save the modifications.
I found the http endpoint which can keep a transaction open and it completely fits the "modifying the db without saving" need, but I can't find how to run my checker on the modified data : is there a way to run something else than Cypher query with the http endpoint or do I have to consider an other way to solve this ?
I now it would be possible to run my checker using the TransactionEventHandler beforeCommit, but it means the user couldn't know if his data are okay without starting a commit, and the fact that the data are split between the db without modification and the TransactionData which store the modification make the checker tricky to apply.
So, if someone knows how I could solve this, it would be great.
Thank you.
Your options is to use Unmanaged Extension and Transaction Event API.
You are able to handle incoming transaction and read all data which are in it. If transaction break your rules, then you can discard the transaction.
I recommend you to use GraphAware framework for that.
Here is the great article about that http://graphaware.com/neo4j/transactions/2014/07/11/neo4j-transaction-event-api.html
Is it possible to find out the updates / modifications / changes done to neo4j db over a time interval?
NEO4J DB will be polled at periodic intervals for finding the changes happened to it over that time period.
And then these changes have to be sync'd with other DBs.This is the real task.
Here changes include addition ,updation ,deletion of Nodes, Relationships,properties.
How do we track the changes that have been done in a particular timeframe. Not all nodes and relationships have timestamps set on it.
Add a timestamp field to each of your nodes and relationships that inserts the timestamp() while they are created. Then write a cypher query to bring back all nodes and relationships within the given time range.
EDIT
There are two ways of implementing this synchronization.
Option 1
If you can use Spring Data Neo4j then you can use the lifecycle events as explained here to intercept the CUD operations and do the necessary synchronization either synchronously or asynchronously.
Option 2
If you can't use Spring, then you need to implement the interception code yourself. The best way I can think of is to publish all the CUD operations to a Topic and then write subscribers that can each synchronize to to each of the stores. In your case you have Neo4jSubscriber, DbOneSubscriber, Db2Subscriber etc.
There is something called time tree where you use year month and day nodes to track changes and you can use this as well to get the history.
You also need to make sure you set the changing attributes / properties on the the relating object nodes when relating it to the day / month or year nodes
i hope this helps someone
I'm building an app that needs to store a fair amount of events that the users carry out. (Think LOTS as in millions per month).
I need to report on the these events (total of type x in the last month, etc) and need something resilient and fast.
I've toyed with Redis etc to store aggregates of the data, but this could just mean that I'm building up a massive store of single figure aggregates that aren't rebuildable.
Whilst this isn't a bad solution, I'm looking at storing the raw event data in tables that I can then query on a needs basis, and potentially generate aggregate counters on a periodic basis. This would thus give me the ability to add counters over time, and also carry out ad-hoc inspections on what is going on, something which aggregates don't allow.
Question is, how is best to do this? I obviously don't want to have to create a model for each table (which is what Rails would prefer), so do I just create the tables and interact with raw SQL on a needs basis, or is there some other choice for dealing with this sort of data?
I've worked on an app that had that type of data flow and the solution is the following :
-> store everything
-> create aggregates
-> delete everything after a short period (1 week or somehting) to free up resources
So you can simply store events with rails, have some background aggregate creation from another fast script (cron sql), read with rails the aggregates and yet another background script for raw event deletion.
Also .. rails and performance don't quite go hand in hand usually ;)