I build a social network with Neo4j, it includes:
Node labels: User, Post, Comment, Page, Group
Relationships: LIKE, WRITE, HAS, JOIN, FOLLOW,...
It is like Facebook.
example: A user follow B user: when B have a action such as like post, comment, follow another user, follow page, join group, etc. so that action will be sent to A. Similar, C, D, E users that follow B will receive the same notification.
I don't know how to design the data model for this problem and I have some solutions:
create Notification nodes for every user. If a action is executed, create n notification for n follower. Benefit: we can check that this user have seen notification, right? But, number of nodes quickly increase, power of n.
create a query for every call API notification (for client application), this query only get a action list of users are followed in special time (24 hours or a 2, 3 days). But Followers don't check this notification seen or yet, and this query may make server slowly.
create node with limited quantity such as 20, 30 nodes per user.
Create unlimited nodes (include time of action) on 24 hours and those nodes has time of action property > 24 hours will be deleted (expire time maybe is 2, 3 days).
Who can help me solve this problem? I should chose which solution or a new way?
I believe that the best approach is the option 1. As you said, you will be able to know if the follower has read or not the notification. About the number of notification nodes by follower: this problem is called "supernodes" or "dense nodes" - nodes that have too many connections.
The book Learning Neo4j (by Rik Van Bruggen, available for download in the Neo4j's web site) talk about "Dense node" or "Supernode" and says:
"[supernodes] becomes a real problem for graph traversals because the graph
database management system will have to evaluate all of the connected
relationships to that node in order to determine what the next step
will be in the graph traversal."
The book proposes a solution that consists in add meta nodes between the follower and the notification (in your case). This meta node should got at most a hundred of connections. If the current meta node reaches 100 connections a new meta node must be created and added to the hierarchy, according to the example of figure, showing a example with popular artists and your fans:
I think you do not worry about it right now. If in the future your followers node becomes a problem then you will be able to refactor your database schema. But at now keep things simple!
In the series of posts called "Building a Twitter clone with Neo4j" Max de Marzi describes the process of building the model. Maybe it can help you to make best decisions about your model!
Related
I am using total count of devices as the "server attributes" at customer entity level that is in turn being used for Dashboard widgets like in "Doughnut charts". Hence to get the total count information, I have put a rule chain in place that handles "Device" Addition / Assignment event to increment the "totalDeviceCount" attribute at customer level. But when the device is getting deleted / UnAssigned , I am unable to get access to the Customer entity using "Enrichment" node as the relation is already removed at the trigger of this event. With this I have the challenge of maintaining the right count information for the widgets.
Has anyone come across similar requirement? How to handle this scenario?
Has anyone come across similar requirement? How to handle this scenario?
What you could do is to count your devices periodically, instead of tracking each individual addition/removal.
This you can achieve using the Aggregate Latest Node, where you can indicate a period (say, each minute), the entity or devices you want to count, and to which variable name you want to save it.
This node outputs a POST_TELEMETRY_REQUEST. If you are ok with that then just route that node to Save Timeseries. If you want an attribute, route that node to a Script Transformation Node and change the msgType to POST_ATTRIBUTE_REQUEST.
Let's assume this Neo4j data construct.
(:Store)<-[:FROM]-(:Notification)-[:FOR]->(//users//:User)
Which should serve as, for example, a notification for users on a new sale in a store.
Such a notification may be addressed to a large number of users simultaneously, in this case I can see two approaches for this data modelling:
Create separate :Notification with relationships for each of the :User's; All connected to single :Store node. So when notification is received - relationships and :Notification node is removed.
Approach which i thought should be more effective performance wise and about which my question is: Create a single :Notification node connected to a store and multiple [:FOR]'s for different users. Notification received - :FOR for this user removed, if no :FOR's left - :Notification itself removed .
So my questions are:
Am I correct in assuming that the 2nd one is a better practice?
How can I generally, after deleting a relationship, check if there are more such relationships, and do it without making Neo4j match all connected ones of this type, i.e check if there is at least one relationship of this type left?
Definitely having fewer objects created in the first place would be more efficient in general. It will result in 2n fewer objects (1 node and relationship per user). The exception would be if you had so many users per notification that the node density was too high. To avoid that though you could simply create additional notifications at a particular user threshold.
The following query is adapted from #Luanne's answer to this question which i though was pretty slick.
It presumes that besides the User nodes, the only other connection to the Notification nodes is the store nodes. If the degree of nodes connected to the Notification node is one then it must be just a Store node remaining. Remove the relationship to the Store node and remove then remove the Notification node.
MATCH (n:Notification {name: 'Notification One'} )-[store_rel:FROM]->(s:Store)
WHERE size((n)--())=1
DELETE store_rel, n
I am used to have tables for ongoing activities from my former life as relational DB guy. I am wondering how I would store ongoing information like transactions, logs or whatever in a neo4j DB. Let#s assume I have an account, which is been assigned to a user A:
(u:User {name:"A"})
I want to keep track on transactions he does, e.g. deducting or adding a value:
(t:Transaction {value:"-20", date:timestamp()})
Would I do for every transaction a new node and assign it to the user:
(u) -[r:changeBalance]-> (t)
In the end I might have lots of nodes which are assigned to the user and keep one transaction each resulting in lots of nodes with only one information. I was pondering if a query that has a limit on the last 50 transactions (limit 50, sort by t.date) might still have to read all available transaction nodes to get the total sorting queue before the limit applies - this seems a bit unperformant.
How would you model a list of actions in a neo4j DB? Any hint is very appreciated.
If you used a simple query like the following, you would NOT be reading all Transaction nodes per User.
MATCH (u:User)-[r:ChangeBalance]->(t:Transaction)
RETURN u, t
ORDER BY t.date;
You'd only be reading the Transaction nodes that are directly related to each User (via a ChangeBalance relationship). So, the performance would not be as bad as you are afraid it might be.
Although everything is fine with your query - you are reading only transactions, that are related to this specific user - this approach can be improved.
Let's imagine that, for some reason, you application will work 5 years and you have user that have 10 transactions per day. It will result in ~18250 transaction connected to single node.
This is not great idea, from data model perspective. In this case if you want to filter result (get 50 latest transaction) on some non-indexed field, then this will result in full 18250 node traverse.
This can be solved by adding additional relations to database.
Currently you have such graph: (user)-[:HAS]->(transasction)
( user )
/ | \
(transasction1) (transaction2) (transaction3)
You can add additional relation between transactions, to specify sequence of events.
Like that: (transaction)-[:NEXT]->(transasction)
( user )
/ | \
(transasction1)-(transaction2)-(transaction3)
Note: there is no need to have additional PREVIOUS relation, because Neo4j store relationship pointers in both directions, so traversing backwards can be done at same speed as forward.
And maintain relations to first and last user transasctions:
(user)-[:LAST_TRANSACTION]->(transaction)
(user)-[:FIRST_TRANSACTION]->(transaction)
This allows you to get last transaction in 1 hop. And then latest 50 with additional 50 hops.
So, adding additional complexity, you can traverse/manipulate with your data in more efficient ways.
This idea come from EventStore database (and similar to them).
Moreover, with such data model User balance can be aggerated by wrapping up sequence of transaction. This can give you a nice and fast way to get user balance at any point.
Getting latest 50 transaction in this model can look like this:
MATCH (user:User {id: 1} WITH user
MATCH (user)-[:LAST_TRANSACTION]->(last_transaction:Transaction) WITH last_transaction
MATCH (last_transasction)<-[:NEXT*0..50]-(transasctions:Transaction)
RETURN transactions
Getting total user balance can be:
MATCH (user:User {id: 1}) WITH user
MATCH (user)-[:FIRST_TRANSACTION]->(first_transaction:Transaction) WITH first_transaction
MATCH (first_transaction)-[:NEXT*]->(transactions:Transaction)
RETURN first_transaction.value + sum(transasctions.value)
Are there sets of best practices to approach how to model data in a graph database (I am considering arangodb right now but the question would apply to other platforms)? Here is a practical case to illustrate my question:
Assuming we are creating a centralised contact list for users. Each user has contacts but some contacts could be common to users e.g. John knows Mary, and Marc knows Mary. I would thus have 3 nodes (John, Mary and Marc) but John should only see his relationship to Mary, not Marc's relationship to Mary
So how should a full graph be designed in order to support user access to their information?
Option 1: Create 1 graph per user. That way, I know exactly who can see what (I could for example prefix all my collections with the user id). That would be simple but would duplicate a lot of data (e.g. if I put all my family in the db, my brother will do too, creating twice the same data, in different graphs)
Option 2: Create 1 general graph with Contact nodes, plus User nodes. I would have the contact John, Mary and Marc connected, but the User node representing John, would be linked to the Contact nodes John and Mary only. That way I would know to get only the contact nodes that are connected to the User node I am focusing on.
The problem is that edges cannot be linked to the User node (I cannot have an edge going from a node to an edge...can I?). So I would have to add an attribute of user_id to all the edges in order to only fetch the ones relevant to the current user.
This is slightly nicer as I do not have to duplicate nodes, but I would still have to duplicate edges as they would be user specific
Option 3: Do it SQL like with a Rights table, maintaining a list of Contact ids along with what user can see what Node and what Edge (heavy on joins)
Options 4: ???
As in everything, there are many ways to reach a solution but I was wondering what was considered best practice to balance cleanliness of approach and performance for insertion/deletion...knowing that performance might be platform dependent
i would suggest an Option 4:
First i would not distinguish between User and Contact Nodes, but all of them should be Contact Nodes.
If you create a new User you basically create a new Contact for him (or use an existing one) and connect your Applications Authentication to this specific Contact.
Then you can use directed edges to create the contact list for a user.
Say you have two users John and Mary, than John can add Mary to his contact list, but Mary would not recognize. If she wants to add John this means you will add a second edge.
If you want to have symmetrical contacts only (if John adds Mary to his list, he should automatically appear in her list) you simply ignore this direction in your queries.
If you now want to get the contacts for John this can be done by selecting the Neighbors of John.
In ArangoDB this can be realized with two collections, say Contact and Knows, where Knows holds the edges.
The following code pasted into arangosh creates your situation described above:
db._create("Contact");
db._createEdgeCollection("Knows");
db.Contact.save({_key: "John", mail: "john#example.com"});
db.Contact.save({_key: "Mary", mail: "mary#somewhere.com"});
db.Contact.save({_key: "Marc", mail: "marc#somewhereelse.com"});
db.Knows.save("Contact/John", "Contact/Mary", {});
db.Knows.save("Contact/Marc", "Contact/Mary", {});
To query the contact list for user John:
db._query('RETURN NEIGHBORS(Contact, Knows, "John", "outbound")').toArray()
Should give Mary as result, no information about Marc.
If you do not want to join Contacts and User Accounts as i suggested you could also separate them in different collections, in this case you have to slightly modify the edges and the query:
db.Knows.save("User/John", "Contact/Mary", {});
db.Knows.save("User/Marc", "Contact/Mary", {});
db._query('RETURN NEIGHBORS(Users, Knows, "John", "outbound")').toArray()
should give the same result.
Edit:
Regarding your question in Option 2:
In ArangoDB it is actually possible to point edges to other edges, however build in graph functionality will now consider the edges pointed to as if they were nodes. This means they do not follow their direction automatically. But you can use these resulting edges in further AQL statements and continue the search with AQL features.
I'm exploring potential use cases for neo4j, and I find that the relationship model is great, but I'm curious if the database can support something along the lines of a business transaction log.
For instance, a video rental store:
Customer A rents Video A on 01/01/2014
Customer A returns Video A on 01/20/2014
Customer B rents Video A on 01/25/2014
Customer B returns video A on 02/15/2014
Customer C rents Video A on 03/10/2014
etc...
The business requirement would be to track all rental transaction relationships relating to the Video A node.
This seems to be technically possible. Would one create a new relationship for every time that a new rental occurs? Are there better ways to approach this? Is this a misuse of the technology?
Nice! This is the exact use case that led me to develop FlockData (github link). FD uses Neo4J to track event type activity against a domain document (Rental in your example). Then use Tags to create Nodes that represent Meta Data associated with the domain doc (Movie/Person). You have an event node for each change in state of the Rental. Couple of graphs over here on LinkedIn showing "User Created", "User Approved" and "User Audited".
FD uses 3 databases to achieve its goals - Neo4j for the network of relationships, KV store for the bulky data (Redis or Riak) and ElasticSearch to let users find their Business Context Document (the Rental) via free text.
In terms of your specific question exercise caution with nodes that have a lot of relationships. Checkout this article on modelling dates. Peter Neubauer has a similar article somewhere in the Neo4j docs.
I'd look at it depending on what you're trying to get out of it. If you're looking to develop a recommendation engine, or see the relationships between users and/or movies, a graphDB is a pretty natural solution. If you're looking at tracking the state changes of Video A over time, a Temporal database is modeled for that (http://en.wikipedia.org/wiki/Temporal_database). For a straight up transactional system, a traditional relational database will work easily. Personally, I think you'll have better options with a graphDB. In your example, you would have 3 consumer nodes, 1 video node, 3 relationships of type :RENTS and two of :RETURNS. You'd want to make sure that your property model supports the same user re-renting the same movie (store the date in an array, not a single value). Just some thoughts...