Are there sets of best practices to approach how to model data in a graph database (I am considering arangodb right now but the question would apply to other platforms)? Here is a practical case to illustrate my question:
Assuming we are creating a centralised contact list for users. Each user has contacts but some contacts could be common to users e.g. John knows Mary, and Marc knows Mary. I would thus have 3 nodes (John, Mary and Marc) but John should only see his relationship to Mary, not Marc's relationship to Mary
So how should a full graph be designed in order to support user access to their information?
Option 1: Create 1 graph per user. That way, I know exactly who can see what (I could for example prefix all my collections with the user id). That would be simple but would duplicate a lot of data (e.g. if I put all my family in the db, my brother will do too, creating twice the same data, in different graphs)
Option 2: Create 1 general graph with Contact nodes, plus User nodes. I would have the contact John, Mary and Marc connected, but the User node representing John, would be linked to the Contact nodes John and Mary only. That way I would know to get only the contact nodes that are connected to the User node I am focusing on.
The problem is that edges cannot be linked to the User node (I cannot have an edge going from a node to an edge...can I?). So I would have to add an attribute of user_id to all the edges in order to only fetch the ones relevant to the current user.
This is slightly nicer as I do not have to duplicate nodes, but I would still have to duplicate edges as they would be user specific
Option 3: Do it SQL like with a Rights table, maintaining a list of Contact ids along with what user can see what Node and what Edge (heavy on joins)
Options 4: ???
As in everything, there are many ways to reach a solution but I was wondering what was considered best practice to balance cleanliness of approach and performance for insertion/deletion...knowing that performance might be platform dependent
i would suggest an Option 4:
First i would not distinguish between User and Contact Nodes, but all of them should be Contact Nodes.
If you create a new User you basically create a new Contact for him (or use an existing one) and connect your Applications Authentication to this specific Contact.
Then you can use directed edges to create the contact list for a user.
Say you have two users John and Mary, than John can add Mary to his contact list, but Mary would not recognize. If she wants to add John this means you will add a second edge.
If you want to have symmetrical contacts only (if John adds Mary to his list, he should automatically appear in her list) you simply ignore this direction in your queries.
If you now want to get the contacts for John this can be done by selecting the Neighbors of John.
In ArangoDB this can be realized with two collections, say Contact and Knows, where Knows holds the edges.
The following code pasted into arangosh creates your situation described above:
db._create("Contact");
db._createEdgeCollection("Knows");
db.Contact.save({_key: "John", mail: "john#example.com"});
db.Contact.save({_key: "Mary", mail: "mary#somewhere.com"});
db.Contact.save({_key: "Marc", mail: "marc#somewhereelse.com"});
db.Knows.save("Contact/John", "Contact/Mary", {});
db.Knows.save("Contact/Marc", "Contact/Mary", {});
To query the contact list for user John:
db._query('RETURN NEIGHBORS(Contact, Knows, "John", "outbound")').toArray()
Should give Mary as result, no information about Marc.
If you do not want to join Contacts and User Accounts as i suggested you could also separate them in different collections, in this case you have to slightly modify the edges and the query:
db.Knows.save("User/John", "Contact/Mary", {});
db.Knows.save("User/Marc", "Contact/Mary", {});
db._query('RETURN NEIGHBORS(Users, Knows, "John", "outbound")').toArray()
should give the same result.
Edit:
Regarding your question in Option 2:
In ArangoDB it is actually possible to point edges to other edges, however build in graph functionality will now consider the edges pointed to as if they were nodes. This means they do not follow their direction automatically. But you can use these resulting edges in further AQL statements and continue the search with AQL features.
Related
I have a schema which looks like below:
A customer is linked to another customer with a relationship SIMILAR having similarity score.
Example: (c1:Customer)-->(c2:Customer)
An Email node is connected to each customer with relationship MAIL_AT with the following node properties:
{
"active_email_address": "a#mail.com",
"cibil_email_addresses": [
"b#mail.com", "c#mail.com"
]
}
Example: (e1:Email)<-[:MAIL_AT]-(c1:Customer)-[:SIMILAR]->(c2:Customer)-[:MAIL_AT]->(e2:Email)
A Risk node with some risk-related properties (below) and is related to customer with relationship HAS_RISK:
{
"f0_score": 870.0,
"pta_score": 430.0
}
A Fraud node with some fraud-related properties (below) and is related to customer with relationship IS_FRAUD:
{
"has_commited_fraud": true
}
My Objectives:
To find the customers with common email addresses (irrespective of active and secondary)?
My tentative solution:
MATCH (email:Email)
WITH email.cibil_email_addresses + email.active_email_address AS emailAddress, email
UNWIND emailAddress AS eaddr
WITH DISTINCT eaddr AS deaddr, email
UNWIND deaddr AS eaddress
MATCH (customer:Customer)-[]->(someEmail:Email)
WHERE eaddress IN someEmail.cibil_email_addresses + someEmail.active_email_address
WITH eaddress, COLLECT(customer.customer_id) AS customers
RETURN eaddress, customers
Problem: It is taking forever to execute this. Working with lists will take time I understand, however, I'm flexible to change the schema (If suggested). Should I break the email address into separate nodes? If yes, then how do I break cibil_email_addresses into different nodes as they can vary - Should I create two nodes with different cibil email addresses and connect both of them to customer with relationship HAS_CIBIL_EMAIL? (Is this a valid schema design). Also, it is possible, a customer's active_email_address is present in other customer's cibil_email_address. I'm trying to find a synthetic identity attack. PS: If there exists some APOC that can help achieve this and below, do suggest with example.
In production, for a given customer with email addresses, risk values, similarity score, and also given other customers may or may not be tagged with fraud_status, I want to check whether this new person will fall in a fraud ring or not. PS: If I need to use any gds to solve this, please suggest with examples.
If I were to do this same exercise with some other node such as Address which may be partially matching and will be having same list of historical addresses in a list, what should be my ideal approach?
I know, I'm tagging someone in my question, but that person only seems to be active with respect to Cypher on StackOverflow. #cybersam any help?
Thanks.
This should work:
MATCH (e:Email)
UNWIND (e.cibil_email_addresses + e.active_email_address) AS address
WITH address, COLLECT(e) AS es
UNWIND es AS email
MATCH (email)<-[:MAIL_AT]-(cust)
RETURN address, COLLECT(cust) AS customers
The WITH clause takes advantage of the arregating function COLLECT to automatically collect all the Email nodes containing the same address, using address as the grouping key.
You should only ask one question at a time. You have a couple of other questions at the bottom. If you continue to need help with them, please create new questions.
I build a social network with Neo4j, it includes:
Node labels: User, Post, Comment, Page, Group
Relationships: LIKE, WRITE, HAS, JOIN, FOLLOW,...
It is like Facebook.
example: A user follow B user: when B have a action such as like post, comment, follow another user, follow page, join group, etc. so that action will be sent to A. Similar, C, D, E users that follow B will receive the same notification.
I don't know how to design the data model for this problem and I have some solutions:
create Notification nodes for every user. If a action is executed, create n notification for n follower. Benefit: we can check that this user have seen notification, right? But, number of nodes quickly increase, power of n.
create a query for every call API notification (for client application), this query only get a action list of users are followed in special time (24 hours or a 2, 3 days). But Followers don't check this notification seen or yet, and this query may make server slowly.
create node with limited quantity such as 20, 30 nodes per user.
Create unlimited nodes (include time of action) on 24 hours and those nodes has time of action property > 24 hours will be deleted (expire time maybe is 2, 3 days).
Who can help me solve this problem? I should chose which solution or a new way?
I believe that the best approach is the option 1. As you said, you will be able to know if the follower has read or not the notification. About the number of notification nodes by follower: this problem is called "supernodes" or "dense nodes" - nodes that have too many connections.
The book Learning Neo4j (by Rik Van Bruggen, available for download in the Neo4j's web site) talk about "Dense node" or "Supernode" and says:
"[supernodes] becomes a real problem for graph traversals because the graph
database management system will have to evaluate all of the connected
relationships to that node in order to determine what the next step
will be in the graph traversal."
The book proposes a solution that consists in add meta nodes between the follower and the notification (in your case). This meta node should got at most a hundred of connections. If the current meta node reaches 100 connections a new meta node must be created and added to the hierarchy, according to the example of figure, showing a example with popular artists and your fans:
I think you do not worry about it right now. If in the future your followers node becomes a problem then you will be able to refactor your database schema. But at now keep things simple!
In the series of posts called "Building a Twitter clone with Neo4j" Max de Marzi describes the process of building the model. Maybe it can help you to make best decisions about your model!
In Ian Robinson's book Graph Databases on page 73, he states: "We can use NEXT and/or PREVIOUS relationships (depending on our preference)..."
My question: what benefit is there to implementing both?
There is no benefit, it will just increase the store sizes on disk.
Cypher will behave the same in traversing one way or the other.
In some cases you will also want a LAST relationship, like :
(User)-[:LAST_EVENT]->(:Click)-[:PREVIOUS]->(:Click)-[:ETC....
Where you can choose to have an additional LAST_EVENT relationship between user and the last event, so there will be also a PREVIOUS relationship between those two nodes.
i'm working with Users assigned to a Grid location
(User)-[:PICK_UP]->(Grid)
With the query
MATCH (u:User)-[:PICK_UP]->(g:Grid)-[:TO]-(g2:Grid)<-[:PICK_UP]-(u2:User)
RETURN g,g2,u,u2
I have the result
In the image i have two groups of nodes, that represent the grid and its neighbors with users (red node). I would like to 'group'/create relations between the users nearby to a Spot node.
E.g. with the first group: grids 34, 40, 41, with the users 1,4,5,9. I would like to group the users in my query so i can get the result [user1, u4, u5, u9] and then those users i can assign them to a Spot, like this
Any suggestions??
Thank you !!
The thing to keep in mind is that your (u:User)-[:PICK_UP]->(g:Grid)-[:TO]-(g2:Grid)<-[:PICK_UP]-(u2:User) is matching a specific path, and while you see two groups in the graphical display, there are actually overlapping paths there. Viewing your result in table mode might be helpful.
So onto answering your question! Firstly, this was a tricky one, but a really cool one. I think I've got a good solution:
MATCH path=(grid:Grid)-[:TO]-(other_grid:Grid)
WITH CASE WHEN ID(grid) < ID(other_grid) THEN ID(other_grid) ELSE ID(grid) END AS id_to_reject
WITH collect(DISTINCT id_to_reject) AS ids_to_reject
MATCH (grid:Grid)
WHERE NOT(ID(grid) IN ids_to_reject)
CREATE (spot:Spot)
WITH grid, spot
MATCH (grid)-[:TO|PICK_UP*1..6]-(user:User)
MERGE (user)-[:AT_SPOT]->(spot)
The first thing that the query does it to compare all Grid nodes which are related to each other. For each of these pairs it passes on the ID() of the Grid node which is greater. The IDs which aren't in the list are therefore the smallest in the group and can act as a representative of the group. For each one of these representative Grid nodes we create a Spot node.
Using that node, it finds all User nodes within six hops via both TO and PICK_UP relationships. That should give all users in the group (both the users of our representative grid as well as the users of the other grids).
Then it's a simple matter to MERGE a relationship from each user to the Spot.
I have read the Neo4j manual and saw the numerous short examples regarding movie graph. I have also installed it locally and played with the cypher.
Here is the setup:
I have the following nodes: Movies (with name and id, owned by friend), Actors(with name and ids) Directors (with names and id), Genre (with id and name)
Relations are: Actors acted in Movies (1 movie - many actors), Directors directed a movie (1 director per movie but a director can direct many movies), and Movies has several genre "(many to many)
1) Owned by friend I dont know why but following the LOAD CSV example they put USA as a node rather than a property but is there a logical reason why its better to put it as a node rather than a property like i did?
2)
What I want to search is similar to the answer given to this question:
Nearest nodes to a give node, assigning dynamically weight to relationship types
However - I do not have a weight on the relationship and its more of a "go find the first give nodes connected to it"
Given that the "owned by friend" can only be owned by 1 person.
If given movie title "Spider-Man" (which for example purpose is owned by frank) go find the next occurrence of a movie that is owned by John.
So after reading Neo4j I believe that I dont need to specify which relationship is needed to traverse but just go find the next movie that meets my criteria, right?
So Following the above link
MATCH (n:Start { title: 'Spider-Man' }),
(n)-[:CONNECTED*0..2]-(x)
RETURN x
So go to node Spider-Man and go find me X as long as it is connected but I got stump by *0..2 because its the range...what if I just say "go find me the first you that means the own by John"
3) following up to #2 - how do i insert the fitler "own by john" ?
There are a number of things in your question that don't quite make sense. Here's a stab at an answer.
1) Making 'USA' a node rather than a property is useful if you want to search based on country. If 'USA' is a node, you are able to limit your search by starting at the 'USA' node. If you don't care to do this, then it doesn't really matter. It may also save a small amount of space for longer country names to store the name once and link to it via relationships.
2) Your example doesn't match your described graph. I can't really speak to it without a better example.
3) This is probably easy to answer once you improve your example.
OK. Based on the comments to the answer, here's what you need. To find one movie owned by John that is connected via common actors, directors, etc to the movie Spider-man owned by Frank (that is, sub-graphs like (movie)<--(actor)-->(movie) ) you can write:
MATCH (n:Movie {title : 'Spider-Man', owned_by : 'Frank'})<-[*2]->(m:Movie {owned_by : 'John'})
RETURN m LIMIT 1
If you want more responses, alter or remove the LIMIT on the RETURN clause. If you want to allow chains that pass through chains like (movie)<--(actor)-->(movie)<--(director)-->(movie), you can increase the number of relationships matched (the *2) to 4, 6, 8, etc. You probably shouldn't just write the relationship part of the MATCH clause as -[*]-, because this could get into infinite loops.