We have six different types of permissions for content nodes. If we want to query neo4j for the content by the permission type, is it better to store the permissions as an attribute for each content node, or as a separate node to which each piece of content has a relationship?
This is a good data modeling question, and the truth is it depends.
I'm personally in favor of storing them as a separate node, so you don't have to traverse all nodes(or at least all user nodes) in order to find all the permissions you are looking for, especially if you start to get a lot of users and will be looking for all users of permission X.
This also adds a level of normalization, as well as the ability to perform counts easily.
Related
I have the following architecture.
You will find a duplication in HAS relationship. The main one is between Badge and Skill as I want to be able to aggregate/count same Skill from different Badge of the same User.
So, the duplicate relationship is between User and Skill. That is because, for instance, if an Organization wants to know all the skills of single or multiple recipients I would follow the following path:
Org -OWNS-> Badges -IS_AWARDED_To-> User -HAS-> Skill
//Skill nodes for a specific or multiple user represent each skill contained in every Badge the user was awarded.
However, if I do not add the duplicated relationship HAS between User and Skill, I will follow the following path instead:
Org -OWNS-> Badges -IS_AWARDED_TO-> User -IS_AWARDED-> Badges -HAS-> Skill
//Now I have all skills for a specific or multiple User for every badge awarded
The difference between the two paths is obvious. The first one will result in less queries but the duplication of the relationship is a concern. The second one will remove the duplication problem (is it a problem?) but has more queries. I am still a newbie to neo4j and feel free to tell me that both of my approaches seem convoluted and there is a more optimized way to reach what I am trying to do.
Your two models are valid, and you can use both of them.
But like you said, on the first one you duplicate some data. Generally we do that when we have some performance issues. Is it your case for now ?
As a starting point, I recommend you to start with the model 2 (ie. without duplication), and if you have some issues with this model, you can easely change it to the model 1 (the flexibility of Neo4j is really great for graph refactoring !).
In IT, nothing is free : if you duplicate some data to have better performances in reads, you will have an impact on writes.
When you write a (badge)-[:HAS]->(skill) relationship, you also need to create a (user)-[:HAS]->(skill) rel (same for update or delete).
So you need to keep the consistency of this data when you update the graph. In fact it's like you are creating a SQL stored view.
searched, and read from capacity documents, but I can't get figures on what is the maximum capacity for a single node to have?
If I have a user, that has so many posts, comments, uploads, etc, that is related to him, is there any maximum of relations that I can attach to him?
thanks!
There is not really a maximum / limit.
The relationships are stored in separate structures by type and direction.
For some use-cases it might make sense to separate some information out to a separate node, it depends on the use-cases that you want to support with your graph model.
In my Neo4j project I have Role and Permission entities which represent user roles and permissions. Each User in the system has relationships to appropriate sets of roles and permissions.
I think Role and Permission are some kind of supernodes that can become a major headache from a performance point of view in future.
What is the best practice for this case ? How to reimplement Role and Permission in order to avoid possible issues with supernodes ?
Do you plan to make some aggregate/mass queries based on Roles (i.e. count number of people of certain role, list them)?
If not, and you just want to check if a specific user has certain Role, than in my humble opinion it should not cause difficult to maintain, important performance issues ( as you will traverse certain relationships of the graph, ignoring vast majority of multiple relations of your "supernodes" ). I would keep with simple design ( "premature optimization is the root of all evil" ;) ), and once problems are noticed (internally, relationships are stored in a linkedlist-like structure, so finding a proper one may take time on supernode, even if you restrict searching to a certain relation type), splitting Role nodes using meta-node approach should do the job (it's described in Learning Neo4j)
If yes, you have a problem. That's probably a field in which RDBMS are better... Using meta nodes probably won't help, as you will still to have process all of them to list/count all users... So caching that data in a separate store may be simply the best idea ...
I'm going to assume that you're just using Neo4j as a permissions lookup data source (like hasPermission(current_user, 'permission_string')) and not tied into any queries to other entities. That can be fine, especially if you have a hierarchical access schema. If that's not true then this might not apply and it would be good to have a clearer idea of what your entities look like.
Since you're likely using permissions throughout your application it might and if they're going to grow in size and scope it could make sense for performance to use some form of caching like an in-memory store or in Redis, for example.
It might even make sense to generate a denormalized cache of every permission state for every user. So you would evaluate your rules which might be based on hierarchical roles/permissions and come out with a list of "User X has permission Y". Then whenever you change a user or a permission you'd regenerate the cache for that entity, and if you changed a role you would regenerate the cache for all of the associated users and permissions.
Also I don't know if I would apply this advice to just Neo4j. If you're talking about a simple key/value lookup then a lot of general purpose databases would be overkill in performance critical situations.
I have a request to develop an application that keep track of the movements of a certain item (or items). To better demonstrate what the application must do, I drew a diagram (simplified abstraction).
As I never worked with any databases other than the relational ones, I really don't know if I can solve this problem with a graph database.
These questions must be answered by the system:
What was the path that a certain pen drive walked?
I passed some pen drivers. Where are they now?
What are the pens I received, from where did they come from and to where did they go?
Where are the pens I burned and passed? And with whom?
Any help and suggestions are much appreciated.
Thanks
In Neo4j everything is either a node or a relationship. So it's useful to think: what would be my nodes and relationships?
Here it might be, for example, that every "pen drive, "person" and "location" is a node. Verbs like "walk" or "give" would be your relationships.
In this model, you'd be able use "Cypher" to query for things like "give me all location nodes connected to pen nodes by the relationship walk." Or, say "start at all person nodes and return nodes who have a give relationship to a pen drive node that doesn't have a give relationship that connects back to the starting person node."
This rich graph query language gives you nice algorithms like shortest distance for free, so you beyond a transactional record you could determine whether, for example, a pen drive made it from A to B using the optimal path. But as you can see above, "relational joins" do not beget simple queries or descriptions thereof.
When it comes to database design, when the model becomes cumbersome to map mentally, it's going to be a pain to develop too. Design your database based on how you plan to query it. If you're unable to easily explain those queries in terms of Neo4j, it's possible that Neo4j isn't going to be the best fit.
I've created a graph model for a social network and needed some concrete advice regarding the design in regards to scaling. Pardon the n00bness of these questions but I'm not finding very many clear examples out there...
NOTE: the status updates and activity nodes /relationships are linked lists - with the newest entries constantly being placed at the top of the list.
Linked lists allow for news feed generation, but there could be hundreds of records per user - I presume the limit clause isn't sufficient even though the data is in descending order by date. Do I have to have a separate linked list that would only hold the most recent 10 status/activity updates) and constantly replace the head on that list to get better activity feed generation, or will one list properly sorted and do the job (with a limit clause)
These nodes all have properties (json data with content, IDs, etc) - how do "global" indexes come into play here so that I can find, for example, users that like Depeche Mode without waiting a lifetime for results? I know how to add a node to an index, just wondering if I'm missing a part of the picture here..
Security - logins and passwords.. I would presume a graph database could store them, but I'd presume it's a security risk at this point - would it be better to keep this in postgres etc?
How would you improve this model to handle scalability? Imagine 20 million users banging away on this..
Imagine 40 million users - what's wrong with this model when it comes to scalability?
Part 1.
You can write cypher or gremlin queries that do what you want. Remember that you can traverse forwards and backwards on edges. Given a user, it should always be relatively constant time to pull up the last ten things they did.
Part 2.
If you are representing a band as an entity of a certain type, index on that attribute. Then you'll be able to pull out that node and traverse outwards to find all the users who like that band. If you don't have an independent entity, or it is somehow implicit, you'll want to enable full text search for your respective graph database.
Part 3.
Learn more about security. The only thing you would be storing would be a properly hashed string of the user's password. At that point you would be fine using any graph db and good security practices.
Part 4/5.
Once you have one user, worry about the next thousand.
When you have a thousand users, worry about the next hundred thousand.
When you have one hundred thousand, worry about the next million.
When you have a million users, you can start worrying about the questions you asked.
Until you have at least 0.1% of the users/volume you want to scale to, it's mental masturbation to try and ask questions about how to scale up to a certain size.