I have the following architecture.
You will find a duplication in HAS relationship. The main one is between Badge and Skill as I want to be able to aggregate/count same Skill from different Badge of the same User.
So, the duplicate relationship is between User and Skill. That is because, for instance, if an Organization wants to know all the skills of single or multiple recipients I would follow the following path:
Org -OWNS-> Badges -IS_AWARDED_To-> User -HAS-> Skill
//Skill nodes for a specific or multiple user represent each skill contained in every Badge the user was awarded.
However, if I do not add the duplicated relationship HAS between User and Skill, I will follow the following path instead:
Org -OWNS-> Badges -IS_AWARDED_TO-> User -IS_AWARDED-> Badges -HAS-> Skill
//Now I have all skills for a specific or multiple User for every badge awarded
The difference between the two paths is obvious. The first one will result in less queries but the duplication of the relationship is a concern. The second one will remove the duplication problem (is it a problem?) but has more queries. I am still a newbie to neo4j and feel free to tell me that both of my approaches seem convoluted and there is a more optimized way to reach what I am trying to do.
Your two models are valid, and you can use both of them.
But like you said, on the first one you duplicate some data. Generally we do that when we have some performance issues. Is it your case for now ?
As a starting point, I recommend you to start with the model 2 (ie. without duplication), and if you have some issues with this model, you can easely change it to the model 1 (the flexibility of Neo4j is really great for graph refactoring !).
In IT, nothing is free : if you duplicate some data to have better performances in reads, you will have an impact on writes.
When you write a (badge)-[:HAS]->(skill) relationship, you also need to create a (user)-[:HAS]->(skill) rel (same for update or delete).
So you need to keep the consistency of this data when you update the graph. In fact it's like you are creating a SQL stored view.
Related
In the book Neo4j in Action by Aleksa Vukotic and Nicki Watt, the authors say:
In our experience, it is less common for relationship indexes to be good solutions. We are not saying that relationship indexing is poor practice, but if you find yourself adding lots of relationship indexes, it is worth asking why.
It sounds that the authors do not recommend to index relationship in a graph database but no explanation is given thereafter. Does anyone know why?
I've voted for this question to be migrated to SO, and answering it while hoping it to be really migrated. I used Neo4j a couple of years. Although it has changed a lot since then, the principles of being a graph database won't alter much I believe. In my opinion, if you need a lot of indices to promptly query the relationships between the nodes, you could have designed your data model in some other way such that it focuses more on the graph nodes (just for example, relationships being your nodes, and nodes being your relationships as in line graph); because the querying mechanism (e.g. Cypher query) is generally optimised for the nodes.
First, it's important to understand the role of indexes in Neo4j, in that indexes are used to find starting points in the graph, after which relationship traversal and filtering are used to perform the remainder of the pattern matching and to complete the query.
The advice therefore is about the same as: "we do not recommend using relationships as starting points in the graph", and we find that true more often than not.
Usually when you need to do index lookups, you have certain "things" in mind as your starting places, and important things in graphs are typically represented by nodes. If we're asking "what employees are connected to this particular company" we're interested in starting quickly by finding that particular company and expanding out, not in finding all :EMPLOYED_BY relationships in the graph and filtering by the connected company, which would take far more time.
Often we find that those who encounter this restriction, and need this kind of fast lookup of relationships anyway, may need to rethink their model. Often when there is a need to lookup relationships as starting places in the graph, it is an indication that the thing represented by a relationship is important enough that it really should be a node in the graph (with its own relationships to the previously connected nodes), so this becomes a "modeling smell" that drives refactoring changes to the model. Often this kind of change feels more natural after, and affords more capability for the thing as a node that wasn't available when it was being modeled as a relationship (for example, the ability to apply multiple labels to it, or to connect it via relationships to more nodes than just the original two).
All that said, there will be cases where a relationship really does just need to be a relationship (either for business reasons, or because it truly is most practical modeling-wise for it to be kept as a relationship), and using those relationships as starting points in the graph make sense.
With the fulltext schema indexes introduced in Neo4j 3.5, we added the capability to add relationship indexes by relationship type(s) and property(or properties). So the capability is there, if needed, after you've ruled out refactoring of your model.
I am concerned I am not getting the full benefit from relations in Neo4J. While we use them to relate two nodes (of course), we rarely add properties to relationships and I feel like we're missing the bigger picture.
Consider a case where there's an EVENT and affected people. We want confirmation from all people that they are informed of the event.
Here is what we do, and I think it is not great:
(e:EVENT)-[:NOTIFICATION]->(:EVENT_STATUS)-[:AFFECTED]->(a:PERSON)
Now it isn't so bad, because we need EVENTs and we already have PERSON. So we're adding the stuff that connects them. It works. However, the only purpose of EVENT_STATUS is to track a notification date and the PERSON's confirmation information. The fact is, it feels like we're implementing a relational database structure.
Would it be wrong/suicidal to add the notification date and the PERSON's confirmation to the relation?
(e:EVENT)-[:INFORMED {notification_date: 123123123,
confirmation_date: 123123999,
confirmation_type: 'ATTENDING'}]->(a:PERSON)
Help me understand the purpose of properties on Relationships, please!
edit - English... is a skill.
Your proposed solution is just fine, since you are tracking different pieces of information about a particular type of relationship between 2 nodes. This is exactly what relationship properties are for.
There is no need to add extra relationships and nodes, as you are now doing. Not only are you wasting resources, but your queries are made unnecessarily complex.
In my Neo4j project I have Role and Permission entities which represent user roles and permissions. Each User in the system has relationships to appropriate sets of roles and permissions.
I think Role and Permission are some kind of supernodes that can become a major headache from a performance point of view in future.
What is the best practice for this case ? How to reimplement Role and Permission in order to avoid possible issues with supernodes ?
Do you plan to make some aggregate/mass queries based on Roles (i.e. count number of people of certain role, list them)?
If not, and you just want to check if a specific user has certain Role, than in my humble opinion it should not cause difficult to maintain, important performance issues ( as you will traverse certain relationships of the graph, ignoring vast majority of multiple relations of your "supernodes" ). I would keep with simple design ( "premature optimization is the root of all evil" ;) ), and once problems are noticed (internally, relationships are stored in a linkedlist-like structure, so finding a proper one may take time on supernode, even if you restrict searching to a certain relation type), splitting Role nodes using meta-node approach should do the job (it's described in Learning Neo4j)
If yes, you have a problem. That's probably a field in which RDBMS are better... Using meta nodes probably won't help, as you will still to have process all of them to list/count all users... So caching that data in a separate store may be simply the best idea ...
I'm going to assume that you're just using Neo4j as a permissions lookup data source (like hasPermission(current_user, 'permission_string')) and not tied into any queries to other entities. That can be fine, especially if you have a hierarchical access schema. If that's not true then this might not apply and it would be good to have a clearer idea of what your entities look like.
Since you're likely using permissions throughout your application it might and if they're going to grow in size and scope it could make sense for performance to use some form of caching like an in-memory store or in Redis, for example.
It might even make sense to generate a denormalized cache of every permission state for every user. So you would evaluate your rules which might be based on hierarchical roles/permissions and come out with a list of "User X has permission Y". Then whenever you change a user or a permission you'd regenerate the cache for that entity, and if you changed a role you would regenerate the cache for all of the associated users and permissions.
Also I don't know if I would apply this advice to just Neo4j. If you're talking about a simple key/value lookup then a lot of general purpose databases would be overkill in performance critical situations.
I'm working on financial data mart structure.
And I'm having some doubts on whats the better approach to do so.
The source system database,Dynamics AX 2009, has three tables for customer transaction.
One table for open transactions, where the Customer still needs to pay for service/product;
One table for settle transactions, where it holds what the customer have already paid;
Finally a table that have all customers transactions, holds transactions from open to settle and also others transactions as customer to bank or ledger accounts.
I thought in two options, first I will maintain a fact table representing the three table, fact for open transactions, fact for any customer transaction and fact for settle transaction.
Second is to create a single fact to hold all transactions, to do so I would have to do a full join on three tables.
I'm not sure on both approaches, as the first seems to copy tables from production and create the proper dimension.
On the Second one I would create a massive fact table, that where data would constantly change, as open transaction are delete on source system when they are settle.
Another doubt, should i create a fact with scd(slowly changing dimension) structure to maintain history data?(star date, end date , flag)
It's hard to say from the information given whether this needs to be one or more Fact tables. However, the key point which you should use to decide is whether all of the information is at the same granularity. Consider the grain of your intended Fact table(s) and you should find an answer for whether you need one table or multiple tables.
If all of the information sits at the same grain - i.e. all of the same dimensions apply to all of the measures you are considering putting into the same Fact table - then they can probably all live in the same Fact table. If you're finding that some of the Dimensions wouldn't apply to some of the measures then you probably need to re-think your design. Either you might need multiple Fact tables, or you might need to take all of your measures down to the lowest grain and combine hierarchies into single Dimensions if you currently have them split across multiple Dimensions.
While it's been mentioned that having measures in separate cubes could make it difficult to compare things, keep in mind that you don't need one cube per Fact table. You can have multiple Fact tables in a single cube, and sometimes this is very helpful when you need to be able to compare measures which share some Dimensions but not others. This is far, far better than forcing data which does not have the same grain into one Fact table.
Also, it sounds like what you're trying to model is the sales ledger of an organisation. I'd suggest having a dig around via Google as you may well be able to find materials discussing dimensional data warehouse design for sales ledger structures, rather than reinventing the wheel. If you don't have a decent understanding of the accounting concepts you're trying to model I would especially recommend looking for a reference schema to work from, or failing that doing some reading up on accountancy concepts (and sales ledgers specifically). Understanding the account structure should help you understand what the grain of your Fact table(s) needs to be, how to model the Dimensions, and so on.
This is a really helpful abridged version of Kimball's modelling techniques which discusses grain, and the different types of Fact table, amongst many other topics:
http://www.kimballgroup.com/wp-content/uploads/2013/08/2013.09-Kimball-Dimensional-Modeling-Techniques11.pdf
I think you should just use one fact table (one cube) and use a dimension to differentiate between open/settled/etc. transactions. That's what dimensions are for: They help you to categorize your measures and get a specific view on them. This approach will also open much more possibilities to create knowledge with your cube. With separate cubes for open/settled/etc. transactions, it will be harder or not possible to set this data into contrast.
Since the data is changing constantly, you should consider to update your fact table in a given time and rebuild your cube if it needs to.
If you use scd or not really depends on the data you process and what it is used for. Is there a business case claiming it? Is there a technical use?
I think this is something you have to decide on your own.
I've created a graph model for a social network and needed some concrete advice regarding the design in regards to scaling. Pardon the n00bness of these questions but I'm not finding very many clear examples out there...
NOTE: the status updates and activity nodes /relationships are linked lists - with the newest entries constantly being placed at the top of the list.
Linked lists allow for news feed generation, but there could be hundreds of records per user - I presume the limit clause isn't sufficient even though the data is in descending order by date. Do I have to have a separate linked list that would only hold the most recent 10 status/activity updates) and constantly replace the head on that list to get better activity feed generation, or will one list properly sorted and do the job (with a limit clause)
These nodes all have properties (json data with content, IDs, etc) - how do "global" indexes come into play here so that I can find, for example, users that like Depeche Mode without waiting a lifetime for results? I know how to add a node to an index, just wondering if I'm missing a part of the picture here..
Security - logins and passwords.. I would presume a graph database could store them, but I'd presume it's a security risk at this point - would it be better to keep this in postgres etc?
How would you improve this model to handle scalability? Imagine 20 million users banging away on this..
Imagine 40 million users - what's wrong with this model when it comes to scalability?
Part 1.
You can write cypher or gremlin queries that do what you want. Remember that you can traverse forwards and backwards on edges. Given a user, it should always be relatively constant time to pull up the last ten things they did.
Part 2.
If you are representing a band as an entity of a certain type, index on that attribute. Then you'll be able to pull out that node and traverse outwards to find all the users who like that band. If you don't have an independent entity, or it is somehow implicit, you'll want to enable full text search for your respective graph database.
Part 3.
Learn more about security. The only thing you would be storing would be a properly hashed string of the user's password. At that point you would be fine using any graph db and good security practices.
Part 4/5.
Once you have one user, worry about the next thousand.
When you have a thousand users, worry about the next hundred thousand.
When you have one hundred thousand, worry about the next million.
When you have a million users, you can start worrying about the questions you asked.
Until you have at least 0.1% of the users/volume you want to scale to, it's mental masturbation to try and ask questions about how to scale up to a certain size.