Neo4j how to avoid supernodes - neo4j

In my Neo4j project I have Role and Permission entities which represent user roles and permissions. Each User in the system has relationships to appropriate sets of roles and permissions.
I think Role and Permission are some kind of supernodes that can become a major headache from a performance point of view in future.
What is the best practice for this case ? How to reimplement Role and Permission in order to avoid possible issues with supernodes ?

Do you plan to make some aggregate/mass queries based on Roles (i.e. count number of people of certain role, list them)?
If not, and you just want to check if a specific user has certain Role, than in my humble opinion it should not cause difficult to maintain, important performance issues ( as you will traverse certain relationships of the graph, ignoring vast majority of multiple relations of your "supernodes" ). I would keep with simple design ( "premature optimization is the root of all evil" ;) ), and once problems are noticed (internally, relationships are stored in a linkedlist-like structure, so finding a proper one may take time on supernode, even if you restrict searching to a certain relation type), splitting Role nodes using meta-node approach should do the job (it's described in Learning Neo4j)
If yes, you have a problem. That's probably a field in which RDBMS are better... Using meta nodes probably won't help, as you will still to have process all of them to list/count all users... So caching that data in a separate store may be simply the best idea ...

I'm going to assume that you're just using Neo4j as a permissions lookup data source (like hasPermission(current_user, 'permission_string')) and not tied into any queries to other entities. That can be fine, especially if you have a hierarchical access schema. If that's not true then this might not apply and it would be good to have a clearer idea of what your entities look like.
Since you're likely using permissions throughout your application it might and if they're going to grow in size and scope it could make sense for performance to use some form of caching like an in-memory store or in Redis, for example.
It might even make sense to generate a denormalized cache of every permission state for every user. So you would evaluate your rules which might be based on hierarchical roles/permissions and come out with a list of "User X has permission Y". Then whenever you change a user or a permission you'd regenerate the cache for that entity, and if you changed a role you would regenerate the cache for all of the associated users and permissions.
Also I don't know if I would apply this advice to just Neo4j. If you're talking about a simple key/value lookup then a lot of general purpose databases would be overkill in performance critical situations.

Related

Graph databases for modeling specific domain

With a normal 'graph database' the data is broken up into nodes and edges, and there isn't much of a restriction/schema between the connections. With this, it seems great for modeling straightforward graphs where the relationships are relatively consistent -- Movies with cast and crew; Computer networks with IPs and devices; Social networks with users and connections; etc.
Are there any graph-like databases that can be more specialized? For example to be able to model something like an electrical circuit where each component has a sort of 'schema' or well defined input and output -- i.e., a Resistor has two connections and has various properties:
a Transistor takes has three connections and has various properties, etc.
I'm not asking about particular circuit simulators, such as https://www.falstad.com/circuit/circuitjs.html, but more about whether it's possible in any graph (or pseudo-graph) databases to model and enforce very specific, well-defined relationships in a network, such as circuit design.
Definitely possible.
I've been working on this problem with Neo4j, and Restagraph is the result. It provides a REST API that enforces a schema on any updates to the database, and I've packaged it as a Docker image.
I haven't really promoted it so far, because it's only recently been mature enough for my own use, and I really need to improve the documentation. If you try it out, though, I'd love to hear any feedback you have.
TLDR: in general yes, but it depends.
This is a really broad question, so let me break it down.
While it's a little exaggerating to talk about all graph databases (which are not as standardized as SQL databases - which in turn are not very standardized as well), so take this answer with a grain of salt: Yes, that is possible.
As in SQL databases, you usually can set up constraints to be checked before any changes in data is persisted.
Most graph databases incorporate something along the lines of a "type", similarly to what a table represents in SQL databases. Some allow to constrain relationships to only target specific types, so you could restrict relationships e.g. between a node using a CAN bus and an I2C-bus to the specific types.
If a database does not provide these mechanisms, it's usually possible to constrain relationships to the existence of specific keys and values in the model. To have another example than your circuit one: Imagine a node-based system, which has typed inputs and outputs - an int-based output can only be connected to an int based input, a float based output only to a float based input, etc. Then you could add a field output_type and input_type to the nodes and constrain relationships between the values.
As soon as you add the ability to write (the SQL-similar stored) procedures, you can write very complex data integrity constraints.
So, while it is possible, the question is, if you should.
How much logic you actually want to put into your database is a decades-long heated argument. At some point in your application architecture, you will have to check the validity of the data that you are handling. Handling the data consistency in the database itself solves a lot of problems with race conditions or performance issues through multiple round trips between the application and the database, which would occur if the consistency checks are done in the application layer.
Putting a lot of your logic into the database severely limits your ability to switch databases ("vendor lock-in"), might lead to code duplication between your application layer and your database, and sprays your logic between two (or more) layers of your architecture (which makes it harder to find bugs, introduces temporal coupling, and might re-introduce race conditions and performance problems where you have to use transactions again).
My personal take is along the lines of Steve Wozniak - see your database as another service. If that service can provide you with everything you need to ensure data integrity, it might be a good idea to just use the database directly. But if this increases the problems I mentioned before, you might be better off putting a layer between your database and your business logic.
Objectivity/DB is object/graph database that uses schema. You can absolutely do what you are proposing. It supports complex object definitions including type inheritance and it has a full graph/navigational query language similar to Cypher. www.objectivity.com

Duplicating relations vs executing more queries

I have the following architecture.
You will find a duplication in HAS relationship. The main one is between Badge and Skill as I want to be able to aggregate/count same Skill from different Badge of the same User.
So, the duplicate relationship is between User and Skill. That is because, for instance, if an Organization wants to know all the skills of single or multiple recipients I would follow the following path:
Org -OWNS-> Badges -IS_AWARDED_To-> User -HAS-> Skill
//Skill nodes for a specific or multiple user represent each skill contained in every Badge the user was awarded.
However, if I do not add the duplicated relationship HAS between User and Skill, I will follow the following path instead:
Org -OWNS-> Badges -IS_AWARDED_TO-> User -IS_AWARDED-> Badges -HAS-> Skill
//Now I have all skills for a specific or multiple User for every badge awarded
The difference between the two paths is obvious. The first one will result in less queries but the duplication of the relationship is a concern. The second one will remove the duplication problem (is it a problem?) but has more queries. I am still a newbie to neo4j and feel free to tell me that both of my approaches seem convoluted and there is a more optimized way to reach what I am trying to do.
Your two models are valid, and you can use both of them.
But like you said, on the first one you duplicate some data. Generally we do that when we have some performance issues. Is it your case for now ?
As a starting point, I recommend you to start with the model 2 (ie. without duplication), and if you have some issues with this model, you can easely change it to the model 1 (the flexibility of Neo4j is really great for graph refactoring !).
In IT, nothing is free : if you duplicate some data to have better performances in reads, you will have an impact on writes.
When you write a (badge)-[:HAS]->(skill) relationship, you also need to create a (user)-[:HAS]->(skill) rel (same for update or delete).
So you need to keep the consistency of this data when you update the graph. In fact it's like you are creating a SQL stored view.

Does a serialized hash column make more sense than an associated model/table for flexibility?

I have been researching quite a bit and the general consensus is to avoid serialized hashes in a DB whenever possible, however the design I have lends itself to this structure, so I'm hoping to get some opinions and/or advice. Here is the scenario:
I have a model/table :products which houses financial products. Each product has_many investment strategies, which I had originally stored in a separate :strategies model/table. Since each product has completely different strategies, and each strategy has different attributes, its become extremely difficult (and hacky) to manipulate each strategy's attributes into normalized, consistent columns (to the point where I have products that I simply cannot add to the application). Additionally, a strategy's attributes can sometimes change based on the amount of money allocated to that strategy.
In order to solve this issue, I am looking into removing the :strategies model/table altogether and simply adding a strategies column to my :products model/table. The new column would house a multi-dimensional hash of each product's strategies. This option allows for tremendous flexibility from a data storage perspective.
My primary question is, do I lose any functionality by restructuring my database this way? There will be times when I need to search a product by it's strategy's attributes and I have read that searching within a multi-dimensional hash is difficult at best. Is this considered bad practice? Is there a third solution that I haven't considered?
The advantages of rolling with multiple tables for this design is you can leverage the database to protect your data with constraints, functions and triggers. The database is the only place you can protect your customers data with 100% confidence. These tried and true techniques have lost their luster in recent years and are viewed as cumbersome and/or unnecessary to those who do not understand them.
Hash based stores within relational databases are currently changing quickly due to popularity of nosql databases, however, traditionally it has been difficult to fully protect your customers data from the database with this implementation. Therefore, the application layer is where much of this protection lives. With that said, this is being innovated on and maybe someday they will solve it.
The big advantage of using the hash as a column in a table is you can get up and going more quickly while your figuring out your problem. In addition, you can pivot more easily because most modifications are made in the application layer on the fly.
Full text seaching and complex queries can also be a bit more difficult if your using an hash based store within a relational database.
General rule of thumb is if you need the data to safe and or have some complex reporting to do, go relational. Think a big financial services type app ;) Otherwise if your building a more social, data display style app, or just mocking things up there is nothing wrong with a serialized hash column. Most importantly remember to write tests so you can refactor more confidently if you choose wrong!
My $0.02
I would be curious to know which decision you choose and how it has worked out.

Dynamic business rules engine for ruby on rails

I have an application which will require a "dynamic business rules" engine. Some of the business rules changes very frequently. Some of then applies for a limited set of business accounts. For example: my customer have a process where they qualify stores, based on their size, number of the sales person, number of products, location, etc. But he manages different account, and each account give different "weights" to each attribute.
How do I implement this engine using Ruby? I know Java has drools, but I find drools annoying and complex. And I prefer not having to use JRuby...
Regards,
Rubem
If you're sure a rule engine is what you need, you will need to find one you can use in Ruby. A quick Google search brought up Rools (http://rools.rubyforge.org/) and Ruby Rules (http://xircles.codehaus.org/projects/ruby-rules). I'm not sure of the status of either project though. Using JRuby with Drools might be your best bet but then again, I'm a Java developer and a big Drools advocate. :)
Without knowing all the details, it's a little hard to say how that should be implemented. It also depends on how you want the rules to be updated. One approach is to write a collection of rules similar to this: "if a store exists with more than 50 sales people and the store hasn't had its weight updated to reflect that, then update the store's weight." However, in some way you could compare that to hardcoding.
A better approach might be to create Weight objects with criteria that need to be met for the weight to apply. Then you could write one rule that matches on both Weights and Stores: "if a Store exists that matches a Weight's criteria and the Store doesn't already have that Weight assigned to it, then add the Weigh to the Store." Then the business folks could just create and update Weights, possibly in a web front-ended database, instead of maintaining rules.

The Ruby community values simplicity...what's your argument for simplifying a db schema in a new project?

I'm working on a project with developers who have not worked with Ruby OR Rails before.
They have created a schema that is too complicated, in my opinion. The schema has 117 tables, and obtaining the simplest piece of information would require traversing/joining 7 tabels...and of course, there's no "main" table that serves as a sort of key between them. The schema renders many of the rails tools like 'find' method, and many of the has_many/belongs to relationships almost useless. And coding for all of these relationships will likely be more time-consuming than we have the money to code for.
THE QUESTION:
Assuming you are VERY convinced (IMHO...hehe) that the schema is not ideal, and there are multiple ways to represent the domain, how would you argue FOR simplifying the schema (aside from what I've already said)?
I'll stand up in 2 roles here
DBA: Database admin/designer.
Dev: Application developer.
I assume the DBA is a person who really know all the Database tricks. Reaallyy Knows.
DBA:
Database is the key of the application and should have predefined structure in order to serve its purpose well and with best performance.
If you cannot use random schema (which is reasonably normalised and good) then the tools are wrong.
Dev:
The database is just a data store, so we need to keep it simple and concentrate on the application.
DBA:
Database is not a store it is the core of the application. There is no application without database.
Dev:
No. The application is the core. There is no application without the front-end and the business logic applied to it.
And the war begins...
Both points are valid and it is always trade off.
If the database will ONLY be used by RoR, then you can use it more like a simple store.
If the DB can be used by other application OR it will be used with large amount of data and high traffic it must enforce some best practices.
Generally there is no way you can disagree with DBA.
But they can understand your situation and might allow you to loose the standards a bit so you could be more productive.
So you need to work closely, together.
And you need to talk to each other to explain and prove the point why database should be like this or that.
Otherwise, the team is broken and project can be failure with hight probability.
ActiveRecord is a very handy tool. But it cannot do everything for you. It does not provide Database structure by default that you expect exactly. So it should be tuned.
On the other side. If DBA can accept that all PKs are Auto incremented integers that would make Developer's life easier (ActiveRecord does it by default).
On the other side, if developers would accept some of DBA constraints it would make DBA's life easier.
Now to answer your question:
how would you argue FOR simplifying the schema
Do not argue. Meet the team and deliver the message and point on WHY it should be done.
Maybe it really shouldn't and you don't know all the things, maybe they are not aware of something.
You could agree on the general structure of the database AND try to describe it using RoR migrations as a meta language.
This way they would see the general picture, and you would use your great ActiveRecords.
And also everybody would be on the same page.
Your DB schema should reflect the domain and its relationships.
De-normalisation should only be done when you have measured that there is a performance problem.
7 joins is not excessive or bad, provided you have good indexes in place.
The general way to make this argument up the chain is based on cost. If you do things simply, there will be less code and fewer bugs. The system will be able to be built more quickly, or with more features, and thus will create more ROI. If you can get the money manager on board with that approach, he or she may let you dictate terms to the team. There is the counterargument that extreme over-normalization prevents bad data, but I have found that this is not the case, as the complexity it engenders tends to lead to more errors and more database code in general.
The architectural and technical argument here is simple. You have decided to use Ruby on Rails. Therefore you have decided to use the ActiveRecord pattern. The ActiveRecord pattern is driven by having the database tables match the object model. That's the pattern in use here, and in many other places, so the best practices they are trying to apply for extreme data normalization simply do not apply. Buy a copy of Patterns of Enterprise Application Architecture and put the little red bookmark at page 160 so they can understand how the pattern works from the architecture perspective.
What the DBA types tend to be unaware of is how much work ActiveRecord does for you, from query generation, cascading deletes, optimistic locking, auto populated columns, versioning (with acts_as_versioned), soft deletes (with acts_as_paranoid), etc. There is a strong argument to use well tested, community supported library functions to perform these operations versus custom code that must be maintained by a DBA.
The real issue with DBAs is then that they need some work to do. Let them focus on monitoring performance, finding slow queries in the code, creating indexes and doing backups.
If you end up losing the political battle for a sane schema, you may want to consider switching to DataMapper. It's the next pattern in PoEAA. The other thing you may be able to get them to do is to create views in the database that correspond to the object model. This way, you could use many of the finding capabilities in the ActiveRecord model based on the views, but have custom insert, update, and delete methods.

Resources