Ensuring consistency with relational database without foreign keys - ruby-on-rails

I have seen examples like discourse where tables in relational database don't have foreign keys. While the other tenants of RDB are still used like CONSTRAINTS, INDEXES , FULLTEXTSEARCH etc.. but as per Rails Active record guidelines , foreign keys are dropped.
https://meta.discourse.org/t/foreign-key-constraints-in-db/2642
Do we need to periodically check for consistency in such applications ? And in that case should it be done for each request -response that there is no invalid foreign key and correct it at same time in application layer.

Ok so the first thing to understand is why we generally put such constraints in the database. The second point will be why some people don't like this. The third will be what the ramifications of not doing so are.
Why we put RI checks in the database
A relational database is basically a big math engine performing set (well, actually bag, due to concessions with real-world data integrity problems) operations on large sets of data. As the sets grow, the ability to verify integrity of the data reduces until at some point one has trouble verifying the entire validity of the data according to the set model one follows. I have worked with PostgreSQL databases where constraints were not possible, so that in some areas we had to accept that there would be referential integrity violations.
The problem of managing referential integrity where one software project owns the database can be formidable, but they can become far worse when many different programs can read or write the same data. This gets worse because normalization and encapsulation concerns increase with the number of pathways for reading (and worse, writing) the data.
Ensuring that one can make sure that referential integrity is not violated on each write is thus an important tool in data management.
Why some people avoid RI checks in the database
Referential integrity constraints however are not free. There are two important drawbacks to using them that sometimes cause developers to decide not to.
Referential integrity checks are not free. They do impact database performance, and often the database is understood to be the least scalable part of a system, and
They divide logic, placing it in different locations and segregating data model logic from application logic. While this separation of concerns is usually desirable, where a single application owns a database, it is sometimes (but not always!) considered to be a less desirable tradeoff.
It is worth noting further that Rails guidelines don't offer solid guidance on this tradeoff. Like many ORMs, Active Record offers tools for addressing this in the application, I found plenty of examples of people using foreign keys in the database, and nobody saying "don't use them."
Concerns from avoiding RI checks in the database
The concerns and further mitigating measures of course depend on the importanc and further use of data. A lower-impact data set which is just the private data store of an application (the normal rails way) doesn't have the same implications as a higher-impact data store used that is intended to be used for decision support later. So repeated read-use is am important question in deciding whether you need to periodically re-scan.
The second concern are alternate sources of writes. In general in this model the most important concern is to prevent alternate sources of writes, outside of uses of these specific ActiveRecord-using classes.
So in answer to your question, you may or may not need to. But you should probably do a risk assessment and decide what to do. Such a risk assessment will guide this decision not only at the moment but also in the future.
As a side note
You can use foreign keys to insist on consistency while using the hooks and so forth to ensure that the logic is properly handled in the ActiveRecord component. I.e. instead of using ON DELETE CASCADE have that handled by a hook.

Related

Graph databases for modeling specific domain

With a normal 'graph database' the data is broken up into nodes and edges, and there isn't much of a restriction/schema between the connections. With this, it seems great for modeling straightforward graphs where the relationships are relatively consistent -- Movies with cast and crew; Computer networks with IPs and devices; Social networks with users and connections; etc.
Are there any graph-like databases that can be more specialized? For example to be able to model something like an electrical circuit where each component has a sort of 'schema' or well defined input and output -- i.e., a Resistor has two connections and has various properties:
a Transistor takes has three connections and has various properties, etc.
I'm not asking about particular circuit simulators, such as https://www.falstad.com/circuit/circuitjs.html, but more about whether it's possible in any graph (or pseudo-graph) databases to model and enforce very specific, well-defined relationships in a network, such as circuit design.
Definitely possible.
I've been working on this problem with Neo4j, and Restagraph is the result. It provides a REST API that enforces a schema on any updates to the database, and I've packaged it as a Docker image.
I haven't really promoted it so far, because it's only recently been mature enough for my own use, and I really need to improve the documentation. If you try it out, though, I'd love to hear any feedback you have.
TLDR: in general yes, but it depends.
This is a really broad question, so let me break it down.
While it's a little exaggerating to talk about all graph databases (which are not as standardized as SQL databases - which in turn are not very standardized as well), so take this answer with a grain of salt: Yes, that is possible.
As in SQL databases, you usually can set up constraints to be checked before any changes in data is persisted.
Most graph databases incorporate something along the lines of a "type", similarly to what a table represents in SQL databases. Some allow to constrain relationships to only target specific types, so you could restrict relationships e.g. between a node using a CAN bus and an I2C-bus to the specific types.
If a database does not provide these mechanisms, it's usually possible to constrain relationships to the existence of specific keys and values in the model. To have another example than your circuit one: Imagine a node-based system, which has typed inputs and outputs - an int-based output can only be connected to an int based input, a float based output only to a float based input, etc. Then you could add a field output_type and input_type to the nodes and constrain relationships between the values.
As soon as you add the ability to write (the SQL-similar stored) procedures, you can write very complex data integrity constraints.
So, while it is possible, the question is, if you should.
How much logic you actually want to put into your database is a decades-long heated argument. At some point in your application architecture, you will have to check the validity of the data that you are handling. Handling the data consistency in the database itself solves a lot of problems with race conditions or performance issues through multiple round trips between the application and the database, which would occur if the consistency checks are done in the application layer.
Putting a lot of your logic into the database severely limits your ability to switch databases ("vendor lock-in"), might lead to code duplication between your application layer and your database, and sprays your logic between two (or more) layers of your architecture (which makes it harder to find bugs, introduces temporal coupling, and might re-introduce race conditions and performance problems where you have to use transactions again).
My personal take is along the lines of Steve Wozniak - see your database as another service. If that service can provide you with everything you need to ensure data integrity, it might be a good idea to just use the database directly. But if this increases the problems I mentioned before, you might be better off putting a layer between your database and your business logic.
Objectivity/DB is object/graph database that uses schema. You can absolutely do what you are proposing. It supports complex object definitions including type inheritance and it has a full graph/navigational query language similar to Cypher. www.objectivity.com

Does a serialized hash column make more sense than an associated model/table for flexibility?

I have been researching quite a bit and the general consensus is to avoid serialized hashes in a DB whenever possible, however the design I have lends itself to this structure, so I'm hoping to get some opinions and/or advice. Here is the scenario:
I have a model/table :products which houses financial products. Each product has_many investment strategies, which I had originally stored in a separate :strategies model/table. Since each product has completely different strategies, and each strategy has different attributes, its become extremely difficult (and hacky) to manipulate each strategy's attributes into normalized, consistent columns (to the point where I have products that I simply cannot add to the application). Additionally, a strategy's attributes can sometimes change based on the amount of money allocated to that strategy.
In order to solve this issue, I am looking into removing the :strategies model/table altogether and simply adding a strategies column to my :products model/table. The new column would house a multi-dimensional hash of each product's strategies. This option allows for tremendous flexibility from a data storage perspective.
My primary question is, do I lose any functionality by restructuring my database this way? There will be times when I need to search a product by it's strategy's attributes and I have read that searching within a multi-dimensional hash is difficult at best. Is this considered bad practice? Is there a third solution that I haven't considered?
The advantages of rolling with multiple tables for this design is you can leverage the database to protect your data with constraints, functions and triggers. The database is the only place you can protect your customers data with 100% confidence. These tried and true techniques have lost their luster in recent years and are viewed as cumbersome and/or unnecessary to those who do not understand them.
Hash based stores within relational databases are currently changing quickly due to popularity of nosql databases, however, traditionally it has been difficult to fully protect your customers data from the database with this implementation. Therefore, the application layer is where much of this protection lives. With that said, this is being innovated on and maybe someday they will solve it.
The big advantage of using the hash as a column in a table is you can get up and going more quickly while your figuring out your problem. In addition, you can pivot more easily because most modifications are made in the application layer on the fly.
Full text seaching and complex queries can also be a bit more difficult if your using an hash based store within a relational database.
General rule of thumb is if you need the data to safe and or have some complex reporting to do, go relational. Think a big financial services type app ;) Otherwise if your building a more social, data display style app, or just mocking things up there is nothing wrong with a serialized hash column. Most importantly remember to write tests so you can refactor more confidently if you choose wrong!
My $0.02
I would be curious to know which decision you choose and how it has worked out.

How do you keep your business rules DRY?

In the perfect application every business rule would exist only once.
I work for a shop that enforces business rules in as much as possible in the database. In many cases to achieve a better user experience we're performing identical validations on the client side. Not very DRY. Being a SPOT purist, I hate this.
On the other end of the spectrum, some shops create dumb databases (the Rails community leans in this direction) and relegate the business logic to a separate tier. But even with this tack, some validation logic ends up repeated client side.
To further complicate the matter, I understand why the database should be treated as a fortress and so I agree that validations be enforced/repeated at the database.
Trying to enforce validations in one spot isn't easy in light of conflicting concerns -- stay DRY, keep the database a fortress, and provide a good user experience. I have some idea for overcoming this issue, but I imagine there are better.
Can we balance these conflicting concerns in a DRY manner?
Anyone who doesn't enforce the required business rules in the database where they belong is going to have bad data, simple as that. Data integrity is the job of the database. Databases are affected by many more sources than the application and to put required rules in the application only is short-sighted. If you do this you will get bad data from imports, from other applications when they connect, from ad hoc queries to fix large amounts of data (think increasing all the prices by 10%), etc. It is foolish in the extreme to enforce rules only through an application. But then again, I'm the person who has to fix the bad data that gets into poorly designed databases where application developers think they should do stuff only in the application.
The data will live on long past the application in many cases. You lose the rules when this happens as well.
In way, a clean separation of concerns where all business logic exists in a single core location is a Utopian fantasy that’s hard to uphold
Can't see why.
handle all of the business logic in a separate tier (in Rails the models would house “most” of this)
Correct. Django does this also.
some business logic does end up spilling into other places (in Rails it might spill over into the controllers
Not really. The business logic can -- and should -- be in the model tier. Some of it will be encoded as methods of classes, libraries, and other bundles of logic that can be used elsewhere. Django does this with Form objects that validate input. These come from the model, but are used as part of the front-end HTML as well as any bulk loads.
There's no reason for business logic to be defined elsewhere. It can be used in other tiers, but it should be defined in the model.
Use an ORM layer to generate the SQL from the model. Everything in one place.
[database] built on constraints that cause it to reject bad data
I think that's a misreading the Database As A Fortress post. It says "A solid data model", "reject data that does not belong, and to prevent relationships that do not make sense". These are simple declarative referential integrity.
An ORM layer can generate this from the model.
The concerns of database-as-a-fortress and single-point-of-truth are one and the same thing. They are not mutually exclusive. And so there is no need what so ever to "balance them".
What you call "database-as-a-fortress" really is the only possible way to implement single-point-of-truth.
"database-as-a-fortress" is a good thing. It means that the database forces anyone to stay away who does not want to comply (to the truth that the database is supposed to adhere to). That so many people find that a problem (rather than the solution that it really is), says everything about how the majority of database users thinks about the importance of "truth" (as opposed to "I have this shit load of objects I need to persist, and I just want the database to do that despite any business rule what so ever.").
The Model-View-Controller framework is one way to solve this issue. The rails community uses a similar concept...
The basic idea is that all business logic is handled in the controller, and where rules need to be applied in the view or model they are passed into it by the controller.
Well, in general, business rules are much more than sets of constraints, so it's obvious not all business rules can be placed in the database. On the other hand, HLGEM pointed out that it's naive to expect the application to handle all data validation: I can confirm this from experience.
I don't think there's a nice way to put all of the business rules in one place and have them applied on the client, the server side and the database. I live with business rules at the entity level (as hibernate recreates them at the database level) and the remaining rules at the service level.
Database-as-a-fortress is nonsense for relational databases. You need an OODB for that.

The Ruby community values simplicity...what's your argument for simplifying a db schema in a new project?

I'm working on a project with developers who have not worked with Ruby OR Rails before.
They have created a schema that is too complicated, in my opinion. The schema has 117 tables, and obtaining the simplest piece of information would require traversing/joining 7 tabels...and of course, there's no "main" table that serves as a sort of key between them. The schema renders many of the rails tools like 'find' method, and many of the has_many/belongs to relationships almost useless. And coding for all of these relationships will likely be more time-consuming than we have the money to code for.
THE QUESTION:
Assuming you are VERY convinced (IMHO...hehe) that the schema is not ideal, and there are multiple ways to represent the domain, how would you argue FOR simplifying the schema (aside from what I've already said)?
I'll stand up in 2 roles here
DBA: Database admin/designer.
Dev: Application developer.
I assume the DBA is a person who really know all the Database tricks. Reaallyy Knows.
DBA:
Database is the key of the application and should have predefined structure in order to serve its purpose well and with best performance.
If you cannot use random schema (which is reasonably normalised and good) then the tools are wrong.
Dev:
The database is just a data store, so we need to keep it simple and concentrate on the application.
DBA:
Database is not a store it is the core of the application. There is no application without database.
Dev:
No. The application is the core. There is no application without the front-end and the business logic applied to it.
And the war begins...
Both points are valid and it is always trade off.
If the database will ONLY be used by RoR, then you can use it more like a simple store.
If the DB can be used by other application OR it will be used with large amount of data and high traffic it must enforce some best practices.
Generally there is no way you can disagree with DBA.
But they can understand your situation and might allow you to loose the standards a bit so you could be more productive.
So you need to work closely, together.
And you need to talk to each other to explain and prove the point why database should be like this or that.
Otherwise, the team is broken and project can be failure with hight probability.
ActiveRecord is a very handy tool. But it cannot do everything for you. It does not provide Database structure by default that you expect exactly. So it should be tuned.
On the other side. If DBA can accept that all PKs are Auto incremented integers that would make Developer's life easier (ActiveRecord does it by default).
On the other side, if developers would accept some of DBA constraints it would make DBA's life easier.
Now to answer your question:
how would you argue FOR simplifying the schema
Do not argue. Meet the team and deliver the message and point on WHY it should be done.
Maybe it really shouldn't and you don't know all the things, maybe they are not aware of something.
You could agree on the general structure of the database AND try to describe it using RoR migrations as a meta language.
This way they would see the general picture, and you would use your great ActiveRecords.
And also everybody would be on the same page.
Your DB schema should reflect the domain and its relationships.
De-normalisation should only be done when you have measured that there is a performance problem.
7 joins is not excessive or bad, provided you have good indexes in place.
The general way to make this argument up the chain is based on cost. If you do things simply, there will be less code and fewer bugs. The system will be able to be built more quickly, or with more features, and thus will create more ROI. If you can get the money manager on board with that approach, he or she may let you dictate terms to the team. There is the counterargument that extreme over-normalization prevents bad data, but I have found that this is not the case, as the complexity it engenders tends to lead to more errors and more database code in general.
The architectural and technical argument here is simple. You have decided to use Ruby on Rails. Therefore you have decided to use the ActiveRecord pattern. The ActiveRecord pattern is driven by having the database tables match the object model. That's the pattern in use here, and in many other places, so the best practices they are trying to apply for extreme data normalization simply do not apply. Buy a copy of Patterns of Enterprise Application Architecture and put the little red bookmark at page 160 so they can understand how the pattern works from the architecture perspective.
What the DBA types tend to be unaware of is how much work ActiveRecord does for you, from query generation, cascading deletes, optimistic locking, auto populated columns, versioning (with acts_as_versioned), soft deletes (with acts_as_paranoid), etc. There is a strong argument to use well tested, community supported library functions to perform these operations versus custom code that must be maintained by a DBA.
The real issue with DBAs is then that they need some work to do. Let them focus on monitoring performance, finding slow queries in the code, creating indexes and doing backups.
If you end up losing the political battle for a sane schema, you may want to consider switching to DataMapper. It's the next pattern in PoEAA. The other thing you may be able to get them to do is to create views in the database that correspond to the object model. This way, you could use many of the finding capabilities in the ActiveRecord model based on the views, but have custom insert, update, and delete methods.

server side db programming: why?

Given that database is generally the least scalable component (of a web application), are there any situations where one would put logic in procedures/triggers over keeping it in his favorite programming language (ruby...) or her favorite web framework (...rails!).
Server-side logic is often much faster, even with procedural approach.
You can fine-tune your grant options and hide the data you don't want to show
All queries in one places are more convenient than if they were scattered all around the code.
And here's a (very subjective) article in my blog on the reason I prefer stored procedures:
Schema Junk
BTW, triggers (as opposed to functions / stored procedures / packages) I generally dislike.
They are completely other story.
You're keeping the processing in the database, along with the data.
If you process on the server side, then you have to transfer the data out to a server process across the network, process it, and (optionally) send it back. You have the network bandwidth/latency issues, plus memory overheads.
To clarify - if I have 10m rows of data, my two extreme scenarios are to a) pull those 10m rows across the network and process on the server side, or b) process in place in the database using the server and language (SQL) optimised for this purpose. Note that this is a generalisation and not a hard-and-fast rule, but it's the one I follow for most scenarios.
When many heterogeneous applications and various other systems need to access your single database and be sure through their operations data stays consistent without integrity conflicts. So you put your logic into triggers and stored procedures that will offer an interface to external clients.
Maybe not for most web-based systems, but certainly for enterprise databases. Stored procedures and the like allow you much greater control over security and performance, as well as offering a bit of encapsulation for the database itself. You can change the schema all you want as long as the stored procedure interface remains the same.
In (almost) every situation you would keep the processing that is part of the database in the database. Application code cannot substitute for triggers, you won't get very far before you have updated the database and failed to fire the application's equivalent of the triggers (the first time you use the DBMS's management console, for instance).
Let the database do the database work and let the application to the application's work. If you have a specific performance problem with the database, and that performance problem can be addressed by moving processing from the database, in that case you might want to consider doing so.
But worrying about database performance without a database performance problem existing (which is what you seem to be doing here) is both silly and, sadly, apparently a pre-occupation of many Stackoverlow posters.
Least scalable? SQL???
Look up, "federating."
If the database is shared, having logic in the database is better in order to control everything that happens. If it's not it might just make the system overly complicated.
If you have multiple applications that talk to your database, stored procedures and triggers can enforce correctness more pervasively. Accordingly, if correctness is more important than convenience, putting logic in the database is sensible.
Scalability may be a red herring, though. Sometimes it's easier to express the behavior you want in the domain layer of an OO language, but it can be actually more expensive than doing the idiomatic SQL way.
The security mechanism at a previous company was first built in the service layer, then pushed to the db side. The motivation was actually due to some limitations in a data access framework we were using. The solution turned out to be a bit buggy because our security model was complicated, but the upside was that bugs only had to be fixed in the database; we didn't have to worry about different clients following different rules.
Triggers mean 3rd-party apps can modify the database without creating logical inconsistencies.
If you do that, you are tying your business logic to your model. If you code all your business logic in T-SQL, you aren't going to have a lot of fun if later you need to use Oracle or what have you as your database server. Actually, I'm not sure I understand this question exactly. How do you think this would improve scalability? It really shouldn't.
Personally, I'm really not a fan of triggers, particularly in a database dedicated to a single application. I hate trying to track down why some data is inconsistent, to find it's down to a poorly written trigger (and they can be tricky to get exactly correct).
Security is another advantage of using stored procs. You do not have to set the security at the table level if you don't use dynamic code (Including ithe stored proc). This means your users cannot do anything unless they have a proc to to it. This is one way of reducing the possibility of fraud.
Further procs are easier to performance tune than most application code and even better, when one needs to change, that is all you have to put on production, not recomplie the whole application.
Data integrity must be maintained at the database level. That means constraints, defaults values, foreign keys, possibly triggers (if you have very complex rules or ones involving multiple tables). If you do not do this at the database level, you will eventually have integrity issues. Peolpe will write a quick fix for a problem and run the code in the query window and the required rules are missed creating a larger problem. A millino new records will have to be imported through an ETL program that doesn't access the application because going through the application code would take too long running one record at a time.
If you think you are building an application where scalibility will be an issue, you need to hire a database professional and follow his or her suggestions for design based on performance. Databases can scale to terrabytes of data but only if they are originally designed by someone is a specialist in this kind of thing. When you wait until the while application is runnning slower than dirt and you havea new large client coming on board, it is too late. Database design must consider performance from the beginning as it is very hard to redesign when you already have millions of records.
A good way to reduce scalability of your data tier is to interact with it on a procedural basis. (Fetch row..process... update a row, repeat)
This can be done within a stored procedure by use of cursors or within an application (fetch a row, process, update a row) .. The result (poor performance) is the same.
When people say they want to do processing in their application it sometimes implies a procedural interaction.
Sometimes its necessary to treat data procedurally however from my experience developers with limited database experience will tend to design systems in a way that do not leverage the strenght of the platform because they are not comfortable thinking in terms of set based solutions. This can lead to severe performance issues.
For example to add 1 to a count field of all rows in a table the following is all thats needed:
UPDATE table SET cnt = cnt + 1
The procedural treatment of the same is likely to be orders of magnitude slower in execution and developers can easily overlook concurrency issues that make their process inconsistant. For example this kind of code is inconsistant given the avaliable read isolation levels of many RDMBS platforms.
SELECT id,cnt FROM table
...
foreach row
...
UPDATE table SET cnt = row.cnt+1 WHERE id=row.id
...
I think just in terms of abstraction and ease of servicing a running environment utilizing stored procedures can be a useful tool.
Procedure plan cache and reduced number of network round trips in high latency environments can also have significant performance advantages.
It is also true that trying to be too clever or work very complex problems in the RDBMS's half-baked procedural language can easily become a recipe for disaster.
"Given that database is generally the least scalable component (of a web application), are there any situations where one would put logic in procedures/triggers over keeping it in his favorite programming language (ruby...) or her favorite web framework (...rails!)."
What makes you think that "scalability" is the only relevant concern in a system design ? I agree with rexem where he commented that it is very obvious that you are "not" biased ...
Databases are sets of assertions of fact. Those sets become more valuable if they can also be guaranteed to conform to certain integrity rules. Those guarantees are not worth a dime if it is the applications that are expected to enforce such integrity. Triggers and sprocs are the only way SQL systems have to allow such guarantees to be offered by the DBMS itself.
That aspect outweighs "scalability" anytime, anywhere, anyhow.

Resources