I want to work with versioned ActiveRecord associations. E.g., I want to find the object that another object belongs_to as of a certain past date, or the one that it belonged to before that. Does there already exist a library subclassing Rails' ActiveRecord to provide versioned relations? Or some other Ruby library which provides persistable versioned relations?
Try the ActsAsVersioned plugin
Provided you're not dealing with huge amounts of data, and the extra temporal dimension won't push your db over the edge, there are no major downsides to historically versioned data. Extra query complexity can be a slight pain, but it's nothing major.
In my case I wrote a rails plugin that handles versioning, it adds 5 columns to each versioned table (and helps handle querying/manipulation etc):
valid_from - datetime - the datetime that this version was created at
valid_to - datetime - the datetime that this version stopped being valid
root_id - integer - the id of the original row (that this is a subsequent version of)
created_by - integer - The user id of the user that performed the creation of this version
retired_by - integer - The user id of the user that retired this version
For currently active rows, valid_to is null. Adding an index on valid_to aids in keeping performance snappy.
Supporting historical state in a transactional application is a good way to massively expand complexity, slow DB performance and make life difficult for yourself. If you only need to display or report on historical state and do not need it up to the minute consider building a star schema with Type-II slowly changing dimensions and a periodic process that updates it.
This will be substantially less complex than building an application with systemic ad-hoc history tracking running through the code base. If this approach will do what you require of the application you will probably be better off doing it. It also means that the application database will play nicely with the vanilla database access mechanisms that come with the system.
If you need reasonably frequent refresh you can implement a changed-data capture system on the database, which is relatively simple if the application only has to be concerned with current state. With a CDC mechanism the load process only has to update based on changes and will run relatively quickly.
Related
I'm working on an existing Rails app, on Postgresql, that calculates commissions and various data for contractors.
Employees have many Contractors. Contractors and Employees both have fields that are used in business logic to calculate commissions.
My client wants to have a yearly snapshot of all of their data, so that they can be free to change business logic, add and remove employees, etc without losing their past (calculated) data.
My initial thought in implementing this would be Postgres schemas. I would have a cron task every year that takes the database as-is and copies every table and record to a schema for that year. That would be equivalent to simply having the older version of the DB in the future. I am worried, however, that application logic would break once columns are added in the future.
For example, a schema is created one year and a column gets added to a contractors table later that is used in a commissions calculation. How would I also save the old version of this commissions formula that doesn't depend on the new column?
The only solution I can think of is to simply keep the old formula and conditionally use them based on schema. I feel like this is very dirty and can lead to a lot of garbage as business logic changes.
How do you recommend I approach this problem? Thanks in advance for your help!
I think you should have stored the calculated commision in your db to prevent recalculation. An accepted calculated value is a fact, just persist that value.
Should you need to audit the calculated fields sometime later, Im not sure the old calculation logic should be made very convenient to retrieve on application layer. You might need to trace back your code svn for this. Or the data warehouse should have the calculation logic. The application can only provide the required calculation parameters and let the auditor handle it.
If the usecase is to easily rollback to a specific historical business rules out of blue, then I wouldnt recommend to accommodate such requirement.
I'm using PostgreSQL in a Rails 3.2 application that receives updates from a third party all day long. Sometimes this third party will throw over 2,000 requests a minute at my application, each update consisting of a large XML file.
Right now I am storing basic information from each XML file into a table. Then, a background process picks up big chunks of data in that table and copies the data into a table using PostgreSQL's COPY feature.
Am I doing the right thing or the wrong thing here? This table that is the load target is also the major CRUD target of the UI. Does the COPY feature lock the entire table when the load happens, and should I be doing a bunch of inserts instead? I originally thought the inserts would be too expensive, but if the direct load locks the whole table then that's going to be a problem.
COPY is the lowest level way to mass-insert records into PostgreSQL. I like your solution to post-process the records in a background job.
Alternatively, if you need to have performance and maintain some Rails/Ruby functionality, consider the
activerecord-import gem. The gem will perform mass-insertions and allow ActiveRecord callbacks and validations to be used as needed. Even if you use this for post-processing of the bulk COPYed records, it may gain you a significant performance increase.
Here is a good article for using activerecord-import:
http://ruby-journal.com/how-to-import-millions-records-via-activerecord-within-minutes-not-hours/
This is what the Postgres team recommends for optimal import performance: http://www.postgresql.org/docs/current/interactive/populate.html
I'm writing a rails application that has a user document which has about 20 different attributes. Each time an attribute is updated, I need to store it in a transactions document which will have who changed, which attribute was changed and the old value and new value of the attribute.
Does it make sense to have a separate document to store transactions? or should I use a noSQL DB like CouchDB which supports versioning by default and then I don't have to worry about creating a transactions document.
If I do decide to create a transaction document, then the key of the document will be dynamic.
When I need to pull history, I can pull out all versions of a document and dynamically figure it out?
I would not store all transactions for a given user in a single document. This document will become very large and it may begin to take a up a lot of memory when you have to bring it into memory. It might also be difficult to query on various transactions (i.e. find all transactions for a given user that modified the name attribute).
The Versioning in CouchDB and similar NoSQL databases is a little bit misleading, and I tapped into the same mistake as you just did before. Versioning simply means - optimistic concurrency. This means that if you want to update a document, you need to provide the old version number with it, to be sure that nothing has been overwritten. So when you get a document and in the meanwhile someone else changes it, your version number is out of date (or out of sync) and you need to read it again and apply the required changes before submitting it to the database. Also some NoSQL stores allow you to ignore this versioning, while others (like CouchDB) enforce it.
Back to topic - Versioning won’t do what you want, you are rather looking for a log store with write often, read seldom (as I assume, you won’t read the history that often). In that case Cassandra is perfect for this, if you require a high throughput, but any other NoSQL or SQL DB might do the job as well, depending on your performance requirements.
We have to create rather large Ruby on Rails application based on large database. This database is updated daily, each table has about 500 000 records (or more) and this number will grow over time. We will also have to provide proper versioning of all data along with referential integrity. It must be possible for user to move from version to version, which are kind of "snapshots" of main database at different points of time. In addition some portions of data need to be served to other external applications with and API.
Considering large amounts of data we thought of splitting database into pieces:
State of the data at present time
Versioned attributes of each table
Snapshots of the first database at specific, historical points in time
Each of those would have it's own application, creating a service with API to interact with the data. It's needed as we don't want to create multiple applications connecting to multiple databases directly.
The question is: is this the proper approach? If not, what would you suggest?
We've never had any experience with project of this magnitude and we're trying to find the best possible solution. We don't know if this kind of data separation has any sense. If so, how to provide proper communication of different applications with individual services and between services themselves, as this will be also required.
In general the amount of data in the tables should not be your first concern. In PostgreSQL you have a very large number of options to optimize queries against large tables. The larger question has to do with what exactly you are querying, when, and why. Your query loads are always larger concerns than the amount of data. It's one thing to have ten years of financial data amounting to 4M rows. It's something different to have to aggregate those ten years of data to determine what the balance of the checking account is.
In general it sounds to me like you are trying to create a system that will rely on such aggregates. In that case I recommend the following approach, which I call log-aggregate-snapshot. In this, you have essentially three complementary models which work together to provide up-to-date well-performing solution. However the restrictions on this are important to recognize and understand.
Event model. This is append-only, with no updates. In this model inserts occur, and updates to some metadata used for some queries only as absolutely needed. For a financial application this would be the tables representing the journal entries and lines.
The aggregate closing model. This is append-only (though deletes are allowed for purposes of re-opening periods). This provides roll-forward information for specific purposes. Once a closing entry is in, no entries can be made for a closed period. In a financial application, this would represent closing balances. New balances can be calculated by starting at an aggregation point and rolling forward. You can also use partial indexes to make it easier to pull just the data you need.
Auxiliary data model. This consists of smaller tables which do allow updates, inserts, and deletes provided that integrity to the other models is not impinged. In a financial application this might be things like customer or vendor data, employee data, and the like.
I'm working on a project with developers who have not worked with Ruby OR Rails before.
They have created a schema that is too complicated, in my opinion. The schema has 117 tables, and obtaining the simplest piece of information would require traversing/joining 7 tabels...and of course, there's no "main" table that serves as a sort of key between them. The schema renders many of the rails tools like 'find' method, and many of the has_many/belongs to relationships almost useless. And coding for all of these relationships will likely be more time-consuming than we have the money to code for.
THE QUESTION:
Assuming you are VERY convinced (IMHO...hehe) that the schema is not ideal, and there are multiple ways to represent the domain, how would you argue FOR simplifying the schema (aside from what I've already said)?
I'll stand up in 2 roles here
DBA: Database admin/designer.
Dev: Application developer.
I assume the DBA is a person who really know all the Database tricks. Reaallyy Knows.
DBA:
Database is the key of the application and should have predefined structure in order to serve its purpose well and with best performance.
If you cannot use random schema (which is reasonably normalised and good) then the tools are wrong.
Dev:
The database is just a data store, so we need to keep it simple and concentrate on the application.
DBA:
Database is not a store it is the core of the application. There is no application without database.
Dev:
No. The application is the core. There is no application without the front-end and the business logic applied to it.
And the war begins...
Both points are valid and it is always trade off.
If the database will ONLY be used by RoR, then you can use it more like a simple store.
If the DB can be used by other application OR it will be used with large amount of data and high traffic it must enforce some best practices.
Generally there is no way you can disagree with DBA.
But they can understand your situation and might allow you to loose the standards a bit so you could be more productive.
So you need to work closely, together.
And you need to talk to each other to explain and prove the point why database should be like this or that.
Otherwise, the team is broken and project can be failure with hight probability.
ActiveRecord is a very handy tool. But it cannot do everything for you. It does not provide Database structure by default that you expect exactly. So it should be tuned.
On the other side. If DBA can accept that all PKs are Auto incremented integers that would make Developer's life easier (ActiveRecord does it by default).
On the other side, if developers would accept some of DBA constraints it would make DBA's life easier.
Now to answer your question:
how would you argue FOR simplifying the schema
Do not argue. Meet the team and deliver the message and point on WHY it should be done.
Maybe it really shouldn't and you don't know all the things, maybe they are not aware of something.
You could agree on the general structure of the database AND try to describe it using RoR migrations as a meta language.
This way they would see the general picture, and you would use your great ActiveRecords.
And also everybody would be on the same page.
Your DB schema should reflect the domain and its relationships.
De-normalisation should only be done when you have measured that there is a performance problem.
7 joins is not excessive or bad, provided you have good indexes in place.
The general way to make this argument up the chain is based on cost. If you do things simply, there will be less code and fewer bugs. The system will be able to be built more quickly, or with more features, and thus will create more ROI. If you can get the money manager on board with that approach, he or she may let you dictate terms to the team. There is the counterargument that extreme over-normalization prevents bad data, but I have found that this is not the case, as the complexity it engenders tends to lead to more errors and more database code in general.
The architectural and technical argument here is simple. You have decided to use Ruby on Rails. Therefore you have decided to use the ActiveRecord pattern. The ActiveRecord pattern is driven by having the database tables match the object model. That's the pattern in use here, and in many other places, so the best practices they are trying to apply for extreme data normalization simply do not apply. Buy a copy of Patterns of Enterprise Application Architecture and put the little red bookmark at page 160 so they can understand how the pattern works from the architecture perspective.
What the DBA types tend to be unaware of is how much work ActiveRecord does for you, from query generation, cascading deletes, optimistic locking, auto populated columns, versioning (with acts_as_versioned), soft deletes (with acts_as_paranoid), etc. There is a strong argument to use well tested, community supported library functions to perform these operations versus custom code that must be maintained by a DBA.
The real issue with DBAs is then that they need some work to do. Let them focus on monitoring performance, finding slow queries in the code, creating indexes and doing backups.
If you end up losing the political battle for a sane schema, you may want to consider switching to DataMapper. It's the next pattern in PoEAA. The other thing you may be able to get them to do is to create views in the database that correspond to the object model. This way, you could use many of the finding capabilities in the ActiveRecord model based on the views, but have custom insert, update, and delete methods.