I'm working on an existing Rails app, on Postgresql, that calculates commissions and various data for contractors.
Employees have many Contractors. Contractors and Employees both have fields that are used in business logic to calculate commissions.
My client wants to have a yearly snapshot of all of their data, so that they can be free to change business logic, add and remove employees, etc without losing their past (calculated) data.
My initial thought in implementing this would be Postgres schemas. I would have a cron task every year that takes the database as-is and copies every table and record to a schema for that year. That would be equivalent to simply having the older version of the DB in the future. I am worried, however, that application logic would break once columns are added in the future.
For example, a schema is created one year and a column gets added to a contractors table later that is used in a commissions calculation. How would I also save the old version of this commissions formula that doesn't depend on the new column?
The only solution I can think of is to simply keep the old formula and conditionally use them based on schema. I feel like this is very dirty and can lead to a lot of garbage as business logic changes.
How do you recommend I approach this problem? Thanks in advance for your help!
I think you should have stored the calculated commision in your db to prevent recalculation. An accepted calculated value is a fact, just persist that value.
Should you need to audit the calculated fields sometime later, Im not sure the old calculation logic should be made very convenient to retrieve on application layer. You might need to trace back your code svn for this. Or the data warehouse should have the calculation logic. The application can only provide the required calculation parameters and let the auditor handle it.
If the usecase is to easily rollback to a specific historical business rules out of blue, then I wouldnt recommend to accommodate such requirement.
Related
I am fairly new to data warehouses and I want to make sure my plan makes sense. I am using a star schema to capture insurance information for reporting purposes. The users want to see everything as-of a specific day, since ever field can change daily I am planning to have a record in the fact table join to a unique record in each dimension. I know that this will generate a lot of records but I am unable to think of another way to do it.
I appreciate any insight.
Hello and good morning.
I am working on a side project where I am adding an analytic board to an already existing app. The problem is that now the users table has over 400 columns. My question is that what's a better way of organizing this table such as splintering the table off into separate tables. How do you do that and how do you communicate the tables between the new tables?
Another concern is that If I separate the table will I still be able to save into it through the user model? I have code right now that says:
user.wallet += 100
user.save
If I separate wallet from user and link the two tables will I have to change this code. The reason I'm asking this is that there is a ton of code like this in the app.
Thank you so much if you can help me understanding how to organize a database. As a bonus if there is a book that talks about database organization can you recommend it to me (preferably one that is in rails).
Edit: Is there also a way to do all of this without loosing any data. For example transfer the data to a new column on the new table then destroying the old column.
Please read about:
Database Normalization
You'll get loads of hits when searching for that string and there are many books about database design covering that subject.
It is most likely, that this table of yours lacks normalization, but you have to see yourself!
Just to give an orientation - I would get a little anxious when dealing with a tenth of that number of columns. That saying, I clearly have to stress that there might be well normalized tables with 400 columns as well as sloppily created examples with just 10 columns.
Generally speaking, the probability of dealing with bad designed tables and hence facing trouble simply rises with the number of columns.
So take your time and if you find out, that users table needs normalization next step would indeed be to spread data over several tables. Because that clearly (and most likely even heavily) affects the coding of your application here is where you thoroughly have to balance pros and cons - simply impossible to judge that from far away.
Say, you have substantial problems (e.g. fierce performance problems - you wouldn't post it) that could be eased by normalization there are different approaches of how to split data. Here please read about:
Cardinalities
Usually the new tables are linked by
Foreign Keys
, identical data (like a user id) that appear in multiple tables and that are used to join them, that is.
And finally, yes, you can do that without losing data as the overall amount of information never changes when normalizing.
In case your last question was meant to be technical: There is no problem in reading data from one column and inserting them into a new one (of a new table). That has to happen in a certain order as foreign keys have to be filled before you can use them. See
Referential Integrity
However, quite obvious: Deleting data and dropping columns interferes with the operability of your application. Good planning is due.
We have to create rather large Ruby on Rails application based on large database. This database is updated daily, each table has about 500 000 records (or more) and this number will grow over time. We will also have to provide proper versioning of all data along with referential integrity. It must be possible for user to move from version to version, which are kind of "snapshots" of main database at different points of time. In addition some portions of data need to be served to other external applications with and API.
Considering large amounts of data we thought of splitting database into pieces:
State of the data at present time
Versioned attributes of each table
Snapshots of the first database at specific, historical points in time
Each of those would have it's own application, creating a service with API to interact with the data. It's needed as we don't want to create multiple applications connecting to multiple databases directly.
The question is: is this the proper approach? If not, what would you suggest?
We've never had any experience with project of this magnitude and we're trying to find the best possible solution. We don't know if this kind of data separation has any sense. If so, how to provide proper communication of different applications with individual services and between services themselves, as this will be also required.
In general the amount of data in the tables should not be your first concern. In PostgreSQL you have a very large number of options to optimize queries against large tables. The larger question has to do with what exactly you are querying, when, and why. Your query loads are always larger concerns than the amount of data. It's one thing to have ten years of financial data amounting to 4M rows. It's something different to have to aggregate those ten years of data to determine what the balance of the checking account is.
In general it sounds to me like you are trying to create a system that will rely on such aggregates. In that case I recommend the following approach, which I call log-aggregate-snapshot. In this, you have essentially three complementary models which work together to provide up-to-date well-performing solution. However the restrictions on this are important to recognize and understand.
Event model. This is append-only, with no updates. In this model inserts occur, and updates to some metadata used for some queries only as absolutely needed. For a financial application this would be the tables representing the journal entries and lines.
The aggregate closing model. This is append-only (though deletes are allowed for purposes of re-opening periods). This provides roll-forward information for specific purposes. Once a closing entry is in, no entries can be made for a closed period. In a financial application, this would represent closing balances. New balances can be calculated by starting at an aggregation point and rolling forward. You can also use partial indexes to make it easier to pull just the data you need.
Auxiliary data model. This consists of smaller tables which do allow updates, inserts, and deletes provided that integrity to the other models is not impinged. In a financial application this might be things like customer or vendor data, employee data, and the like.
Let's say for example I'm managing a Rails application that has static content that's relevant in all of my environments but I still want to be able to modify if needed. Examples: states, questions for a quiz, wine varietals, etc. There's relations between your user content and these static data and I want to be able to modify it live if need be, so it has to be stored in the database.
I've always managed that with migrations, in order to keep my team and all of my environments in sync.
I've had people tell me dogmatically that migrations should only be for structural changes to the database. I see the point.
My counterargument is that this mostly "static" data is essential for the app to function and if I don't keep it up to date automatically (everyone's already trained to run migrations), someone's going to have failures and search around for what the problem is, before they figure out that a new mandatory field has been added to a table and that they need to import something. So I just do it in the migration. This also makes deployments much simpler and safer.
The way I've concretely been doing it is to keep my test fixture files up to date with the good data (which has the side effect of letting me write more realistic tests) and re-importing it whenever necessary. I do it with connection.execute "some SQL" rather than with the models, because I've found that Model.reset_column_information + a bunch of Model.create sometimes worked if everyone immediately updated, but would eventually explode in my face when I pushed to prod let's say a few weeks later, because I'd have newer validations on the model that would conflict with the 2 week old migration.
Anyway, I think this YAML + SQL process works explodes a little less, but I also find it pretty kludgey. I was wondering how people manage that kind of data. Is there other tricks available right in Rails? Are there gems to help manage static data?
In an app I work with, we use a concept we call "DictionaryTerms" that work as look up values. Every term has a category that it belongs to. In our case, it's demographic terms (hence the data in the screenshot), and include terms having to do with gender, race, and location (e.g. State), among others.
You can then use the typical CRUD actions to add/remove/edit dictionary terms. If you need to migrate terms between environments, you could write a rake task to export/import the data from one database to another via a CSV file.
If you don't want to have to import/export, then you might want to host that data separate from the app itself, accessible via something like a JSON request, and have your app pull the terms from that request. That seems like a lot of extra work if your case is a simple one.
I want to work with versioned ActiveRecord associations. E.g., I want to find the object that another object belongs_to as of a certain past date, or the one that it belonged to before that. Does there already exist a library subclassing Rails' ActiveRecord to provide versioned relations? Or some other Ruby library which provides persistable versioned relations?
Try the ActsAsVersioned plugin
Provided you're not dealing with huge amounts of data, and the extra temporal dimension won't push your db over the edge, there are no major downsides to historically versioned data. Extra query complexity can be a slight pain, but it's nothing major.
In my case I wrote a rails plugin that handles versioning, it adds 5 columns to each versioned table (and helps handle querying/manipulation etc):
valid_from - datetime - the datetime that this version was created at
valid_to - datetime - the datetime that this version stopped being valid
root_id - integer - the id of the original row (that this is a subsequent version of)
created_by - integer - The user id of the user that performed the creation of this version
retired_by - integer - The user id of the user that retired this version
For currently active rows, valid_to is null. Adding an index on valid_to aids in keeping performance snappy.
Supporting historical state in a transactional application is a good way to massively expand complexity, slow DB performance and make life difficult for yourself. If you only need to display or report on historical state and do not need it up to the minute consider building a star schema with Type-II slowly changing dimensions and a periodic process that updates it.
This will be substantially less complex than building an application with systemic ad-hoc history tracking running through the code base. If this approach will do what you require of the application you will probably be better off doing it. It also means that the application database will play nicely with the vanilla database access mechanisms that come with the system.
If you need reasonably frequent refresh you can implement a changed-data capture system on the database, which is relatively simple if the application only has to be concerned with current state. With a CDC mechanism the load process only has to update based on changes and will run relatively quickly.