nosql dynamic fields in rails - ruby-on-rails

I'm writing a rails application that has a user document which has about 20 different attributes. Each time an attribute is updated, I need to store it in a transactions document which will have who changed, which attribute was changed and the old value and new value of the attribute.
Does it make sense to have a separate document to store transactions? or should I use a noSQL DB like CouchDB which supports versioning by default and then I don't have to worry about creating a transactions document.
If I do decide to create a transaction document, then the key of the document will be dynamic.
When I need to pull history, I can pull out all versions of a document and dynamically figure it out?

I would not store all transactions for a given user in a single document. This document will become very large and it may begin to take a up a lot of memory when you have to bring it into memory. It might also be difficult to query on various transactions (i.e. find all transactions for a given user that modified the name attribute).

The Versioning in CouchDB and similar NoSQL databases is a little bit misleading, and I tapped into the same mistake as you just did before. Versioning simply means - optimistic concurrency. This means that if you want to update a document, you need to provide the old version number with it, to be sure that nothing has been overwritten. So when you get a document and in the meanwhile someone else changes it, your version number is out of date (or out of sync) and you need to read it again and apply the required changes before submitting it to the database. Also some NoSQL stores allow you to ignore this versioning, while others (like CouchDB) enforce it.
Back to topic - Versioning won’t do what you want, you are rather looking for a log store with write often, read seldom (as I assume, you won’t read the history that often). In that case Cassandra is perfect for this, if you require a high throughput, but any other NoSQL or SQL DB might do the job as well, depending on your performance requirements.

Related

Ruby on Rails database and application design

We have to create rather large Ruby on Rails application based on large database. This database is updated daily, each table has about 500 000 records (or more) and this number will grow over time. We will also have to provide proper versioning of all data along with referential integrity. It must be possible for user to move from version to version, which are kind of "snapshots" of main database at different points of time. In addition some portions of data need to be served to other external applications with and API.
Considering large amounts of data we thought of splitting database into pieces:
State of the data at present time
Versioned attributes of each table
Snapshots of the first database at specific, historical points in time
Each of those would have it's own application, creating a service with API to interact with the data. It's needed as we don't want to create multiple applications connecting to multiple databases directly.
The question is: is this the proper approach? If not, what would you suggest?
We've never had any experience with project of this magnitude and we're trying to find the best possible solution. We don't know if this kind of data separation has any sense. If so, how to provide proper communication of different applications with individual services and between services themselves, as this will be also required.
In general the amount of data in the tables should not be your first concern. In PostgreSQL you have a very large number of options to optimize queries against large tables. The larger question has to do with what exactly you are querying, when, and why. Your query loads are always larger concerns than the amount of data. It's one thing to have ten years of financial data amounting to 4M rows. It's something different to have to aggregate those ten years of data to determine what the balance of the checking account is.
In general it sounds to me like you are trying to create a system that will rely on such aggregates. In that case I recommend the following approach, which I call log-aggregate-snapshot. In this, you have essentially three complementary models which work together to provide up-to-date well-performing solution. However the restrictions on this are important to recognize and understand.
Event model. This is append-only, with no updates. In this model inserts occur, and updates to some metadata used for some queries only as absolutely needed. For a financial application this would be the tables representing the journal entries and lines.
The aggregate closing model. This is append-only (though deletes are allowed for purposes of re-opening periods). This provides roll-forward information for specific purposes. Once a closing entry is in, no entries can be made for a closed period. In a financial application, this would represent closing balances. New balances can be calculated by starting at an aggregation point and rolling forward. You can also use partial indexes to make it easier to pull just the data you need.
Auxiliary data model. This consists of smaller tables which do allow updates, inserts, and deletes provided that integrity to the other models is not impinged. In a financial application this might be things like customer or vendor data, employee data, and the like.

Lazily downloading fields from DB with Mongoid, or handling large documents

We've designed our Mongo database to be highly denormalized, resulting in many documents in our collections containing very large arrays as some of the fields. Naturally, this can result in downloads from our DB longer than necessary because the documents are just so large.
Whenever we need to grab some records from the DB, I have mitigated the performance implications of this by using .only to choose just the fields I want, but this requires me to download that extra data before I may need it, and in general it's a lot more involved for me to keep track of what fields end up being needed when I am querying for the document(s).
Does Mongoid have a way that I can simply define particular fields in my model as ones that should be lazily loaded so that I grab them from the server just when they're first accessed? I searched through Mongoid's documentation to see if it had anything built in, but I'm not seeing any such thing. Perhaps there's a third party gem that adds this functionality to Mongoid?
Mongoid doesn't have support for lazy loading data from the server, not aware of any plugins to do it either.
While technically you could add this to Mongoid, you would still be better off manually specifying only so you load what you need once. If you lazy loaded the fields based on usage, you would have to pull the data from MongoDB every time a field is accessed while nil.
Meaning if you accessed 5 different fields on top of the original document load, you're sending 6 queries to MongoDB which involves the general roundtrip/processing as compared to just specifying it in only in the first place.

What's the best practice for handling mostly static data I want to use in all of my environments with Rails?

Let's say for example I'm managing a Rails application that has static content that's relevant in all of my environments but I still want to be able to modify if needed. Examples: states, questions for a quiz, wine varietals, etc. There's relations between your user content and these static data and I want to be able to modify it live if need be, so it has to be stored in the database.
I've always managed that with migrations, in order to keep my team and all of my environments in sync.
I've had people tell me dogmatically that migrations should only be for structural changes to the database. I see the point.
My counterargument is that this mostly "static" data is essential for the app to function and if I don't keep it up to date automatically (everyone's already trained to run migrations), someone's going to have failures and search around for what the problem is, before they figure out that a new mandatory field has been added to a table and that they need to import something. So I just do it in the migration. This also makes deployments much simpler and safer.
The way I've concretely been doing it is to keep my test fixture files up to date with the good data (which has the side effect of letting me write more realistic tests) and re-importing it whenever necessary. I do it with connection.execute "some SQL" rather than with the models, because I've found that Model.reset_column_information + a bunch of Model.create sometimes worked if everyone immediately updated, but would eventually explode in my face when I pushed to prod let's say a few weeks later, because I'd have newer validations on the model that would conflict with the 2 week old migration.
Anyway, I think this YAML + SQL process works explodes a little less, but I also find it pretty kludgey. I was wondering how people manage that kind of data. Is there other tricks available right in Rails? Are there gems to help manage static data?
In an app I work with, we use a concept we call "DictionaryTerms" that work as look up values. Every term has a category that it belongs to. In our case, it's demographic terms (hence the data in the screenshot), and include terms having to do with gender, race, and location (e.g. State), among others.
You can then use the typical CRUD actions to add/remove/edit dictionary terms. If you need to migrate terms between environments, you could write a rake task to export/import the data from one database to another via a CSV file.
If you don't want to have to import/export, then you might want to host that data separate from the app itself, accessible via something like a JSON request, and have your app pull the terms from that request. That seems like a lot of extra work if your case is a simple one.

Should i keep a file as text or import to a database?

I am constructing an anagram generator that was a coding exercise, and uses a word list thats about 633,000 lines long (one word per line). I wrote the program just in Ruby originally, and I would like to modify this to deploy it online.
My hosting service supports Ruby on Rails as about the only Ruby-based solution. I thought of hosting on my own machine, and using a smaller framework, but I don't want to deal with the security issues at this moment.
I have only used RoR for database-driven (CRUD) apps. However, I have never populated a sqlite database this way, so this is a two-part question:
1) Should I import this to a database? If so, what's the best method to do so? I would like to stick with sqlite to keep things simple if that's the case.
2) Is a 'flat file' better? I wont be doing any creating or updating, just checking against the list of words.
Thank you.
How about keeping it in memory? Storing that many words would take just a few megabytes of RAM, and otherwise you'd be accessing the file frequently so it'd probably be cached anyway. The advantage of keeping the word list in memory is that you can organize it in whatever data structure suits your needs best (I'm thinking a trie). If you can't spare that much memory, it might be to your advantage to use a database so you can efficiently load only the parts of the word list you need for any given query - of course, in that case you'd want to create some index columns (well at least one) so you can take advantage of the indexing capabilities of SQL.
Assuming that what you're doing is looking up whether a word exists in your list, I would say that SQLite with an indexed column will likely be faster than scanning through the word list linearly. Now, if your current approach is fast enough for your purposes, then I see no reason to bother porting it over to a database; it's just an added headache for no gain as far as you're concerned. If you're seeing the search times become a burden, then dumping it into an indexed database would be a good idea.
You can create the table with the following schema:
CREATE TABLE words (
word text primary key
);
CREATE INDEX word_idx ON words(word);
And import your data with:
sqlite words.db < schema.sql
while read word
do
sqlite3 words.db "INSERT INTO words values('$word');"
done < words.txt
I would skip the database for reasons listed above. A simple hash in memory will perform about as fast a lookup in the database.
Even if the database was a bit faster for the lookup, you're still wasting time with the DB having to parse the query and create a plan for the lookup, then assemble the results and send them back to your program. Plus you can save yourself a dependency.
If you plan on moving other parts of your program to a persistent store, then go for it. But a hashmap should be sufficient for your use.

How to achieve versioned ActiveRecord associations?

I want to work with versioned ActiveRecord associations. E.g., I want to find the object that another object belongs_to as of a certain past date, or the one that it belonged to before that. Does there already exist a library subclassing Rails' ActiveRecord to provide versioned relations? Or some other Ruby library which provides persistable versioned relations?
Try the ActsAsVersioned plugin
Provided you're not dealing with huge amounts of data, and the extra temporal dimension won't push your db over the edge, there are no major downsides to historically versioned data. Extra query complexity can be a slight pain, but it's nothing major.
In my case I wrote a rails plugin that handles versioning, it adds 5 columns to each versioned table (and helps handle querying/manipulation etc):
valid_from - datetime - the datetime that this version was created at
valid_to - datetime - the datetime that this version stopped being valid
root_id - integer - the id of the original row (that this is a subsequent version of)
created_by - integer - The user id of the user that performed the creation of this version
retired_by - integer - The user id of the user that retired this version
For currently active rows, valid_to is null. Adding an index on valid_to aids in keeping performance snappy.
Supporting historical state in a transactional application is a good way to massively expand complexity, slow DB performance and make life difficult for yourself. If you only need to display or report on historical state and do not need it up to the minute consider building a star schema with Type-II slowly changing dimensions and a periodic process that updates it.
This will be substantially less complex than building an application with systemic ad-hoc history tracking running through the code base. If this approach will do what you require of the application you will probably be better off doing it. It also means that the application database will play nicely with the vanilla database access mechanisms that come with the system.
If you need reasonably frequent refresh you can implement a changed-data capture system on the database, which is relatively simple if the application only has to be concerned with current state. With a CDC mechanism the load process only has to update based on changes and will run relatively quickly.

Resources