Updating large amounts of data in Rails App - ruby-on-rails

I have a rails app with a table of about 30 million rows that I build from a text document my data provider gives me quarterly. From there I do some manipulation and comparison with some other tables and create an additional table with a more customized data.
My first time doing this, I ran a ruby script through Rails console. This was slow and obviously not the best way.
What is the best way to streamline this process and update it on my production server without any, or at least very limited downtime?
This is the process I'm thinking is best for now:
create rake tasks for reading in the data. Use activerecord-import plugin to do batch writing and to turn off activerecord validations. Load this data into brand new, duplicate tables.
Build indexes on newly created tables.
Rename newly created tables to the names the rails app is looking for.
Delete the old.
All of this I'm planning on doing right on the production server.
Is there a better way to do this?
Other notes from comments:
Tables already exist
Old tables and data are disposable
Tables can be locked for select only
Must minimize downtime
Our current server situation is 2 High CPU Amazon EC2 instances. I believe they have 1.7GB of RAM so storing the entire import temporarily is probably not an option.
New data is raw text file, line delimited. I have the script for parsing it already written in Ruby.

1) create "my_table_new" as an empty clone of "my_table"
2) import the file (in batches of x lines) into my_new_table - indexes built as you go.
3) Run: RENAME TABLE my_table TO my_table_old, my_table_new TO my_table;
Doing this as one command makes it instant (close enough) so virtually no downtime. I've done this with large data sets, and as its the rename that's the 'switch' you should retain uptime.

Depending on your logic, I would seriously consider processing the data in the database using SQL. This is close to the data and 30m rows is typically not a thing you want to be pulling out of the database and comparing to other data you have also pulled out of the database.
So think outside of the Ruby on Rails box.
SQL has built-in capability to join data and compare data and insert and update tables, those capabilities can be very powerful and fast, allowing the data to be processed close to the data.

Related

How to send SQL queries to two databases simultaneously in Rails?

I have a very high-traffic Rails app. We use an older version of PostgreSQL as the backend database which we need to upgrade. We cannot use either the data-directory copy method because the formats of data files have changed too much between our existing releases and the current PostgreSQL release (10.x at the time of writing). We also cannot use the dump-restore processes for migration because we would either incur downtime of several hours or lose important customer data. Replication would not be possible as the two DB versions are incompatible for that.
The strategy so far is to have two databases and copy all the data (and functions) from existing to a new installation. However, while the copy is happening, we need data arriving at the backend to reach both servers so that once the data migration is complete, the switch becomes a matter of redeploying the code.
I have figured out the other parts of the puzzle but am unable to determine how to send all writes happening on the Rails app to both DB servers.
I am not bothered if both installations get queried for displaying data to the user (I can discard the data coming out of the new installation); so, if it is possible on driver level, or adding a line somewhere in the ActiveRecord, I am fine with it.
PS: Rails version is 4.1 and the company is not planning to upgrade that.
you can have multiple database by adding an env for the database.yml file. After that you can have a seperate class Like ActiveRecordBase and connect that to the new env.
have a look at this post
However, as I can see, that will not solve your problem. Redirecting new data to the new DB while copying from the old one can lead to data inconsistencies.
For and example, ID of a record can be changed due to two data source feeds.
If you are upgrading the DB, I would recommend define a schedule downtime and let your users know in advance. I would say, having a small downtime is far better than fixing inconstant data down the line.
When you have a downtime,
Let the customers know well in advance
Keep the downtime minimal
Have a backup procedure, in an even the new site takes longer than you think, rollback to the old site.

how to use a separate table for activity on a model

I am using Rails 4.2.6 and Postgres 9.4.
I have a Queryable table which we for managing querying data. It has about 20k rows and several different models converge at this point. We have the ability to "rebuild" the table (ie deleting everything in it and recreating it). However, this takes about 20 minutes and don't do it on production.
Is there a way to tell our Queryable model to build a copy at like 'queryables_future' and rebuild the table there and when completed, delete our current 'queryables' table and rename 'queryables_future' to 'queryables'? Or any other proposed workaround?
This is something you would do in a queued background job using a tool such as Sidekiq. Background jobs run in a separate process than your main web application, so it takes some configuration and effort to set them up, but they're immensely powerful once you do.
This is a rather broad subject so I'd recommend checking out these links:
https://github.com/mperham/sidekiq
http://edgeguides.rubyonrails.org/active_job_basics.html
https://github.com/tobiassvn/sidetiq

Give users read-only access to Neo4j while doing Batch Update

This is just a general question, not too technical. We have this use-case wherein we are to load hundreds of thousands of records to an existing Neo4j database. Now, we cannot afford to make the database offline because of users who are accessing it. I know that Neo4j requires exclusive lock on the database while it's performing batch updates. Is there a way around my problem? I don't want to lock my database while doing updates. I still want my users to access it - even for just read-only access. Thanks.
Neo4j never requires exclusive lock on the database. It selectively locks portions of the graph that are affected by mutating operations. So there are some things you can do to achieve your goal. Are you a Neo4j Enterprise customer?
Option 1: If so, you can run your batch insert on the master node and route users to slaves for reading.
Option 2: Alternatively, you could do a "blue-green" style deployment where you:
take a backup (B) of your existing database (A), then mark the A database read-only
apply your batch inserts onto B either by starting a separate instance, or even better, using BatchInserters. That way, you'll insert your hundreds of thousands in a few seconds
start the new database B
flip a switch on a load-balancer, so that users start to be routed to the B instead of A
take A down
(Please let me know if you need some tips how to make a read-only DB.)
Option 3: If you can only afford to run one instance at any one time, then there are techniques you can employ to let your users access the database as usual and still insert large volumes of data. One of them could be using a single-threaded "writer" with a queue that batches write operations. Because one thread only ever writes to the database, you never run into deadlock scenarios and people can happily read from the database. For option 3, I suggest using GraphAware Writer.
I've assumed you are not trying to insert hundreds of thousands of nodes to a running Neo4j database using Cypher. If you are, I would start there and change it to use Java APIs or the BatchInserter API.

RoR: resetting the staging table after processing user-uploaded csv file?

I am pretty new to Ruby on Rails and have been studying it using the Ruby on Rails Tutorial by Michael Hartl.
I am now working on my own project, which allows users to log in the website, provide personal biometric information and upload a csv file of their choice(workout data) to populate the database with the workout information.
I sought help from other friends with more experience and their advice was to create a staging table and use the staging table to populate the other tables(I currently have eight different tables for workout measurements).
I did quite a bit of research on staging table usage online, but couldn't find a solid answer to how to effectively use a staging table to import a csv file into multiple models.
From my understanding of staging tables, I should reset the staging table every time I(the user) is done uploading and importing the csv file into the database, but could not find anything online on whether it is the right practice or not.
Is this the right approach to using staging tables? The only other option that I can think of is creating and dropping a staging table every time the user uploads a file, but that seems too costly for it to be correct.
Thanks!
A "staging table" is simply an intermediate table which will have the field types in the same format as the expected CSV. When a user uploads a CSV file you can read the CSV row wise and populate this table. Having a staging server has the advantage that any expensive processing on the data prior to populating the actual domain tables can be done in background. Two approaches for doing that are described below:
Trigger background processing after saving a data set to staging table.
Once the data has been uploaded to staging server, you can trigger a background job to process the data and populate the models asynchronusly in the backend. I'd recommend the library sidekiq for this purpose. Many other alternative are available in the Ruby Toolbox
Cron jobs
Using this approach you have a function that periodically checks the staging table and then fills in the data populated so far to the relevant target tables. A suitable ruby library for this is the whenever gem.
You need not process the staging table in one go and dropping the staging table after operations is certainly not recommended. What would happen if someone would try to upload data to staging table while it was being dropped. Client server systems should be designed in a way that it can be used multiple users concurrently. A good strategy is to lazily process the data in staging table one row at a time - and the rows can be deleted post processing.
Also for simpler use case (a single save - process - discard sequence) you can simply save the CSV on disk and process it in background through the strategies mentioned above eliminating the need for staging table. Staging table is specially useful if you plan to populate multiple data stores (probably spread across geographic boundaries) and/or perform processing through several workers crunching the data concurrently.

Is it safe to run migrations on a live database?

I have a simple rails-backed app running 2-3 million pageviews a day off a Heroku Ronin database. The load on the database is pretty light, though, and it could handle a lot more than we're throwing at it.
Is it safe for me to run a migration to add tables to this database without going into maintenance mode? Also, would it be safe to run a migration to add a few columns to the core table responsible for almost all of the reads and writes?
Downtime is not acceptable, even for a few minutes.
If running migrations live isn't advisable, what I'll probably do is set up a new database, run the migrations on that, write a script to sync the two databases, and then point the app at the new one.
But I'd rather avoid that if possible. :)
Sounds like your migration includes:
adding new tables (perhaps indexes? If so, that could take a bit longer than you might expect)
adding new columns (default values and/or nullable?)
wrapping your changes in a transaction (?)
Suggest you gauge the impact that your changes will have on your Prod environment by:
taking a backup of Prod (with all the Prod data within)
running your change scripts against that. Time each operation
Balance the 2 points above against the typical read & write load at the time you're expecting to run this (02:00, right?).
Consider a 'soft' downtime by disabling (somehow) write operations to the tables being effected.
Overall (or in general), adding n tables and new nullable columns to an existing table would/could likely be done without any downtime or performance impact.
Always measure the impact your changes will have on a copy of Prod. Measure 'responsiveness' at the time you apply your changes to this copy. Of course this means deploying another copy of your Prod app as well, but the effort would be worthwhile.
Assuming it's a pg database (which it should be for Heroku).
http://www.postgresql.org/docs/current/static/explicit-locking.html
alter table will acquire an access exclusive lock. So, the table will be locked.
On top of this, you will be required to restart the Rails application in order for it to be aware of any new models. If you are going to be adding tables to the application or modifying model code in any way.
As for pointing to a new app with a freshly modified database, how are you going to do the sync of the data and also sync the changes in data between the two databases in the time that the sync takes?
Adding tables shouldn't be a concern, as your application won't be aware of them until proper upgrades are done. As for adding columns to a core table, I'm not so sure. If you really need to prevent downtime, perhaps it's better to add a secondary table that (linked by an ID with the core table) adds your extra columns.
Just my two cents.

Resources