EDIT: I have totally rewritten this question for clarity. I got no comments and no answers earlier.
I am maintaining a 2.x Rails app with plenty of statistical data. Some data is real and some is estimated for the future years. Every year I need to update estimated data with real data and calculate new estimates.
I have been using BIG yml-files and migrations for loading the data into the app every year. My migrations are full of estimation calculations and data corrections.
Problem
My migrations are full of none-schema related material and I can't even dream of doing db:migrate:reset without waiting few hours (if it even works). I'd love to see my migrations nice and clean - only with schema related modifications. But how I am suppose to update the data every year if not using migrations?
Help needed
I'd like to hear your comments and answers. I'm not looking for a silver bullet - more like best practises and ideas how people are handling similar situation.
It sounds like you have a large operation (data load using yml files) once a year but smaller operations once a month.
From my experience with statistical data you will probably end up doing more and more of these operations to clean and add more data.
I would use a job processing framework like resque and resque scheduler.
You can schedule the jobs to run once a month, year, day or constantly running. A job is something like loading yml files (or sets of yml files) or cleaning up data. You can control parameters to send to your job so you can use one class but alternate how it updates or cleans your data based on the way you enqueue or schedule the job.
First of all, I have to say that this is a very interesting question. As far as i know, it isn't a good idea loading data from migrations. Generally speaking you should use db/seeds.rb for data loading in your db and I think it could be a good idea to write a little class helper to put in your lib dir and then call it from db/seeds.rb. I image you could organize you files in the following way:
lib/data_loader.rb
lib/years/2009.rb
lib/years/2010.rb
Obviously, you should clear your migrations and write code for lib/data_loader.rb in the way you should prefer but I was only trying to offer a general idea of how I'd organize my code if I have to face a problem like that.
I'm not sure I've replied to your question in a way that helps but I hope it does.
If I were you I would go with creating custom rake task. You will have access to all you models and activerecord connections and once a year you will end up doing:
rake calculate
I have a situation where I need to load data from CSV files that change infrequently and update data from the Internet daily. I'll include a somewhat complete example on how to do the former.
First I have a rake file in lib/tasks/update.rake:
require 'update/from_csv_files.rb'
namespace :update do
task :csvfiles => :environment do
Dir.glob('db/static_data/*.csv') do |file|
Update::FromCsvFiles.load(file)
end
end
end
The => :environment means we will have access to the database via the usual models.
Then I have code in the lib/update/from_csv_files.rb file to do the actual work:
require 'csv'
module Update
module FromCsvFiles
def FromCsvFiles.load(file)
csv = CSV.open(file, 'r')
csv.each do |row|
id = row[0]
s = Statistic.find_by_id(id)
if (s.nil?)
s = Statistic.new
s.id= id
end
s.survey_area = row[1]
s.nr_of_space_men = row[2]
s.save
end
end
end
end
Then I can just run rake update:csvfiles whenever my CSV files changes to load the new data. I also have another task that is set up in a similar way to update my daily data.
In your case you should be able to write some code to load your YML files or make your calculations directly. To handle your smaller corrections you could make a generic method for loading YML files and call it with specific files from the rake task. That way you only need to include the YML file and update the rake file with a new task. To handle execution order you can make a rake task that calls the other rake tasks in the appropriate order. I'm just throwing around some ideas now, you know better than me.
Related
I've some .json files with data that are automatically updated from time to time. I also have a Ruby on Rails app where I want to insert the information that's in those files.
By now, I'm parsing the JSON and inserting the data in the database in the seeds.rb file, but I'll want to add more data without having to restart the app, I mean, on the go.
From time to time, I'll check those files and if they have modifications, I want to insert those new items into my database.
What's the best way to do it?
Looks like a job for a cron.
Create the code you need in a rake task (in /lib/tasks):
task :import, => :environment do
Importer.read_json_if_modified # your importer class
end
Then run this with the period you want using your system's cron.
I have an app that I've converted over from another cms. The old URLs were being stored in the database like so:
/this-is-an-old-permalink/
And I need them to be like this:
this-is-an-old-permalink
Note the absence of forward slashes. What is the easiest way to go about removing them?
I'm not necessarily looking for the exact code to do it (although that'd be nice!) -- I'm asking also as a Rails newb: What is the best method to go about doing things like this? I've only really worked with Rails in setting up a model, controller, views and outputting data. I haven't had to do any processing like this. Would it go in the model? Any help is appreciated!
edit
Do I need to get all records, loop through them, do regex on that one field and then save it?
Since you're likely only going to write this once, your best bet is to create a script for it within lib, or to write a migration for it. I recommend the latter, because it will then be executed automatically with rake db:migrate if you restore from your old backup at a later date. You can then use all your standard Model processing tricks (like you would use on a Controller) within the migration without exposing the substitution code to a Controller.
EDIT:
You can add the following to a new file within lib/tasks to create a new rake task for this called db:substitute_slashes:
namespace :db do
desc "Remove slashes from old-style URLs"
task :substitute_slashes => :environment do
Modelname.find(:all).each do |obj|
obj.fieldname.gsub!(/regex here/,'')
obj.save!
end
end
end
The exclamation on the end of save! means it will throw an exception if the resulting object fails validation, which is a good thing to check for in this case.
You would then be able to execute this with the command rake db:substitute_slashes.
Question 1
Suppose I write define some variables in db/seeds.rb, e.g.: user = User.create(...).
What is the scope of these variables ?
Question 2
If I have a big amount of code in db/seeds.rb, is it recommended to put it in a class ?
The variables are in the scope of the rake instance that has been started.
So they would be in scope for other tasks if multiple tasks where started at once.
For example
rake db:seed custom:sometask
Instance variables defined in db:seed could be accessed in 'sometask'
If the rake file is too big because of adding too many records, you could move the data that is to be inserted into a yaml file, that could make your seeds file cleaner, rather than creating a class.
Seed data is anything that must be loaded for an application to work properly. An application needs its seed data loaded in order to run in development, test, and production.
Seed data is mostly unchanging. It typically won’t be edited in your application. But requirements can and do change, so seed data may need to be reloaded on deployed applications.
Answer for your second question
lines of code in seed.rb doesn't affect the performance the basic task of seed is to initialize the database with predefined records. Keep one thing in mind that the parent creation is done before the child is created.
Here are some references that might help you
ASCIICasts
Rail Spikes
I need to populate my production database app with data in particular tables. This is before anyone ever even touches the application. This data would also be required in development mode as it's required for testing against. Fixtures are normally the way to go for testing data, but what's the "best practice" for Ruby on Rails to ship this data to the live database also upon db creation?
ultimately this is a two part question I suppose.
1) What's the best way to load test data into my database for development, this will be roughly 1,000 items. Is it through a migration or through fixtures? The reason this is a different answer than the question below is that in development, there's certain fields in the tables that I'd like to make random. In production, these fields would all start with the same value of 0.
2) What's the best way to bootstrap a production db with live data I need in it, is this also through a migration or fixture?
I think the answer is to seed as described here: http://lptf.blogspot.com/2009/09/seed-data-in-rails-234.html but I need a way to seed for development and seed for production. Also, why bother using Fixtures if seeding is available? When does one seed and when does one use fixtures?
Usually fixtures are used to provide your tests with data, not to populate data into your database. You can - and some people have, like the links you point to - use fixtures for this purpose.
Fixtures are OK, but using Ruby gives us some advantages: for example, being able to read from a CSV file and populate records based on that data set. Or reading from a YAML fixture file if you really want to: since your starting with a programming language your options are wide open from there.
My current team tried to use db/seed.rb, and checking RAILS_ENV to load only certain data in certain places.
The annoying thing about db:seed is that it's meant to be a one shot thing: so if you have additional items to add in the middle of development - or when your app has hit production - ... well, you need to take that into consideration (ActiveRecord's find_or_create_by...() method might be your friend here).
We tried the Bootstrapper plugin, which puts a nice DSL over the RAILS_ENV checking, and lets your run only the environment you want. It's pretty nice.
Our needs actually went beyond that - we found we needed database style migrations for our seed data. Right now we are putting normal Ruby scripts into a folder (db/bootstrapdata/) and running these scripts with Arild Shirazi's required gem to load (and thus run) the scripts in this directory.
Now this only gives you part of the database style migrations. It's not hard to go from this to creating something where these data migrations can only be run once (like database migrations).
Your needs might stop at bootstrapper: we have pretty unique needs (developing the system when we only know half the spec, larg-ish Rails team, big data migration from the previous generation of software. Your needs might be simpler).
If you did want to use fixtures the advantage over seed is that you can easily export also.
A quick guess at how the rake task may looks is as follows
desc 'Export the data objects to Fixtures from data in an existing
database. Defaults to development database. Set RAILS_ENV to override.'
task :export => :environment do
sql = "SELECT * FROM %s"
skip_tables = ["schema_info"]
export_tables = [
"roles",
"roles_users",
"roles_utilities",
"user_filters",
"users",
"utilities"
]
time_now = Time.now.strftime("%Y_%h_%d_%H%M")
folder = "#{RAILS_ROOT}/db/fixtures/#{time_now}/"
FileUtils.mkdir_p folder
puts "Exporting data to #{folder}"
ActiveRecord::Base.establish_connection(:development)
export_tables.each do |table_name|
i = "000"
File.open("#{folder}/#{table_name}.yml", 'w') do |file|
data = ActiveRecord::Base.connection.select_all(sql % table_name)
file.write data.inject({}) { |hash, record|
hash["#{table_name}_#{i.succ!}"] = record
hash }.to_yaml
end
end
end
desc "Import the models that have YAML files in
db/fixture/defaults or from a specified path."
task :import do
location = 'db/fixtures/default'
puts ""
puts "enter import path [#{location}]"
location_in = STDIN.gets.chomp
location = location_in unless location_in.blank?
ENV['FIXTURES_PATH'] = location
puts "Importing data from #{ENV['FIXTURES_PATH']}"
Rake::Task["db:fixtures:load"].invoke
end
I have fixtures with initial data that needs to reside in my database (countries, regions, carriers, etc.). I have a task rake db:seed that will seed a database.
namespace :db do
desc "Load seed fixtures (from db/fixtures) into the current environment's database."
task :seed => :environment do
require 'active_record/fixtures'
Dir.glob(RAILS_ROOT + '/db/fixtures/yamls/*.yml').each do |file|
Fixtures.create_fixtures('db/fixtures/yamls', File.basename(file, '.*'))
end
end
end
I am a bit worried because this task wipes my database clean and loads the initial data. The fact that this is even possible to do more than once on production scares the crap out of me. Is this normal and do I just have to be cautious? Or do people usually protect a task like this in some way?
Seeding data with fixtures is an extremely bad idea.
Fixtures are not validated and since most Rails developers don't use database constraints this means you can easily get invalid or incomplete data inserted into your production database.
Fixtures also set strange primary key ids by default, which is not necessarily a problem but is annoying to work with.
There are a lot of solutions for this. My personal favorite is a rake task that runs a Ruby script that simply uses ActiveRecord to insert records. This is what Rails 3 will do with db:seed, but you can easily write this yourself.
I complement this with a method I add to ActiveRecord::Base called create_or_update. Using this I can run the seed script multiple times, updating old records instead of throwing an exception.
I wrote an article about these techniques a while back called Loading seed data.
For the first part of your question, yes I'd just put some precaution for running a task like this in production. I put a protection like this in my bootstrapping/seeding task:
task :exit_or_continue_in_production? do
if Rails.env.production?
puts "!!!WARNING!!! This task will DESTROY " +
"your production database and RESET all " +
"application settings"
puts "Continue? y/n"
continue = STDIN.gets.chomp
unless continue == 'y'
puts "Exiting..."
exit!
end
end
end
I have created this gist for some context.
For the second part of the question -- usually you really want two things: a) very easily seeding the database and setting up the application for development, and b) bootstrapping the application on production server (like: inserting admin user, creating folders application depends on, etc).
I'd use fixtures for seeding in development -- everyone from the team then sees the same data in the app and what's in app is consistent with what's in tests. (Usually I wrap rake app:bootstrap, rake app:seed rake gems:install, etc into rake app:install so everyone can work on the app by just cloning the repo and running this one task.)
I'd however never use fixtures for seeding/bootstrapping on production server. Rails' db/seed.rb is really fine for this task, but you can of course put the same logic in your own rake app:seed task, like others pointed out.
Rails 3 will solve this for you using a seed.rb file.
http://github.com/brynary/rails/commit/4932f7b38f72104819022abca0c952ba6f9888cb
We've built up a bunch of best practices that we use for seeding data. We rely heavily on seeding, and we have some unique requirements since we need to seed multi-tenant systems. Here's some best practices we've used:
Fixtures aren't the best solution, but you still should store your seed data in something other than Ruby. Ruby code for storing seed data tends to get repetitive, and storing data in a parseable file means you can write generic code to handle your seeds in a consistent fashion.
If you're going to potentially update seeds, use a marker column named something like code to match your seeds file to your actual data. Never rely on ids being consistent between environments.
Think about how you want to handle updating existing seed data. Is there any potential that users have modified this data? If so, should you maintain the user's information rather than overriding it with seed data?
If you're interested in some of the ways we do seeding, we've packaged them into a gem called SeedOMatic.
How about just deleting the task off your production server once you have seeded the database?
I just had an interesting idea...
what if you created \db\seeds\ and added migration-style files:
file: 200907301234_add_us_states.rb
class AddUsStates < ActiveRecord::Seeds
def up
add_to(:states, [
{:name => 'Wisconsin', :abbreviation => 'WI', :flower => 'someflower'},
{:name => 'Louisiana', :abbreviation => 'LA', :flower => 'cypress tree'}
]
end
end
def down
remove_from(:states).based_on(:name).with_values('Wisconsin', 'Louisiana', ...)
end
end
alternately:
def up
State.create!( :name => ... )
end
This would allow you to run migrations and seeds in an order that would allow them to coexist more peaceably.
thoughts?