Goal: Using a CRON task (or other scheduled event) to update database with nightly export of data from an existing system.
All data is created/updated/deleted in an existing system. The website does no directly integrate with this system, so the rails app simply needs to reflect the updates that appear in the data export.
I have a .txt file of ~5,000 products that looks like this:
"1234":"product name":"attr 1":"attr 2":"ABC Manufacturing":"2222"
"A134":"another product":"attr 1":"attr 2":"Foobar World":"2447"
...
All values are strings enclosed in double quotes (") that are separated by colons (:)
Fields are:
id: unique id; alphanumeric
name: product name; any character
attribute columns: strings; any character (e.g., size, weight, color, dimension)
vendor_name: string; any character
vendor_id: unique vendor id; numeric
Vendor information is not normalized in the current system.
What are best practices here? Is it okay to delete the products and vendors tables and rewrite with the new data on every cycle? Or is it better to only add new rows and update existing ones?
Notes:
This data will be used to generate Orders that will persist through nightly database imports. OrderItems will need to be connected to the product ids that are specified in the data file, so we can't rely on an auto-incrementing primary key to be the same for each import; the unique alphanumeric id will need to be used to join products to order_items.
Ideally, I'd like the importer to normalize the Vendor data
I cannot use vanilla SQL statements, so I imagine I'll need to write a rake task in order to use Product.create(...) and Vendor.create(...) style syntax.
This will be implemented on EngineYard
I wouldn't delete the products and vendors tables on every cycle. Is this a rails app? If so there are some really nice ActiveRecord helpers that would come in handy for you.
If you have a Product active record model, you can do:
p = Product.find_or_initialize_by_identifier(<id you get from file>)
p.name = <name from file>
p.size = <size from file>
etc...
p.save!
The find_or_initialize will lookup the product in the database by the id you specify, and if it can't find it, it will create a new one. The really handy thing about doing it this way, is that ActiveRecord will only save to the database if any of the data has changed, and it will automatically update any timestamp fields you have in the table (updated_at) accordingly. One more thing, since you would be looking up records by the identifier (id from the file), I would make sure to add an index on that field in the database.
To make a rake task to accomplish this, I would add a rake file to the lib/tasks directory of your rails app. We'll call it data.rake.
Inside data.rake, it would look something like this:
namespace :data do
desc "import data from files to database"
task :import => :environment do
file = File.open(<file to import>)
file.each do |line|
attrs = line.split(":")
p = Product.find_or_initialize_by_identifier(attrs[0])
p.name = attrs[1]
etc...
p.save!
end
end
end
Than to call the rake task, use "rake data:import" from the command line.
Since Products don't really change that often, the best way I would see is to update only the records that change.
Get all the deltas
Mass update using a single SQL statement
If you are having your normalization code in the models, you could use Product.create and Vendor.create or else it would be just a overkill. Also, Look into inserting multiple records in a single SQL transaction, its much faster.
Create an importer rake task that is cronned
Parse the file line by line using Faster CSV or via vanilla ruby like:
file.each do |line|
products_array = line.split(":")
end
Split each line on the ":" and push in into a hash
Use a find_or_initialize to populate your db such as:
Product.find_or_initialize_by_name_and_vendor_id("foo", 111)
Related
I am currently trying to import over 40 CSV's exported from sqlite3 into oracle db but I seem to have issues whilst importing some of the CSV's, into the corresponding tables.
The code line with:
class_name.create!(row.to_hash)
produces errors on some classes because the callbacks are also triggered when the
.create!() method is called
def import_csv_into_db
Dir.foreach(Rails.root.join('db', 'csv_export')) do |filename|
next if filename == '.' or filename == '..' or filename == 'extract_db_into_csv.sh' or filename =='import_csv.rb'
filename_renamed = File.basename(filename, File.extname(filename)).chomp('s').titleize.gsub(/\s+/, "")
CSV.foreach(Rails.root.join('db', 'csv_export',filename), headers: true) do |row|
class_name = Object.const_get(filename_renamed)
puts class_name
class_name.create!(row.to_hash)
puts "Insert on table #{class_name}s complete with: #{row.to_hash}"
end
end
end
The issue at hand is that my CSV import function is in the seeds.rb, so whenver I run bundle exec rake db:seed the CSV's are imported.
How exactly can I avoid the callbacks being triggered when class_name.create!(row.to_hash) is triggered within the function in the seeds.rb ?
In my customer.rb I have callbacks such as:
after_create :add_default_user or after_create :add_build_config
I'd like to manipulate my function within the seeds.rb to skip the callbacks when the function tries importing a CSV file like customers.csv (which would logically call Customer.create!(row.to_hash)).
There are lower level methods which will not run callbacks. For example, instead of create! you can call insert!. Instead of destroy you can call delete.
Side note: use insert_all! to bulk insert multiple rows at once. Your import will be much faster and it does not use validations. Though I would recommend the more flexible active-import instead.
However, skipping callbacks might cause problems if they are necessary for the integrity of the data. If you delete instead of destroy associated data may not be deleted, or you may get errors because of referential integrity. Be sure to add on delete actions on your foreign keys to avoid this. Then the database itself will take care of it.
Consider whether your db:seeds is doing too much. If importing this CSV is a hindrance to seeding the database, consider if it should be a separate rake task instead.
Consider whether your callbacks can be rewritten to be idempotent, that is to be able to run multiple times. For example, after_create :add_default_user should recognize there already is a default user and not try to re-add it.
Finally, consider whether callbacks which are run every time a model is created are the correct place to do this work.
I'm not sure if this is just a lacking of the Rails language, or if I am searching all the wrong things here on Stack Overflow, but I cannot find out how to add an attribute to each record in an array.
Here is an example of what I'm trying to do:
#news_stories.each do |individual_news_story|
#user_for_record = User.where(:id => individual_news_story[:user_id]).pluck('name', 'profile_image_url');
individual_news_story.attributes(:author_name) = #user_for_record[0][0]
individual_news_story.attributes(:author_avatar) = #user_for_record[0][1]
end
Any ideas?
If the NewsStory model (or whatever its name is) has a belongs_to relationship to User, then you don't have to do any of this. You can access the attributes of the associated User directly:
#news_stories.each do |news_story|
news_story.user.name # gives you the name of the associated user
news_story.user.profile_image_url # same for the avatar
end
To avoid an N+1 query, you can preload the associated user record for every news story at once by using includes in the NewsStory query:
NewsStory.includes(:user)... # rest of the query
If you do this, you won't need the #user_for_record query — Rails will do the heavy lifting for you, and you could even see a performance improvement, thanks to not issuing a separate pluck query for every single news story in the collection.
If you need to have those extra attributes there regardless:
You can select them as extra attributes in your NewsStory query:
NewsStory.
includes(:user).
joins(:user).
select([
NewsStory.arel_table[Arel.star],
User.arel_table[:name].as("author_name"),
User.arel_table[:profile_image_url].as("author_avatar"),
]).
where(...) # rest of the query
It looks like you're trying to cache the name and avatar of the user on the NewsStory model, in which case, what you want is this:
#news_stories.each do |individual_news_story|
user_for_record = User.find(individual_news_story.user_id)
individual_news_story.author_name = user_for_record.name
individual_news_story.author_avatar = user_for_record.profile_image_url
end
A couple of notes.
I've used find instead of where. find returns a single record identified by it's primary key (id); where returns an array of records. There are definitely more efficient ways to do this -- eager-loading, for one -- but since you're just starting out, I think it's more important to learn the basics before you dig into the advanced stuff to make things more performant.
I've gotten rid of the pluck call, because here again, you're just learning and pluck is a performance optimization useful when you're working with large amounts of data, and if that's what you're doing then activerecord has a batch api you should look into.
I've changed #user_for_record to user_for_record. The # denote instance variables in ruby. Instance variables are shared and accessible from any instance method in an instance of a class. In this case, all you need is a local variable.
In our Rails app, the user (or we on his behalf) load some data or even insert it manually using a crud.
After this step the user must validate all the configuration (the data) and "accept and agree" that it's all correct.
On a given day, the application will execute some tasks according the configuration.
Today, we already have a "freeze" flag, where we can prevent changes in the data, so the user cannot mess the things up...
But we also would like to do something like hash the data and say something like "your config is frozen and the hash is 34FE00...".
This would give the user a certain that the system is running with the configuration he approved.
How can we do that? There are 7 or 8 tables. The total of records created would be around 2k or 3k.
How to hash the data to detect changes after the approval? How would you do that?
I'm thinking about doing a find_by_user in each table, loop all records and use some fields (or all) to build a string and hash it at the end of the current loop.
After loop all tables, I would have 8 hash strings and would concatenate and hash them in a final hash.
How does it looks like? Any ideas?
Here's a possible implementation. Just define object as an Array of all the stuff you'd like to hash :
require 'digest/md5'
def validation_hash(object, len = 16)
Digest::MD5.hexdigest(object.to_json)[0,len]
end
puts validation_hash([Actor.first,Movie.first(5)])
# => 94eba93c0a8e92f8
# After changing a single character in the first Actors's biography :
# => 35f342d915d6be4e
My seeds file populated the countries table with a list of countries. But now it needs to be changed to hard-code the id (instead of rails generating the id column for me).
I added the id column and values as per below:
zmb: {id: 103,code: 'ZMB', name: Country.human_attribute_name(:zambia, default: 'Error!'), display_order: nil, create_user: user, update_user: user, eff_date: Time.now, exp_date: default_exp_date},
skn: {id: 104,code: 'SKN', name: Country.human_attribute_name(:st_kitts_and_nevis, default: 'Error!'), display_order: nil, create_user: user, update_user: user, eff_date: Time.now, exp_date: default_exp_date}
countries.each { |key, value| countries_for_later[key] = Country.find_or_initialize_by(id: value[:id]); countries_for_later[key].assign_attributes(value); countries_for_later[key].save!; }
Above it just a snippet. I have added an id: for every country.
But when I run db:seed I get the following error:
ActiveRecord::RecordInvalid: Validation failed: Code has already been taken
I am new to rails so I'm not sure what is causing this - is it because the ID column already exists in the database?
What I think is happening is you have existing data in your database ... let's say
[{id:1 , code: 'ABC'},
{id:2 , code: 'DEF'}]
Now you run your seed file which has {id: 3, 'DEF'} for example.
Because you are using find_or_initialize_by with id you are running into errors. Since you can potentially insert duplicates.
I recon you should just clear your data, but you can try doing find_or_initialize_by using code instead of id. That way you wont ever have a problem of trying to create a duplicate country code.
Country.find_or_initialize_by(code: value[:code])
I think you might run into problems with your ids, but you will have to test that. It's generally bad practice to do what you are doing. Whether they ids change or now should be irrelevant. Your seed file should reference the objects that are being created not ids.
Also make sure you aren't using any default_scopes ... this would affect how find_or_initialize_by works.
The error is about Code: Code has already been taken. You've a validation which says Code should be uniq. You can delete all Countries and load seeds again.
Run this in the rails console:
Country.delete_all
Then re-run the seed:
rake db:seed
Yes, it is due to duplicate entry. In that case run ModelName.delete_all in your rails console and then run rake db:seed again being in the current project directory. Hope this works.
ActiveRecord::RecordInvalid: Validation failed: Code has already been taken
is the default error message for the uniqueness validator for :code.
Running rake db:reset will definitely clear and reseed your database. Not sure about the hardcoded ids though.
Check this : Overriding id on create in ActiveRecord
you will have to disable protection with
save(false)
or
Country.create(attributes_for_country, without_protection: true)
I haven't tested this though, be careful with your validators.
Add the line for
countries_for_later[key].id = value[:id]
the problem is that you can't set :id => value[:id] to Country.new because id is a special attribute, and is automatically protected from mass-assignment
so it will be:
countries.each { |key, value|
countries_for_later[key] = Country.find_or_initialize_by(id: value[:id])
countries_for_later[key].assign_attributes(value)
countries_for_later[key].id = value[:id] if countries_for_later[key].new_record?
countries_for_later[key].save(false)
}
The ids data that you are using in your seeds file: does that have any meaning outside of Rails? Eg
zmb: {id: 103,code: 'ZMB',
is this some external data for Zambia, where 103 is it's ID in some internationally recognised table of country codes? (in my countries database, Zambia's "numcode" value is 894). If it is, then you should rename it to something else, and let Rails decide what the id field should be.
Generally, mucking about with the value of ID in rails is going to be a pain in the ass for you. I'd recommend not doing it. If you need to do tests on data, then use some other unique field (like 'code') to test whether associations etc have been set up, or whatever you want to do, and let Rails worry about what value to use for ID.
I'm trying to figure out how to run a migration that prepends a string to the beginning of a column. In this specific case I have a column called url that currently stores everything after the domain (e.g. /test.html). However I now want to prepend a single string http://google.com to the beginning of the url. In the case of this example, the resulting string value of url for this entry would be http://google.com/test.html.
How can I accomplish this with a migration?
I'm not sure this really qualifies as something you should put into a migration; generally, migrations change the structure of your database, rather than change the format of the data inside of it.
The easiest and quickest way to do this would not be to futz around in your database at all, and instead just make the url method of that model return something like "http://google.com#{read_attribute(:url)}". If you really want to change the data in your database, I'd make a rake task to do it, something like:
namespace :data do
task :add_domain do
Model.each do |model|
model.url = "http://google.com#{model.url}" if model.url !~ /google\.com/
model.save if model.changed?
end
end
end
If this must be a migration for you, then your migration's up would look very similar to the internals of that rake task. (Or it would call that rake task directly.)
you could use migration or a rake task to do this.
If you want to run it as a migration,
def up
execute("update TABLE set url = 'http://google.com' || url") // '||' concatenates string in postgres. Use the function provided by your database
end
def down
//this is little tricky. I would advice to leave this empty
end