How to import large amounts of data into Rails? - ruby-on-rails

To load small amounts of data, I've been using rake tasks to important data from CSVs into Rails:
desc "Import users."
task :import_users => :environment do
File.open("users.txt", "r").each do |line|
name, age, profession = line.strip.split("\t")
u = User.new(:name => name, :age => age, :profession => profession)
u.save
end
end
For larger files (about 50,000 records), though, this is incredibly slow. Is there a faster way to import the data?

You might want to take a look at activerecord-import and check out this similar thread.

Without extra libraries (I agree that a bulk import with AR extensions should be quicker)(though AR:Extension skips model validations) you can add a little bit of concurrency and take advantage of a multicore machine
# Returns the number of processor for Linux, OS X or Windows.
def number_of_processors
if RUBY_PLATFORM =~ /linux/
return `cat /proc/cpuinfo | grep processor | wc -l`.to_i
elsif RUBY_PLATFORM =~ /darwin/
return `sysctl -n hw.logicalcpu`.to_i
elsif RUBY_PLATFORM =~ /win32/
# this works for windows 2000 or greater
require 'win32ole'
wmi = WIN32OLE.connect("winmgmts://")
wmi.ExecQuery("select * from Win32_ComputerSystem").each do |system|
begin
processors = system.NumberOfLogicalProcessors
rescue
processors = 0
end
return [system.NumberOfProcessors, processors].max
end
end
raise "can't determine 'number_of_processors' for '#{RUBY_PLATFORM}'"
end
desc "Import users."
task :fork_import_users => :environment do
procs = number_of_processors
lines = IO.readlines('user.txt')
nb_lines = lines.size
slices = nb_lines / procs
procs.times do
subset = lines.slice!(0..slices)
fork do
subset.each do |line|
name, age, profession = line.strip.split("\t")
u = User.new(:name => name, :age => age, :profession => profession)
u.save
end
end
end
Process.waitall
end
on my machine with 2 cores and the fork version I get
real 1m41.974s
user 1m32.629s
sys 0m7.318s
while with your version:
real 2m56.401s
user 1m21.953s
sys 0m7.529s

You should try FasterCSV. It is quite fast and dead easy to use for me.

Related

Migrating uploaded files from Active Storage to Carrierwave

For a variety of reasons I am migrating my uploads from ActiveStorage (AS) to CarrierWave (CW).
I am making rake task and have the logic sorted out - I am stumped at how to feed the AS blob into the CW file.
I am trying something like ths:
#files.each.with_index(1) do | a, index |
if a.attachment.attached?
a.attachment.download do |file|
a.file = file
end
a.save!
end
end
This is based on these two links:
https://edgeguides.rubyonrails.org/active_storage_overview.html#downloading-files
message.video.open do |file|
system '/path/to/virus/scanner', file.path
# ...
end
and
https://github.com/carrierwaveuploader/carrierwave#activerecord
# like this
File.open('somewhere') do |f|
u.avatar = f
end
I tested this locally and the files are not mounted via the uploader. My question(s) would be:
am I missing something obvious here?
is my approach wrong and needs a new one?
Bonus Karma Question:
I can't seem to see a clear path to set the CW filename when I do this?
Here is my final rack task (based on the accepted answer) - open to tweaks. Does the job for me:
namespace :carrierwave do
desc "Import the old AS files into CW"
task import: :environment do
#files = Attachment.all
puts "#{#files.count} files to be processed"
puts "+" * 50
#files.each.with_index(1) do | a, index |
if a.attachment.attached?
puts "Attachment #{index}: Key: #{a.attachment.blob.key} ID: #{a.id} Filename: #{a.attachment.blob.filename}"
class FileIO < StringIO
def initialize(stream, filename)
super(stream)
#original_filename = filename
end
attr_reader :original_filename
end
a.attachment.download do |file|
a.file = FileIO.new(file, a.attachment.blob.filename.to_s)
end
a.save!
puts "-" * 50
end
end
end
desc "Purge the old AS files"
task purge: :environment do
#files = Attachment.all
puts "#{#files.count} files to be processed"
puts "+" * 50
#files.each.with_index(1) do | a, index |
if a.attachment.attached?
puts "Attachment #{index}: Key: #{a.attachment.blob.key} ID: #{a.id} Filename: #{a.attachment.blob.filename}"
a.attachment.purge
puts "-" * 50
#count = index
end
end
puts "#{#count} files purged"
end
end
Now in my case I am doing this in steps - I have branched my master with this rake task and the associated MCV updates. If my site was in true production would probably run the import rake task first then confirm all went well THEN purge the old AS files.
The file object you get from the attachment.download block is a string. More precisely, the response from .download is the file, "streamed and yielded in chunks" (see documentation). I validated this by calling file.class to make sure the class is what I expected.
So, to solve your issue, you need to provide an object on which .read can be called. Commonly that is done using the Ruby StringIO class.
However, considering Carrierwave also expects a filename, you can solve it using a helper model that inherits StringIO (from blogpost linked above):
class FileIO < StringIO
def initialize(stream, filename)
super(stream)
#original_filename = filename
end
attr_reader :original_filename
end
And then you can replace a.file = file with a.file = FileIO.new(file, 'new_filename')

Ruby task uses 97 %CPU

My ruby application uses about 97 %CPU which eventually gets killed. The program is reading files from the folder and if a file name exists in the database, it skips it and checks another file. While executing this procedure, application usually gets killed.
COMMAND %CPU
ruby 96.5
Even if I insert almost all files and try to lunch an application again (because it was killed), system tends to kill it even sooner. How can I decrease the %CPU?
task :process_data, [:data_directory] => :environment do |_task, args|
# add data to a database
saver = CsvToSqlSaver.new
saver.fill_files_names
Dir.foreach(args.data_directory) do |filename|
# if not present in records already we read it
Base.logger.info "> Found #{filename}."
next if saver.files_names.to_s.include?(filename) ||
!filename.include?('csv')
Base.logger.info "> Reading #{filename}."
begin
saver.generate_db_rows_from_csv_file(args.data_directory, filename)
# handle Malformed .csv exception
rescue CSV::ArgumentError, CSV::MalformedCSVError => e
Base.logger.info e.message
next
end # we continue csv file loop?
unless integrator.insert_data_to_database
Base.logger.info '> No new data saved.'
end
end
end
This is fill_files_names:
def fill_files_names
#files_names = []
files_names = MyFilesTable.select(:filename).distinct
files_names.each do |row|
#files_names.push(row[:filename])
end
end
This is Base:
class Base
class << self
attr_accessor :logger
end
#logger ||= Logger.new(STDERR)
end
This is generate_db_rows_from_csv_file
def generate_db_rows_from_csv_file(directory, filename)
#incoming_data = []
CSV.foreach("#{directory}/#{filename}",
headers: true, quote_char: "\x00") do |csv_record|
# if invalid record, go further
next if record_invalid?(csv_record)
generate_row_in_the_database(csv_record, filename)
end
end

Relationships created by a rake task are not persisted though the rails server

I'm working my first project using Neo4j. I'm parsing wikipedia's page and pagelinks dumps to create a graph where the nodes are pages and the edges are links.
I've defined some rake tasks that download the dumps, parse the data, and save it in a Neo4j database. At the end of the rake task I print the number of pages and links created, and some of the pages with the most links. Here is the output of the raks task for the zawiki.
$ rake wiki[zawiki]
[ omitted ]
...
:: Done parsing zawiki
:: 1984 pages
:: 2144 links
:: The pages with the most links are:
9625.0 - Emijrp/List_of_Wikipedians_by_number_of_edits_(bots_included): 40
1363.0 - Gvangjsih_Bouxcuengh_Swcigih: 30
9112.0 - Fuzsuih: 27
1367.0 - Cungzcoj: 26
9279.0 - Vangz_Yenfanh: 19
It looks like pages and links are being created, but when I start a rails console, or the server the links aren't found.
$ rails c
jruby-1.7.5 :013 > Pages.all.count
=> 1984
jruby-1.7.5 :003 > Pages.all.reduce(0) { |count, page| count + page.links.count}
=> 0
jruby-1.7.5 :012 > Pages.all.sort_by { |p| p.links.count }.reverse[0...5].map { |p| p.links.count }
=> [0, 0, 0, 0, 0]
Here is the rake task, and this is the projects github page. Can anyone tell me why the links aren't saved?
DUMP_DIR = Rails.root.join('lib','assets')
desc "Download wiki dumps and parse them"
task :wiki, [:wiki] => 'wiki:all'
namespace :wiki do
task :all, [:wiki] => [:get, :parse] do |t, args|
# Print info about the newly created pages and links.
link_count = 0
Pages.all.each do |page|
link_count += page.links.count
end
indent "Done parsing #{args[:wiki]}"
indent "#{Pages.count} pages"
indent "#{link_count} links"
indent "The pages with the most links are:"
Pages.all.sort_by { |a| a.links.count }.reverse[0...5].each do |page|
puts "#{page.page_id} - #{page.title}: #{page.links.count}"
end
end
desc "Download wiki page and page links database dumps to /lib/assets"
task :get, :wiki do |t, args|
indent "Downloading dumps"
sh "#{Rails.root.join('lib', "get_wiki").to_s} #{args[:wiki]}"
indent "Done"
end
desc "Parse all dumps"
task :parse, [:wiki] => 'parse:all'
namespace :parse do
task :all, [:wiki] => [:pages, :pagelinks]
desc "Read wiki page dumps from lib/assests into the database"
task :pages, [:wiki] => :environment do |t, args|
parse_dumps('page', args[:wiki]) do |obj|
page = Pages.create_from_dump(obj)
end
indent = "Created #{Pages.count} pages"
end
desc "Read wiki pagelink dumps from lib/assests into the database"
task :pagelinks, [:wiki] => :environment do |t, args|
errors = 0
parse_dumps('pagelinks', args[:wiki]) do |from_id, namespace, to_title|
from = Pages.find(:page_id => from_id)
to = Pages.find(:title => to_title)
if to.nil? || from.nil?
errors = errors.succ
else
from.links << to
from.save
end
end
end
end
end
def indent *args
print ":: "
puts args
end
def parse_dumps(dump, wiki_match, &block)
wiki_match ||= /\w+/
DUMP_DIR.entries.each do |file|
file, wiki = *(file.to_s.match(Regexp.new "(#{wiki_match})-#{dump}.sql"))
if file
indent "Parsing #{wiki} #{dump.pluralize} from #{file}"
each_value(DUMP_DIR.join(file), &block)
end
end
end
def each_value(filename)
f = File.open(filename)
num_read = 0
begin # read file until line starting with INSERT INTO
line = f.gets
end until line.match /^INSERT INTO/
begin
line = line.match(/\(.*\)[,;]/)[0] # ignore begining of line until (...) object
begin
yield line[1..-3].split(',').map { |e| e.match(/^['"].*['"]$/) ? e[1..-2] : e.to_f }
num_read = num_read.succ
line = f.gets.chomp
end while(line[0] == '(') # until next insert block, or end of file
end while line.match /^INSERT INTO/ # Until line doesn't start with (...
f.close
end
app/models/pages.rb
class Pages < Neo4j::Rails::Model
include Neo4j::NodeMixin
has_n(:links).to(Pages)
property :page_id
property :namespace, :type => Fixnum
property :title, :type => String
property :restrictions, :type => String
property :counter, :type => Fixnum
property :is_redirect, :type => Fixnum
property :is_new, :type => Fixnum
property :random, :type => Float
property :touched, :type => String
property :latest, :type => Fixnum
property :length, :type => Fixnum
property :no_title_convert, :type => Fixnum
def self.create_from_dump(obj)
# TODO: I wonder if there is a way to compine these calls
page = {}
# order of this array is important, it corresponds to the data in obj
attrs = [:page_id, :namespace, :title, :restrictions, :counter, :is_redirect,
:is_new, :random, :touched, :latest, :length, :no_title_convert]
attrs.each_index { |i| page[attrs[i]] = obj[i] }
page = Pages.create(page)
return page
end
end
I must admit that I have no idea of how Neo4j works.
Transferring from other databases though, I too assume that either some validation is wrong, or maybe even something is misconfigured in your use of the database. The latter I can't give any advice on where to look, but if it's about validation, you can look at Page#errors or try calling Page#save! and see what it raises.
One crazy idea that just came to mind looking at this example is that maybe for that relation to be configured properly, you need a back reference, too.
Maybe has_n(:links).to(Page, :links) will help you. Or, if that doesn't work:
has_n(:links_left).to(Page, :links_right)
has_n(:links_right).from(Page, :links_left)
The more I look at this, the more I think the back reference to the same table is not configured properly and thus won't validate.

How to initialize a method in a rake file?

Apologies for the probably noobie question:
I have a rake task that is designed to take data from a site and save it as a RateData object.
rs.each do |market,url|
doc = Nokogiri::HTML(open(url))
doc.xpath("//table/tr").each do |item|
provider = "rs"
market = market
rate = item.xpath('td[1]').text.gsub!(/[^0-9\.]/, '')
volume = item.xpath('td[2]').text.gsub(/[^k0-9\.]/, '')
volume = volume.gsub(/\.(?=.k)/, '')
volume = volume.gsub(/k/, '00')
volume = volume.to_f
rate = rate.to_f
RateData.create(:provider => provider, :market => market, :rate => rate, :volume => volume, :bid_ask => 1)
end
end
The RateData.create method is in the rate_data_controller and is accessible when I call it in the rails console. How can I make it available in this rake task?
Many thanks!
you need to pass the environment into the task
task :your_task, [] => :environment do
or with args
task :your_task, [:foo] => :environment do |task, args|

how to import data into rails?

I have a Rails 3 application with a User class, and a tab-delimited file of users that I want to import.
How do I get access to the Active Record model outside the rails console, so that I can write a script to do
require "???active-record???"
File.open("users.txt", "r").each do |line|
name, age, profession = line.strip.split("\t")
u = User.new(:name => name, :age => age, :profession => profession)
u.save
end
Do I use the "ar-extensions" gem, or is there another way? (I don't particularly care about speed right now, I just want something simple.)
You can write a rake method to so.
Add this to a my_rakes.rake file in your_app/lib/tasks folder:
desc "Import users."
task :import_users => :environment do
File.open("users.txt", "r").each do |line|
name, age, profession = line.strip.split("\t")
u = User.new(:name => name, :age => age, :profession => profession)
u.save
end
end
An then call $ rake import_users from the root folder of your app in Terminal.
Use the activerecord-import gem for bulk importing.
Install via your Gemfile:
gem 'activerecord-import'
Collect your users and import:
desc "Import users."
task :import_users => :environment do
users = File.open("users.txt", "r").map do |line|
name, age, profession = line.strip.split("\t")
User.new(:name => name, :age => age, :profession => profession)
end
User.import users
end

Resources