Exceptionally slow import task - ruby-on-rails

I have a rake task (Rails 3 / Mongoid) that takes a lot of time to complete for no apparent reason, my guess is that I'm doing something multiple times where it's not needed or that I'm missing something very obvious (I'm no MongoDB or Mongoid expert):
task :fix_editors => :environment do
(0...50).each do |num|
CSV.foreach("foo_20141013_ascii.csv-#{num}.csv", col_sep: ";", headers: true, force_quotes: true) do |row|
editors = Hash[*Editor.all.collect {|ed| [ed.name, ed.id]}.flatten]
begin
book = Book.where(internal_id: row["ID"], editorial_data_checked: false).first
if book && !row["Marchio"].nil?
editor_name = HTMLEntities.new.decode(row['Marchio']).strip.titleize
editor_id = editors[editor_name]
unless editor_id
editor = Editor.create(name: editor_name)
editors[editor_name] = editor.id
editor_id = editor.id
end
if book.update_attributes(editor_id: editor_id, editorial_data_checked: true)
puts "#{book.slug} updated with editor data"
else
puts "Nothing done for #{book.slug}"
end
end
rescue => e
puts e
retry
end
end
end
end
The CSV I had to read at the beginning was very big, so I've split it in 50 smaller files (that was my first attempt to speed things up).
Then I tried to remove all the queries I could, that's why it doesn't read from the Editor collection for every row but collects all of them at the beginning and then just looks up things in a hash.
At the end I removed all save calls and used update_attributes.
The Book collection is more or less 1 million records, so it's pretty large. I have 13k Editors, so no big deal there.
Here is my Book class:
https://gist.github.com/anonymous/087e6c81ef5f355a160d
Locally it takes more than 1 second per row, I don't think it's normal, but feel free to let me know if you disagree. All writes take less then 0.1/0.2 (I've used Benchmark.measure)
I'm out of ideas, can anybody help me? Am I missing something? Thanks in advance

Replace
editors = Hash[*Editor.all.collect {|ed| [ed.name, ed.id]}.flatten]
to the second line right after
task :fix_editors => :environment do
other thing that you could do batch processing: load 1000 rows, then 1000 books and then process those 1000 books

Do you have an index on column internal_id of books table?

Related

Ruby on Rails Performance - update row or check first

I have a model Product that has attribute description and code which is an index.
I would like to alter the product in code based on a CSV file.
What is faster?
#p = Product.find_by_code(row[:code])
if #p.description != row[:desc]
#p.update_attribute(:description, row[:desc])
or
#p = Product.find_by_code(row[:code])
#p.update_attribute(:description, row[:desc])
Let's consider all cases, such as descriptions are equal and not equal at all.
How is = comparison implemented for strings and texts?
You should use the ruby Benchmark module and directly measure that !
require 'benchmark'
Benchmark.bm do |x|
x.report('sort!') do
#p = Product.find_by(code: row[:code])
if #p.description != row[:desc]
#p.description = row[:desc]
p.save
end
end
x.report('sort') do
#p = Product.find_by(code: row[:code])
#p.description = row[:desc]
p.save
end
end
Ruby on Rails is clever enough to know whether an attribute has actually changed, and so won't roundtrip to the database to update a field when it hasn't changed. You can see this on the Rails console (rails c) if you run your update_attribute code with the same value, and then with a changed value - you'll only see the SQL log output when it's changed.
If you use update_attributes instead (which takes a hash of attributes to change) and there is nothing to update, you'll see it does begin and end a transaction with the database, albeit with no commands within it.
Hope that helps!

Rails Everyday Action

I need the system to run the following code everyday but I don't know how to do accomplish this.
#user = User.all
date = Date.today
if date.workingday?
#user.each do |user|
if !Bank.where(:user_id => user.id , :created_at => (date.beginning_of_day..date.end_of_day) , :bank_type => 2 ).exists?
banco = Bank.new()
banco.user = user
banco.bank_type = 2
banco.hours = 8
banco.save
end
end
end
The most conventional way is to set this up to be executed with rails runner in a cron job.
There are tools like whenever that make it easier to create these jobs by defining how often they need to be executed in Ruby rather than in the peculiar and sometimes difficult to understand crontab format.
As a note, User.all is a very dangerous thing to do. As the number of users in your system grows, loading them all into memory will eventually blow up your server. You should load them in groups of 100 or so to avoid overloading the memory.
Additionally, that where clause shouldn't be necessary if you've set up proper has_many and belongs_to relationships here. I would expect this could work:
unless (user.bank)
user.create_bank(
bank_type: 2,
hours: 8
)
end
It's not clear how created_at factors in here. Are these assigned daily? If so, that should be something like bank_date as a DATE type column, not date and time. Using the created_at timestamp as part of the relationship is asking for trouble, that should reflect when the record was created, nothing more.

Rails: build for difference between relationships

A doc has many articles and can have many edits.
I want to build an edit for each article up to the total number of #doc.articles. This code works with the first build (i.e., when no edits yet exist).
def editing
#doc = Doc.find(params[:id])
unbuilt = #doc.articles - #doc.edits
unbuilt.reverse.each do |article|
#doc.edits.build(:body => article.body, :article_id => article.id, :doc_id => #doc.id)
end
end
But when edits already exist it'll keep those edits and still build for the #doc.articles total, ending up with too many edits and some duplicates if only one article was changed.
I want to put some condition against :article_id which exists in both edits and articles in to say (in pseudocode):
unbuilt = #doc.articles - #doc.edits
unbuilt.where('article_id not in (?)', #doc.edits).reverse.each do |article|
#doc.edits.build(...)
end
Any help would be excellent! Thank-you so much.
You are doing something weird here:
unbuilt = #doc.articles - #doc.edits
You probably want this instead
unbuilt = #doc.articles - #doc.edits.map(&:article)
This works if #doc.articles and #doc.edits are small collections, otherwise a SQL solution would be preferred.
-- EDIT: added explanation --
this piece of Ruby
#doc.edits.map(&:article)
is equivalent to
#doc.edits.map do |edit| edit.article end
the previous one is much more compact and exploits a feature introduced in ruby 1.9
It basically takes a symbol (:article), calls on it the 'to_proc' method (it does this by using the '&' character). You can think of the 'to_proc' method as something very similar to this:
def to_proc
proc { |object| object.send(self) }
end
In ruby, blocks and procs are generally equivalent (kindof), so this works!

How do i skip over the first three rows instead of the only the first in FasterCSV

I am using FasterCSV and i am looping with a foreach like this
FasterCSV.foreach("#{Rails.public_path}/uploads/transfer.csv", :encoding => 'u', :headers => :first_row) do |row|
but the problem is my csv has the first 3 lines as the headers...any way to make fasterCSV skip the first three rows rather then only the first??
Not sure about FasterCSV, but in Ruby 1.9 standard CSV library (which is made from FasterCSV), I can do something like:
c = CSV.open '/path/to/my.csv'
c.drop(3).each do |row|
# do whatever with row
end
I'm not a user of FasterCSV, but why not do the control yourself:
additional_rows_to_skip = 2
FasterCSV.foreach("...", :encoding => 'u', :headers => :first_row) do |row|
if additional_rows_to_skip > 0
additional_rows_to_skip -= 1
else
# do stuff...
end
end
Thanks to Mladen Jablanovic. I got my clue.. But I realized something interesting
In 1.9, reading seems to be from POS.
In this I mean if you do
c = CSV.open iFileName
logger.debug c.first
logger.debug c.first
logger.debug c.first
You'll get three different results in your log. One for each of the three header rows.
c.each do |row| #now seems to start on the 4th row.
It makes perfect sense that it would read the file this way. Then it would only have to have the current row in memory.
I still like Mladen Jablanovićs answer, but this is an interesting bit of logic too.

How can I speed up this Rails code?

It's a vague question I know....but the performance on this block of code is horrible. It takes about 15secs from the original post to the action to rendering the page...
The purpose of this action is to retrieve all Occupations from a CV, all the skills from that CV and the occupations. They need to be organized in 2 arrays:
the first array contains all the Occupations (no duplicates) and has them ordered according their score. Fo each double entry found the score is increased by 1
the second array contains ALL the skills from both the occupation array and the cv. Again no doubles are allowed, but for every double encountered the score of the existing is increased by one.
Below is the code block that performs this operation. It's relatively big compared to my other code snippets, but i hope it's understandable. I know working with the arrays like i do is confusing, but here is what each array location means:
position 0 : the actuall skill/occupation object
position 1 : the score of the entry
position 2 : the location found in the db
position 3 : the location found in the cv
def categorize
#cv = Cv.find(params[:cv_id], :include => [:desired_occupations, :past_occupations, :educational_skills])
#menu = :second
#language = Language.resolve(:code => :en, :name => :en)
#occupation_hashes = []
#skill_hashes = []
(#cv.desired_occupations + #cv.past_occupations).each do |occupation|
section = []
section << 'Desired occupation' if #cv.desired_occupations.include? occupation
section << 'Work experience' if #cv.past_occupations.include? occupation
unless (array = #occupation_hashes.assoc(occupation)).blank?
array[1] += 1
array[2] = (array[2] & section).uniq
else
#occupation_hashes << [occupation, 1, section]
end
occupation.skills.each do |skill|
unless (array = #skill_hashes.assoc skill).blank?
label = occupation.concept.label(#language).value
array[1]+= 1
array[3] << label unless array[3].include? label
else
#skill_hashes << [skill, 1, [], [occupation.concept.label(#language).value]]
end
end
end
#cv.educational_skills.each do |skill|
unless (array = #skill_hashes.assoc skill).blank?
array[1]+= 1
array[3] << 'Education skills' unless array[3].include? 'Education skills'
else
#skill_hashes << [skill, 1, ['Education skills'], []]
end
end
# Sort the hashes
#occupation_hashes.sort! { |x,y| y[1] <=> x[1]}
#skill_hashes.sort! { |x,y| y[1] <=> x[1]}
#max = #skill_hashes.first[1]
#min = #skill_hashes.last[1] end
I can post the additional models and migrations to make it clear what each class does, but I think the first few lines of the above script should be clear on the associations. I'm looking for a way to optimize the each-loops...
That's quite the block of code there. Generally if you're writing methods that serious you're going to have trouble maintaining it in the future. A technique that would help is breaking up that monolithic chunk of code and turning it into a helper class that does the processing in more logical stages, making it easier to fine-tune aspects of it.
For instance, an interface might be:
#categorizer = CvCategorizer.new(params[:cv_id])
This would encapsulate all of the above and save it into instance variables made accessible by being declared with attr_reader.
Using a utility class means you can break up the initialization into steps that are made more clear:
def initialize(cv_id)
# Call a wrapper method that loads the CV
#cv = self.load_cv(cv_id)
# Perform discrete steps to re-order the imported data
self.organize_occupations
self.organize_skills
end
It's really hard to say why this is slow by just looking at it, though I would pay very close attention to log/development.log to see what's going on in there. It could be the initial load is painfully slow but the rest of the method is fine.
You should do a but of profiling in your code to see what is taking a large chunk of time. You can figure out how to work on of the profilers, or just sprinkle some simple puts or logger.info statements throughout your code with a timestamp. Probably easiest to do this by using Benchmark. Note: you may need to require 'benchmark'... not sure if it is auto required in Rails or not.
For a single line, you can do something like this:
logger.info Benchmark.measure { #cv = Cv.find(params[:cv_id], :include => [:desired_occupations, :past_occupations, :educational_skills]) }
And for timing larger blocks of code:
logger.info Benchmark.measure do
(#cv.desired_occupations + #cv.past_occupations).each do |occupation|
section = []
section << 'Desired occupation' if #cv.desired_occupations.include? occupation
section << 'Work experience' if #cv.past_occupations.include? occupation
unless (array = #occupation_hashes.assoc(occupation)).blank?
array[1] += 1
array[2] = (array[2] & section).uniq
else
#occupation_hashes << [occupation, 1, section]
end
end
end
I'd just start with large blocks and then narrow it down. Not knowing how large of a dataset you are dealing with, it is hard to say what the problem zone is.
I'll also concur with others that you will be way better off to break this thing into smaller methods. This will also make it easier to test for performance, as you can do things like:
Benchmark.measure { 10000.times { foo.do_that_thing_that_might_be_slow }}

Resources