It's a vague question I know....but the performance on this block of code is horrible. It takes about 15secs from the original post to the action to rendering the page...
The purpose of this action is to retrieve all Occupations from a CV, all the skills from that CV and the occupations. They need to be organized in 2 arrays:
the first array contains all the Occupations (no duplicates) and has them ordered according their score. Fo each double entry found the score is increased by 1
the second array contains ALL the skills from both the occupation array and the cv. Again no doubles are allowed, but for every double encountered the score of the existing is increased by one.
Below is the code block that performs this operation. It's relatively big compared to my other code snippets, but i hope it's understandable. I know working with the arrays like i do is confusing, but here is what each array location means:
position 0 : the actuall skill/occupation object
position 1 : the score of the entry
position 2 : the location found in the db
position 3 : the location found in the cv
def categorize
#cv = Cv.find(params[:cv_id], :include => [:desired_occupations, :past_occupations, :educational_skills])
#menu = :second
#language = Language.resolve(:code => :en, :name => :en)
#occupation_hashes = []
#skill_hashes = []
(#cv.desired_occupations + #cv.past_occupations).each do |occupation|
section = []
section << 'Desired occupation' if #cv.desired_occupations.include? occupation
section << 'Work experience' if #cv.past_occupations.include? occupation
unless (array = #occupation_hashes.assoc(occupation)).blank?
array[1] += 1
array[2] = (array[2] & section).uniq
else
#occupation_hashes << [occupation, 1, section]
end
occupation.skills.each do |skill|
unless (array = #skill_hashes.assoc skill).blank?
label = occupation.concept.label(#language).value
array[1]+= 1
array[3] << label unless array[3].include? label
else
#skill_hashes << [skill, 1, [], [occupation.concept.label(#language).value]]
end
end
end
#cv.educational_skills.each do |skill|
unless (array = #skill_hashes.assoc skill).blank?
array[1]+= 1
array[3] << 'Education skills' unless array[3].include? 'Education skills'
else
#skill_hashes << [skill, 1, ['Education skills'], []]
end
end
# Sort the hashes
#occupation_hashes.sort! { |x,y| y[1] <=> x[1]}
#skill_hashes.sort! { |x,y| y[1] <=> x[1]}
#max = #skill_hashes.first[1]
#min = #skill_hashes.last[1] end
I can post the additional models and migrations to make it clear what each class does, but I think the first few lines of the above script should be clear on the associations. I'm looking for a way to optimize the each-loops...
That's quite the block of code there. Generally if you're writing methods that serious you're going to have trouble maintaining it in the future. A technique that would help is breaking up that monolithic chunk of code and turning it into a helper class that does the processing in more logical stages, making it easier to fine-tune aspects of it.
For instance, an interface might be:
#categorizer = CvCategorizer.new(params[:cv_id])
This would encapsulate all of the above and save it into instance variables made accessible by being declared with attr_reader.
Using a utility class means you can break up the initialization into steps that are made more clear:
def initialize(cv_id)
# Call a wrapper method that loads the CV
#cv = self.load_cv(cv_id)
# Perform discrete steps to re-order the imported data
self.organize_occupations
self.organize_skills
end
It's really hard to say why this is slow by just looking at it, though I would pay very close attention to log/development.log to see what's going on in there. It could be the initial load is painfully slow but the rest of the method is fine.
You should do a but of profiling in your code to see what is taking a large chunk of time. You can figure out how to work on of the profilers, or just sprinkle some simple puts or logger.info statements throughout your code with a timestamp. Probably easiest to do this by using Benchmark. Note: you may need to require 'benchmark'... not sure if it is auto required in Rails or not.
For a single line, you can do something like this:
logger.info Benchmark.measure { #cv = Cv.find(params[:cv_id], :include => [:desired_occupations, :past_occupations, :educational_skills]) }
And for timing larger blocks of code:
logger.info Benchmark.measure do
(#cv.desired_occupations + #cv.past_occupations).each do |occupation|
section = []
section << 'Desired occupation' if #cv.desired_occupations.include? occupation
section << 'Work experience' if #cv.past_occupations.include? occupation
unless (array = #occupation_hashes.assoc(occupation)).blank?
array[1] += 1
array[2] = (array[2] & section).uniq
else
#occupation_hashes << [occupation, 1, section]
end
end
end
I'd just start with large blocks and then narrow it down. Not knowing how large of a dataset you are dealing with, it is hard to say what the problem zone is.
I'll also concur with others that you will be way better off to break this thing into smaller methods. This will also make it easier to test for performance, as you can do things like:
Benchmark.measure { 10000.times { foo.do_that_thing_that_might_be_slow }}
Related
I have a PORO (Plain Old Ruby Object) to deal with some business logic. It receives an ActiveRecord object and classify it. For the sake of simplicity, take the following as an example:
class Classificator
STATES = {
1 => "Positive",
2 => "Neutral",
3 => "Negative"
}
def initializer(item)
#item = item
end
def name
STATES.fetch(state_id)
end
private
def state_id
return 1 if #item.value > 0
return 2 if #item.value == 0
return 3 if #item.value < 0
end
end
However, I also want to do queries that groups objects based on these state_id "virtual attribute". I'm currently dealing with that by creating this attribute in the SQL queries and using it in GROUP BY statements. See the example:
class Classificator::Query
SQL_CONDITIONS = {
1 => "items.value > 0",
2 => "items.value = 0",
3 => "items.value < 0"
}
def initialize(relation = Item.all)
#relation = relation
end
def count
#relation.select(group_conditions).group('state_id').count
end
private
def group_conditions
'CASE ' + SQL_CONDITIONS.map do |k, v|
'WHEN ' + v.to_s + " THEN " + k.to_s
end.join(' ') + " END AS state_id"
end
end
This way, I can get this business logic into SQL and make this kind of query in a very efficient way.
The problem is: I have duplicated business logic. It exists in "ruby" code, to classify a single object and also in "SQL" to classify a collection of objects in database-level.
Is it a bad practice? Is there a way to avoid this? I actually was able to do this, doing the following:
item = Item.find(4)
items.select(group_conditions).where(id: item.id).select('state_id')
But by doing this, I loose the ability to classify objects that are not persisted in database. The other way out would be classifying each object in ruby, using an Iterator, but then I would lose database performance.
It's seem to be unavoidable to keep duplicated business logic if I need the best of the two cases. But I just want to be sure about this. :)
Thanks!
I'd rather keep database simple, and put logic in Ruby code as much as possible. Since classification is not stored in database, I won't expect the queries to return it.
My solution is to define a concern which will be included into ActiveRecord model classes.
module Classified
extend ActiveSupport::Concern
STATES = {
1 => "Positive",
2 => "Neutral",
3 => "Negative"
}
included do
def state_name
STATES.fetch(state_id)
end
private
def state_id
(0 <=> value.to_i) + 2
end
end
end
class Item < ActiveRecord::Base
include Classified
end
And I fetch items from database just as usual.
items = Item.where(...)
Since each item knows its own classification value, I don't have to ask database for it.
items.each do |item|
puts item.state_name
end
ActiveRecord itself implies a degree of coupling between your persistence and business logic. However, as much as the pattern allows, and if you don't have real performance constraints, the first option should be to keep your persistence code as dumb as possible, and move this "classification" (which is clearly a business rule) away from the database as much as possible.
The rationale is that database-related code is more expensive to change (especially as your system is already in production) and generally more difficult and slower to test than pure business logic.
Is there any chance to introduce trigger in the database? If so, I would go with “calculated” field state_id in the database, that changes it’s value on both INSERT and UPDATE (this will bring even more productivity benefit) and this code in ruby:
def state_if
return #item.state_id if #item.state_id # persistent object
case #item.value
when 0 then 2
when -Float::INFINITY...0 then 3
else 1
end
end
I'm working on an AB test and need to divide a population. How should I divide something like
User.where(:condition => true) randomly into two roughly equal groups?
I'm considering iterating through the whole array and pushing onto one of two other arrays based on a random value, but this is a large query and that sounds very slow.
e.g.
array.each do |object|
if rand(2) == 0
first_group << object
else
second_group << object
end
end
To get a random ordering right from the database you can do
# MySQL
User.order('RAND()')
# PostgreSQL
User.order('RANDOM()')
A nice one liner to split an array into two halves can be found here:
left, right = a.each_slice( (a.size/2.0).round ).to_a
I would write a definition which would return the following
def randomizer(sample_size)
initial_arr = ["objq","obj2", "objn"]
sampler = initial_arr(sample_size)
sampled_data = initial_arr - sampler
end
here sample_size will be size of the array would like to randomize and split like 50 or 100 based on your data size.
for basic trial I have done the same as
[:foo, :bar, :hello, :world, :ruby].sample(3)
output would be [:hello, :ruby, :bar].
second would be the result of [:foo, :bar, :hello, :world, :ruby] - [:foo, :bar, :hello, :world, :ruby].sample(3) which is [:hello, :world]
This way you can avoid looping over array and execute the code faster.
for additional information you can check http://www.ruby-doc.org/core-2.1.1/Array.html#method-i-sample
This is available in ruby 1.9.1 as well.
You could perform basic Array operations on the query result:
results = User.where(:condition => true)
Start by using shuffle to put the array to get a random ordering:
array = results.shuffle
Then slice the array in to roughly equal parts:
group1 = array.slice(0..array.length/2)
group2 = array.slice(array.length/2+1..array.length)
If order is important, sort the groups back into the initial order:
group1.sort! {|a, b| results.index(a) <=> results.index(b) }
group2.sort! {|a, b| results.index(a) <=> results.index(b) }
I'm trying to implement my first ruby sorting algorithm. This algorithm is based on some specific rules ("always prefer objects of type xxx over objects of types yyy"), and if none of these rules triggered, it uses the ruby <=>-operator. I'm doing this on a ruby-on-rails one-to-many association.
The problem is this algortihm does not return the array itself, it just returns -1 or 1, the result of the comparison..But I actually don't understand why, as my result is only returned in the sort-block.
Here is my current code:
def sort_products!
products.sort! do |p1, p2|
result = 0
# Scalable Products are always the last ones in order
if p1.class.name == "ScalableProduct"
result = -1
elsif p2.class.name == "ScalableProduct"
result = 1
end
if result == 0
# Put products producing electricity and heating down
if p1.can_deliver_electricity?
result = -1
elsif p2.can_deliver_electricity?
result = 1
end
end
# Else: Just compare names
result = p1.name <=> p2.name if result == 0
result
end
end
The best practice here, in my opinion, would be to implement the <=> inside the Product model. You'll need to include the Comparable model in order to achive this:
class Product
include Comparable
def <=>(another_product)
# Compare self with another_product
# Return -1, 0, or 1
end
end
Then your sorting method will be reduced to:
def sort_products!
products.sort!
end
Change the do..end for brackets as delimiters of the block. It is first sorting, and then using the block on the result (because of the precedence of the do..end syntax). Using brackets, it uses the block as a sorting block, which is what you wanted.
Also, in your comparison, if both your products are ScalableProduct then you will not order them in a sensible way. If they are both ScalableProduct at the same time, you might want to keep result as 0 so it falls back to comparing by name. Same deal with can_deliver_electricity?.
I know that serializing an object is (to my knowledge) the only way to effectively deep-copy an object (as long as it isn't stateful like IO and whatnot), but is one way particularly more efficient than another?
For example, since I'm using Rails, I could always use ActiveSupport::JSON, to_xml - and from what I can tell marshalling the object is one of the most accepted ways to do this. I'd expect that marshalling is probably the most efficient of these since it's a Ruby internal, but am I missing anything?
Edit: note that its implementation is something I already have covered - I don't want to replace existing shallow copy methods (like dup and clone), so I'll just end up likely adding Object::deep_copy, the result of which being whichever of the above methods (or any suggestions you have :) that has the least overhead.
I was wondering the same thing, so I benchmarked a few different techniques against each other. I was primarily concerned with Arrays and Hashes - I didn't test any complex objects. Perhaps unsurprisingly, a custom deep-clone implementation proved to be the fastest. If you are looking for quick and easy implementation, Marshal appears to be the way to go.
I also benchmarked an XML solution with Rails 3.0.7, not shown below. It was much, much slower, ~10 seconds for only 1000 iterations (the solutions below all ran 10,000 times for the benchmark).
Two notes regarding my JSON solution. First, I used the C variant, version 1.4.3. Second, it doesn't actually work 100%, as symbols will be converted to Strings.
This was all run with ruby 1.9.2p180.
#!/usr/bin/env ruby
require 'benchmark'
require 'yaml'
require 'json/ext'
require 'msgpack'
def dc1(value)
Marshal.load(Marshal.dump(value))
end
def dc2(value)
YAML.load(YAML.dump(value))
end
def dc3(value)
JSON.load(JSON.dump(value))
end
def dc4(value)
if value.is_a?(Hash)
result = value.clone
value.each{|k, v| result[k] = dc4(v)}
result
elsif value.is_a?(Array)
result = value.clone
result.clear
value.each{|v| result << dc4(v)}
result
else
value
end
end
def dc5(value)
MessagePack.unpack(value.to_msgpack)
end
value = {'a' => {:x => [1, [nil, 'b'], {'a' => 1}]}, 'b' => ['z']}
Benchmark.bm do |x|
iterations = 10000
x.report {iterations.times {dc1(value)}}
x.report {iterations.times {dc2(value)}}
x.report {iterations.times {dc3(value)}}
x.report {iterations.times {dc4(value)}}
x.report {iterations.times {dc5(value)}}
end
results in:
user system total real
0.230000 0.000000 0.230000 ( 0.239257) (Marshal)
3.240000 0.030000 3.270000 ( 3.262255) (YAML)
0.590000 0.010000 0.600000 ( 0.601693) (JSON)
0.060000 0.000000 0.060000 ( 0.067661) (Custom)
0.090000 0.010000 0.100000 ( 0.097705) (MessagePack)
I think you need to add an initialize_copy method to the class you are copying. Then put the logic for the deep copy in there. Then when you call clone it will fire that method. I haven't done it but that's my understanding.
I think plan B would be just overriding the clone method:
class CopyMe
attr_accessor :var
def initialize var=''
#var = var
end
def clone deep= false
deep ? CopyMe.new(#var.clone) : CopyMe.new()
end
end
a = CopyMe.new("test")
puts "A: #{a.var}"
b = a.clone
puts "B: #{b.var}"
c = a.clone(true)
puts "C: #{c.var}"
Output
mike#sleepycat:~/projects$ ruby ~/Desktop/clone.rb
A: test
B:
C: test
I'm sure you could make that cooler with a little tinkering but for better or for worse that is probably how I would do it.
Probably the reason Ruby doesn't contain a deep clone has to do with the complexity of the problem. See the notes at the end.
To make a clone that will "deep copy," Hashes, Arrays, and elemental values, i.e., make a copy of each element in the original such that the copy will have the same values, but new objects, you can use this:
class Object
def deepclone
case
when self.class==Hash
hash = {}
self.each { |k,v| hash[k] = v.deepclone }
hash
when self.class==Array
array = []
self.each { |v| array << v.deepclone }
array
else
if defined?(self.class.new)
self.class.new(self)
else
self
end
end
end
end
If you want to redefine the behavior of Ruby's clone method , you can name it just clone instead of deepclone (in 3 places), but I have no idea how redefining Ruby's clone behavior will affect Ruby libraries, or Ruby on Rails, so Caveat Emptor. Personally, I can't recommend doing that.
For example:
a = {'a'=>'x','b'=>'y'} => {"a"=>"x", "b"=>"y"}
b = a.deepclone => {"a"=>"x", "b"=>"y"}
puts "#{a['a'].object_id} / #{b['a'].object_id}" => 15227640 / 15209520
If you want your classes to deepclone properly, their new method (initialize) must be able to deepclone an object of that class in the standard way, i.e., if the first parameter is given, it's assumed to be an object to be deepcloned.
Suppose we want a class M, for example. The first parameter must be an optional object of class M. Here we have a second optional argument z to pre-set the value of z in the new object.
class M
attr_accessor :z
def initialize(m=nil, z=nil)
if m
# deepclone all the variables in m to the new object
#z = m.z.deepclone
else
# default all the variables in M
#z = z # default is nil if not specified
end
end
end
The z pre-set is ignored during cloning here, but your method may have a different behavior. Objects of this class would be created like this:
# a new 'plain vanilla' object of M
m=M.new => #<M:0x0000000213fd88 #z=nil>
# a new object of M with m.z pre-set to 'g'
m=M.new(nil,'g') => #<M:0x00000002134ca8 #z="g">
# a deepclone of m in which the strings are the same value, but different objects
n=m.deepclone => #<M:0x00000002131d00 #z="g">
puts "#{m.z.object_id} / #{n.z.object_id}" => 17409660 / 17403500
Where objects of class M are part of an array:
a = {'a'=>M.new(nil,'g'),'b'=>'y'} => {"a"=>#<M:0x00000001f8bf78 #z="g">, "b"=>"y"}
b = a.deepclone => {"a"=>#<M:0x00000001766f28 #z="g">, "b"=>"y"}
puts "#{a['a'].object_id} / #{b['a'].object_id}" => 12303600 / 12269460
puts "#{a['b'].object_id} / #{b['b'].object_id}" => 16811400 / 17802280
Notes:
If deepclone tries to clone an object which doesn't clone itself in the standard way, it may fail.
If deepclone tries to clone an object which can clone itself in the standard way, and if it is a complex structure, it may (and probably will) make a shallow clone of itself.
deepclone doesn't deep copy the keys in the Hashes. The reason is that they are not usually treated as data, but if you change hash[k] to hash[k.deepclone] they will also be deep copied also.
Certain elemental values have no new method, such as Fixnum. These objects always have the same object ID, and are copied, not cloned.
Be careful because when you deep copy, two parts of your Hash or Array that contained the same object in the original will contain different objects in the deepclone.
This is the code in my reports controller, it just looks so bad, can anyone give me some suggestions on how to tidy it up?
# app\controller\reports_controller.rb
#report_lines = []
#sum_wp, #sum_projcted_wp, #sum_il, #sum_projcted_il, #sum_li,#sum_gross_profit ,#sum_opportunities = [0,0,0,0,0,0,0]
date = #start_date
num_of_months.times do
wp,projected_wp, invoice_line,projected_il,line_item, opp = Report.data_of_invoicing_and_delivery_report(#part_or_service,date)
#sum_wp += wp
#sum_projcted_wp +=projected_wp
#sum_il=invoice_line
#sum_projcted_il +=projected_il
#sum_li += line_item
gross_profit = invoice_line - line_item
#sum_gross_profit += gross_profit
#sum_opportunities += opp
#report_lines << [date.strftime("%m/%Y"),wp,projected_wp ,invoice_line,projected_il,line_item,gross_profit,opp]
date = date.next_month
end
I'm looking to use some method like
#sum_a,#sum_b,#sum_c += [1,2,3]
My instant thought is: move the code to a model.
The objective should be "Thin Controllers", so they should not contain business logic.
Second, I like to present my report lines to my Views as OpenStruct() objects, which seems cleaner to me.
So I'd consider moving this accumulation logic into (most likely) a class method on Report and returning an array of "report line" OpenStructs and a single totals OpenStruct to pass to my View.
My controller code would become something like this:
#report_lines, #report_totals = Report.summarised_data_of_inv_and_dlvry_rpt(#part_or_service, #start_date, num_of_months)
EDIT: (A day later)
Looking at that adding accumulating-into-an-array thing, I came up with this:
require 'test/unit'
class Array
def add_corresponding(other)
each_index { |i| self[i] += other[i] }
end
end
class TestProblem < Test::Unit::TestCase
def test_add_corresponding
a = [1,2,3,4,5]
assert_equal [3,5,8,11,16], a.add_corresponding([2,3,5,7,11])
assert_equal [2,3,6,8,10], a.add_corresponding([-1,-2,-2,-3,-6])
end
end
Look: a test! It seems to work OK. There are no checks for differences in size between the two arrays, so there's lots of ways it could go wrong, but the concept seems sound enough. I'm considering trying something similar that would let me take an ActiveRecord resultset and accumulate it into an OpenStruct, which is what I tend to use in my reports...
Our new Array method might reduce the original code to something like this:
totals = [0,0,0,0,0,0,0]
date = #start_date
num_of_months.times do
wp, projected_wp, invoice_line, projected_il, line_item, opp = Report.data_of_invoicing_and_delivery_report(#part_or_service,date)
totals.add_corresponding [wp, projected_wp, invoice_line, projected_il, line_item, opp, invoice_line - line_item]
#report_lines << [date.strftime("%m/%Y"),wp,projected_wp ,invoice_line,projected_il,line_item,gross_profit,opp]
date = date.next_month
end
#sum_wp, #sum_projcted_wp, #sum_il, #sum_projcted_il, #sum_li, #sum_opportunities, #sum_gross_profit = totals
...which if Report#data_of_invoicing_and_delivery_report could also calculate gross_profit would reduce even further to:
num_of_months.times do
totals.add_corresponding(Report.data_of_invoicing_and_delivery_report(#part_or_service,date))
end
Completely un-tested, but that's a hell of a reduction for the addition of a one-line method to Array and performing a single extra subtraction in a model.
Create a summation object that contains all those fields, pass the entire array to #sum.increment_sums(Report.data_of...)