I am looking for the most efficient way (in speed) to converts a huge number of objects (1M instances) to another object type.
Unfortunately I don't have the choice of what I am getting as an input (the million object).
So far I've tried with each_slice but it does not show much improvement when it comes to speed!
It looks like this:
expected_objects_of_type_2 = []
huge_array.each_slice(3000) do |batch|
batch.each do |object_type_1|
expected_objects_of_type_2 << NewType2.new(object_type_1)
end
end
Any idea?
Thanks!
I did a quick test with a few different methods of looping the array and measured the timings:
huge_array = Array.new(10000000){rand(1..1000)}
a = Time.now
string_array = huge_array.map{|x| x.to_s}
b = Time.now
puts b-a
Same with:
sa = []
huge_array.each do |x|
sa << x.to_s
end
and
sa = []
huge_array.each_slice(3000) do |batch|
batch.each do |x|
sa << x.to_s
end
end
No idea what you are converting so I did a bit of simple int to string.
Timings
Map: 1.7
Each: 2.3
Slice: 3.2
So apparently your slice overhead makes things slower. Map seems to be the fastest (which is internally just a for loop but with a non-dynamic length array as output). The << seems to slow things down a bit.
So if each object needs an individual converting you are stuck with O(n) complexity and can't speed things up by a lot. Just avaid overhead.
Depending on your data, sorting and exploiting caching effects might help or avoiding duplicates if you have a lot of identical data but we have no way to know if we don't know your actual conversions.
I would treat each slice in its own thread:
huge_array.each_slice(3000) do |batch|
Thread.new do
batch.each do |object_type_1|
expected_objects_of_type_2 << NewType2.new(object_type_1)
end
end
end
Then you have to wait for the threads to terminate using join. They should be accumulated in an array and joined.
Related
I have a rather noobish question about ActiveRecord in ruby on rails.
I'm working on an app on a Postgresql database that will need to handle large amounts of data from multiple platforms as quickly as possible. I'm going through the process of trying to optimize for speed.
I have two functions and I'm wondering which one would be faster theoretically.
Example #1
def spend_branded(date_range)
total_branded_spend = 0.0
platform_list.each do |platform|
platform.where(date: date_range).each do |platform_performance|
total_branded_spend += platform_performance.spend["branded"].to_f
end
end
total_branded_spend
end
VS.
Example #2
def spend_branded(date_range)
total_branded_spend = 0.0
platform_list.each do |platform|
total_branded_spend += (platform.where(date: date_range).sum(:branded_spend)).to_f
end
total_branded_spend
end
As you can see, in the first example, a selection of records are retrieved via the .where() method and then are iterated on with the desired field summed manually. In the second example however, I'm making use of the .sum() method to do the summing at the database level.
I'm wondering if anyone knows which method is faster in general. I suspect the second method is faster, but is it faster by many degrees?
Thanks so much for taking the time to read this question.
EDIT:
As #lacostenycoder pointed out, I should have clarified what platform_list is. It references an array with 1 to 3 ActiveRecord collections containing 1 record per each day in the date_range.
Upon benchmarking with the method provided in his answer, I found the 2nd method to be slightly faster.
user system total real
spend_branded 0.000000 0.000000 0.000000 ( 0.003632)
spend_branded_sum 0.000000 0.000000 0.000000 ( 0.002612)
(102 records processed)
Here's how you can benchmark your methods for comparison. Open a rails console rails c, then paste this into your console.
def spend_branded(date_range)
total_branded_spend = 0.0
platform_list.each do |platform|
platform.where(date: date_range).each do |platform_performance|
total_branded_spend += platform_performance.spend["branded"].to_f
end
end
total_branded_spend
end
def spend_branded_sum(date_range)
total_branded_spend = 0.0
platform_list.each do |platform|
total_branded_spend += (platform.where(date: date_range).sum(:branded_spend)).to_f
end
total_branded_spend
end
require 'benchmark'
Benchmark.bm do |x|
x.report(:spend_branded) { spend_branded(date_range) }
x.report(:spend_branded_sum) { spend_branded_sum(date_range) }
end
Of course we would expect the 2nd way to be faster. We can probably offer more help if you showed more about the model relations and how platform_list is defined.
Also you might want to check out the PgHero gem which can be helpful in identifying slow queries and where to add indices to get better performance. In general when done correctly, doing proper calculations at the database level will be orders of magnitude faster than iteration over large sets of Ruby object.
Also you might try to refactor your first version to this:
def spend_branded(date_range)
platform_list.map do |platform|
platform.where(date: date_range)
.pluck(:spend).map{|h| h['branded'].to_f}.sum
end.sum
end
And 2nd version to
def spend_branded_sum(date_range)
platform_list.map do |platform|
platform.where(date: date_range).sum(:branded_spend).to_f
end.sum
end
lacostenycoder is correct to recommend that you benchmark your code.
If the values you are trying to sum are directly available in the database, Calculations are very likely going to be faster. I do not know how much faster.
If platform_list is a collection of models, something like this might work and might outperform your iteration:
Platform.
where(date: date_range).
where(id: platform_list.map(&:id)).
sum(:branded_spend)
I have array of ranges :
[[39600..82800], [39600..70200],[70200..80480]]
I need to determine if there is overlapping or not.What is an easy way to do it in ruby?
In the above case the output should be 'Overlapping'.
This is a very interesting puzzle, especially if you care about performances.
If the ranges are just two, it's a fairly simple algorithm, which is also covered in ActiveSupport overlaps? extension.
def ranges_overlap?(r1, r2)
r1.cover?(r2.first) || r2.cover?(r1.first)
end
If you want to compare multiple ranges, it's a fairly interesting algorithm exercise.
You could loop over all the ranges, but you will need to compare each range with all the other possibilities, but this is an algorithm with exponential cost.
A more efficient solution is to order the ranges and execute a binary search, or to use data structures (such as trees) to make possible to compute the overlapping.
This problem is also explained in the Interval tree page. Computing an overlap essentially consists of finding the intersection of the trees.
Is this not a way to do it?
def any_overlapping_ranges(array_of_ranges)
array_of_ranges.sort_by(&:first).each_cons(2).any?{|x,y|x.last>y.first}
end
p any_overlapping_ranges([50..100, 1..51,200..220]) #=> True
Consider this:
class Range
include Comparable
def <=>(other)
self.begin <=> other.begin
end
def self.overlap?(*ranges)
edges = ranges.sort.flat_map { |range| [range.begin, range.end] }
edges != edges.sort.uniq
end
end
Range.overlap?(2..12, 6..36, 42..96) # => true
Notes:
This could take in any number of ranges.
Have a look at the gist with some tests to play with the code.
The code creates a flat array with the start and end of each range.
This array will retain the order if they don't overlap. (Its easier to visualize with some examples than textually explaining why, try it).
For sake of simplicity and readability I'll suggest this approach:
def overlaps?(ranges)
ranges.each_with_index do |range, index|
(index..ranges.size).each do |i|
nextRange = ranges[i] unless index == i
if nextRange and range.to_a & nextRange.to_a
puts "#{range} overlaps with #{nextRange}"
end
end
end
end
r = [(39600..82800), (39600..70200),(70200..80480)]
overlaps?(r)
and the output:
ruby ranges.rb
39600..82800 overlaps with 39600..70200
39600..82800 overlaps with 70200..80480
39600..70200 overlaps with 70200..80480
I have a loop that will iterate tens of thousands of times, and a set that may have only 50 distinct values. Which of the following is more efficient to have as part of the loop?
if !myset.include?('value')
myset.add('value')
or
myset.add('value')
If it is more often that myself already has the values, then the whole execution in the first code would be just the if condition, and the second one which does add anyway would probably be slightly slower.
If it is more often that myself does not have the values, then in the first code, evaluation of the condition is extra and would be slower whereas the second one would be slightly faster.
Either way, I think the difference is so subtle that it can be absorbed within the error.
If we randomize over a set of 50 distinct values:
require 'benchmark'
Benchmark.bm do |b|
b.report do
set = []
100_000.times do
i = rand(50)
set.push(i)
end
end
b.report do
set = []
100_000.times do
i = rand(50)
unless set.include?(i)
set.push(i)
end
end
end
end
the result I get is 0.04 against 0.2 with checking. So its 5 times faster if you don't perform checking in this case.
The larger is the set of randomized values the longer it is going to take (with checking).
You can try to perform similar benchmark with your code to see what tendencies you get. Run it with large numbers and multiple times to get cleaner results.
Update:
require 'set'
require 'benchmark'
Benchmark.bm do |b|
b.report do
set = Set.new
100_000.times do
i = rand(50)
set.add(i)
end
end
b.report do
set = Set.new
100_000.times do
i = rand(50)
unless set.include?(i)
set.add(i)
end
end
end
end
Running with actual Set both examples appear to be slower and quite similar - around 0.48.
If you use Set you don't need write if just myset.add('value'), what about speed set.add is mostly the same as array.push
I've written a function to remove email addresses from my data using gsub. The code is below. The problem is that it takes a total of 27 minutes to execute the function on a set of 10,000 records. (16 minutes for the first pattern, 11 minutes for the second). Elsewhere in the code I process about 20 other RegExp's using a similar flow (iterating through data.each) and they all finish in less than a second. (BTW, I recognize that my RegExp's aren't perfect and may catch some strings that aren't email addresses.)
Is there something about these two RegExp's that is causing the processing time to be so high? I've tried it on seven different data sources all with the same result, so the problem isn't peculiar to my data set.
def remove_email_addresses!(data)
email_patterns = [
/[[:graph:]]+#[[:graph:]]+/i,
/[[:graph:]]+ +at +[^ ][ [[:graph:]]]{0,40} +dot +com/i
]
data.each do |row|
email_patterns.each do |pattern|
row[:title].gsub!(pattern,"") unless row[:title].blank?
row[:description].gsub!(pattern,"") unless row[:description].blank?
end
end
end
Check that your faster code isn't just doing var =~ /blah/ matching, rather than replacement: that is several orders of magnitude faster.
In addition to reducing backtracking and replacing + and * with ranges for safety, as follows...
email_patterns = [
/\b[-_.\w]{1,128}#[-_.\w]{1,128}/i,
/\b[-_.\w]{1,128} {1,10}at {1,10}[^ ][-_.\w ]{0,40} {1,10}dot {1,10}com/i
]
... you could also try "unrolling your loop", though this is unlikely to cause any issues unless there is some kind of interaction between the iterators (which there shouldn't be, but...). That is:
data.each do |row|
row[:title].gsub!(patterns[0],"") unless row[:title].blank?
row[:description].gsub!(patterns[0],"") unless row[:description].blank?
row[:title].gsub!(patterns[1],"") unless row[:title].blank?
row[:description].gsub!(patterns[1],"") unless row[:description].blank?
end
Finally, if this causes little to no speedup, consider profiling with something like ruby-prof to find out whether the regexes themselves are the issue, or whether there's a problem in the do iterator or the unless clauses instead.
Could it be that the data is large enough that it causes issues with paging once read in? If so, might it be faster to read the data in and parse it in chunks of N entries, rather than process the whole lot at once?
I have a very big hash and I want to iterate it. Hash.each seems to be too slow.
Is there any efficient way to do this?
How about convert this hash to an array?
In each loop I'm doing very simple string stuff:
name_hash.each {|name, str|
record += name.to_s + "\|" + str +"\n"
}
and the hash uses people's names as the key, some related content as the value:
name_hash = {:"jose garcia" => "ca:tw#2#1,2#:th#1#3#;ar:tw#1#4#:fi#1#5#;ny:tw#1#6#;"}
Consider the following example, which uses a hash of 1 million elements:
#! /usr/bin/env ruby
require 'benchmark'
h = {}
1_000_000.times do |n|
h[n] = rand
end
puts Benchmark.measure { h.each { |k, v| } }
a = nil
puts Benchmark.measure { a = h.to_a }
puts Benchmark.measure { a.each { |k, v| } }
I run this on my system at work (running Ruby 1.8.5) and I get:
0.350000 0.020000 0.370000 ( 0.380571)
0.300000 0.020000 0.320000 ( 0.307207)
0.160000 0.040000 0.200000 ( 0.198388)
So iterating over the array is indeed faster (0.16 seconds versus 0.35 seconds for the hash). But it took 0.3 seconds to generate the array. So the net process is slower 0.46 seconds versus 0.35 seconds.
So it seems it's best just to iterate over the hash, at least in this test case.
String#+ is slow. This should improve it
record = name_hash.map{|line| line.join("|")}.join("\n")
If you are using this to output to somewhere, you should not create a huge string but rather write line by line to the output.
A more idiomatic way to do that in ruby:
record = name_hash.map{|k,v| "#{k}|#{v}"}.join("\n")
I don't know how that will compare with speed, but part of the problem might be because you keep adding a little bit onto a string and creating new (ever longer) string objects with each iteration. The join is done in C and might perform better.
Iterating over large collections is slow, the each method is not what's throttling it. What in your loop are you doing that's so slow? If you need to convert to an array, you can do that by calling some_hash.to_a
I had thought ruby 1.9.x had made hash iteration faster but could have been wrong. If it's simple structures you could try a different hash, like https://github.com/rdp/google_hash which is one I hacked up to make #each more reliable...
Probably "by making a single db query"
Converting a large Hash to an Array will require creating a large object and will require two iterations, albeit with one of them being internal to the interpreter and probably very fast.
This is unlikely to be faster than just iterating over the Hash, but it might be for large objects.
Check out the Standard Library Benchmark package for an easy way to measure runtime.
I would also venture a guess that the real problem here is that you have a Hash-like ActiveRecord object that imposes a round-trip to your db server in each cycle of the enumeration. It's possible that what you really want is to bypass AR and run a native query to retrieve everything at once in a single round-trip.